26
Hakan YÜKSEL hakan.yuksel@turkiyefinans.com.tr http://yukselis.wordpress.com Failover Cluster Troubleshooting 10.08.201 1

Webcast - Failover Cluster Troubleshooting

Embed Size (px)

DESCRIPTION

Failover Cluster TroubleShooting

Citation preview

Page 1: Webcast - Failover Cluster Troubleshooting

Hakan YÜ[email protected]://yukselis.wordpress.com

Failover Cluster Troubleshooting10.08.2011

Page 2: Webcast - Failover Cluster Troubleshooting

Ajanda

Cluster Kavramlar, Gereksinimler, Mimari, Log

Yönetimi, ..Quorum ModeliTroubleshooting Soru – Cevap

Page 3: Webcast - Failover Cluster Troubleshooting

Cluster GereksinimleriReview hardware and infrastructure requirements for a failover cluster.• Servers: Microsoft supports a failover cluster solution only if all the hardware components are

marked as "Certified for Windows Server 2008 R2." In addition, the complete configuration (servers, network, and storage) must pass all tests in the Validate a Configuration Wizard, which is included in the Failover Cluster Manager snap-in

• Storage: You must use shared storage that is compatible with Windows Server 2008 R2• Network adapters and cable (for network communication): The network hardware, like other

components in the failover cluster solution, must be marked as "Certified for Windows Server 2008 R2." If you use iSCSI, your network adapters should be dedicated to either network communication or iSCSI, not both

• Account for administering the cluster: When you first create a cluster or add servers to it, you must be logged on to the domain with an account that has administrator rights and permissions on all servers in that cluster. The account does not need to be a Domain Admins account—it can be a Domain Users account that is in the Administrators group on each clustered server. In addition, if the account is not a Domain Admins account, the account (or the group that the account is a member of) must be delegated Create Computer Objects and Read All Properties permissions in the domain

• Standart Edition üzerindeki sunucular üzerinde cluster activate edilebilir • SCSI-3 Commands Persistent Reservations (PRs) Required • Basic GPT and MBR disks supported• Multipath IO (MPIO) recommended

Page 4: Webcast - Failover Cluster Troubleshooting

Sık Sorulanlar• Sanal makinalar üzerinde cluster yapabilir miyim?

– evet! • Fiziksel ve Sanal sunucular aynı cluster içerisinde olabilir mi?

– evet!• Sunucular aynı donanımsal özelliklere sahip olmalı mı ?

– hayır

• Validation testinden geçiyorsanız, destekleniyordur.

Page 5: Webcast - Failover Cluster Troubleshooting

Cluster Validate

• Ürün içerisinde konumlanmıştır• Gereksinimlerin karşılanmaması durumunda uyarı verir• Clusterı oluşturan servers ve storage ile ilgili tüm

kontrolleri yapar• Her değişiklik sonrası çalıştırılması gerekir

o Create a new cluster o Add a node, disk, or networko Update system software (drivers, firmware, service packs,

MPIO)o Configure hardware (HBA, MPIO, Network Adapter, etc)o Change any component in your solution

o It’s the very first thing you do!

http://technet.microsoft.com/en-us/library/cc732035(WS.10).aspx#BKMK_understanding_tests

Page 6: Webcast - Failover Cluster Troubleshooting

Quorum ve Majority Node Set

• Quorum cluster konfigürasyonu ve durum bilgisinin olduğu database.

• Windows Server 2008 ile yeni bir Quorum modeli mevcut (Node and Disk Majority), bu sefer Quorum diskin kullanımı biraz farklı oluyor: Quorumu node sayısı ile beraber bir oy hakkı olarak kullanıyoruz..

• Majority Node Set MNS demokratik bir sistemdir. Quorum da sadece bir oy var ise ve buna sahiplenen cluster a sahiplenebiliyorsa, MNS de çoğunluk clustera sahiplenir. Mesela 5 nodelu cluster da split brain senaryosu yaşanırsa her node toplam kaç node ila haberleşebildiğine bakar. Bir node iki node ile haberleşebiliyorsa, 3 node 5 nodedan çoğunluğu oluşturur ve cluster sahiplenir. Diğer iki node azınlıkta olduklarını anlar ve diğer 3 node un haberleşebildiğini varsayarlar.

• 2003 Cluster ortamında yaşanılan bir split brain senaryosunda hangi node quorum diskinin sahibi ise uygulamalar onun üzerinde aktif olarak çalışmakta, clientların erişip erişememesinin bir önemi bulunmamaktaydı.

Page 7: Webcast - Failover Cluster Troubleshooting

Quoruma Bakış

• Node majority• Node and File Share majority

• Disk only (not recommended)• Node and Disk majority

Vote Vote Vote VoteVote

Majority is greater than 50% Possible Voters:

Nodes (1 each), Disk Witness (1 max), File Share Witness (1 max) 4 Quorum Types

Page 8: Webcast - Failover Cluster Troubleshooting

Quorum Modelini Seçme

Considerations for choosing a quorum mode include:

• By default, failover clustering chooses:- Node Majority if there are an odd number of nodes in the cluster- Node and Disk Majority if there are an even number of nodes in the cluster

• Node and File Share Majority is recommended for geographically dispersed clusters

• No Majority: Disk Only is not recommended, because of the disk subsystem’s single point of failure

• Plan changes to the quorum mode carefully to avoid a mode that may result in loss of quorum

Page 9: Webcast - Failover Cluster Troubleshooting

Failover Cluster Mimari

• Microsoft Cluster Service (MSCS) sharing nothing modelini kullanır. Bunun anlamı sadece bir server kaynakların sahibi olabilir bunlar disk,virtual server, IP vb..

• Classdb file HKLM\Cluster registry hive üzerinden download eder. Nodelar üzerinde ve quorum üzerinde durur. Son güncelleme bilgisini içerir• Birbirlerine 3343 üzerinden register replikasyonu yapmakta. • File Share Witness içerisine de clusdb kopyalanmaktadır.

• When the computer is started, the Cluster Disk Driver (Clusdisk.sys) reads the following local registry key to obtain a list of the signatures of the shared disks under cluster management:HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ClusDisk\Parameters \Signatures

• Recommandation private only hb public mix olmalı

Page 10: Webcast - Failover Cluster Troubleshooting

.. mimari

• Heartbeat 5 sn.de bilgi gelmez ise Host manager devreye girerek public üzerinde kontrollere devam ediyor

• Preffered Owner listeside hangi node gideceğini karar verecek,

• Possible ownerda hangi node gidip gidemeyeceğine karar verecek. • Tüm resourcesların aynı ownerlara sahip olması gerekmektedir.

• Affecti group resource fail olursa group failover yapsın.• Diskteki efekti group seçili gelmekte.

• Pause Node• When you pause a node, existing groups and resources stay online, but additional

groups and resources cannot be brought online on the node. Pausing a node is usually done when applying software updates to the node

Page 11: Webcast - Failover Cluster Troubleshooting

Scsi Bus Reset, SCSI3 Persistent Reservarion

• Split Brain Senaryosu: İki node birbirleri arasındaki network iletişimi kaybetme durumu. Bu durumda Cluster servisi (clusdisk.sys) Challande/Defense protokolu ile SCSI reserver komutları vasıtasıyla önce reset komutu gönderir bundan sonra reserve komutu ile quorum diskini reserve eder online getirir akabinde ownershipliği alarak tüm resourceları online duruma çeker.

• Windows Server 2008 ile birlikte artık scsi bus resetleri kullanılmıyor. Scsi 3 serial persistent reservation kullanılmaktadır. Scsi bus reset den sadece o disk değil aynı bus üzerindeki bütün diskler etkilenmekte, konfigürasyona bağlı olarak her disk için her node dan bir bus reset gönderilebilmekte bu durumda cluster kendisini online etme süreleri uzamakta ve offline kalabilmekteler bu durumda manuel online çekilmesi gerekebiliyor idi.

Page 12: Webcast - Failover Cluster Troubleshooting

Resource Monitor

• Cluster üzerinde resource groupların doğru çalışıp çalışmadığını kontrol eden resource monitorler mevcuttur. Resource monitor clsusvc altinda çalişan dll lerden oluşmaktadır. belli servisler (exchange,SQL,vb..) için özel dll’ler mevcut. 2008 ‘de bunun adi RHS.exe

• Disk üzerinde Turn On maintanence for this disk işaretlersek is alive ve looks alive işlemleri yapılmayacaktır yani diskin statusunu kontrol etmeyecek, diske erişim yapmayacak (içerisine dir çekme) cluster servisi devamli online oldugunu farzeder.

The Resource Hosting Subsystem (RHS) conducts periodic health checks of all cluster resources to ensure they are functioning properly. This is accomplished by executing IsAlive and LooksAlive processes which are specific to the type of resource

Page 13: Webcast - Failover Cluster Troubleshooting

Failover Süreci

2 node birbirine ulaşamadiği durumda quarum diskine erişmeye çalışır bu duruma arbitration process denilir. Clusdisk.sys dosyası nodeların ikisininde disklere erişimin engellemek için yönetimi yapar. MNS mimarisi ile birlikte quarum bilgisi register replikasyonu ile sağlanmaktadır. Bu dosyalara %\windows\system32\config altından erişilebilinir. Cluster açılması esnasında clusdb dosyasını registryden download edilerek cluster işletimi çalışmaya başlar. Bu konfigürasyon dosyasında hangi disklere erişebileceğinin bilgisi yer almaktadır..

Page 14: Webcast - Failover Cluster Troubleshooting

Cluster KomponentleriOBJECT MANAGER (clussvc.exe) (OM)

Şu anki configurasyonu tutarHOST MANAGER (HM)

Host ekleme çıkarma, node faile görme, modüller ile birlikte çalışıyor, cluster ayağa kalktı,kim cevap verirse 3343 üzeridnen onunla konuşuyor

MEMBERSİP MANAGER (MM)Hklm clussvc altına lokalde yazar sonrada gider object managere ilertir OM bunu ram üzerine alır, Join oldu, evict oldu, MM bunu kayıt altına alır, bilgi paylaşımını sağlar

GLOBAL UPDATE MANAGER (GUM)Bütün değişikilklerin replikasyonundan sorumludurBackup – VSS çalışıyor bilgisini diğer nodelar üzerine bildiri böylelikle diğer nodelar üzerinde değişklik yapmanın önüne geçer Tüm updatelerden sorumlu

RESOURCE CONTROL MANAGER (RCM)Rsh.exe ile çalışır Dependencilerden bu sorumlu En baba modül :P

TOPOLOGY MANAGERNETWORK MANAGER (nm) / INTERFACE MANGER (im)Nic up / fail

DATABASE MANAGER Replikasyondan sorumluGup.mang. üzerinden yapıyorLogu tutan dm yapmaktadırRegistry. Clusdb yüklenmektedir.

QUORUM MANAGERQuorum oluştumu, oluşmadımı Hangi quorum modeli olmakta ona bakar Doğru replikeyi seçmekten o sorumluRCM ile konuşabilir, quoarum oluşruramıyoruz rcm devreye sokup diyoruz ki nerede ise quorum oluşturacaz bize bir vote verebilir misin, 1 eksik miyiz.

SECURİTY MANAGEREncryption, kerberos ilişkileri

Page 15: Webcast - Failover Cluster Troubleshooting

Microsoft Failover Cluster Virtual Adapter

• Microsoft Cluster ortamlarda “Microsoft Failover Cluster Virtual Adapter” adında bir interface oluşturur, hidden bir interface’dir NetFT (Network Faut Tolerant) dosyasını simüle eder, clusterlar arası iletişimi yürütür, heartbeat için bir redundancy sağlar. Bu interface mevcut interface üzerine bind olur smb’den SAN’e olan trafik bu kart üzerinde utilize edilir. NetFT, ipconfig /All üzerinden görülür kendisine APIPA adresi tahsis (169.254.1.2) eder, bu ip üzerinden aslında data transferi yapılmaz bu IP fiziksel kart üzerine bind olduğunda TM üzerinden utilizasyon görülmektedir.

Page 16: Webcast - Failover Cluster Troubleshooting

Failover Cluster Kurulum Adımları

Failover Cluster PrerequisitesEstablish a Network Naming Convention TCP/IP Network Configuration Public NetworkStorage NetworkHeartbeat Network

ProceduresPrepare the Failover ClusterCreate a Domain User AccountAdd Nodes to an Active Directory DomainExpose Storage to Cluster NodesInstall the Failover Cluster FeatureRun Cluster ValidationCreate and Configure the Failover ClusterCreate a ClusterSet Cluster Network Properties and Apply Naming Convention

Create a Highly Available Services

-> Create a Highly Available iSCSI TargetConfiguring Windows Firewall for Microsoft iSCSI Software TargetInstalling the Microsoft iSCSI Software TargetCreate the Failover iSCSI Target Resource GroupCreate an iSCSI Target in the Microsoft iSCSI Target MMCCreate and Configure Virtual DisksConnect Initiators

Testing Your Failover Cluster ConfigurationServer Core Installation Option of Windows Server 2008 Step-by-Step Guide: http://technet2.microsoft.com/windowsserver2008/en/library/47a23a74-e13c-46de-8d30-ad0afb1eaffc1033.mspx?mfr=true

Page 17: Webcast - Failover Cluster Troubleshooting

Troubleshooting

• Reviewing cluster events • Reviewing hardware events • Using the Validate a Configuration Wizard • Reviewing storage/SAN events• Troubleshooting methodologies for cluster issues, whether in Windows 2003 or Windows 2008, are

fairly similar. Most of the typical support issues in the cluster category fall under the following categories:

– · Cluster Service fails to start.– · Cluster resources in a failed state or fail to come online.– · Determine root cause of cluster failure.– · Initial configuration of the cluster

• The Win 2003 legacy CLUSTER.LOG text file no longer exists. In Win 2008 the cluster log is handled by the Windows Event Tracing (ETW) process. This is the same logging infrastructure that handles events for other aspects you are already well familiar with, such as the System or Application Event logs you view in Event Viewer.

• Command Line– c:\>cluster log /gen

• Powershell– C:\PS> Get-ClusterLog

• ForceQuorum– net start clussvc /forcequorum (or /fq)

Page 18: Webcast - Failover Cluster Troubleshooting

Troubleshooting Tips

• When you encounter a problem, always, always, always start with Cluster Events

1. Look at a Cluster wide view of the Cluster events2. Dig into all events in the System Event log3. Check the Application Event log

• Don’t be distracted by symptoms - focus on root cause1. For example, if you see Cluster IP Address failures, don’t waste lots of time looking

at Cluster eventso Instead look for other networking related errors

2. There may be multiple retries after a failure, producing more events. Look for what caused the first failure

• You don’t always need to run a FULL validate• http://technet.microsoft.com/en-us/library/cc732035(WS.10).aspx

• Don’t “assume” the cluster will work and skip Validate

Page 19: Webcast - Failover Cluster Troubleshooting

Cluster Eventları• Cluster Events

• Recent Cluster Events üzerinde son 24 saate ait eventlar görünmektedir.

• Monitoring Cluster Events– Fully featured Failover Cluster Management Packs

• Cluster logging level– Set-ClusterLog –level 3

Page 20: Webcast - Failover Cluster Troubleshooting

Configuring Debug Logging

Logging enabled by default

Log files stored as .ETL in:%WinDir%\System32\winevt\logs\Microsoft-Windows-FailoverClustering

Default log size is 100 MBSet-Clusterlog –Size 100

Default log level is 3Set-Clusterlog –Level 3

Cluster Output Levels

Level ErrorWarnin

gInfo Verbose Debug

0 (disabled)

1 P

2 P P

3 P P P

4 P P P P

5 P P P P PCan have performance impact

Default

Up to three log filesThis means log history can be kept for up to three rebootsThe number of logs can be modified via the registry:HKLM\Software\Microsoft\Windows\CurrentVersion\WINEVT\Channels\Microsoft-Windows-FailoverClustering/Diagnostic\FileMax

Page 22: Webcast - Failover Cluster Troubleshooting

Cluster Nodlara bağlanmada yaşanan problemler

‘Create Cluster Wizard’, ‘Validate a Configuration Wizard’, and ‘Add Node Wizard’, so any of the following messages and warnings we list could be due to WMI issues:

· "RPC Server Unavailable" error.· Access is Denied.· The computer ‘Node1’ could not be reached.· Failed to retrieve the maximum number of nodes for ‘{0}’.· The computer ‘Node1.contoso.com’ does not have the Failover Clustering feature installed. Use Server Manager to install the feature on this computer.o Note: first confirm you have installed the Failover Clustering feature on this node

Troubleshooting Steps

1) Ensure it is not a DNS Issue2) Check your that WMI is Running on the Node (wbemtest)3) Check your Firewall Settings4) Reboot the Node5) Rebuild a Corrupt WMI Repository· In the Services console, manually stop the WMI service to ensure that dependent services are stopped· Start WMI service again· Launch and elevated CMD or PowerShell· CMD/PS > winmgmt /salvagerepository

6) Patch WMI for Performance Improvements (974930)

Page 23: Webcast - Failover Cluster Troubleshooting

Antivirus Exclusion

Antivirus Yazılımınız Cluster Aware bir yazılım mı ? Antivirus software that is not cluster-aware may cause unexpected problems on a server that is running Cluster Services. For example, you may experience resource failures or problems when you try to move a group to a different node.

If you are troubleshooting failover issues or general problems with a Cluster services and antivirus software is installed, temporarily uninstall the antivirus software or check with the manufacturer of the software to determine whether the antivirus software works with Cluster services. Just disabling the antivirus software is insufficient in most cases. Even if you disable the antivirus software, the filter driver is still loaded when you restart the computer.

Antivirüsü sistemden nasıl disable edebilirim ;

http://support.microsoft.com/kb/250355#appliesto

Exclusion List• Q:\ (quorum) disc from virus scanning.• The %Systemroot%\Cluster folder.• The temp folder for the Cluster Service account. For example, exclude the \clusterserviceaccount\Local Settings\

Temp folder from virus scanning. w2k3

Page 24: Webcast - Failover Cluster Troubleshooting

Cluster Log Error Anlamları

status 170 - Which means "The requested resource is in use." This could be related to Persistent Reservation problems, it can also be MPIO, fibre/HBA drivers and/or some type of lower level file system driver or software such as anti-virus, quota management, open file agent for backup software, etc, etc,: 00000c94.000008d4::<date and time>.585 INFO Physical Disk <Disk Q:>: [DiskArb] Issuing Reserve on signature 33af636f. 00000c94.000008d4::<date and time>.616 ERR Physical Disk <Disk Q:>: [DiskArb] Reserve completed, status 170. 00000c94.000008d4::<date and time>.616 INFO Physical Disk <Disk Q:>: [DiskArb] Arbitrate returned status 170.

status 5 - Is usually a permissions related problem, in this case it was a problem with either Cluster Service Account (CSA) username/password were not synchronized between the nodes. This can also happen if the cluster looses it's Secure Channel connection to the DC in order for the CSA to get authenticated. Another situation in which this can occur, is when one of the domain Group Policy Objects (GPO) or one of the Local Policy Objects is missing a User Rights Assignment needed for the CSA to funtion properly.000014a0.00001460::::<date and time>.629 WARN [JOIN] JoinVersion data for sponsor <Cluster Name> is invalid, status 5.000014a0.000017d0::::<date and time>.629 WARN [JOIN] Unable to get join version data from sponsor 10.7.47.100 using NTLM package, status 5.

status 1117 - Which means an ERROR_IO_DEVICE (The request could not be performed because of an I/O device error) when Event ID 1123 occurs000015a0.000014a8::<date and time>.511 WARN IP Address <IP Address resource name>: IP Interface 4 (address 10.101.160.65) failed LooksAlive check, status 1117, address 0x10119e0, instance 0xf74d6fb8.

Page 25: Webcast - Failover Cluster Troubleshooting

Cluster Nedir, Niçin Kullanıyoruz

• Cluster Blog– http://blogs.msdn.com/b/clustering/

• Technet Failover Cluster– http://technet.microsoft.com/en-us/library/cc754482.aspx

• Configuring Auditing for a Windows Server 2008 Failover Cluster– http://

blogs.technet.com/b/askcore/archive/2009/01/19/configuring-auditing-for-a-windows-server-2008-failover-cluster.aspx

• Top Issues for Microsoft Support for Windows 2008 Failover Clusters– http://

blogs.technet.com/b/askcore/archive/2008/10/13/top-issues-for-microsoft-support-for-windows-2008-failover-clusters.aspx

• Checklist: Create a Clustered Virtual Machine– http://technet.microsoft.com/en-us/library/dd759220.aspx

• Top Issues for Microsoft Support for Windows 2008 Failover Clusters– http://

blogs.technet.com/b/askcore/archive/2008/10/13/top-issues-for-microsoft-support-for-windows-2008-failover-clusters.aspx

• Failover Clusters in Windows Server 2008 R2– http://technet.microsoft.com/en-us/library/ff182338(WS.10).aspx

• Trouble Connecting to Cluster Nodes? Check WMI– http://blogs.msdn.com/b/clustering/archive/2010/11/23/10095621.aspx

Page 26: Webcast - Failover Cluster Troubleshooting

Sorular & Teşekkürler