Upload
karlarao
View
7.966
Download
5
Embed Size (px)
Citation preview
Performance Scenario:
Diagnosing and resolving sudden slow down on two node RAC
Introduction…
•
Karl Arao, OCP‐DBA, RHCT
•
Senior Consultant at SQL*Wizard
•
RAC user for 3years
•
1st
environment on VMware
•
I “heart”
performance
•
Don’t like to guess when troubleshooting
Scenario
One Thursday…a client called…
There was a SUDDEN slow down
on ALL
of the applications
…a big impact to the Business
And it’s running on
RAC RACno changes on the
RAC nodes and on the applications
Some of 10g Performance Features
• OEM Performance Page• ADDM • SQL Tuning advisor• AWR (DBA_HIST_)• ASH• Time Model (total time for all db calls)• Wait Class (12 wait class)• Metrics (v$ performance metric deltas)• Services
Setup
• Server and Storage: SunFire
X4200 (2CPU, 12GB memory) with LUNs
on EMC CX300
• OS: RHEL 4.3 ES• Database and clusterware: Oracle 10.2.0.3• Database Files, Flash Recovery Area, OCR, and
Voting disk are located on OCFS2 filesystems
• Application: Forms and Reports (6i and also lower)
Troubleshooting Principle
Systematic/Layered approach..
Understand..
Then Fix..
Lets get it on!
1. Measured the OS stack
• Monitored the following– cpu
(vmstat, top, mpstat)
– io
(iostat)
– memory (vmstat, meminfo)
– network (netstat)– process info (top, ps)
• CPU on server1
• CPU on server2
• Datafiles
on server1
• Datafiles
on server2
• OCR & voting disk on server1
• OCR & voting disk on server2
• Archivelogs
on server1
• Archivelogs
on server2
• Flash Recovery Area on server1
• Flash Recovery Area on server2
• Memory on server1
• Memory on server2
• Compared my past & current RDA of the database
• Query on some v$views.. a query on v$session showed that server1 has more connections
(89% of the total users)
2. Checked the DB environment
This could be because of:1)
The clients having lower versions (< Sql*Plus 8.1
or OCI8, see Note 97926.1) that may not support TAF (FAILOVER_MODE) and Load Balancing
(LOAD_BALANCE)
OR
2) They are using TNS entries explicitly connecting to server1
2. Checked the DB environment
• Users don’t have FAILOVER capabilities
2. Checked the DB environment
• Checked the application module usage on server1
2. Checked the DB environment
• How bout I graph it in excel? Will the data be more
meaningful?
.. YES most of the users uses the xxxlogin.fmx
module
2. Checked the DB environment
3. Checked instance‐wide DB performance
• Graphed the ASH data..
.. suffering from “gc
cr
block lost” and “gc
cr
multi block request” from 7am to 4pm
3. Checked instance‐wide DB performance
• Researched on Metalink
for known issues.. Found Doc ID: 563566.1 gc
lost blocks
diagnostics
• Was able to pinpoint the peak period from the graph. Then, generated ADDM and AWR
report on that peak period..
3. Checked instance‐wide DB performance
• ADDM
Elapsed Time: 60min
DB Time: 61.83min
AAS: 1.03
Max CPU: 2
3. Checked instance‐wide DB performance
• Should I follow these recommendations right away?
Nope collect more facts, numbers, figures
3. Checked instance‐wide DB performance
• AWR
3. Checked instance‐wide DB performance
• Do we have a workload distribution problem? Nope even with distributed users..
We still have performance problem..
4. Checked session‐level DB performance
• The database has too many activity, where do I start? Where to drill down?
• gv$session_longops
& gv$session_wait
output too many users, and require repetitive
monitoring• In the spirit of Method‐R…
"WORK FIRST TO REDUCE THE BIGGEST RESPONSE TIME COMPONENT OF A
BUSINESS' MOST IMPORTANT USER ACTION“
• Went to the Accounting Department, checked on the desktop terminals
4. Checked session‐level DB performance
• Users PC1069 (with SID 601) and PC918 (with SID 483) are on total hang
4. Checked session‐level DB performance
• Checked on the – performance/wait counters
– the current SQLs
4. Checked session‐level DB performance
• v$session_wait
(SID 601)
4. Checked session‐level DB performance
• v$sesstat
(SID 601)
4. Checked session‐level DB performance
• v$sql, v$sql_plan, v$sql_plan_statistics
(SID 601)
• Running for 98 minutes
• Just 12.14 seconds on CPU
4. Checked session‐level DB performance
• v$sesstat
(SID 483)
4. Checked session‐level DB performance
• v$sql, v$sql_plan, v$sql_plan_statistics
(SID 483)
• Running for 3 hours• Just 2.68 seconds on CPU
4. Checked session‐level DB performance
• Another graph of ASH
5. Drilled down on the network interconnect
• Generated a “cat & egrep”
command to look for problems in the interconnect from the OS Watcher “netstat”
output
(from Metalink
Doc ID: 563566.1 gc
lost blocks diagnostics)
5. Drilled down on the network interconnect
$ cat server1_netstat.dat | egrep
‐i "udpInOverflows|packet
receive
errors|fragments
dropped|reassembles
failed|fragments
dropped after
timeout"
34096 fragments dropped after timeout
306030 packet reassembles failed
15 packet receive errors
34096 fragments dropped after timeout
306268 packet reassembles failed
15 packet receive errors
34096 fragments dropped after timeout
306574 packet reassembles failed
…
output snipped …
5. Drilled down on the network interconnect
• Restarted the switch
STILL
THERE IS A PERFORMANCE PROBLEM
5. Drilled down on the network interconnect
• Replaced the switch
THEY GOT FAST
5. Drilled down on the network interconnect
karao@karl:~/Desktop$ cat karlarao.dat
| egrep
‐i "udpInOverflows|packet
receive
errors|fragments
dropped|reassembles
failed|fragments
dropped after timeout"0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors
5. Drilled down on the network interconnect
• Another graph of ASH (Stacked graph)
5. Drilled down on the network interconnect
• Another graph of ASH (3d view)
Conclusion
You don’t have to guess..
Even if it’s a RAC environment..
It just takes facts, numbers, figuresto solve a performance problem
References and Tools
• http://karlarao.wordpress.com• http://blog.tanelpoder.com
– http://www.tanelpoder.com/files/TPT_public.zip– http://www.tanelpoder.com/files/PerfSheet.zip– Neil Gunther
& Tanel
Poder
‐
Multidimensional Visualization of Oracle
Performance using Barry007 http://arxiv.org/pdf/0809.2532
• http://ashmasters.com• http://www.perfvision.com• http://www.method‐r.com
• Metalink
Doc ID 97926.1 Failover Issues and Limitations [Connect‐time
failover and TAF]
• Metalink
Doc ID 563566.1 gc
lost blocks diagnostics• Metalink
Doc ID 301137.1 OS Watcher User Guide
Join Oracle Users –
Philippines
• Facebookhttp://www.facebook.com/home.php#/pages/Oracle‐Users‐Philippines/86773013086?ref=ts
• Linkedinhttp://www.linkedin.com/groups?home=&gid=2028295&trk=anet_ug_hm