45
Performance Scenario : Diagnosing and resolving sudden slow down on two node RAC

Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

Embed Size (px)

Citation preview

Page 1: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

Performance Scenario: 

Diagnosing and resolving sudden  slow down on two node RAC

Page 2: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

Introduction…

Karl Arao, OCP‐DBA, RHCT

Senior Consultant at SQL*Wizard

RAC user for 3years

1st

environment on VMware

I “heart”

performance

Don’t like to guess when troubleshooting

Page 3: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

Scenario

One Thursday…a client called…

There was a SUDDEN slow down 

on ALL

of the applications

…a big impact to the Business

Page 4: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

And it’s running on

RAC RACno changes on the 

RAC nodes and on the applications

Page 5: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

Some of 10g Performance Features

• OEM Performance Page• ADDM • SQL Tuning advisor• AWR (DBA_HIST_)• ASH• Time Model (total time for all db calls)• Wait Class (12 wait class)• Metrics (v$ performance metric deltas)• Services

Page 6: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

Setup

• Server and Storage: SunFire

X4200 (2CPU,  12GB memory) with LUNs

on EMC CX300

• OS: RHEL 4.3 ES• Database and clusterware: Oracle 10.2.0.3• Database Files, Flash Recovery Area, OCR, and 

Voting disk are located on OCFS2 filesystems

• Application: Forms and Reports (6i and also  lower)

Page 7: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

Troubleshooting Principle

Systematic/Layered approach..

Understand.. 

Then Fix.. 

Lets get it on!

Page 8: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

1. Measured the OS stack

• Monitored the following– cpu

(vmstat, top, mpstat)

– io

(iostat)

– memory (vmstat, meminfo)

– network (netstat)– process info (top, ps)

Page 9: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

• CPU on server1

• CPU on server2

Page 10: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

• Datafiles

on server1

• Datafiles

on server2

Page 11: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

• OCR & voting disk on server1

• OCR & voting disk on server2

Page 12: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

• Archivelogs

on server1

• Archivelogs

on server2

Page 13: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

• Flash Recovery Area on server1

• Flash Recovery Area on server2

Page 14: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

• Memory on server1

• Memory on server2

Page 15: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

• Compared my past & current RDA of the  database

• Query on some v$views.. a query on v$session showed that server1 has more connections 

(89% of the total users)

2. Checked the DB environment

Page 16: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

This could be because of:1)

The clients having lower versions (< Sql*Plus 8.1 

or OCI8, see Note 97926.1) that may not support  TAF (FAILOVER_MODE) and Load Balancing 

(LOAD_BALANCE) 

OR

2) They are using TNS entries explicitly connecting  to server1

2. Checked the DB environment

Page 17: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

• Users don’t have FAILOVER capabilities

2. Checked the DB environment

Page 18: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

• Checked the application module usage on server1

2. Checked the DB environment

Page 19: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

• How bout I graph it in excel? Will the data be more 

meaningful? 

.. YES most of the users uses the xxxlogin.fmx

module

2. Checked the DB environment

Page 20: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

3. Checked instance‐wide  DB performance

• Graphed the ASH data.. 

.. suffering from “gc

cr

block lost” and “gc

cr

multi block request” from 7am to 4pm

Page 21: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

3. Checked instance‐wide  DB performance

• Researched on Metalink

for known issues..  Found Doc ID: 563566.1 gc

lost blocks 

diagnostics

• Was able to pinpoint the peak period from the  graph. Then, generated ADDM and AWR 

report on that peak period..

Page 22: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

3. Checked instance‐wide  DB performance

• ADDM

Elapsed Time: 60min

DB Time: 61.83min

AAS: 1.03

Max CPU: 2

Page 23: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

3. Checked instance‐wide  DB performance

• Should I follow these recommendations right away?

Nope  collect more facts, numbers, figures

Page 24: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

3. Checked instance‐wide  DB performance

• AWR

Page 25: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

3. Checked instance‐wide  DB performance

• Do we have a workload distribution problem? Nope  even with distributed users.. 

We still have performance problem.. 

Page 26: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

4. Checked session‐level  DB performance

• The database has too many activity, where do  I start? Where to drill down?

• gv$session_longops

& gv$session_wait

output  too many users, and require repetitive 

monitoring• In the spirit of Method‐R…

"WORK FIRST TO REDUCE THE BIGGEST RESPONSE TIME COMPONENT OF A 

BUSINESS' MOST IMPORTANT USER ACTION“

• Went to the Accounting Department, checked  on the desktop terminals

Page 27: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

4. Checked session‐level  DB performance

• Users PC1069 (with SID 601) and PC918 (with  SID 483) are on total hang

Page 28: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

4. Checked session‐level  DB performance

• Checked on the – performance/wait counters  

– the current SQLs

Page 29: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

4. Checked session‐level  DB performance

• v$session_wait

(SID 601)

Page 30: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

4. Checked session‐level  DB performance

• v$sesstat

(SID 601)

Page 31: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

4. Checked session‐level  DB performance

• v$sql, v$sql_plan, v$sql_plan_statistics

(SID 601)

• Running for 98 minutes

• Just 12.14 seconds on CPU

Page 32: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

4. Checked session‐level  DB performance

• v$sesstat

(SID 483)

Page 33: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

4. Checked session‐level  DB performance

• v$sql, v$sql_plan, v$sql_plan_statistics

(SID 483)

• Running for 3 hours• Just 2.68 seconds on CPU

Page 34: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

4. Checked session‐level  DB performance

• Another graph of ASH

Page 35: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

5. Drilled down on the network  interconnect

• Generated a “cat & egrep”

command to look  for problems in the interconnect from the OS  Watcher “netstat”

output

(from Metalink

Doc ID: 563566.1 gc

lost blocks diagnostics)

Page 36: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

5. Drilled down on the network  interconnect

$ cat server1_netstat.dat | egrep

‐i "udpInOverflows|packet

receive 

errors|fragments

dropped|reassembles

failed|fragments

dropped after 

timeout"

34096 fragments dropped after timeout

306030 packet reassembles failed

15 packet receive errors

34096 fragments dropped after timeout

306268 packet reassembles failed

15 packet receive errors

34096 fragments dropped after timeout

306574 packet reassembles failed

output snipped …

Page 37: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

5. Drilled down on the network  interconnect

• Restarted the switch

STILL

THERE IS A PERFORMANCE PROBLEM

Page 38: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

5. Drilled down on the network  interconnect

• Replaced the switch

THEY GOT FAST

Page 39: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

5. Drilled down on the network  interconnect

karao@karl:~/Desktop$ cat karlarao.dat

| egrep

‐i "udpInOverflows|packet

receive 

errors|fragments

dropped|reassembles

failed|fragments

dropped after timeout"0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors

Page 40: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

5. Drilled down on the network  interconnect

• Another graph of ASH (Stacked graph)

Page 41: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

5. Drilled down on the network  interconnect

• Another graph of ASH (3d view)

Page 42: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

Conclusion

You don’t have to guess.. 

Even if it’s a RAC environment..

It just takes facts, numbers, figuresto solve a performance problem 

Page 43: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

References and Tools

• http://karlarao.wordpress.com• http://blog.tanelpoder.com

– http://www.tanelpoder.com/files/TPT_public.zip– http://www.tanelpoder.com/files/PerfSheet.zip– Neil Gunther

& Tanel

Poder

Multidimensional Visualization of Oracle 

Performance using Barry007 http://arxiv.org/pdf/0809.2532

• http://ashmasters.com• http://www.perfvision.com• http://www.method‐r.com

• Metalink

Doc ID 97926.1 Failover Issues and Limitations [Connect‐time 

failover and TAF]

• Metalink

Doc ID 563566.1 gc

lost blocks diagnostics• Metalink

Doc ID 301137.1 OS Watcher User Guide

Page 44: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

Join Oracle Users –

Philippines

• Facebookhttp://www.facebook.com/home.php#/pages/Oracle‐Users‐Philippines/86773013086?ref=ts

• Linkedinhttp://www.linkedin.com/groups?home=&gid=2028295&trk=anet_ug_hm

Page 45: Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

Contact me through:

[email protected]

0919‐267‐3389

889‐6999