Upload
dinhque
View
220
Download
4
Embed Size (px)
Citation preview
ver 2.0.1 1
Problem Solving Workshop: Critical Thinking and
Root Cause Analysis (RCA)
Graham FurnisSenior Consultant, B Wyze Solutions
Workshop will begin at 1pm ESTWorkshop will begin at 1pm EST
ver 2.0.1 2
Problem Solving Workshop:
Critical Thinking and
Root Cause Analysis (RCA)
Graham FurnisSenior Consultant, B Wyze Solutions
ver 2.0.1 3
Workshop Agenda
• Problem Solving
– Understanding Problem Solving– Problem Solving Perspectives– Problem Solving as a Structured Process– Problem Solving within ITSM
• Root Cause Analysis
– RCA Methods and Techniques– Getting to the True Root Cause– Problem Options and Solutions
ver 2.0.1 4
SECTION 1:
PROBLEM SOLVING
CONCEPTS
ver 2.0.1 5
(ITSM) IT Service Management
• A discipline
– for managing IT systems and technology – centered on the identification and delivery of
IT Services used by the business
• Within ITSM,
– ITIL (IT Infrastructure Library) framework links Root Cause Analysis to the processes of
• Incident Management
• Problem Management
ver 2.0.1 6
ITSM / ITIL Defintions
• Problem – The unknown cause of one or more incidents
• Incident – unplanned event that is deviation from normal
• Priority– Impact - degree of +ve / -ve business effect– Urgency - degree of response time required
• Service Level Agreement (SLA)– A written or understood agreement for the Service
ver 2.0.1 7
Problem Solving
• A mental process for thinking and reasoning
– Not as easy as one may think…
– Something that can be continually learned, practiced and improved
ver 2.0.1 8
Problem Finding
• Identifying the Problem is the first step to
good Problem Solving…
– it becomes the target that is being solved for– identifying the problem is often more complex
than actually solving the problem
• A key to good problem finding involves the use of Creative Thinking
ver 2.0.1 9
Problem Shaping
• Problem shaping follows Problem Finding
– Questions need to be asked that shape the direction and findings of problem investigation
– Questions are iterative and refines and shapes further questions to be asked
• A key to good problem finding involves the
use of Critical Thinking
ver 2.0.1 10
Adjust Your Thinking and Reasoning
• Be able to adjust your approach and
perspective to solving the problem
• Critical Thinking
– Familiar
• Creative Thinking
– Unfamiliar
ver 2.0.1 11
Critical Thinking and Deductive Reasoning
• Critical thinking in Problem Solving can be
thought of as logical thinking
– Deductive Reasoning is based on a set of propositions and the subsequent investigation and factual discoveries
– Deductive Reasoning tends to be a top-down approach to Problem Solving
ver 2.0.1 12
Critical Thinking and Deductive Reasoning
• Use when:– A problem is familiar or of a familiar type– A problem solver has sufficient skill & experience
• Example:– We observe that a critical marketing application has
several different user error messages across the marketing department. We have programming experience to know that each error message is triggered by application error trapping code. Therefore, we deduce that we should investigate the programming code related to the application modules that produced the error message to confirm the application logic.
ver 2.0.1 13
Creative Thinking and Inductive Reasoning
• Creative thinking in Problem Solving can be
thought of as “thinking outside the box” of
common and tried solutions
– Inductive Reasoning are assumptions
• they are not necessarily valid conclusions, but start points to be further investigated and validated
– Inductive Reasoning tends to be a bottom-up approach to Problem-Solving
ver 2.0.1 14
Creative Thinking and Inductive Reasoning
• Use when:– A problem is unfamiliar– Deductive reasoning has reached a dead end
• Example:– We observe that a critical marketing application has
several different user error messages across the marketing department. We have no programming skill. We have observed in our past experience that shared applications are run from a central server. The marketing application is a shared application; and therefore we induce (assume) that the Problem must be based on a server. Our investigations will now take this path.
ver 2.0.1 15
The Basic Problem Solving Process
Primary Goal
• Prevent Problems from ever recurring by taking effective corrective actions.
Steps:
(1) Correctly defining the problem,
(2) Finding the root cause(s) of the problem through Root Cause Analysis
(3) Determining the most effective corrective actions to take, and
(4) Implementing the solution to successfully manage the problem
ver 2.0.1 16
Root Cause Analysis
• Root Cause Analysis is a sub-process of the larger Problem Solving process
• Each Root Cause Analysis approach shares a common aim to:
– avoid focusing on and solving the symptoms of the problem,
– Instead, to drill deeper to identify and solve the true root cause of the problem
ver 2.0.1 17
Root Cause Analysis
Primary Goal
• Determine the lowest level “root” cause(s) of a Problem that supports taking the most effective corrective actions.
Primary Objective
• Find the correct root cause of the Problem, because without it we cannot determine what effective corrective actions must be taken.
ver 2.0.1 18
Kepner-TregoeRCA Method
1. Define the Problem
2. Assess the Problem
3. Establish Possible Causes
4. Explore Possible and Probable Causes
5. Verify Root Cause(s)
ver 2.0.1 19
Problem Solving Plan
• It is recommended that a structured
problem solving plan should be created
when solving any Problem
– The plan is iterative…
ver 2.0.1 20
SECTION 2:
- Hands on Activities -
RCA Methods and Techniques
ver 2.0.1 21
RCA Methods and Techniques
• Journalism Standard
• Pareto analysis
• Cause and Effect Analysis
• Change Analysis
• Ishikawa diagram
• The 5 Whys
ver 2.0.1 22
Journalism Standard
• “Just the Facts”…
– Research & list the basic facts of the situation– Seek interviews & independent confirmation– Evaluate using a neutral approach
• Avoid:
– eliminating possible causes due to assumptions– missing possible causes due to tunnel vision
• Use the 5 w’s…
– Who, What, When, Where, How, Why
ver 2.0.1 23
Which RCA method to start with?
• What are the tactics you will start with??
Facts
Facts
Pareto
Interview
ver 2.0.1 24
Case Study
• Read the following case study from your Exercise workbook
– The Marketing Client Support team of LITI Corporation has reported two related Incidents this week related to their Sales Management Service………
• Activity 1: Define the Problem
• Activity 2: Document the Facts
ver 2.0.1 25
Facts Table
• Unfamiliar? Guesses? Inductions?
• Familiar? Experience? Deductions?
Why
•How
•Where
•When
•What
•Who
ver 2.0.1 26
Activity Answers
• CRM application is failing at random
when saving data
Problem
Definition
•Causing Marketing•To lose data
•To lose productivity
ver 2.0.1 27
Activity Answers
• Technology: • CRM Desktop Application
• Desktop PC
• Local Network
• CRM Server
What
• Processes: • Client history, Billing and Payments
What
• People:
• Bob, Mary, “several others”Who
ver 2.0.1 28
Activity Answers
•Client calls, CRM record opened for updates
–Bob’s case: 1 update, Client History module
–Mary’s case: 20 updates, Billing module
–Records not saved immediately
–CRM Fails to Save data
•Recovery is to Reboot PC or Refresh Records
•L2 Support technician confirmed:
–PC network connectivity
–Desktop PC memory and disk space sufficient
–CRM application server up and running
–Re-entered 3 records and saved successfully
How
ver 2.0.1 29
Activity Answers
• In office
–Marketing Department
Where
•Mary had several in the last month
– Two incidents this week
•Doesn’t happen every time
–Random occurence
When
ver 2.0.1 30
Activity Answers
•Deductions:
–Desktop unlikely– sufficient memory & disk space
–Network unlikely – no other reports of connectivity
–Server unsure – running, but were there errors?
•Have you seen this before?
–Familiar? Think you know what it is?
•Start to investigate based on experience
–Unfamiliar?
•Choose an appropriate RCA technique
Why
ver 2.0.1 31
Pareto Analysis
• The “80/20 rule”
– 80% of the effects of something
• are a result of
– 20% of the inputs or causes
• Therefore;
– Problem causes accounting for 80% of problems should be investigated first
ver 2.0.1 32
Activity 3:Pareto Analysis
• Complete the Pareto Table• Make your deductions
Likely or
Unlikely
ver 2.0.1 33
Activity 3:Answers
ver 2.0.1 34
Research Information
Contact the Server Support Group
• The Server Support group assures you that the Server PC supporting the CRM Application is fully monitored and showed no processing overload according to Server log files, and currently has 50% available disk capacity after a data clean up last month.
Contact Bob and Mary
• Bob and Mary both respond that there’s nothing unusual or incorrect with their data as they re-entered and attached the same data after re-booting and successfully saved.
• While they are on the phone, you also find out:
– The Problem first appeared (but was not reported) 4 to 5 weeks ago. The first occurrence was several weeks after a CRM release introducing a Billing module, but just before the Billing bug fix.
– The “others” were almost all other Sales Reps, and they believe the frequency of these failed saves is increasing
– Sales Reps and the Marketing Manager believe that the application is used more heavily and stores more information as time goes on. It must be a failure to save the quantity of data. They demand this failure be addressed to allow them to store the critical information required.
ver 2.0.1 35
Activity Answers
• Technology: • CRM Desktop Application
• Desktop PC - memory and disk space
• Local Network - local connection
• CRM Server - up and running
• Processes: • Client history, Billing and Payments
• People:
• Bob, Mary, almost all Sales Reps
Who
ver 2.0.1 36
Activity Answers
•CRM Application fails to save data
–Bob’s case: 1 update, Client History module
–Mary’s case: 20 updates, Billing module
•Happened before, but unreported in the last month
•L2 Support technician report:
–network connectivity confirmed
–CRM application server up and running
–Desktop PC memory and disk space sufficient
–L2 technician re-entered 3 records and saved successfully
–Release Billing Module followed by Bug fix–Invalid data types
–Too much data being saved
What
ver 2.0.1 37
Activity Answers
• In office
–Marketing Department
Where
• Started 4-5 weeks ago
–Two incidents this week
–Mary had several in the last month
–Frequency is increasing
•Doesn’t happen every time = random
When
ver 2.0.1 38
Activity Answers
•Client Calls, Sales Rep open client record
–Sales Rep updates records
•May not be saved immediately
– User executes command to save
– CRM Application fails to save record
• Recovery
– CRM application rebooted or records refreshed– Re-enter record(s) and save
How
ver 2.0.1 39
Activity Answers
•Deductions:
–Server unlikely – confirmed all cases
–Something new must be happening!
•Investigate the Application Release
Or
•Cause & Effect Analysis of components and events
Why
ver 2.0.1 40
Which RCA method to use next?
• What are the tactics you will start with??
Facts
Facts
Pareto
Facts
Facts
Cause & Effect
Interview Interview
ver 2.0.1 41
Cause and Effect Analysis
• Relationship Analysis … to determine a
cause and effect path
– first event is the cause (the trigger) – second event is the effect (the consequence)
– Deductive, top-down, making use of critical
thinking and deductive reasoning skills
ver 2.0.1 42
Types of Cause and Effect Analysis
• Technical Relationships
• Fault Tree Analysis (FTA)
• IT Process Relationships
• End-User Interaction Relationships
• Chronological Event Relationships
– (chain of events)
ver 2.0.1 43
Cause and Effect Conditions
• Necessary Conditions
• Sufficient Conditions
• Contributory Conditions
ver 2.0.1 44
Activity 4 & 5
Activity 4: Technology Cause and Effect
• Complete the Technology Analysis Cause and Effects table
• GROUP ACTIVITY
Activity 5: Process Cause and Effect
• Complete the Process Analysis Cause and Effects table
• Make your deductions
ver 2.0.1 45
Answers
Contributory
Sufficient
Necessary
Ms Office
ver 2.0.1 46
Answers
Record Occurrences Last Time Component Description
Incident 2 1 day ago CRM Desktop CRM Desktop application failure to save
Incident 3 2 weeks ago Desktop PC Desktop network cable disconnected, desk-side reconnect
Maintenance 2 2 weeks ago Server PC Server shut down and restart – standard Sunday maintenance window
Incident 5 3 weeks ago Desktop PC Desktop PC performance degradation, close applications or reboot required
Incident 3 3 weeks ago
CRM Desktop CRM Desktop application performance degradation, close applications and reboot required
Incident 1 1 month ago Server PC Server PC disk space alarm, historic data archived
Change 1 1 month ago CRM Server Application Release Bug Fix Updates
Maintenance 4 1 month ago Server PC Server disk clean up and tuning – standard Sunday maintenance window
Change 1 2 months ago CRM Desktop Update Desktop CRM Application Drivers
Change 1 2 months ago CRM Desktop Application Release functionality Update to Client Billing module
likely
ver 2.0.1 47
Research Information
Contact the CRM Application Development and Support Group
• The Application Support group indicates the bug fix was released in response to an undersized data size field limit in the new Client Billing module. Since the bug fix there have been no further related Incidents.
• The Application Support group believes it must be a User data entry error as the Application has been fully tested.
• The Application Support group further explains that when data is updated by a User, it is held in memory on the Users PC. When the User Saves this data, each record is written to the record on the Server PC Application database. There are no errors recorded in the database error log.
ver 2.0.1 48
Activity Answers
• Technology: • CRM Desktop Application
• Desktop PC
• Local Network
• CRM Server
• Processes: • Client history, Billing and Payments
• People:
• Bob, Mary, almost all Sales Reps
Who
ver 2.0.1 49
Activity Answers
•CRM Desktop Application � likely
–Too much data being saved
–Coding error possible
•CRM Server Application � not likely
–CRM bug fix unlikely as data has been retyped
successfully
–CRM Server Application and Database appear to be operating successfully
What
ver 2.0.1 50
Activity Answers
• In office
–Marketing Department
Where
• Started 4-5 weeks ago
–Two incidents this week
–Mary had several in the last month
–Frequency is increasing
•Doesn’t happen every time = random
When
ver 2.0.1 51
Activity Answers
•Client Calls, Sales Rep open client record
–Sales Rep updates records
•May not be saved immediately
– User executes command to save
– CRM Application fails to save record
• Recovery
– CRM application rebooted or records refreshed– Re-enter record(s) and save
How
ver 2.0.1 52
Activity Answers
•Deductions:
–Something new must be happening!
•Something must be happening at the User end
•CRM Desktop Application is most likely
•MS Office and Windows may be contributing
Why
ver 2.0.1 53
Which RCA method to use next?
• What are the tactics you will start with??
Facts
Facts
Pareto
Facts
Facts
Cause & Effect
Facts
Facts
Ishikawa
Interview Interview Interview
ver 2.0.1 54
Ishikawa Technique
• A founding tool for modern management
and is considered one of the seven basic
tools of quality control
• Forces a problem solver to think creatively
across several different categories…
– And to relate across categories
ver 2.0.1 55
Ishikawa Causes
• Both a bottom-up and top-down approach
that can benefit from integration with other
problem solving methods
– Useful in a complex systems environment
– Apply Necessary, Sufficient, and Contributory conditions
ver 2.0.1 56
Ishikawa Diagram(fishbone diagram)
ver 2.0.1 57
Activity 6:Ishikawa Analysis
• Complete the Ishikawa diagram, brainstorming
to identify the primary and secondary possible
causes under each category
– Don’t worry about being right – get creative!!!
• Make your deductions
ver 2.0.1 58
Answers
ver 2.0.1 59
Research Information
Contact the Marketing Manager and Sales Reps
• The Marketing Manager insists the CRM application is not used for any new business activity; nor is it used by any other department as only the Marketing Manager can approve new users. As the software is developed internally, there are no user limits. And there are no environmental factors that have changed (ie: no office moves, etc)
• The Marketing Manager further state that the CRM Application is better managed and used since starting a new quality review initiative where the Manager reviews and updates poorly documented or incomplete Client records. It’s critical to the Marketing Manager that these records are accurate as they drive the weekly Sales reports to upper management. This quality effort has been in place now for more than a month.
ver 2.0.1 60
Research Information
Contact the Marketing Manager and Sales Reps
• Sales Reps insist that they are following Marketing procedures when updating records. There are no shortcuts taken. Only the Sale Reps have access to these records, and Sales Reps do not have the admin rights to share and update other Sales Reps client records.
• Sales Reps do not think updates happen at the same time. Client calls are too random. However, there are often multiple records left open for periods of time when multipe Client calls are taken in succession. This practice is the norm, and Sales Reps will complete the Client Updates when call volumes lower and time permits.
ver 2.0.1 61
Activity Answers
• Technology: • CRM Desktop Application
• Desktop PC
• Local Network
• CRM Server
• Processes: • Client history, Billing and Payments
• QA Manager Review & Update
• People:
• Bob, Mary, almost all Sales Reps
• Marketing Manager
Who
ver 2.0.1 62
Activity Answers
•CRM Desktop Application � likely
–Too much data being saved
–Coding error possible
–Could be a time out issue
–Could be a data conflict (concurrency lock)
•CRM Server Application � not likely
–CRM bug fix unlikely as data has been retyped
successfully
–CRM Server Application and Database appear to
be operating successfully
What
ver 2.0.1 63
Activity Answers
• In office
–Marketing Department
Where
• CRM Save failure Started 4-5 weeks ago
–Two incidents this week
–Mary had several in the last month
–Frequency is increasing
–Doesn’t happen every time = random
•CRM QA Updates more than 1 month
When
ver 2.0.1 64
Activity Answers
•Client Calls, Sales open CRM client record
–Sales Rep updates records
•May not be saved immediately
– User executes command to save
– CRM Application fails to save record– Recovery
• CRM application rebooted or records refreshed
• Re-enter record(s) and save
•Random check, Manager opens CRM record
–Review & update unclear or missing data
–Save record and continue to review other records
How
ver 2.0.1 65
Activity Answers
•Deductions:
–Something new must be happening!
•Something must be happening at the User end
–Likely a record lock conflict
•Caused by the Marketing Manager QA Updates
–Secondary or contributing issues might be
•Time out
•Too much data
Why
ver 2.0.1 66
Which RCA method to use next?
• What are the tactics you will start with??
Facts
Facts
Pareto
Facts
Facts
Cause & Effect
Facts
Facts
Ishikawa
Hypothesis Testing & Validation
Interview Interview Interview
ver 2.0.1 67
SECTION 3:
Getting to the True Root Cause
ver 2.0.1 68
Change Analysis
• Comparative Analysis (or trial and error)
• This technique is based on comparing all factors contributing to the situation where a problem does not exist, to the situation where the problem does exist
– Top-down approach – Requiring full knowledge– May involve re-enactment and observation,
where a technician changes one factor at a time in an attempt to re-create the Problem
ver 2.0.1 69
Hypothesis Testing and Validation
• A hypothesis is a proposed Cause for the
Problem and is then tested
• Testing falls into one of two types:
– controlled experiment or – operational observation
• Propose a range of testing options
– assessed with a Risk Assessment … to avoid worsening the problem
ver 2.0.1 70
Activity 7:Hypothesis Testing
• Complete the Hypothesis testing options table
• Make your recommendations
ver 2.0.1 71
AnswersExperiment
Experiment
Observation
Observation
High QualityHigh Risk of corrupted data
on reversion.
High QualityLow Risk of corrupted data
on reversion.
Low Risk -Low quality might miss conditions
Med Risk –problem must recur
Low quality might miss conditions
ver 2.0.1 72
Research InformationTesting Results:
• Testing has been arranged within the production CRM Application over the maintenance weekend for the creation of 5 test clients and 2 Desktop PCs. The following is observed:– When a single client record is updated but not saved, and then the same
record is opened and updated in a second PC, the second PC fails to save the record and appears to be “frozen”. The resolution is to refresh the Client record or close the CRM Application.
• This same test for multiple records opened and just one record updated will also fail to save the block of records
– This appears to have duplicated the Problem and identified the Cause. • However, it is prudent to test the two other possible contributing scenarios for
their effect
– On creating batch record updates and saving, there were no application errors. This same scenario was repeated for opening the records on multiple desktop PCs, but only making updates and saving on one specific PC. No save errors resulted.
• This same test was duplicated with large PDF file attachments to records. Again, no file save errors resulted.
ver 2.0.1 73
Activity Answers
•Deductions:
–Something new must be happening!
•Something must be happening at the User end
–It appears the Cause of the Problem is a concurrency issue caused by the Marketing
Manager updating and saving records that are open and updated elsewhere, but not yet saved.
Why
ver 2.0.1 74
Which RCA method to use next?
• What are the tactics you will start with??
Facts
Facts
Pareto
Facts
Facts
Cause & Effect
Facts
Facts
Ishikawa
Hypothesis Testing & Validation
Facts
Facts
5 Whys
Interview Interview Interview Interview
ver 2.0.1 75
The 5 Whys
• A method for perseverance in Problem Shaping
– Don’t stop at a superficial symptoms
– Ask “why did this happen” in five successions
• Best used for
– simple problems or
– use in conjunction with other problem solving techniques
• Tips for using:
– Verify each “why” question before proceeding to the next – Focus on making the last “why” question one of process
ver 2.0.1 76
Activity 8: The 5 Whys
• 1. Why did the problem occur?
• Due to a record concurrency lock
• 2. Why did concurrency lock get into production?
•
• 3. Why?
•
• 4. Why?
•
• 5. Why? (process)
– True Root Cause
A failure to detect during testing
Concurrency requirements not in test cases
Business Analysts didn’t ask sharing needs
No standards exist for concurrency
ver 2.0.1 77
Which RCA method to use next?
• What are the tactics you will start with??
Facts
Facts
Pareto
Facts
Facts
Cause & Effect
Facts
Facts
Ishikawa
Hypothesis Testing & Validation
Facts
Facts
5 Whys
Root Cause
Interview Interview Interview Interview
ver 2.0.1 78
The Basic Problem Solving Process
Primary Goal
• Prevent Problems from ever recurring by taking effective corrective actions.
Steps:
(1) Correctly defining the problem,
(2) Finding the root cause(s) of the problem through Root Cause Analysis
(3) Determining the most effective corrective actions to take, and
(4) Implementing the solution to successfully manage the problem
Range of Solution Options
Implement Best Options
ver 2.0.1 79
Options & Solutions
• Determine an effective range of solution
options to address the root cause(s)
• For each option:
– Assess from a business justified perspective– Consider an assessment of risk– Implement using an appropriate project plan
ver 2.0.1 80
Activity 9: Options & Solutions
• Complete the Options and Solutions table
• Identify your recommended solution(s)
ver 2.0.1 81
Answers
ver 2.0.1 82
Conclusions
Lessons Learned:
• How can we do better?
• What worked well with the case study?
• What could have been done differently
for improvement in our Problem Solving
approach?
ver 2.0.1 83
Post Workshop Evaluation
Thank-You!• Please send Comments, Suggestions, and all
Requests to: