Ten Tools

Preview:

DESCRIPTION

tools six sigma

Citation preview

Neil W. Polhemus, CTO, StatPoint Technologies, Inc.

George H. Dyson, Director of Six Sigma ServicesCopyright ©

2010 by StatPoint Technologies, Inc.

Ten Data Analysis Tools You Can’t Afford to be Without

Business Improvement Objectives

Businesses must create value. This output must be greater than the inputs needed to produce it.

If the output meets the customers’

needs, the business is effective.

If the business creates added value with minimum resources, the business is efficient.

The Role of Six Sigma is to Help a Business Produce the Maximum Value While Using Minimum ResourcesThe Role of Six Sigma is to Help a Business Produce the Maximum Value While Using Minimum Resources

(Pyzdek 2003)

2

Six Sigma Business Successes

• Cost reductions

• Market - share growth

• Defect reductions

• Culture changes

• Productivity improvements

• Customer relations improvements

• Product and service improvements

• Cycle - time reductions

(CSSBB Primer, 2001) / (Pande, 2000)

All these Successes have a common thread….

3

DATA!!!

Data drives analysis.

Analysis uses statistical models & tools of all kinds.

To extract meaningful information

To uncover signals in the presence of noise

To understand the past

To monitor the present

To forecast the future

This understanding results in reduced cost and saving money.

Deming: “Doesn’t anyone care about Profit?”

4

Examples

Product comparisons

Survey analysis

Distribution fitting

Comparison of multiple samples

Outlier detection

Curve fitting

Response surface modeling

Time series forecasting

Event rate modeling

Interactive maps

5

Problem #1 –

Product Comparisons

Consumer Reports 2010: Sedans (Family, Upscale, Luxury)

7 variables

Price (dollars)

Road-test score (0 –

100)

Predicted reliability (1 –

5)

Owner satisfaction (1 –

5)

Owner cost (1 –

5)

Safety (1 –

5)

Fuel economy (overall mpg)

6

Data file: cars.sgd

(n=30)

7

Multivariate Visualization

The data consist of 7 quantitative variables plus one categorical factor. Each car is thus a point in 7-dimensional space with one other feature. How can we visualize what’s going on?

1.

2-D scatterplot2.

3-D scatterplot3.

Scatterplot matrix4.

Bubble chart5.

Parallel coordinates plot6.

Star glyphs7.

Chernoff

faces

8.

Radar/spider plot

8

2-D ScatterplotUseful for plotting 2 dimensions.

9

Plot of MPG vs Price

24 34 44 54 64(X 1000)Price

17

20

23

26

29

32

35

MPG

ClassFamily Luxury Upscale

3-D ScatterplotUseful for plotting 3 dimensions.

10

Plot of MPG vs Price and Road Test

2434

4454

64(X 1000)Price 58

6878

8898

Road Test

17202326293235

MPG

ClassFamily Luxury Upscale

Toyota Camry Hybrid

Nissan Altima Hybrid

Bubble ChartUses size of bubble to illustrate a third dimension.

Bubble Chart for Reliability

24 34 44 54 64(X 1000)Price

17

20

23

26

29

32

35M

PGClass

Family Luxury Upscale

11

Scatterplot Matrix Plots all pairs of variables.

12

ClassFamily Luxury Upscale

Price

Road Test

Reliability

Owner Satisfaction

Owner Cost

Safety

MPG

Parallel Coordinates PlotEach case is shown as a line connecting the values of the variables.

13

Parallel Coordinates Plot

0

0.2

0.4

0.6

0.8

1 ClassFamily Luxury Upscale

MPG

Price

Road Tes

tReli

abilit

yOwner

Satisfac

tion

Safety

Owner Cost

Star GlyphsEach case is shown as a polygon with vertices scaled by the variables.

1/Price

MPG

Road Test

ReliabilityOwner Satisfaction

Safety

Owner Cost

14

Star GlyphsVariables are scaled so that larger area is better.

ClassFamily Luxury Upscale

Nissan Altima Honda Accord Toyota Camry XLE Volkswagen Passat Toyota Camry Hybrid Hyundai Sonata

Chevrolet Malibu Ford Fusion Mercury Milan Nissan Altima Hybrid Ford Taurus Kia Optima

Chevrolet Impala Lexus ES Toyota Avalon Acura TL Hyundai Azera Lincoln MKZ

Buick Lucerne Volvo S60 Infiniti M35 AWD Infinit M35 Audi A6 Acura RL

BMW 535i Cadillac STS Cadillac DTS Lexus GS 300 Lexus GS 450 Hybrid Volvo S80

15

Chernoff

FacesEach variable is assigned to a different feature of the faces.

ClassFamily Luxury Upscale

Nissan Altima Honda Accord Toyota Camry XLE Volkswagen Passat Toyota Camry Hybrid Hyundai Sonata

Chevrolet Malibu Ford Fusion Mercury Milan Nissan Altima Hybrid Ford Taurus Kia Optima

Chevrolet Impala Lexus ES Toyota Avalon Acura TL Hyundai Azera Lincoln MKZ

Buick Lucerne Volvo S60 Infiniti M35 AWD Infinit M35 Audi A6 Acura RL

BMW 535i Cadillac STS Cadillac DTS Lexus GS 300 Lexus GS 450 Hybrid Volvo S80

16

Feature AssignmentsUnfortunately, some features have a greater impact than others.

17

Radar/Spider PlotGood for comparing a small number of cases.

Owner Cost (0.0-5.0)

Price (25000.0-55000.0)

Road Test (50.0-100.0)

Radar/Spider Plot

MPG (15.0-25.0)

Owner Satisfaction (0.0-5.0)

Reliability (0.0-5.0)

Safety (0.0-5.0)

ModelVolkswagen PassatChevrolet ImpalaLincoln MKZCadillac STS

18

Problem #2: Survey Analysis

The most commonly used statistical procedure is the calculation of a two-way table (a tabulation of responses that can be classified in 2 ways).

Such crosstabulations

result in contingency tables that provide

much useful information.

Example: On January 18-19, 2010, Rasmussen Reports asked 1000 likely voters: “How would your rate the U.S. healthcare system?”

19

Data file: healthcare.sgd

20

Mosaic PlotScales the area of each bar according to the counts in the table.

21

Chi-square TestTests for lack of independence between row and column classification.

22

Correspondence AnalysisUsed to help visualize the important information in two-way tables.

23

Problem #3: Distribution Fitting

In many studies, determining the distribution of a quantitative variable is critical.

Common examples covered in Six Sigma include capability studies.

Distribution fitting is also quite critical in many design problems.

24

Data file: waves.sgd

(n=26,304)

25

Source: www.iahr.net

Frequency HistogramShows the number of observations in non-overlapping intervals.

26

Histogram

0 2 4 6 8 10 12Height

0

400

800

1200

1600

2000

2400

frequ

ency

Normal DistributionThe normal distribution is a poor model for this data.

27

Histogram for Height

-1 2 5 8 11 14Height

0

1

2

3

4(X 1000)

frequ

ency

DistributionNormal

Comparison of DistributionsFits many distributions and sorts them by goodness of fit.

28

Lognormal DistributionThe 3-parameter lognormal distribution is much better.

29

Histogram for Height

0 2 4 6 8 10 12Height

0

400

800

1200

1600

2000

2400

frequ

ency

DistributionLargest Extreme ValueLognormal (3-Parameter)Normal

Problem #4: Multiple Samples

Data are frequently obtained from more than one sample.

Asserting a significant difference between the samples (or lack thereof) is an important application of data analysis.

30

Data file: thickness.sgd

(n=480)

31

Box-and-Whisker PlotsA very useful plot for comparing samples (from John Tukey).

32

(>1.5 x IQR below 25th Percentile,or >1.5 x IQR above 75th Percentile)

63

64

Leng

t h

Maximum Observation(within 1.5 x IQR above 75th)

75th Percentile

Median (50th Percentile)

25th Percentile

Minimum Observation(within 1.5 x IQR below 25th)

IQR

* Outlier

1.5 x IQR57

58

59

60

61

62

54

55

56

Plus sign may be added to show sample mean.

Notched Box-and-Whisker PlotsNon-overlapping notches indicate significantly different medians.

33

HSD IntervalsAllow pairwise comparison of all level means.

34

Problem #5: Outlier Detection

Many data sets contain aberrant observations that don’t come from the same distribution as the others.

Identifying outliers and treating them separately often results in better models.

35

Data file: bodytemp.sgd

(n=130)

36

Outlier PlotShows each data value with lines at 1, 2, 3 and 4-sigma.

37

Grubbs’

TestSmall P-value indicates that the extreme Studentized deviate (ESD) is highly unusual.

38

Problem #6: Curve Fitting

A common data analysis problem involves determining the relationship between a response variable Y and a predictor variable X.

If we can estimate a model where Y = f(X), then we can use that model to make predictions.

39

Data file: chlorine.sgd

(n=44)

40

Simple Linear RegressionFits a linear model of the form Y = mX + b.

41

95% confidence limits for mean

95% prediction limits for new observations

Plot of Fitted Modelchlorine = 0.48551 - 0.00271679*weeks

0 10 20 30 40 50weeks

0.38

0.4

0.42

0.44

0.46

0.48

0.5ch

lorin

e

Comparison of Alternative ModelsFits many transformable nonlinear models and sorts by R-Squared.

42

Reciprocal X ModelA nonlinear model of the form Y = m/X + b is much better.

Plot of Fitted Modelchlorine = 0.368053 + 1.02553/weeks

0 10 20 30 40 50weeks

0.38

0.4

0.42

0.44

0.46

0.48

0.5ch

lorin

e

43

Problem #7: Response Surfaces

The MISSION: air dominance at the lowest possible price.

Use optimization models, built from performance data, to design the best aircraft engine.

Note: the data have been altered and are for demonstration purposes only.

44

Problem Statement

Optimize 3 response variables:

Minimize total fleet acquisition cost (Y1)

Maximize climb rate (Y2)

Maximize launch rate (Y3)

Input factors:

X1: Fan Pressure Ratio: 3.9 to 4.7

X2: Overall Pressure Ratio : 34 to 40

X3: Inlet airflow : 240 to 270 pps

45

Data file: engines.sgx

(n=584)

46

Design PlotShows the location of the historical data within the factor space.

47

Experimental Region

3.94.1

4.34.5

4.7FPR 3435

3637

3839

40

OPR

240245250255260265270

Air

Flow

Standardized Pareto ChartsShow the significant factors affecting each response.

48

Standardized Pareto Chart for Y1 Cost

0 10 20 30 40Standardized effect

AC

CC

AA

BC

BB

AB

B:OPR

A:FPR

C:Air Flow +-

Standardized Pareto Chart for Y2 Ps

0 20 40 60 80 100 120Standardized effect

AC

BB

CC

AB

BC

AA

B:OPR

A:FPR

C:Air Flow +-

Standardized Pareto Chart for Y3 Launch

0 20 40 60 80 100 120Standardized effect

AC

CC

BB

AB

BC

AA

B:OPR

A:FPR

C:Air Flow +-

Desirability FunctionQuantifies the desirability of a joint response (Y1, Y2, Y3).

49

For responses to be minimized:

For responses to be maximized:

Combined desirability:

D=d(Y1)*d(Y2)*d(Y3)

Optimal ConditionsFound at the levels shown below:

50

Response SurfaceShow the estimated desirability throughout the experimental region.

51

Desirability PlotAir Flow=254.697

3.6 3.9 4.2 4.5 4.8 5.1FPR

31 33 35 37 39 41 43

OPR

220

240

260

280

300

Air

Flow

Desirability0.00.10.20.30.40.50.60.70.80.91.0

Desirability

Problem #8: Time Series Data

Data recorded at equally spaced points in time is called a time series.

Time series models are used for various purposes:

Analysis of trends and seasonal effects

Forecasting

Control

Autocorrelation between adjacent observations requires special models.

52

Data file: customers.sgd

(n=168)

53

Time Sequence PlotPlots the data versus time.

Time Series Plot for Customers

1/96 1/99 1/02 1/05 1/08 1/11Month

73

83

93

103

113(X 1000)

Cus

tom

ers

54

Autocorrelation FunctionEstimates the correlation between observations at different lags.

Estimated Autocorrelations for Customers

0 12 24 36lag

-1

-0.6

-0.2

0.2

0.6

1

Aut

ocor

rela

tions

55

Seasonal DecompositionShows the average value during each season (scaled to 100).

Seasonal Index Plot for Customers

1 2 3 4 5 6 7 8 9 10 11 12season

90

94

98

102

106

110

114se

ason

al in

dex

56

Seasonal Subseries PlotShows the seasonal averages and trend within each season.

Seasonal Subseries Plot for Customers

0 2 4 6 8 10 12 14Season

73

83

93

103

113(X 1000)

Cus

tom

ers

57

Annual Subseries PlotShows the seasonal effect separately for each cycle.

58

Annual Subseries Plot for Customers

0 3 6 9 12 15Season

73

83

93

103

113(X 1000)

Cust

omer

s

Cycle19961997199819992000200120022003200420052006200720082009

Seasonally Adjusted DataRemoves the seasonal effects from the data.

Seasonally Adjusted Data Plot for Customers

1/96 1/99 1/02 1/05 1/08 1/11Month

76

81

86

91

96

101

106(X 1000)

seas

onal

ly a

djus

ted

59

Automatic ForecastingFits many models and automatically selects the best.

60

ARIMA ModelSelected model is a rather complicated seasonal ARIMA model.

Time Sequence Plot for CustomersARIMA(2,1,1)x(1,1,2)12

1/96 1/00 1/04 1/08 1/12Month

73

83

93

103

113

123(X 1000)

Cus

tom

ers

actualforecast95.0% limits

61

Problem #9: Event Rate Modeling

When the data to be analyzed consist of the time at which events occur, the process by which those events are generated is called a point process.

The most common type of point process is a Poisson process, in which the times between events are independent and follow a negative exponential distribution.

If the event rate is constant, the process is called homogeneous. If the event rate changes over time, then the process is called nonhomogeneous.

62

Data file: earthquakes.sgd

(n=52)

63

Plot of Magnitude Versus DateShows an apparent increase in magnitude over the sampling period.

Plot of Magnitude vs Date

1/1/08 12/31/08 12/31/09 12/31/10Date

5.4

5.9

6.4

6.9

7.4

7.9

8.4M

agni

tude

64

Point Process PlotShows the dates of occurrence only.

Events Plot for Date

1/1/08 12/31/08 12/31/09 12/31/10Time or distance

Earthquake

65

Event Rate

66

The critical parameter in a point process is the rate of events per unit time (such as earthquakes per year). This parameter is usually called .

The rate parameter is also related to the mean time between events, with MTBE = 1 /

may be constant (a homogeneous process) or vary over

time (a nonhomogenous

process).

Cumulative Events PlotThe slope of the line is related to the event rate.

67

Mean

Cumulative Events Plot for Date

1/1/08 12/31/08 12/31/09 12/31/10Time or distance

0

20

40

60

80N

umbe

r of

eve

nts

Trend TestA small P-value would indicate a significant trend.

68

Other UsesPoint process models are also very useful for estimating failure rates.

69

Aircraft 2

Aircraft 4

Aircraft 6

Aircraft 8Aircraft 9

Aircraft 10Aircraft 11Aircraft 12Aircraft 13

Events Plot

0 500 1000 1500 2000 2500Time or distance

Aircraft 1

Aircraft 3

Aircraft 5

Aircraft 7

Between Group ComparisonsTests can be made to determine whether there are significant differences.

70

Mean

Cumulative Events Plot

0 500 1000 1500 2000 2500Time or distance

0

5

10

15

20

25

30N

umbe

r of e

vent

s

Aircraft 1Aircraft 2Aircraft 3Aircraft 4Aircraft 5Aircraft 6Aircraft 7Aircraft 8Aircraft 9Aircraft 10Aircraft 11Aircraft 12Aircraft 13

Problem #10: Interactive Maps

Graphics that allow the user to interact with the data are extremely useful.

Maps are one important example.

71

Data file: census2000.sgd (n=51)

72

U.S. Map StatletThe slider changes the cutoff between the red and blue states.

73

Recommended