IVT 2016 June - Stats for the Non-Statistician

RAUL SOTO, MSC, CQEIVT CONFERENCE - JUNE 2016

PHILADELPHIA PA

Statistics for Non-Statisticians

PRACTICAL APPROACH FOR VALIDATION ANDQUALITY ENGINEERING PROFESSIONALS

(c) 2016 / Raul Soto 1

The contents of this presentation represent the opinion of the speaker; and not necessarily that of his present or past employers.

2IVT APRIL - PHILADELPHIA (C) 2016 RAUL SOTO

About the Author• 20+ years of experience in the medical devices, pharmaceutical, biotechnology, and consumer electronics industries

– MS Biotechnology, emphasis in Biomedical Engineering– BS Mechanical Engineering– ASQ Certified Quality Engineer (CQE)

• I have led validation / qualification efforts in multiple scenarios:

– High-speed, high-volume automated manufacturing and packaging equipment; machine vision systems– Laboratory information systems and instruments– Enterprise resource planning applications (i.e. SAP)– IT network infrastructure, Cognos & Business Objects reports– Manufacturing Execution Systems (MES)– Mobile apps– Product improvements, material changes, vendor changes

• Contact information:– Raul Soto [email protected]


What this talk is about• Introduce and describe the main statistical tools used for

validation, process development, optimization, control.

• Understand basic concepts, underlying assumptions, and limitations

• Understand why we can’t just plug in numbers into Minitab without knowing the fundamental assumptions (“the fine print”)

• Can’t teach two semesters of Statistics in 90 minutes …


Some Uses of Statistics in Validation• Quantify how well a process can meet its specifications (new

process, before/after changes)

• Determine if a process change had the intended effect

• Determine if a process change that was not supposed to impact the product actually had no impact

• Quantify the sources of variation in your process


• Compare equivalence of materials from different vendors

• Determine if multiple lines running the same product are equivalent or not

• Model your process outputs in terms of your process inputs

• Find the process input settings that optimize your process outputs

• Make claims about your process average, or about every unit made by your process


Some Uses of Statistics in Validation

Tools that will be presented

• Process Capability Analysis• Hypothesis Testing• Simple Linear Regression• Analysis of Variance (ANOVA)• Design of Experiments (DoE) • Confidence / Prediction / Tolerance intervals


Process Capability AnalysisHOW TO DETERMINE IF YOUR PROCESS IS CAPABLE



PHILADELPHIA PA

Process Capability• Capable Process: We can make product

that meets specifications

• Process Capability: Quantifies numerically how capable a process is of meeting its specifications

• Compares the actual process width vs the design width (USL – LSL)

• Assumes NORMALITY of the data

(c) 2016 Raul Soto 9

134.4133.0131.6130.2128.8127.4126.0

LSL 125Target 130USL 135Sample Mean 130.5Sample N 100StDev(Overall) 0.675586StDev(Within) 0.704465

Process Data

CI for Cp (2.00, 2.73)CPL 2.60CPU 2.13Cpk 2.13CI for Cpk (1.80, 2.46)

Pp 2.47CI for Pp (2.12, 2.81)PPL 2.71PPU 2.22Ppk 2.22CI for Ppk (1.91, 2.53)Cpm 0.79LB for Cpm 0.70

Cp 2.37Potential (Within) Capability

Overall Capability

PPM < LSL 0.00 0.00 0.00PPM > USL 0.00 0.00 0.00PPM Total 0.00 0.00 0.00

Observed Expected Overall Expected WithinPerformance

LSL Target USLOverallWithin

Process Capability Report for Line 1(using 95.0% confidence)

CPK vs Sigma Levels

• sigma levels : how many standard deviations can we fit between the process mean and the closest specification limit

• Cpk = lowest of CPU or CPL

= ( − )3= ( − )3(c) 2016 / Raul Soto 10

http://www.six-sigma-material.com/Tables.html

Other capability indexes:

(c) 2016 / Raul Soto 11

Cp : ratio of the design spread to the process spread

Cpm : overall capability. Penalizes when the process is off-center

= ∗ s ( )

= ( − )6

Cpk vs Ppk

(c) 2016 / Raul Soto 12

• The basic difference between Ppk and Cpk is the standard deviation used for the calculation. The formulas used are basically the same.

• For Cpk we use the within-lot variability.

• For Ppk we use the overall variability (all lots).

• Cpk will typically be higher than Ppk.

• As we reduce lot-to-lot variability, the process becomes stable, and Ppk approaches Cpk.

Short-term vs Long-Term Capability

• Short – term variation: the variation within a subgroup (for example, one shift, one operator, or one material batch.)

• Overall, long-term variation: the variation of all measurements, which is an estimate of the overall process variation.

• Long-term capability of a process allows us to view the bigger picture

• LT capability is typically lower than the process capability of individual lots

(c) 2016 / Raul Soto 13

Short-term vs Long-Term Capabilityor “why is my process performing poorly, if my three validation batches were so good?”

• The distribution or the mean of a process may be stable for a single set of raw materials, operator, machine, etc.

• But long term variation may increase because of process shifts and drifts

• Possible causes of process shifts : different operators, raw material batches, machines, changes in the environment

• Possible causes of process drifts : equipment aging, parts wear off, environmental factors

• Some authors assume a short term / long term process shift of 1.5 sigma as a rule of thumb. This means that the long term variation will be approximately 1.5 times the short term variation.

(c) 2016 / Raul Soto 14

Process Shifts, Drifts

• Looking only at short term distributions may give the impression that the process is in control and capable

• But if the process mean or the process variability are drifting over time, the long term distribution will show a different picture

• Usually when this happens, the validation of the process is questioned.

• Validation should not be seen as a one-shot deal, but as part of a continuousimprovement process.

(c) 2016 / Raul Soto 15

(c) 2016 / Raul Soto 16

Lot 7Lot 6Lot 5Lot 4Lot 3Lot 2Lot 1

15.0

12.5

10.0

7.5

5.0

Dat

a

Process Mean Shifting Over Time

10

2

4

6

8

01

21

41

61

81

5.4 0.6 5.7 0.9 5.01 0.21 5.31 0.5

7.392 0.9618 3010.29 0.9634 3013.30 0.8041 3010.07 1.226 30

6.635 0.8716 3012.71 0.8753 30

9.237 0.9759 309.947 2.492 210

Mean StDev N

D

ycneuqerF

ata

LelbairaV

llarevO7 toL6 toL5 toL4 toL3 toL2 toL1 to

P lamroN

emiT revO gnitfihS naeM ssecor

(c) 2016 / Raul Soto 17

20

5

01

51

02

4 8 21 61 02 42 8

10.09 0.6285 3014.83 0.8489 3020.36 1.138 3024.82 1.142 3017.53 5.666 120

Mean StDev N

D

ycneuqerF

ata

LelbairaV

_llarevOD toLC toLB toLA to

P lamroN

emiT revO gnitfirD naeM ssecor

Lot DLot CLot BLot A

30

25

20

15

10

Dat

a

Process Mean Drifting Over Time

• When you base your validation only on the Cpk of individual lots, you are not taking into account long-term variation

• This method does not provide a realistic view of how the process will perform in the long term.

• Long term process capability should also be calculated, by using the data from all the lots.

(c) 2016 / Raul Soto 18

The 3-consecutive-lots Rule of Thumb?Short-term vs Long-Term capability

(c) 2016 / Raul Soto 19

Process Capability Example 1:

Calculate lot-to-lot (short term) and process (long term) capabilityusing 3 validation lots

Descriptive Statistics: Lot A, Lot B, Lot C, Lot D

Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 MaximumLot A 60 0 29.748 0.129 0.999 27.271 28.965 29.737 30.331 31.944Lot B 60 0 30.071 0.131 1.015 27.520 29.351 30.135 30.709 32.050Lot C 60 0 30.192 0.132 1.022 27.560 29.548 30.203 30.763 33.302Lot D 60 0 29.671 0.143 1.106 27.045 28.835 29.694 30.541 31.859

(c) 2016 / Raul Soto 20

Lot DLot CLot BLot A

35.0

32.5

30.0

27.5

25.0

Dat

a

Consistent process

(c) 2016 / Raul Soto 21

34.533.031.530.028.527.025.5

LSL 25Target 30USL 35Sample Mean 30Sample N 60StDev(Overall) 0.999348StDev(Within) 1.00359

Process Data




Overall Capability




Process Capability - Lot A(using 95.0% confidence)

(c) 2016 / Raul Soto 22

34.533.031.530.028.527.025.5


Process Data




Overall Capability




Process Capability - Lot B(using 95.0% confidence)

(c) 2016 / Raul Soto 23

34.533.031.530.028.527.025.5


Process Data




Overall Capability




Process Capability - Lot C(using 95.0% confidence)

(c) 2016 / Raul Soto 24

34.232.430.628.827.025.2


Process Data




Overall Capability




Process Capability - Lot D(using 95.0% confidence)

(c) 2016 / Raul Soto 25

Short term (lot to lot) results:

Lot ACpk = 1.66expected defects per million = 0.56

Lot BCpk = 1.63expected defects per million = 0.84

Lot CCpk = 1.62expected defects per million = 1.0

Lot DCpk = 1.50expected defects per million = 6.16

If our validation criteria is based solely on the Cpk’s of individual lots being equal or greater than 1.0, we would pass.

Let’s look at Long-Term variation:

(c) 2016 / Raul Soto 26

34.533.031.530.028.527.025.5


Process Data




Overall Capability




Process Capability - Long Term (All lots)(using 95.0% confidence)

LONG TERM RESULTS

Ppk = 1.58

Expected defects per million units = 2.04

(c) 2016 / Raul Soto 27

Process Capability Example 2:

Calculate lot-to-lot (short term) and process (long term) capabilityusing 3 validation lots

Descriptive Statistics: Lot 1, Lot 2, Lot 3

Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 MaximumLot 1 60 0 15.500 0.0754 0.584 14.138 15.020 15.513 15.807 16.992Lot 2 60 0 11.914 0.0605 0.469 10.562 11.633 11.902 12.218 13.114Lot 3 60 0 17.869 0.0810 0.627 16.299 17.401 17.959 18.232 19.513

(c) 2016 / Raul Soto 28

Lot 3Lot 2Lot 1

20

18

16

14

12

10

Dat

a

Boxplot of Lot 1, Lot 2, Lot 3

USL = 20, LSL = 10

Inconsistent process

(c) 2016 / Raul Soto 29

19.518.016.515.013.512.010.5


Process Data




Overall Capability




Process Capability for Lot 1(using 95.0% confidence)

(c) 2016 / Raul Soto 30

19.518.016.515.013.512.010.5


Process Data




Overall Capability





(c) 2016 / Raul Soto 31

19.217.616.014.412.811.2


Process Data




Overall Capability





(c) 2016 / Raul Soto 32

Individually, based solely in short term process variation, each lot looks good.

Lot 1Cpk = 2.95expected defects per million = 0



If our validation criteria is based solely on the Cpk’s of individual lots being equal or greater than 1.0, we would pass.

But when we look at the long-term variation of the overall process, the picture changes:

(c) 2016 / Raul Soto 33

201816141210


Process Data




Overall Capability




Process Capability - Long Term(using 95.0% confidence)

The overall capability of the process is significantly lower than the individual capability of the individual lots.

This is so because the process mean is shifting.

Overall:

Ppk = 0.66expected defects per million = 47,092 (4.7%)

Look at your long term process capability

This will allow you to obtain a better picture of how the process will perform in the long term.

(c) 2016 / Raul Soto 34

Hypothesis TestingHOW TO MAKE STATISTICAL COMPARISONSBETWEEN TWO SAMPLES, OR BETWEEN ASAMPLE AND A SET VALUE

(c) 2016 / Raul Soto 35


PHILADELPHIA PA

Hypothesis Testing

• Compare two samples

– before/after a change in the process– Raw materials from different vendors– Two plants / machines / tool sets / operators / shifts, etc.

• Compare a sample against a specific value

– Determine if a process / product change has effect on process mean or std dev– Claim that your mean is equal, less than, or greater than a specific value

(c) 2016 / Raul Soto 36

• Hypothesis tests can be used to test claims of :– Equivalency– Superiority– Non-Inferiority

• Tests for quantitative data– Tests for Means (Z-test, t-test)

• Paired test for means (paired t-Test)– Tests for Variation (F-test, ChiSquared-

test)– Tests for Proportions (Z-test)

• Tests for qualitative data– Two sample test, binary data: Fisher’s

Exact test– Two sample test, paired binary data:

McNemar’s test

(c) 2016 / Raul Soto 37

Hypothesis Testing

• A hypothesis is a CLAIM we want to make, and substantiate with evidence

• NULL and ALTERNATE hypotheses

– NULL (Ho): what is assumed to be true– ALTERNATE (Ha): typically what we want to prove

– You need a high level of certainty (typically 90, 95 or 99%) to prove the Alternate Hypothesis

– This means that IF the null hypothesis is true, the probability of observing a difference as extreme, or more, than the one observed, is at least 90 / 95 / 99%(“Frequentist” statement)

(c) 2016 / Raul Soto 38

What is a Hypothesis?

• [1] Start by formulating your QUESTION OF INTEREST– What claim do you want to make? What question do you want to answer?– Be sure everyone in the team is aligned with what the question of interest is– Select the correct type hypothesis test to answer your question

• Test for means, for variance, for proportions• One sided vs Two sided

• [2] How accurate do you want your hypothesis test to be? – This necessary to calculate the sample size you need for your test

• [3] Determine your desired confidence level– 99%, 95%, 90% are the most commonly used

(c) 2016 / Raul Soto 39

What is a Hypothesis?

• Question: Do Manufacturing Lines 1 and 2 make equivalent product with respect to a specific characteristic (weight, length, thickness, hardness, pull strength, etc.)

– NULL (Ho):Lines 1 and 2 make equivalent product with respect to the selected characteristic

– ALT (H1):Lines 1 and 2 do NOT make equivalent product with respect to the selected characteristic

(c) 2016 / Raul Soto 40

Null vs Alternate Hypothesis

• Normality:– The populations from which you get your samples should follow a normal distribution– Check assumption with a normal probability plot, or a histogram of residuals

• Independence:– Assumes that one data point does not influence the value of the next data point– Randomization of runs helps– Check assumption with a Runs vs Order run chart

• Random:– Your sample units must be selected at random from the population

(c) 2016 / Raul Soto 41

Hypothesis Testing: Assumptions

Independence

• Independence means that one value or reading from a process is not affected or dependent on the previous value

• Required by many statistical procedures

• If this is violated, results are likely to be severely biased

(c) 2016 / Raul Soto 42

Using existing / historical data • Always be careful when using historical data –

understand the risks involved

• Look out for confounding or hidden factors (variables)

• Do you have the run order of the data?

• Is the data set complete?

• Were the instruments used to measure characteristics capable?

• When was the data collected? Has the process changedsince?

(c) 2016 / Raul Soto 43

p-values• Used to determine whether to reject, or not reject, a null

hypothesis

• Definition: p-value is the probability that the test statistic will take on a value that is at least as extreme as the observed value of the statistic when the null hypothesis is true. (Montgomery 1997)

• For a 95% confidence level:– If p-value < 0.05, reject the null hypothesis, accept the

alternate– If p-value > 0.05, do not reject the null hypothesis

• This is the typical way to support your conclusions with statistical evidence

(c) 2016 / Raul Soto 44

Montgomery, Douglas C. Design and Analysis of Experiments, 4th Edition. New York: Wiley, 1997. Print. [Chapter 2.4, page 37]

p-values

1. The p-value is not the probability that the null hypothesis is true or the probability that the alternative hypothesis is false. It is not connected to either.

2. The p-value is not the probability that a finding is "merely a fluke.“

3. The p-value is not the probability of falsely rejecting the null hypothesis.

4. The p-value is not the probability that replicating the experiment would yield the same conclusion.

5. The significance level, such as 0.05, is not determined by the p-value.

6. The p-value does not indicate the size or importance of the observed effect.

(c) 2016 / Raul Soto 45

Sterne, J. A. C.; Smith, G. Davey (2001). "Sifting the evidence–what's wrong with significance tests?". BMJ (Clinical research ed.) 322 (7280): 226–231.

Schervish, M. J. (1996). "P Values: What They Are and What They Are Not". The American Statistician 50 (3): 203.

Example• We are testing a new set of tooling for a manufacturing process. • Cheaper, lasts longer.• Claim: No significant change in the product’s critical quality characteristics, at 95% confidence• We need to provide documented evidence that supports that claim.

Product Characteristic: Pull strength (in lb)

State the null and alternate hypotheses:Ho : μ1 = μ2 (no statistically significant shift in the process mean)

H1 : μ1 < > μ2(statistically significant shift in the process mean)

Select the appropriate test : Two-sample hypothesis test for means

(c) 2016 / Raul Soto 46

(c) 2016 / Raul Soto 47

• Historically the process mean is 4.5 lb, with a standard deviation of approx 1 lb• We want our hypothesis test to be able to detect a difference of 0.5 lb• To calculate the required sample size (Minitab):

(c) 2016 / Raul Soto 48

• You will need a sample of at least n = 32 units if you want your hypothesis test to be able to detect a difference of 0.5 lbs

What is “Power”?• Power is the probability of rejecting a false null hypothesis (1 – β)• Confidence is the probability of accepting a true null hypothesis (1 - α)• Increasing power increases the sample size

(c) 2016 / Raul Soto 49

Power How sample size is calculated

0.5 Sample size is calculated using only α error

0.8 Using 80% probability of rejecting a false null hypothesis AND (1-α)% probability of accepting a true one

0.9 Using 90% probability of rejecting a false null hypothesis AND (1-α)% probability of accepting a true one

(c) 2016 / Raul Soto 50

Two-Sample T-Test and CI: Old, New

Two-sample T for Old vs New

N Mean StDevSE MeanOld 32 4.55 1.24 0.22New 32 5.72 1.08 0.19

Difference = mu (Old) - mu (New)Estimate for difference: -1.17195% CI for difference: (-1.752, -0.590)T-Test of difference = 0 (vs not =): T-Value = -4.03 P-Value = 0.000 DF = 60

(Minitab)

(c) 2016 / Raul Soto 51

Statistical conclusion

p-value = 0.000

We reject Ho, and conclude that the population means are different.

Practical conclusion

• We have obtained documented evidence which provides a high degree of certainty (>95%) that the new tooling causes a statistically significant shift in the process mean.

• We may need to change process parameters if we want to bring the process mean back to its original target. > Design of Experiments

(c) 2016 / Raul Soto 52

• What do you call…

… a successful hypothesis test? A Measurement

… an unsuccessful hypothesis test? A Discovery

(c) 2016 / Raul Soto 53

Hypothesis Testing

Simple Linear RegressionCREATE A MATHEMATICAL MODEL OF YOUR RESPONSES (EX: PRODUCT QUALITY CHARACTERISTICS) AS A FUNCTION OF YOURFACTORS (EX: PROCESS CRITICAL PARAMETERS)

(c) 2016 / Raul Soto 54


PHILADELPHIA PA

Simple Linear Regression (SLR)

• Allows us to develop a mathematical model which will allow us to estimate and predict the response as a function of one or more factors.

• Use for quantitative data, not qualitative(use Logistic Regression for qualitative data)

• Correlation does not imply causation!

• Cannot handle interactions easily

• ANOVA can handle interactions and qualitative data

(c) 2016 / Raul Soto 55

(c) 2016 / Raul Soto 56

• Normality:– In the independent variable(s), in the dependent variable, and in the residuals– Check assumption with a normal probability plot, or a histogram of residuals

• Independence:– Assumes that one data point does not influence the value of the next data point– Randomization of runs helps– Check assumption with a Runs vs Order run chart– FACTORS must be independent from one another (no confounding)

• check using Pearson correlation and Variable Inflation Factor (VIF)

• Homogeneity:– All residuals follow the same distribution, with mean = 0 and constant variance– Check assumption with a residuals vs fits plot

(c) 2016 / Raul Soto 57

SLR: Assumptions

Replication vs Subsampling

• NOT the same thing!

• REPLICATION : when each treatment is applied to multiple EUs• SUBSAMPLING : when a treatment is applied to multiple OUs

• EXAMPLE: You are testing your process at Pressure = 300 psi, Temperature = 100°C– If you run multiple product units at this setting => this is SUBSAMPLING– If you run this setting multiple times => this is REPLICATION

• Each allows you to look at different types of variation in the process

(c) 2016 / Raul Soto 58

Data Collection• When you document the results of your statistical tests:

• Include the RUN ORDER – which should be randomized

• When you document the run order it’s possible to analyze the data and determine if there are any time-related effects; for example if your variability is increasing, or decreasing, with time.

• Decreasing : Learning curve effect?• Increasing: Human fatigue? Machine wear?

Periodic equipment adjustment needed?Calibration issues?

(c) 2016 / Raul Soto 59

If these effects are present, then the assumption of independence does NOT hold

• This is used to REDUCE bias – but it does not eliminate it

• The order of the various runs of an experiment or trials should be established at RANDOM

• When there are unknown factors affecting a process, randomization helps block / even out the effects of those factors

– This helps to AVERAGE OUT all unknown / extraneous factors– Also helps to average out time-related effects, such as learning curves or human fatigue

• How ?– Use software that randomizes runs– Use a random numbers table (from a statistics book) and sort

(c) 2016 / Raul Soto 60

Randomization

Randomization• EXAMPLE: You want to use regression to model a sealing

process defined by three variables:– Pressure range: 200 – 300 psi– Temperature range: 100 – 140 °C– Time: 0.9 – 1.1 sec

• You need to test 8 treatments to challenge the process envelope

• DON’T run them in order. Randomize the order• Include REPLICATION: run each possible treatment more

than once

(c) 2016 / Raul Soto 61

Runs P (psi) T (°C) t (s)

1 200 100 0.9

2 200 100 1.1

3 200 140 0.9

4 200 140 1.1

5 300 100 0.9

6 300 100 1.1

7 300 140 0.9

8 300 140 1.1

Randomization

(c) 2016 / Raul Soto 62

• Run order is randomized• 3 replicates per treatment

– This means each treatment is tested 3 times

• If we take multiple sample units at each treatment, that is subsampling.

Experimental vs Observational Units

• Experimental Unit (EU):– the smallest set of entities or units where we apply a treatment– EUs are key to determine sample size

• Observational Unit (OU):– entities or units where we measure one or more characteristics– OUs are important when there is variability in the EU reading

(c) 2016 / Raul Soto 63

• EXAMPLES– You measure the sealing strength in two sides of a foil or blister package– EU: each individual blister, all sides get the same treatment– OU: 2 per EU

– You have a process where you make 4 packaging blisters at the same time using a 4-cavity forming die– EU: each 4-blister set– OU: each blister, 4 per EU

– You want to test an improved coating process. You take 100 units from each of 6 lots, randomize which of the 6 sets gets the control vs experimental process

– EU: each group of 100 units– OU: each unit, 100 OUs per EU

(c) 2016 / Raul Soto 64

Experimental vs Observational Units

Residuals

• Difference between an observation and the predicted value, in the y-axis

• Least-squares regression finds the equation of the curve that minimizes residuals

(c) 2016 / Raul Soto 65

(c) 2016 / Raul Soto 66

(a) Unbiased and homoscedastic. The residuals average to zero in each thin vertical strip and the SD is the same all across the plot.(b) Biased and homoscedastic. The residuals show a linear pattern, probably due to a lurking variable not included in the experiment.(c) Biased and homoscedastic. The residuals show a quadratic pattern, possibly because of a nonlinear relationship. Sometimes a variable transform will eliminate the bias.(d) Unbiased, but heteroscedastic. The SD is small to the left of the plot and large to the right: the residuals are heteroscedastic.(e) Biased and heteroscedastic. The pattern is linear.(f) Biased and heteroscedastic. The pattern is quadratic

Confounding / Multicollinearity• CONFOUNDING is when you cannot distinguish between the effects of two or more

factors.

• Confounding introduces bias and increases variance in your data

• EXAMPLE:– You want to test if raw materials from vendors A and B are equivalent. You run lots from vendor A in Line

1, and lots from vendor B in Line 2.– If you see a difference, is it due to an actual difference in the materials, or a difference in the LINES?– If you can’t tell, then the effects of these two factors, lines and raw materials, are said to be confounded

• Can happen in poorly planned experiments

• Also happens when you analyze historical data, if the data collection run sheets do not capture the appropriate variables used

(c) 2016 / Raul Soto 67

Confounding / Multicollinearity• Variance Inflation Factor (VIF) is a measure of confounding• VIF = 1 : there is no confounding in your model• VIF < 3 : partial confounding, but we can still trust p-values and conclusions• VIF > 3 : confounding, don’t trust p-values or conclusions

(c) 2016 / Raul Soto 68

BAD!

GOOD!

In this example there was major confounding because 3 of the factors were NOT independent (weight, waist, bmi). Look at VIF and p-values

Removing those factors from the model, and re-computing using only one factor (weight), a good model was obtained

Confounding / Multicollinearity

• Principal Components Analysis can be used to visually detect collinearity or correlation among the factors, and identifying factors that are highly collinear

• [Minitab] Stat > Multivariate > Principal Components, enter the variables, select Graphs, and check Loading Plot.

• Principal Components Analysis shows highcollinearity among 3 factors: weight, bmi, and waist.

0.60.50.40.30.20.10.0

0.4

0.2

0.0

-0.2

-0.4

-0.6

-0.8

-1.0

First Component

Seco

nd C

ompo

nent

bmi

height

weight

waist

Loading Plot of waist, height, weight, and bmi

Outliers

(c) 2016 / Raul Soto 70

Outlier?

Outlier?

Outlier?

Outliers• Observation that lies an abnormal distance from other values in a random sample• They may indicate the presence of special-cause variation• They may provide important information about the manufacturing process and about the

data gathering and recording process• Outliers can have big impact on the model

equation

(c) 2016 / Raul Soto 71

Use Studentized Residuals to detect outliers (a.k.a. Standardized Residuals)

If the Studentized Residual for a specific value is larger than ± 3, you are 99.7% confident that the value is a potential outlier

Regression Example

• Simple linear regression will be used to determine if there is a linear relationship between the time in hours a machine is used (input) and off-target values in millimeters (output) with 95% confidence.

• If there is a relationship, then we will determine what that relationship is. Simple linear regression is appropriate since both input and output are continuous variables.

• Random samples will be taken, 5 parts every 12 hours during a 1-week period, for a total of 60 parts.

• The assumptions of normality and equal variances will be checked by graphical means using residuals.

• Studentized residuals will be used to determine any potential outliers; any studentizedresidual greater than 3 or lower than -3 will be considered a potential outlier.

(c) 2016 / Raul Soto 72

With a p-value <0.0001 we are 95% confident that there is a linear relationship between the time (hrs) a machine is used and the off-target value (mm).

This relationship is modeled by the following equation (with an R2

adj = 79.96%)

Off target value = 0.04233 + 0.000805 * HRS

With a p-value of 0.614 we are at least 95% confident that there is no lack of fit

p-values for the factor and the response are both <0.001, therefore both terms belong in the model

Two data points were flagged as possible outliers: 42 and 60. The Studentized residuals are within ± 3, therefore they will not be treated as outliers

VIF = 1, therefore there is no confounding present in the model

S = Root Mean Squared Error = Standard deviation of the residuals(c) 2016 / Raul Soto 73

0.0500.0250.000-0.025-0.050

99.999

90

50

10

1

0.1

Residual

Perc

ent

0.180.150.120.090.06

0.050

0.025

0.000

-0.025

-0.050

Fitted Value

Resi

dual

0.040.020.00-0.02-0.04

16

12

8

4

0

Residual

Freq

uenc

y

7065605550454035302520151051

0.050

0.025

0.000

-0.025

-0.050

Observation Order

Resi

dual

Normal Probability Plot Versus Fits

Histogram Versus Order

Residual Plots for off

Scatterplot of off-center vs hours shows that the relationship between the response and the factor is linear. The R2

adj supports this statement.

Assumptions:1. Normality: Normal probability plot and the histogram

support the assumption of normality.2. Homogeneity: The residuals vs fits plot shows that the

residuals are random. This supports the assumption of homogeneity.

3. Independence: the residuals vs order plot does not show obvious trends or patterns, this supports the assumption of independence.

(c) 2016 / Raul Soto 74

Coefficient of Regression

• R2 always increases as we add terms (variables) to the model… not good• R2

adj can be used instead:

• R2adj actually decreases if unnecessary terms are added to the model equation.

• If R2 and R2adj differ significantly, there is a good chance the model includes non-

significant terms.

(c) 2016 / Raul Soto 7575

Analysis of Variance (ANOVA)HOW TO COMPARE MULTIPLE DATA SETS

(c) 2016 / Raul Soto 76


PHILADELPHIA PA

ANOVA• ANalysis Of VAriance• Used to compare multiple groups and determine if at least one of them is different from

the others• Compares two types of variability: group to group vs within group

• How can I use this?– Compare output from multiple

manufacturing lines making the same product, determine if they are all equivalent

– Compare lots from multiple vendors– Compare 1st, 2nd, 3rd shifts

(c) 2016 / Raul Soto 77

• Equal variances– Assumes that “within group” variances are equal– Use Levene’s Test to test. If variances are unequal, use Welch’s ANOVA

• Normality (means)– Assumes that the means of the groups follow a normal distribution– Use a normal probability plot or a normality test

• Independence– Assumes that one data point does not influence the value of the next data point– Randomization of runs helps

(c) 2016 / Raul Soto 78

ANOVA - Assumptions

• Question of interest:– Is the mean product weight equivalent for Lines 1, 2, and 3?

(c) 2016 / Raul Soto 79

ANOVA - Example

3

2

1

0.120.110.100.090.08

P-Value 0.817

P-Value 0.991

Multiple Comparisons

Levene’s Test

line

Test for Equal Variances: weight vs lineMultiple comparison intervals for the standard deviation, α = 0.05

If intervals do not overlap, the corresponding stdevs are significantly different.

Levene’s testp-value of 0.991 => we have more than 95% confidence that the equal variances assumption is valid

With a p-value of <0.001 we are at least 95% confident that at least one of the lines makes product with weight that is notequivalent to the others

(c) 2016 / Raul Soto 80

321

0.4

0.3

0.2

0.1

0.0

-0.1

-0.2

line

ctIndividual Value Plot of ct vs lineIndividual Value Plot of weight vs line

wei

ght

Tukey’s comparisons shows that we are at least 95% confident that:

Lines 2 and 3 are equivalentThe mean weight for Line 1 is greater than for Lines 2 and 3

(c) 2016 / Raul Soto 81

Regression vs ANOVA?

Regression ANOVAResponse (dependent variable) is quantitative

Response (dependent variable) is quantitative

Factors (independent variables) are quantitative

Factors (independent variables) are categorical (qualitative)

Goal is to determine if response and factors are related.

If they are, then the goal is to determine how they are related

Goal is to determine if multiple factors give (on the average) different or similarvalues of the response.

(c) 2016 / Raul Soto 82

Design of ExperimentsHOW TO PROPERLY DESIGN EXPERIMENTS, COMBINEREGRESSION AND ANOVA, AND CREATE MATHEMATICALMODELS THAT CAN HELP YOU OPTIMIZE YOUR PROCESS

(c) 2016 / Raul Soto 83


PHILADELPHIA PA

Design of Experiments (DoE)

• “Design of Experiments is a test, or series of tests, in which purposeful changes are made to input variables of a process or system so that the changes in the output variables can be observed and identified” – Douglas C. Montgomery

• We deliberately make changes to the process Inputs, and observe the results on the process Outputs

(c) 2016 / Raul Soto 84

VERY powerful process optimization tool

DoE & Response Surface Methods

(c) 2016 / Raul Soto 85

PROCESSINPUTSX’sFACTORS

OUTPUTSY’sRESPONSES

Y = a0 + a1X1 + a2X2 + a3X3 + a4X1X2 + …

DOE allows us to characterize our process response as a function of the process inputs

DoE

(c) 2016 / Raul Soto 86

Much more efficient that the typical one-factor-at-a-time (OFAAT) approach practiced almost everywhere! It provides more information about the process with less runs. It takes into account interactions between factors. The OFAAT does not address interactions

DoE allows us to : determine which inputs (factors) have significant impact in the process output (response) discover interactions between significant factors create a mathematical model of the response in terms of the significant factors Include attribute data in the factors use this model to optimize the process, by determining the appropriate settings of the factors required

to bring the response to a desired target

(c) 2016 / Raul Soto 87

Unbalanced OFAT Experimental designCoating DAY 0 DAY 3 DAY 7

Collagen 5mg 1 0.88 2.02

Collagen 10mg 1 0.85 2.34

Fibrin 5mg 1 0.90 2.07

Fibrin 10mg 1 0.94 3.03

No Coat + Cells 1 1.19 2.34

No Coat - Cells 1 1.12 0.99

Fibrin

Colla

gen

0

00

10

5

5 10

7

3

• Unbalanced design => no orthogonality

• No orthogonality => confounding of effects of factors

• Leads to – LOW p-values – LOW R2

adj fitness– poor mathematical model

Montgomery, Douglas C. Design and Analysis of Experiments. 4th. New York: Wiley, 1997. Print.

(c) 2016 / Raul Soto 88

Balanced Factorial Experimental design• Better balance• Orthogonal• Less confounding

– Better p-values – Better R2

adj fitness– Better mathematical model

• LESS experimental treatments (8 instead of 15)

Fibrin

Colla

gen

0

00

10

10

7

3

Factorial DesignA B C

1 + + +

2 + + -

3 + - +

4 + - -

5 - + +

6 - + -

7 - - +

8 - - -

(c) 2016 / Raul Soto 89

A

C

B

(+, +, +)

(-, -, -)

+ Max setting of a factor

- Minimum settings of a factor

(c) 2016 / Raul Soto 90

Balanced 3-factor models

Central Composite Design (CCD) Box-Behnken Design

Source:http://www.mathworks.com/help/toolbox/stats/f56635.html

Many Experimental Design models!

• Full factorial• Fractional Factorials• Central Composite Design (CCD)• Box-Behnken• Taguchi Methods• Latin Squares• Graeco-Latin Squares• Hyper-Graeco-Latin Squares• Optimal Designs (D, A, G, V – Optimals)

(c) 2016 / Raul Soto 91

Blocking / Covariates

• Often you will need to BLOCK a factor that is not controllable to add ROBUSTNESS

• Use for non-controllable factors that add variation to the process

• Typical factors blocked: raw materials variation from lot to lot, operators, shifts, machines.

• For example, to block the effect of raw materials lot – to – lot variation, use multiple lots in your validation, and know which lots are used in which tests

– You can determine statistically which % of your total variation is due to lot-lot variation (Variance Components a.k.a. Random Effects Model)

(c) 2016 / Raul Soto 92

Interactions• An interaction occurs when the effect of one input factor on the response depends on the

level of another input factor

• Correct modeling of interactions between factors is key to building accurate mathematical models with predictive power.

• Interaction terms show in your model equation as: Response Y = co + c1 A + c2 B + c3 AB(where A and B are the factors, cn are coefficients)

• DOE allows us to take into account interactions between factors. Simple linear regression does not.

(c) 2016 / Raul Soto 93

Interactions

(c) 2016 / Raul Soto 94

C C

ABinteraction

ACinteraction

BCinteraction

Masking / Blinding

• MASKING or BLINDING is when you don’t let testers know which treatments are being applied.

• If the experimenter does not know either, it’s a DOUBLE-BLIND test

• Used to eliminate human bias, preconceived biases.

(c) 2016 / Raul Soto 95

DoE Sequence• Screening Experiment

– Used to evaluate the effect of many factors in the desired response– Determine which factor(s) have a significant effect on the response

• Modeling / Optimization Experiment– Used to develop a mathematical model of the response, using the factors that were

found to be significant in the screening experiment– Determine which settings of the factors allow us to optimize the response– Some software packages, such as Minitab, allow us to optimize multiple responses

simultaneously

(c) 2016 / Raul Soto 96

DoE Example• Optimize adhesive dispensing process• Process factors :

– Pressure (P)

– Pump Voltage (V)

– A screening experiment was performed previously, determined that these two factors (out of 8 total) had significant impact on the response

• Process responses :

– Variable: Adhesive weight (W), target = 9 ± 1 mg

• DOE model selected: Central Composite Design

– 2 factors, 3 levels per factor (max, mid, min)

– CCD tests for nonlinear (quadratic) terms in mathematical model

(c) 2016 / Raul Soto 97

(c) 2016 / Raul Soto 98

Goodness-of-Fit : very goodR-Sq = 98.0% R-Sq(adj) = 97.6%

Terms with significant effect on response:VoltagePressureVoltage * Voltage quadratic termConstant

Terms with no significant effect on response:Voltage * Pressure interactionPressure * Pressure quadratic term

Mathematical Model for weight (uncoded units):

(c) 2016 / Raul Soto 99

= 2.9 + 2.8 + .03 + 0.3

Response Surface (Optimization):

The contour plot allows us to determine the settings of V and P necessary to dispense 9 ± 1 mg

This allows us to set process parameters based on actual scientific data, not guesswork.

(c) 2016 / Raul Soto 100

Confidence, Tolerance, and Prediction IntervalsHOW TO MAKE CLAIMS ABOUT YOUR DATA’SMEAN, OR ABOUT ALL UNITS

(c) 2016 / Raul Soto 101


PHILADELPHIA PA

Intervals• Allow us to make a statistical-based claims or statement with a specified degree of

confidence

• Can be used to Accept / Reject lots for quantitative product characteristics

• Easier to understand instinctively by non-statisticians than p-values or mean / stdev

o Weight: ̅ = 8.5 mg, s = 0.02 o Weight: [8.44 – 8.56 mg]

• Different types of intervals allow you to make different claims

(c) 2016 / Raul Soto 102

Types of Intervals• Confidence Interval

– Make a claim about the mean, the variance, or a proportion of a sample• “We are 95% confident that the mean capsule body length falls within this interval: 0.874 ± 0.018 in”• “We are 99% confident that the mean capsule external diameter falls within this interval: 7.72 – 8.64

mm”

• Prediction Interval– Range that contains response value of a single observation, given specified settings of predictors

• Used often in Regression• When you use your regression model to predict the value of your response for a specific input value,

a prediction interval gives you a range around that value based on a given % confidence

(c) 2016 / Raul Soto 103

Types of Intervals• Tolerance Interval

– Range that contains a specified proportion % of the population – “There is 95% confidence that 99% of the population meets the following specification: 2.000 ± .016

mm”• This example is called a 95%/99% tolerance interval• Can be used as basis for product release if interval is within specifications.• You can run your validation lots, compute a tolerance interval, and compare this interval against

specifications. If the interval is within the specification, you can make the claim.

• Which one should I use?http://www.qualitydigest.com/inside/quality-insider-column/when-should-i-use-confidence-intervals-prediction-intervals.html

(c) 2016 / Raul Soto 104

Confidence Interval• Example:

• Mechanical attachment process, input factor is Pressure (psi), response is pull strength (lb)

• You want to know what is the mean pull strength, with 95% confidence

• Confidence Interval for the mean (Minitab):

One-Sample T: y Variable N Mean StDev SE Mean 95% CIy 12 2.792 0.421 0.122 (2.524, 3.059)

• Claim: We are 95% confident that the true mean pull strength is between 2.524 and 3.059 lb.

(c) 2016 / Raul Soto 105

Confidence Interval for the Mean

(c) 2016 / Raul Soto 106

Prediction Interval for a Forecasted Value• Example:• We perform a regression test to model how pull strength changes as a function of pressure• We want to know what the pull strength will be at a pressure setting of 75 psi, with 95% confidence• Using Minitab:

(c) 2016 / Raul Soto 107

• Claim:At a pressure setting of 75 psi, we are 95% confident that the pull strength will be between 1.97 and 2.98 lbs

Prediction Interval for a Forecasted Value

(c) 2016 / Raul Soto 108

Tolerance Interval (95/95)

• Claim: We are 95% confident that 95% of the population will have a pull strength between 1.455 and 4.128 lb.

(c) 2016 / Raul Soto 109

(Minitab)

4.23.63.02.41.8

Nonparametric

Normal

4.54.03.53.02.52.01.5

4.03.53.02.52.0

99

90

50

10

1

Perc

ent P-Value 0.950

N 12Mean 2.792StDev 0.421

Lower 1.455Upper 4.128


11.8%

AD 0.148

Statistics

Normal

Nonparametric

Achieved Confidence

Normality TestNormal Probability Plot

Tolerance Interval Plot for y95% Tolerance Interval

At Least 95% of Population Covered

Tolerance Interval (95/99)

• Claim: We are 95% confident that 99% of the population will have a pull strength between 1.042 and 4.541 lb.

(c) 2016 / Raul Soto 110

(Minitab)

4.84.23.63.02.41.81.2

Nonparametric

Normal

54321

4.03.53.02.52.0

99

90

50

10

1

Perc

ent P-Value 0.950

N 12Mean 2.792StDev 0.421



0.6%

AD 0.148

Statistics

Normal

Nonparametric

Achieved Confidence

Normality TestNormal Probability Plot

Tolerance Interval Plot for y95% Tolerance Interval

At Least 99% of Population Covered

Other important statistical tools / techniques• Variance Components (a.k.a. Random

Effects Model)– Determine the sources of variation in a process,

and their relative contributions to the total variation

– Example: Gage R&R / Measurement Systems Capability Analysis

• Statistical Process Control (SPC)– Monitor a process to detect if it moves out of

statistical control

• Acceptance Sampling– Accept or Reject a lot of product based on a

sample

• Logistic Regression– Regression model where the dependent variable

is categorical, not quantitative

• Stability Analysis– Calculate product expiration dates– Calculate the probability that a given product

will survive to a predetermined expiration date

• Fisher’s Exact Test– Compare two samples of binomial data

• Mc Nemar’s Test– Compare two paired samples of binomial data

(c) 2016 / Raul Soto 111

Questions

(c) 2016 / Raul Soto 112

Extra Slides

(c) 2016 / Raul Soto 113

Descriptive vs Inferential Statistics• Descriptive statistics provide information about a population, or a sample

– Characterize and summarize the most prominent features of a data set– Measures of central tendency: Mean, Mode, Median

– Measures of variability: Variance, Standard Deviation, Range

• Inferential statistics use information obtained from a sample to make inferences about the population that sample came from– Use the sample mean and standard deviation to estimate the population mean and standard

deviation– Draw conclusions, make statements

(c) 2016 / Raul Soto 114

What is “Variability”THE BASICS

(c) 2016 / Raul Soto 115

RAUL SOTO, MSC, CQEIVT CONFERENCE - MARCH 2016

SAN DIEGO, CA

Da t a

Freq

uenc

y

3 2 .43 1 .230 .02 8 .82 7 .6

9

8

7

6

5

4

3

2

1

0

M ean

30.28 1.092 3029.79 1.031 30

S tD ev N29.83 0.8977 3030.08 1.021 30

V ar iab leD ay 1D ay 2D ay 3D ay 4

C o m m o n C a us e V a r ia tio nP ro ce s s o u tpu t s ta b le , p r e d ic ta b le o ve r tim e

(c) 2016 / Raul Soto 116

Variability: Common vs Special Cause• Common cause variability (“natural” variability, general)

– Inherent in the process

– Caused by multiple small factors that act randomly

– Not controllable by operators

– Examples:

• Natural variations in raw materials

• Natural variations in ambient conditions such as temperature and humidity

– Confusing common cause with special cause => Over-adjustment

(c) 2016 / Raul Soto 117

(c) 2016 / Raul Soto 118

30

01

02

03

04

42 72 03 33 63 9

29.75 0.9993 6030.07 1.015 6030.19 1.022 6030.67 3.199 60

Mean StDev N

D

ycneuqerF

ata

DelbairaV

4 yaD3 yaD2 yaD1 ya

NnoitairaV esuaC laicepS

lamro

(c) 2016 / Raul Soto 119

393633302724

40

30

20

10

0

393633302724

40

30

20

10

0

Mean 29.75StDev 0.9993N 60

Day 1

Mean 30.07StDev 1.015N 60

Day 2


Day 3


Day 4

Day 1

Freq

uenc

y

Day 2

Day 3 Day 4

Normal Special Cause Variation

Variability: Common vs Special Cause• Special cause variability (“assignable” causes)

– Unusual events that, if detected, can be removed or adjusted

– Caused by one or more specific factors

– Examples:

• Tool wear

• Major changes in raw materials

• Equipment failure

– Confusing special cause with common cause => Under-adjustment

(c) 2016 / Raul Soto 120

Box Plots : Elements– Asterisk : Outlier - an unusually large or small

observation. Values beyond the whiskers are outliers.

– Top of the box : third quartile (Q3) - 75% of the data values are less than or equal to this value

– Upper whisker : the highest data value within the upper limit.

• Upper limit = Q3 + 1.5 (Q3 - Q1)

– Line in the middle of the box : Median, the middle of the data. Half the observations are less than or equal to it.

– Bottom of the box is the first quartile (Q1) - 25% of the data values are less than or equal to this value

– Lower whisker : the lowest value within the lower limit.

• Lower limit = Q1- 1.5 (Q3 - Q1)

(c) 2016 / Raul Soto 121

Documents

IVT 2016 June - Stats for the Non-Statistician