19
1 The Campbell Collaboration www.campbellcollaboration.org Applied topics: Interpreting the Practical Significance of Meta-Analysis Findings Mark Lipsey Co-Chair, The Campbell Collaboration Co-Editor-in-Chief, Campbell Systematic Reviews Director, Peabody Research Institute, Vanderbilt University, USA The Campbell Collaboration www.campbellcollaboration.org The problem The effect size statistics that constitute the direct findings of a meta- analysis often provide little insight into the nature, magnitude, or practical significance of the effects they represent. Practitioners, policymakers, and even researchers have difficulty knowing whether the effects are meaningful in an applied context. Example: The mean standardized mean difference effect size (Cohen’s d or Hedges g) for the effects of educational interventions with middle school students on standardized reading tests is about . 15 and statistically significant. Seems small: Is .15 large enough to have practical significance for improving the reading skills of middle school students? Most important to recognize: There is no necessary relationship between the numerical magnitude of an effect size and the practical significance of the effect it represents!

Practical Significance of Meta-Analysis Findings Lipsey

Embed Size (px)

DESCRIPTION

Practical Significance of Meta-Analysis

Citation preview

Page 1: Practical Significance of Meta-Analysis Findings Lipsey

1

The Campbell Collaboration www.campbellcollaboration.org

Applied topics:

Interpreting the Practical Significance of Meta-Analysis Findings

Mark Lipsey

Co-Chair, The Campbell Collaboration Co-Editor-in-Chief, Campbell Systematic Reviews

Director, Peabody Research Institute, Vanderbilt University, USA

The Campbell Collaboration www.campbellcollaboration.org

The problem •  The effect size statistics that constitute the direct findings of a meta-

analysis often provide little insight into the nature, magnitude, or practical significance of the effects they represent.

•  Practitioners, policymakers, and even researchers have difficulty knowing whether the effects are meaningful in an applied context.

•  Example: The mean standardized mean difference effect size (Cohen’s d or Hedges g) for the effects of educational interventions with middle school students on standardized reading tests is about .15 and statistically significant.

–  Seems small: Is .15 large enough to have practical significance for improving the reading skills of middle school students?

•  Most important to recognize: There is no necessary relationship between the numerical magnitude of an effect size and the practical significance of the effect it represents!

Page 2: Practical Significance of Meta-Analysis Findings Lipsey

2

The Campbell Collaboration www.campbellcollaboration.org

A widely used but inappropriate and misleading characterization of effect sizes

• Statistical effect sizes assessed by Cohen’s small (.20), medium (.50) and large (.80) categories

–  Impressionistic norms across a wide range of outcomes in social and behavioral research

–  Almost never are these the appropriate norms for the particular outcomes of a particular intervention

• Comparing an obtained mean effect size with norms can be informative, but those norms must be appropriate to the context, intervention, nature of the outcomes, etc. [more on this later]

The Campbell Collaboration www.campbellcollaboration.org

Two approaches to review here

1.  Descriptive representations of intervention effect sizes: –  Translations of effect sizes into forms that are more readily

interpreted. –  Supports better intuitions about the practical significance of

the effect size.

2.  Direct assessment of practical significance: –  Assessing statistical effect sizes in relationship to criteria that

have recognized practical value in the context of application. –  Requires that appropriate criteria be used; different criteria may

yield different conclusions.

Page 3: Practical Significance of Meta-Analysis Findings Lipsey

3

The Campbell Collaboration www.campbellcollaboration.org

Useful Descriptive Representations of Intervention Effect Sizes

The Campbell Collaboration www.campbellcollaboration.org

Back translation to an original metric •  Useful when the original metric is readily interpretable; not so useful

when it is in arbitrary units. •  Example: Mean Phi coefficient for effects of intervention on the

reoffense rates of juvenile offenders < .20 allegedly trivial. •  Computation of Phi Coefficient as an effect size: Reoffend

(failure)

Don’t Reoffend (success)

Tx a = p b = 1-p a+b=1

Ct c = q d = 1-q c+d=1 a+c= p+q

b+d= (1-p)+ (1-q)

Phi = (ad-bc)/SQRT((a+b)(c+d)(a+c)(b+d))

Page 4: Practical Significance of Meta-Analysis Findings Lipsey

4

The Campbell Collaboration www.campbellcollaboration.org

Back translation to original metric: Phi coefficient example

•  Mean reoffense rate for the control groups in the studies was .50. •  Some algebra (or trial & error in a spreadsheet) yields the reoffense

rate of the average treatment group required to produce Phi = .20 •  [Note: Similar procedure would work for odds ratio ES as well]

Reoffend (failure)

Don’t Reoffend (success)

Tx .30 .70 1.00

Ct .50 .50 1.00 .80 1.20

Phi = .20

Phi = .20 thus means an average .20 reduction in the reoffense rate from a .50 average baseline value; That is, a 40% decrease in the reoffense rate.

Hardly trivial!

The Campbell Collaboration www.campbellcollaboration.org

Back translation to original metric: Standardized test example

•  Suppose the mean standardized mean difference effect size for intervention effects on vocabulary tests is .30

•  The most frequently used measure of vocabulary in the contributing studies was the Peabody Picture Vocabulary Test (PPVT)

•  The PPVT has a normed standard score of 100 with a standard deviation of 15. Differences in standard scores are readily understood by researchers and practitioners familiar with standardized tests

•  The control groups in the studies using the PPVT had a mean standard score of 87.

•  How much improvement in the PPVT standard score is represented by an effect size of .30?

Page 5: Practical Significance of Meta-Analysis Findings Lipsey

5

The Campbell Collaboration www.campbellcollaboration.org

Back translation to original metric: PPVT

The Campbell Collaboration www.campbellcollaboration.org

Intervention effect sizes represented as percentiles on the normal distribution

Percentile values on the control distribution of the intervention effect in standard deviation units

Page 6: Practical Significance of Meta-Analysis Findings Lipsey

6

The Campbell Collaboration www.campbellcollaboration.org

Translating effect sizes into percentiles from a table of areas under the normal curve

The Campbell Collaboration www.campbellcollaboration.org

The percentage of the treatment group that is above the control group mean is Cohen’s U3 index

Effect Size

Proportion above the Control Mean

Additional proportion above

original mean .10 .54 .04 .20 .58 .08 .30 .62 .12 .40 .66 .16 .50 .69 .19 .60 .73 .23 .70 .76 .26 .80 .79 .29 .90 .82 .32

1.00 .84 .34 1.10 .86 .36 1.20 .88 .38

Page 7: Practical Significance of Meta-Analysis Findings Lipsey

7

The Campbell Collaboration www.campbellcollaboration.org

Rosenthal and Rubin Binomial Effect Size Display (BESD)

d = .80

The Campbell Collaboration www.campbellcollaboration.org

BESD representations of SMD and correlation ESs

Effect Size

r

Proportion of control/

intervention cases above the

grand median

BESD

(difference between the proportions)

.10 .05 .47 / .52 .05

.20 .10 .45 / .55 .10

.30 .15 .42 / .57 .15

.40 .20 .40 / .60 .20

.50 .24 .38 / .62 .24

.60 .29 .35 / .64 .29

.70 .33 .33 / .66 .33

.80 .37 .31 / .68 .37

.90 .41 .29 / .70 .41 1.00 .45 .27 / .72 .45 1.10 .48 .26 / .74 .48 1.20 .51 .24 / .75 .51

Page 8: Practical Significance of Meta-Analysis Findings Lipsey

8

The Campbell Collaboration www.campbellcollaboration.org

Even better, use an inherently meaningful threshold •  Suppose we have a mean standardize mean difference effect size of .

23 for the effects of treatment for depression on outcome measures of depression.

•  For many measures of depression, a threshold score has been determined for the range that constitutes clinical levels of depression.

•  Suppose, then, that we can determine from at least a subset of representative studies that the average proportion of the control groups whose scores are in the clinical range is 64%.

•  Assuming that depression scores are normally distributed, we can then use this proportion and the effect size to determine the average proportion in the clinical range for the treatment groups.

•  From that we find the proportion of clinically depressed patients moved out of the clinical range by the treatment.

The Campbell Collaboration www.campbellcollaboration.org

Proportions of T and C samples above and below a meaningful reference value

Success threshold

Proportion above

Proportion below

Page 9: Practical Significance of Meta-Analysis Findings Lipsey

9

The Campbell Collaboration www.campbellcollaboration.org

Using a table of areas under the normal curve Z Cum p Tail p Z Cum p Tail p

64% of the area of the normal curve is below Z=.36

Subtracting ES=.23 SD from Z=.36 gives Z=.13 with 55% of the area

of the normal curve below

The Campbell Collaboration www.campbellcollaboration.org

The mean effect size of .23 indicates that, on average, the intervention reduced the number of clinically depressed

patients from 64% to 55%, a 9% differential Clinical threshold

Proportion above=36%

Proportion below=64%

Proportion below=55%

Proportion above=45%

Page 10: Practical Significance of Meta-Analysis Findings Lipsey

10

The Campbell Collaboration www.campbellcollaboration.org

The more general point

•  With some understanding of the nature of the effect size index you are working with …

•  and some understanding of the context of the intervention and what might be an interpretable representation of the magnitude of the intervention effect on the outcomes of interest,

•  it will almost always be possible to translate any effect size or mean effect size into a form that facilitates interpretation of its practical significance.

The Campbell Collaboration www.campbellcollaboration.org

Direct Assessments of Practical Significance

Page 11: Practical Significance of Meta-Analysis Findings Lipsey

11

The Campbell Collaboration www.campbellcollaboration.org

Assessing the practical significance of effect sizes requires a criterion from the context of application

•  Neither the numerical value of an effect size nor its statistical significance is a valid indicator of the practical significance of the effect.

•  Translating the numerical value into terms easier to understand facilitates an intuitive assessment of practical significance, but is inherently subjective.

•  A more direct assessment of practical significance can often be made by comparing the effect size with an appropriate criterion drawn from the context of application and, therefore, meaningful in that context.

•  The clinical and normative thresholds used as examples in the previous section are a step in that direction, but more can be learned from a more fully-developed criterion framework

The Campbell Collaboration www.campbellcollaboration.org

Examples of some criterion frameworks that can be used to assess the practical significance of intervention effect sizes

E.g., compare the mean effect size found with: •  Established normative expectations for change •  Effects others have found on similar measures with similar

interventions •  Policy-relevant performance gaps •  Intervention costs (not discussed here) Some examples from education follow (happens to be where we have done a lot of work recently)

Page 12: Practical Significance of Meta-Analysis Findings Lipsey

12

The Campbell Collaboration www.campbellcollaboration.org

Benchmarking against normative expectations for change from test norming samples

Data compiled from national norms for standardized achievement tests: •  Up to seven tests were used for reading, math, science, and social

science •  The mean and standard deviation of the scores for each grade were

obtained from the test manuals •  The standardized mean difference effect size across succeeding

grades was computed

The Campbell Collaboration www.campbellcollaboration.org

Annual achievement gain: Mean effect sizes across 7 nationally-normed tests

Grade Transition

Reading

Math

Science

Social Studies

K – 1 1.52 1.14 -- -- 1 - 2 .97 1.03 .58 .63 2 - 3 .60 .89 .48 .51 3 - 4 .36 .52 .37 .33 4 - 5 .40 .56 .40 .35 5 - 6 .32 .41 .27 .32 6 - 7 .23 .30 .28 .27 7 - 8 .26 .32 .26 .25 8 - 9 .24 .22 .22 .18

9 - 10 .19 .25 .19 .19 10 - 11 .19 .14 .15 .15 11- 12 .06 .01 .04 .04

Adapted from Bloom, Hill, Black, and Lipsey (2008). Spring-to-spring differences. The means shown are the simple (unweighted) means of the effect sizes from all or a subset of seven tests: CAT5, SAT9, Terra Nova-CTBS, Gates-MacGinitie, MAT8, Terra Nova-CAT, and SAT10.

Page 13: Practical Significance of Meta-Analysis Findings Lipsey

13

The Campbell Collaboration www.campbellcollaboration.org

Mean effect size relative to the effect size for achievement gain from pretest baseline

.31 SD (38% increase)

.82 SD

Gain from the Beginning to End of Pre-K on a Summary Achievement Measure for Children Who Participated in Pre-K Compared to Children Who Did Not Participate

Pre-K Participants

Nonparticipants

ES for mean control group pre-post gain

Mean interventionES = .31

The Campbell Collaboration www.campbellcollaboration.org

Benchmarking against effect sizes for achievement from random assignment studies of education interventions

Data in our current compilation: •  124 random assignment studies •  181 independent subject samples •  829 effect size estimates

Page 14: Practical Significance of Meta-Analysis Findings Lipsey

14

The Campbell Collaboration www.campbellcollaboration.org

Achievement  effect  sizes  by  grade  level  and  type  of  achievement  test    

Grade  Level    &  Achievement  Measure  

N  of  ES  Es:mates   Mean   SD    

Elementary  School   693   .28   .46  Standardized  test  (broad)   89   .08   .27  Standardized  test  (narrow)   374   .25   .42  Specialized  topic/test   230   .40   .55  

Middle  School   70   .33   .38  Standardized  test  (broad)   13   .15   .33  Standardized  test  (narrow)   30   .32   .26  Specialized  topic/test   27   .43   .48  

High  school   66   .23   .34  Standardized  test  (broad)   -­‐-­‐   -­‐-­‐   -­‐-­‐  Standardized  test  (narrow)   22   .03   .07  Specialized  topic/test   43   .34   .38  

The Campbell Collaboration www.campbellcollaboration.org

Achievement  effect  sizes  by  grade  level  and  type  of  achievement  test    

Grade  Level    &  Achievement  Measure  

N  of  ES  Es:mates   Mean   SD    

Elementary  School   693   .28   .46  Standardized  test  (broad)   89   .08   .27  Standardized  test  (narrow)   374   .25   .42  Specialized  topic/test   230   .40   .55  

Middle  School   70   .33   .38  Standardized  test  (broad)   13   .15   .33  Standardized  test  (narrow)   30   .32   .26  Specialized  topic/test   27   .43   .48  

High  school   66   .23   .34  Standardized  test  (broad)   -­‐-­‐   -­‐-­‐   -­‐-­‐  Standardized  test  (narrow)   22   .03   .07  Specialized  topic/test   43   .34   .38  

Page 15: Practical Significance of Meta-Analysis Findings Lipsey

15

The Campbell Collaboration www.campbellcollaboration.org

Achievement  effect  sizes  by  target  recipients  

Target  Recipients  

Number    of  ES  

Es:mates  

Mean  ES   SD  

Individual  Students  (one-­‐on-­‐one)   252   .40   .53  

Small  groups  (not  classrooms)   322   .26   .40  

Classroom  of  students   176   .18   .41  

Whole  school   35   .10   .30  

Mixed   44   .30   .33  

The Campbell Collaboration www.campbellcollaboration.org

Benchmarking against policy-relevant demographic performance gaps

•  Effectiveness of interventions can be judged relative to the sizes of existing gaps across demographic groups

•  Effect size gaps for groups may vary across grades, years, tests, and districts

Page 16: Practical Significance of Meta-Analysis Findings Lipsey

16

The Campbell Collaboration www.campbellcollaboration.org

Demographic performance gaps on SAT 9 scores in a large urban school district as effect sizes

Subject & Grade

Black-White

Hispanic-White

Eligible-Ineligible for

FRPL Reading

Grade 4 1.09 1.03 .86 Grade 8 1.02 1.14 .68 Grade 12 1.11 1.16 .58

Math Grade 4 .95 .71 .68 Grade 8 1.11 1.07 .58 Grade 12 1.20 1.12 .51

Adapted from Bloom, Hill, Black, and Lipsey (2008). District local outcomes are based on SAT-9 scaled scores for tests administered in spring 2000, 2001, and 2002. SAT 9: Stanford Achievement Tests, 9th Edition (Harcourt Educational Measurement, 1996).

The Campbell Collaboration www.campbellcollaboration.org

Benchmarking against performance gaps between “average” and “weak” schools

Main idea: •  What is the performance gap (in effect size) for the same types

of students in different schools? Approach: •  Estimate a regression model that controls for student

characteristics: race/ethnicity, prior achievement, gender, overage for grade, and free lunch status.

•  Infer performance gap (in effect size) between schools at different percentiles of the performance distribution

Page 17: Practical Significance of Meta-Analysis Findings Lipsey

17

The Campbell Collaboration www.campbellcollaboration.org

Performance gaps between average (50 percentile) and weak (10 percentile) schools in 4 districts as effect sizes

School District Subject & Grade A B C D Reading

Grade 3 .31 .18 .16 .43 Grade 5 .41 .18 .35 .31 Grade 7 .25 .11 .30 NA Grade 10 .07 .11 NA NA

Math Grade 3 .29 .25 .19 .41 Grade 5 .27 .23 .36 .26 Grade 7 .20 .15 .23 NA Grade 10 .14 .17 NA NA

Adapted from Bloom, Hill, Black, and Lipsey (2008). “NA” indicates that a value is not available due to missing test score data. Means are regression-adjusted for test scores in prior grade and students’ demographic characteristics. The tests are the ITBS for District A, SAT9 for District B, MAT for District C, and SAT8 for District D.

The Campbell Collaboration www.campbellcollaboration.org

Cost effectiveness as a framework for practical significance: Example for juvenile offender programs

Excerpted from Aos, Phipps, Barnoski, & Lieb, 2001

Page 18: Practical Significance of Meta-Analysis Findings Lipsey

18

The Campbell Collaboration www.campbellcollaboration.org

In conclusion … •  The numerical values of statistical effect size indices for intervention effects

provide little understanding of the practical magnitude of those effects.

•  Translating effect sizes into a more descriptive and intuitive form makes them easier to understand and assess for practitioners, policymakers, and researchers.

•  There are a number of easily applied translations that could be routinely used in reporting intervention effect sizes.

•  Directly assessing the practical significance of those effects, however, requires that they be benchmarked against some criterion that is meaningful in the intervention context.

•  Assessing practical significance directly is more difficult, but there are approaches that may be appropriate depending on the intervention and outcome construct.

The Campbell Collaboration www.campbellcollaboration.org

References Aos, S., Phipps, P., Barnoski, R., & Lieb, R. (2001). The comparative costs and benefits of programs to reduce crime (Version 4.0). Washington State Institute for Public Policy. Bloom, H. S., Hill, C. J., Black, A. B., & Lipsey, M. W. (2008). Performance trajectories and performance gaps as achievement effect-size benchmarks for educational interventions. Journal of Research on Educational Effectiveness, 1(4), 289-328. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd edition). Hillsdale, NJ: Erlbaum. Redfield, D. L., & Rousseau, E. W. (1981). A meta-analysis of experimental research on teacher questioning behavior. Review of Educational Research, 51, 237-245. Rosenthal, R., & Rubin, D. B. (1982). A simple, general purpose display of magnitude of experimental effect. Journal of Educational Psychology, 74(2), 166-169.

Page 19: Practical Significance of Meta-Analysis Findings Lipsey

19

The Campbell Collaboration www.campbellcollaboration.org

Campbell Collaboration P.O. Box 7004 St. Olavs plass

0130 Oslo, Norway

E-mail: [email protected] http://www.campbellcollaboration.org

Contact Information [email protected]