Upload
dale-anderson
View
219
Download
0
Embed Size (px)
Citation preview
EC339: Lecture 3Chapter 2: Correlation and univariate analysis: A step back
Histograms Way more important than you might think! Look at data’s range
Ex: sample has ages 18-64 Divide into bins
There are formulas… but for all intents and purposes… make it look nice. Bin widths should be round numbers if possible
Easier in SPSS than Excel (First, create new dataset)
Go through ‘SATHist’ and ‘3DHist’
Correlation Data in SPSS Excel Histograms in ‘SATHist’ sheet Create an SPSS *.sav file with this data Use Chartbuilder to create a histogram Double-click on result, and go to the
“Elements Add distribution curve”
This histogram is done using “Histogram” in SPSS, the next two are done using “Histogram Percent” which relates much better to the normal curve… which we will be using repeatedly.
2
2
1
22
1)(
x
exfOnly need to know the mean and standard deviation to plot ANY normal curve
Points created in Excel using Normal equation. See my version of ‘SATScatter’
Open Correlation.xls SATHist tab Comparing Verbal and Math SAT entrance
scores for Wabash College Do you think there should be a relationship?
Positive or Negative?
Which has higher average? What does the standard deviation tell you?
Correlation
n
i y
i
x
i
s
yy
s
xx
nr
1
)()(
1
1
Using this equation: Open the “Wine_Exercise.xls” Spreadsheet and complete this table. With this data, you should also create a scatter plot, show the trendline with regression line equation, and calculate the SD line slope and intercept. We will calculate the regression line slope and intercept when we go over chapter 4. You can do what the spreadsheet says now, and we will get back to it later. See the [Correlation.xls]corr sheet to walk through another example of calculating correlation coefficients.
Wine Exercise
SATScatter Sheet What does the picture tell you about the
relationship? SD (Standard Deviations) Line
If you increase x by one SD, and increase y by one SD. Slope has sign of correlation coefficient Passes through point of averages Remember “point-slope” method to find the equation of a
line Average x and y lines meet at point of averages
)(slopeintercept xy
Extreme Sheet -1 < r < 1 Correlation is BOUND between -1 and 1 Sign of the relationship: positive or negative How strong is the relationship Look for “cigar-shaped” cloud
Use the Patterns sheet to play with different correlation coefficients. Note that you don’t get a pronounced “cigar-shaped” cloud until r is 0.9 or higher.
3D Histogram: Multivariate Analysis
Correlation r is used to measure the degree of linear
association between two variables, but it is not perfect. A high r should never be used to infer causation and r may do a poor job of summarizing the relationship.
r is the sample estimate of ρ “rho” for the population value
Association is not causation
Correlation Dangers Twice the r, does not mean twice as much
clustering r doesn’t tell you about the slope of a relationship Misleading summary—exactly in the same way
that the average and SD are sometimes not enough to describe a list of numbers (e.g., unsymmetrical histogram or outliers): See Patterns sheet
View misleading correlations in corr sheet
Misleading r: Patterns Worksheet Show misleading r Cycle through r You can change parameters yourself to see
how these values might change
SD Line: Has positive slope, if r is positive (negative if r is negative). The slope is std(math)/std(verbal) in this case. (std(y)/std(x) generally).
The SD line goes THROUGH the intersection of the means.
SPSS Version of SATScatter
Costa Rica Example (CRExample) In this example, what does the SDLine
column signify? Does this SD Line look like a good ‘fit’ of the
data?
Aggregation Problem a.k.a. Ecological Correlation
Ex: If you average data by some grouping, you obtain a different correlation than if you take the correlation of individuals. (almost always…)
Correlation at group level suppresses individual variation
See [EcolCorr.xls] and [EcolCorrCPS.xls] Walk through live sheet (F9 is the key here)
EcolCorrCPS
Educ Age Earnings Educ Age EarningsAverage 13.50 43.45 32540 Educ 1SD 0.33 0.58 4775 Age -0.17 1Max 14.22 45.38 43079 Earnings 0.63 -0.30 1Min 12.88 41.91 23290
Summary Statistics Correlation Table
Data averaged at the state level
Individual Data
Educ Age EarningsAverage 13.49 43.38 33356SD 2.53 8.18 43753Max 46 60 361315Min 31 30 -9999
Educ Age EarningsEduc 1Age 0.003 1Earnings 0.336 0.040 1
Summary Statistics
Correlation Table
-$50,000
$0
$50,000
$100,000
$150,000
$200,000
$250,000
$300,000
$350,000
$400,000
30 35 40 45 50 55 60
Ea
rnin
gs in
200
1 ($
)
Note the vastly different correlations between earnings and education and especially age and earnings. Grouping your data can be VERY misleading.
Correlation Lab Open CorrelationLab.doc Use Associated Files
StockReturns.xls Hitters1999.xls