87
Week 11: Basic Descriptive Quantitative Data Analysis Tables, Graphs, & Summary Statistics 1

Intro to quant_analysis_students

Embed Size (px)

Citation preview

Page 1: Intro to quant_analysis_students

Week 11: Basic Descriptive Quantitative Data Analysis

Tables, Graphs, & Summary Statistics

1

Page 2: Intro to quant_analysis_students

Objectives Learn about basic descriptive quantitative

analysis How to perform these tasks in Excel

Starting point for 502B Excel knowledge and quantitative skills are highly

desired by Employers EC stream

2

Page 3: Intro to quant_analysis_students

Introduction

3

Without data, it is anyone’s opinion Why use tables, graphs, summary stats?

“At their best, tables, graphs, and statistics are instruments for reasoning about complex quantitative information.”

Why learn how to design them appropriately?

“At their worst, tables, graphs and summary statistics are instruments of evil used for deceiving a naive viewer.”

Does your mindset match my dataset! http://www.ted.com/talks/hans_rosling_at_state.html

Page 4: Intro to quant_analysis_students

Quantitative Research Process

Page 4

Page 5: Intro to quant_analysis_students

Introduction

Page 5

Page 6: Intro to quant_analysis_students

Page 6

Presenting the Data

Page 7: Intro to quant_analysis_students

Frequency Distribution

Page 7

A convenient way of summarizing a lot of tabular data

What is a Frequency Distribution?

A frequency distribution is a list or a table …

containing class groupings (categories or ranges within which the data fall) ...

and the corresponding frequencies with which data fall within each class or category

For nominal/ordinal data

Page 8: Intro to quant_analysis_students

Introduction

Page 8

Page 9: Intro to quant_analysis_students

Page 9

Table 1 Univariate Frequencies of Percentage of Sales Reported to Tax Authorities

Source: 1999 World Bank World Business Environment

Survey (WBES), excludes missing observations

% of Sales Reported

100%

90-99%

80-89%

70-79%

60-69%

50-59%

<50%

Total

Frequency

3307

1096

916

703

501

694

936

8153

Percent (%)

40.56

13.44

11.24

8.62

6.14

8.51

11.48

100

http://www.enterprisesurveys.org/

Page 10: Intro to quant_analysis_students

Contingency/Pivot/Cross Table

10

May also want to produce a table with more categories Cross table or Contingency table or Pivot

table Suitable if you have two nominal/ordinal

variables Simple extension to a univariate table

Considers relationship between two variables Row variable (Dependent) Column variable (Independent)

Page 11: Intro to quant_analysis_students

Table2Percentage of Sales Reported to Tax Authorities by Region

Page 11

Africa Transition Asia Latin OECD Former Total Europe America Soviet Countries

100% 490 554 416 794 446 607 3,307 90-99% 266 196 142 119 145 228 1,096 80-89% 158 152 117 192 73 224 916 70-79% 162 117 103 153 43 125 703 60-69% 140 69 70 115 22 85 501 50-59% 140 105 141 118 16 174 694 <50% 100 106 283 296 25 126 936 Total 1,456 1,299 1,272 1,787 770 1,569 8,153

Source: 1999 World Bank World Business Environment Survey (WBES)

* Excludes missing observations

Page 12: Intro to quant_analysis_students

Features of a Table

12

Title that accurately summarizes the data Simple, indicates major variables, and time frame

(if applicable) Source: data set or origin of table Explanatory footnotes Easy to read & separated from text Properly formatted for style (see APA Rules) Necessary to advance analysis See Module 7 for APA Table Checklist

Reproduced from APA manual

Page 13: Intro to quant_analysis_students

Page 13

Presenting the Data

Page 14: Intro to quant_analysis_students

Bar Graph

Page 14

Often used to describe categorical data Ordinal/Nominal

Draws attention to the frequency of each category

Page 15: Intro to quant_analysis_students

Page 15

Table 1 Univariate Frequencies of Percentage of Sales Reported to Tax Authorities

Source: 1999 World Bank World Business Environment

Survey (WBES), excludes missing observations

% of Sales Reported

100%

90-99%

80-89%

70-79%

60-69%

50-59%

<50%

Total

Frequency

3307

1096

916

703

501

694

936

8153

Percent (%)

40.56

13.44

11.24

8.62

6.14

8.51

11.48

100

http://www.enterprisesurveys.org/

Page 16: Intro to quant_analysis_students

Bar Graph

Page 16

Figure 1

Percentage of sales reported to tax authority

Source: 1999 World Bank World Business Environment Survey (WBES)

Note. Excludes missing observations. n = 8314

Page 17: Intro to quant_analysis_students

Relative Frequency Polygone

17

Page 18: Intro to quant_analysis_students

Pie Graph

Page 18

Emphasizes the proportion of each category Something that may be good for our tax evasion

data Circle represents the total Segments the shares of the total Segment size is proportional to frequency

Page 19: Intro to quant_analysis_students

Pie Graph

19

Figure 1

Percentage of sales reported to tax authority

Source: 1999 World Bank World Business Environment Survey (WBES)

Note. Excludes missing observations. n = 8314

Page 20: Intro to quant_analysis_students

Page 2020

Pie Graph

Figure 1

Percentage of sales reported to tax authority

Source: 1999 World Bank World Business Environment Survey (WBES)

Note. Excludes missing observations. n = 8314

Page 21: Intro to quant_analysis_students

Page 2121

Pie Graph

Figure 1

Percentage of sales reported to tax authority

Source: 1999 World Bank World Business Environment Survey (WBES)

Note. Excludes missing observations. n = 8314

Page 22: Intro to quant_analysis_students

Charts in Excel I

22

Page 23: Intro to quant_analysis_students

Table2Percentage of Sales Reported to Tax Authorities by Region

Page 23

Africa Transition Asia Latin OECD Former Total Europe America Soviet Countries

100% 490 554 416 794 446 607 3,307 90-99% 266 196 142 119 145 228 1,096 80-89% 158 152 117 192 73 224 916 70-79% 162 117 103 153 43 125 703 60-69% 140 69 70 115 22 85 501 50-59% 140 105 141 118 16 174 694 <50% 100 106 283 296 25 126 936 Total 1,456 1,299 1,272 1,787 770 1,569 8,153

Page 24: Intro to quant_analysis_students

Bar Graph

Page 24

Figure 1

Percentage of sales reported to tax authority

Source: 1999 World Bank World Business Environment Survey (WBES)

Note. Excludes missing observations. n = 8314

Page 25: Intro to quant_analysis_students

Page 2525

Segmented Bar Chart

Figure 1

Percentage of sales reported to tax authority

Source: 1999 World Bank World Business Environment Survey (WBES)

Note. Excludes missing observations. n = 8314

Page 26: Intro to quant_analysis_students

Pie Graph

Page 26

34%

18%11%

11%

10%

10%

7%

43%

15%

12%

9%

5%

8%

8%

33%

11%

9%8%

6%

11%

22%

44%

7%11%

9%

6%

7%

17%

58%

19%

9%

6%

3%2% 3%

39%

15%

14%

8%

5%

11%

8%

Africa and Middle East Transition Europe Asia

Latin America OECD Former Soviet Union

100% 90-99% 80-89% 70-79% 60-69% 50-59% <50%

Source: World Business Environment Survey

Figure #:Percentage of Sales Reported to Tax Authorties

Figure 2

Percentage of sales reported to tax authority by region

Source: 1999 World Bank World Business Environment Survey (WBES)

Note. Excludes missing observations. n = 8314

Page 27: Intro to quant_analysis_students

Vertical Bar Chart

27

Page 28: Intro to quant_analysis_students

Charts in Excel II

28

34%

18%11%

11%

10%

10%

7%

43%

15%

12%

9%

5%

8%

8%

33%

11%

9%8%

6%

11%

22%

44%

7%11%

9%

6%

7%

17%

58%

19%

9%

6%

3%2% 3%

39%

15%

14%

8%

5%

11%

8%

Africa and Middle East Transition Europe Asia

Latin America OECD Former Soviet Union

100% 90-99% 80-89% 70-79% 60-69% 50-59% <50%

Source: World Business Environment Survey

Figure #:Percentage of Sales Reported to Tax Authorties

Page 29: Intro to quant_analysis_students

Time Series Graph

Page 29

Time series are often used in social sciences Data collected at various time period: daily,

weekly, monthly, quarterly, annually, etc. Examples include GDP, Unemployment, University

Tuition Plot series of interest over time

Let’s look at a graph of the unemployment rate by gender and age

Page 30: Intro to quant_analysis_students

Line Graph

Page 30

Page 31: Intro to quant_analysis_students

InstructorPage 31

Histogram

Used for continuous data Frequency Distribution for continuous data

Summary graph showing count of the data pints falling in various ranges

Rough approximate of the distribution of the data

A histogram is a way to summarize data

The distribution condenses the raw data into a more useful form...

and allows for a quick visual interpretation of the data

Page 32: Intro to quant_analysis_students

Histogram

32

Page 33: Intro to quant_analysis_students

InstructorPage 33

Scatter Graphs

Graphs relationship between two continuous variables

Page 34: Intro to quant_analysis_students

Scatter Graph

34

Page 35: Intro to quant_analysis_students

Principles of Graphical Excellence

35

Well-designed presentation of interesting data Substance & design

Simplicity of design, complexity of data Proportion and Balance Clear, precise, efficient

Know what you are trying to show (have a story) make sure you graph shows it

Well formatted, professional Choose format that reflects your data and the story Informative and legible axis Fully labelled & legible

Gets across main point(s) in the shortest time with the least ink in the smallest space Adds information not otherwise available to the reader But supplemented with text describing the figure

Tells the truth about the data Limits complexity and confusion Avoid Chart Junk

Page 36: Intro to quant_analysis_students

36

0

10

20

30

40

50

60

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

0

20

40

60

80

100

120 West

North

Northeast

Southwest

Mexico

Europe

Japan

East

South

International

Examples of Chartjunk

Page 37: Intro to quant_analysis_students

37

Examples of Chartjunk

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

Gridlines!Vibration

Pointless

Fake 3-D Effects

Filled “Floor” Clip Art

In or out?

Filled

“Walls”

Borders and

Fills Galore

Unintentional

Heavy or Double Lines

Filled Labels

Serif Font with

Thin & Thick Lines

Page 38: Intro to quant_analysis_students

Displaying Data: “Mistakes”

Page 38

Graphs are also instruments of evil used for deceiving a naive viewer. Non-zero origin Omitting data that refutes your “evidence” Limiting scope of data

Page 39: Intro to quant_analysis_students

What is Wrong with this Graph?

39

Provincial Personal Income TaxesSingle Individual with $45,000 in income claiming basic personal tax credits

Page 40: Intro to quant_analysis_students

The Real Story

40

Page 41: Intro to quant_analysis_students

Exaggerates a change in data

Page 41Source: Statistics Canada, CANSIM II, V31215364

Page 42: Intro to quant_analysis_students

Dr. Kendall

42

Page 43: Intro to quant_analysis_students

Worst Recession Since the Depression (?)

43

Page 44: Intro to quant_analysis_students

Page 44

Presenting the Data

Page 45: Intro to quant_analysis_students

Describing Data Numerically

45

Simple Arithmetic Mean

Median

Mode

Describing Data Numerically

Variance

Standard Deviation

Range

Central Tendency Variation Association

Covariance

Correlation

Shape of the Distribution

Page 46: Intro to quant_analysis_students

Mode

46

A measure of central tendency Value that occurs most often Not affected by extreme values Used for either numerical or categorical

data There may be no mode or several modes What are the modes for the displayed

data?

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6

Page 47: Intro to quant_analysis_students

Mode

47

A measure of central tendency Value that occurs most often Not affected by extreme values Used for either numerical or

categorical data There may be no mode There may be several modes

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Mode = 9

0 1 2 3 4 5 6

No Mode

Page 48: Intro to quant_analysis_students

Mode

48

There may be several modes

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Mode = 5 & 9

Page 49: Intro to quant_analysis_students

Mode

49

Caution: Mode may not be representative of the data {0.1, 0.1, 5000, 4900, 4500, 5200,…}

Page 50: Intro to quant_analysis_students

Median

50

In an ordered list, the median is the “middle” number (50% above, 50% below)

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Page 51: Intro to quant_analysis_students

Mean

51

The “balancing point” (centre of gravity) of the data E.g. The data “balances” at 5

1 2 3 4 5 6 7 8 9

-2

-1 +3

Page 52: Intro to quant_analysis_students

Arithmetic Mean

52

The arithmetic mean (mean) is the most common measure of central tendency

Calculated by summing the value observations and dividing by the number of observations For a sample of size n:

# of observationsn

xxx

n

xx n21

n

1ii

Observed

values

Page 53: Intro to quant_analysis_students

Arithmetic Mean

53

The most common measure of central tendency Mean = sum of values divided by the number of

values Affected by extreme values (outliers) What is the mean for these examples?

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Page 54: Intro to quant_analysis_students

Arithmetic Mean

54

The most common measure of central tendency

Mean = sum of values divided by the number of values

Affected by extreme values (outliers)

0 1 2 3 4 5 6 7 8 9 10

Mean = 3

0 1 2 3 4 5 6 7 8 9 10

Mean = 4

35

15

5

54321

45

20

5

104321

Page 55: Intro to quant_analysis_students

Measures of Central Tendency

55

Central Tendency

Mean Median Mode

n

xx

n

1ii

Overview

Midpoint of ranked values

Most frequently observed valueArithmetic

average

50% 50%

Page 56: Intro to quant_analysis_students

The “Shape of a Distribution”

56

Use information on mean, median, and mode to “visualize” the data

A data distribution is said to be symmetric if its shape is the same on both sides of the median Symmetry implies that median=arithmetic

mean If a distribution is uni-modal and symmetric

then Median=mean=mode

Page 57: Intro to quant_analysis_students

The “Shape of a Distribution”

57

0

1

2

3

4

5

6

7

8

9

1 2 3 4 5 6 7

# of

Obs

.

Value

MEDIAN50% 50%

Symmetric

:

Median=Mean

Symmetric:

Median=Mean

UNIMODAL

Symmetric & Unimodel: Median=Mean=Mode

Page 58: Intro to quant_analysis_students

The “Shape of a Distribution”

58

0

1

2

3

4

5

6

7

8

9

1 2 3 4 5 6 7

# of

Obs

.

Value

MEDIAN50% 50%

Symmetric:

Median=Mean Symmetric

:

Median=Mean

BIMODAL BIMODALSymmetric & Bimodel: Median=Mean≠Mode

Page 59: Intro to quant_analysis_students

The “Shape of a Distribution”

59

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7

# of

Obs

.

Values

MEDIAN50% 50%

Symmetric: Median=Mean

Symmetric: Median=Mean

MODE?

Symmetric & no mode: Median=Mean (Uniform Distribution)

Page 60: Intro to quant_analysis_students

The “Shape of a Distribution”

60

An asymmetric distribution is said to be skewed

1. Negatively if Mean<Median<Mode2. Positively if Mean>Median>Mode

Hence, by comparing our measures of cental tendancy, we can start to visualize the shape and characteristics of the data

Page 61: Intro to quant_analysis_students

The “Shape of a Distribution”

61

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8

MODE=2MEDIAN=3

50% 50%

MEAN=3.2

MODE < MEDIAN < MEAN = POSITIVELY SKEWED DISTRIBUTION

Page 62: Intro to quant_analysis_students

Example: Positively skewed variable

62

The Distribution of After-Tax Income shows the

distribution of income across all Canadian households

Page 63: Intro to quant_analysis_students

Example: Positively skewed variable

63

The mode income is the most common income and was in the range from $15,000 to $19,999.

The median income is the level of income that separates the population into two groups of equal size and was $39,700.

The mean income is the average income and was $48,400.

Page 64: Intro to quant_analysis_students

Example: Positively skewed variable

64

A distribution in which the mean exceeds the median and the median exceeds the mode is positively skewed, which means it has a long tail of high values.

The distribution of income in Canada is positively skewed.

Most likely to report median rather than mean since long tail distorts average

Page 65: Intro to quant_analysis_students

Example: Positively skewed variable

65

Volunteer hours Charitable contributions # of Cigarette packs smoked (excluding 0) Collective bargaining agreement duration (in

years) # of beers consumed on a Saturday night Duration of low income (in years) Number of children

Page 66: Intro to quant_analysis_students

The “Shape of a Distribution”

66

0

2

4

6

8

10

12

0 1 2 3 4 5 6 7

MODE=6MEDIAN=5

50% 50%

MEAN=4.7

Mean< MEDIAN < Mode = NEGATIVELY SKEWED DISTRIBUTION

Page 67: Intro to quant_analysis_students

Examples

67

University Grades Age Years in school Etc.

Page 68: Intro to quant_analysis_students

Describing Data Numerically

68

Simple Arithmetic Mean

Median

Mode

Describing Data Numerically

Variance

Standard Deviation

Range

Central Tendency Variation Association

Covariance

Correlation

Shape of the Distribution

Page 69: Intro to quant_analysis_students

Same center, different variation

Measures of Dispersion/Variability

69

Variation

Variance Standard Deviation

Range

Measures of variation give information on the spread or variability of the data values.

Page 70: Intro to quant_analysis_students

Range

70

Simplest measure of variation Difference between the largest and the

smallest observations:

Range = Xlargest – Xsmallest

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Example:

Page 71: Intro to quant_analysis_students

Range

71

Simplest measure of variation Difference between the largest and the

smallest observations:

Range = Xlargest – Xsmallest

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 14 - 1 = 13

Example:

Page 72: Intro to quant_analysis_students

The Range

72

• Problem• Ignores all but two data points• These values may be “outliers”

(i.e. not representative)

Page 73: Intro to quant_analysis_students

Disadvantages of the Range

73

Ignores the way in which data are distributed

Sensitive to outliers

7 8 9 10 11 12

Range = 12 - 7 = 5

7 8 9 10 11 12

Range = 12 - 7 = 5

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120

Range = 5 - 1 = 4

Range = 120 - 1 = 119

Page 74: Intro to quant_analysis_students

The Variance

74

• A single summary measure of dispersion would be more helpful

• Takes account of all data Values

Page 75: Intro to quant_analysis_students

The Variance

1. Variance

2. Standard Deviation

N

ii Xx

ns

1

22 )(1

1

75

siancedeviationdards vartan

Page 76: Intro to quant_analysis_students

Measuring variation

76

Small standard deviation

Large standard deviation

Page 77: Intro to quant_analysis_students

Comparing Standard Deviations

77

Mean = 15.5 s = 3.338 11 12 13 14 15 16 17 18 19 20 21

11 12 13 14 15 16 17 18 19 20 21

Data B

Data A

Mean = 15.5 s = 0.926

11 12 13 14 15 16 17 18 19 20 21

Mean = 15.5 s = 4.570

Data C

Page 78: Intro to quant_analysis_students

Describing Data Numerically

78

Simple Arithmetic Mean

Median

Mode

Describing Data Numerically

Variance

Standard Deviation

Range

Central Tendency Variation Association

Covariance

Correlation

Shape of the Distribution

Page 79: Intro to quant_analysis_students

The Sample Covariance

79

The covariance measures the strength of the linear relationship between two variables

The sample covariance:

Only concerned with the strength of the relationship

No causal effect is implied

1n

)y)(yx(xsy),(xCov

n

1iii

xy

Page 80: Intro to quant_analysis_students

Interpreting Covariance

80

Covariance between two variables:

Cov(x,y) > 0 x and y tend to move in the same direction

Cov(x,y) < 0 x and y tend to move in opposite directions

Cov(x,y) = 0 x and y are independent

Page 81: Intro to quant_analysis_students

Coefficient of Correlation

81

Measures the relative strength of the linear relationship between two variable

Sample correlation coefficient:

YX ss

y),(xCovr

Page 82: Intro to quant_analysis_students

Features of Correlation Coefficient, r

82

Unit free Ranges between –1 and 1 The closer to –1, the stronger the negative

linear relationship The closer to 1, the stronger the positive

linear relationship The closer to 0, the weaker any positive

linear relationship

Page 83: Intro to quant_analysis_students

Interpreting the Correlation Coefficient, r

83

Page 84: Intro to quant_analysis_students

Scatter Plots of Data with Various Correlation Coefficients

84

Y

X

Y

X

Y

X

Y

X

Y

X

r = -1Cov<0

r = -.6Cov<0

r = 0Cov=0

r = +.3Cov>0

r = +1Cov>0

Y

Xr = 0Cov=0

Page 85: Intro to quant_analysis_students

502B

85

Page 86: Intro to quant_analysis_students

Fun with Graphs

86

Does your mindset match my dataset! http://www.ted.com/talks/

hans_rosling_at_state.html

Page 87: Intro to quant_analysis_students

Looking ahead SRs to client (cc) and Turnitin on Wednesday by

noon No class next week

Work on 598 critiques 598 Critiques due in class & Turnitin Nov. 30 Comments on your SRs will be ready Nov. 30 Final SRs (if required) due Dec. 8 @11:55PM PST

Note carefully the requirements Moodle site will be inaccessible sometime in

December Final Grades reported via usource once approved

by the Director

87