16
4/20/2007 25 Exercises Mix and Match For each task or property of a regression model, indicate where or how to look to find the answer. 1. Test H0: β1 = β2 = 0 a. t-statistic for b1 2. Test H0: β2 = 0 b. 1-R 2 3. Test H0: α = 0 c. s x 2 2 4. Effect of collinearity on se(b1) d. VIF(X1) 5. Correlations among variables e. t-statistic for b2 6. Scatterplots among variables f. F-statistic 7. Percentage of variation in residuals g. Scatterplot matrix 8. Test whether adding X1 improves fit of model h. s x 2 2 /VIF(X2) 9. All variation in X2 i. t-statistic for b0 10. Unique variation in X2 j. Correlation matrix True/False 11. Excess percentage changes in the value of an investment subtract the cost of borrowing from the percentage changes in the value of the investment. 12. The Capital Asset Pricing Model indicates that the estimated intercept in a regression of excess percentage changes in a stock on those in the market is zero. 13. If a multiple regression has a large F-statistic but small t-statistics for each predictor (i.e., the t-statistics for the slopes are near zero), then collinearity is present in the model. 14. The F-statistic is statistically significant only if either t-statistic for a slope in multiple regression is statistically significant. 15. If the R 2 of a multiple regression with 2 predictors is larger than 80%, then the regression explains a statistically significant fraction of the variance in y. 16. If the t-statistic for X2 is larger than 2 in absolute size, then adding X2 to the simple regression containing X1 produces a significant improvement in the fit of the model. 17. We can detect outliers by reviewing the summary of the associations in the scatterplot matrix.

Ch25 Exercises

Embed Size (px)

DESCRIPTION

bnm n b

Citation preview

Page 1: Ch25 Exercises

4/20/2007

25 Exercises Mix and Match

For each task or property of a regression model, indicate where or how to look to find the answer.

1. Test H0: β1 = β2 = 0 a. t-statistic for b1

2. Test H0: β2 = 0 b. 1-R2

3. Test H0: α = 0 c.

!

sx2

2

4. Effect of collinearity on se(b1) d. VIF(X1)

5. Correlations among variables e. t-statistic for b2

6. Scatterplots among variables f. F-statistic

7. Percentage of variation in residuals

g. Scatterplot matrix

8. Test whether adding X1 improves fit of model

h.

!

sx2

2 /VIF(X2)

9. All variation in X2 i. t-statistic for b0

10. Unique variation in X2 j. Correlation matrix

True/False

11. Excess percentage changes in the value of an investment subtract the cost of borrowing from the percentage changes in the value of the investment.

12. The Capital Asset Pricing Model indicates that the estimated intercept in a regression of excess percentage changes in a stock on those in the market is zero.

13. If a multiple regression has a large F-statistic but small t-statistics for each predictor (i.e., the t-statistics for the slopes are near zero), then collinearity is present in the model.

14. The F-statistic is statistically significant only if either t-statistic for a slope in multiple regression is statistically significant.

15. If the R2 of a multiple regression with 2 predictors is larger than 80%, then the regression explains a statistically significant fraction of the variance in y.

16. If the t-statistic for X2 is larger than 2 in absolute size, then adding X2 to the simple regression containing X1 produces a significant improvement in the fit of the model.

17. We can detect outliers by reviewing the summary of the associations in the scatterplot matrix.

Page 2: Ch25 Exercises

4/20/2007 25 Exercises

E25-2

18. A correlation matrix summarizes the same information in the data as given in a scatterplot matrix.

19. In order to calculate the VIF for an explanatory variable, we need to use the values of the response.

20. If VIF(X2) = 1, then we can be sure that collinearity has not inflated the standard error of the estimated partial slope for X2.

21. The best remedy for a regression model that has collinear predictors is to remove one of those that are correlated.

22. It is not appropriate to ignore the presence of collinearity because it violates one of the assumptions of the MRM.

Think About It

23. Collinearity is sometimes described as a problem with the data, not the model. Rather than having data that fill the scatterplot of x1 on x2, the data are concentrated along a diagonal. For example, this plot shows monthly excess percentage changes in the whole stock market and the S&P 500. The data span the same 120 months considered in the text, running from 1996 through 2005.

-15

-10

-5

0

5

Mark

et

Excess %

Chg

-15 -10 -5 0 5 10

S&P 500 Excess %Chg

a) Data for two months (February 2000 on the left and March 2000 at the right edge of the plot) deviate from the pattern evident in other months. What makes these months unusual? b) If you were to use both returns on the market and those on the S&P 500 as explanatory variables in the same regression, are these two months leveraged? c) Would you want to use these months in the regression or exclude these from the multiple regression?

24. Regression models that describe macro-economic properties in the US often have to deal with large amounts of collinearity. For example, suppose we want to use as explanatory variables the disposable income and the amount of household credit debt. Because the economy in the US continues to grow, both of these variables grow as well. Here’s a scatterplot of the two. These data are quarterly, from 1959 through the first quarter of 2006. (Disposable income is in red and credit debt is green. Yes, indeed, credit debt passed disposable income in 2001.)

Page 3: Ch25 Exercises

4/20/2007 25 Exercises

E25-3

0

2000

4000

6000

8000

10000

12000

Billions o

f D

ollars

19

59

01

01

19

65

01

01

19

71

01

01

19

77

01

01

19

83

01

01

19

89

01

01

19

95

01

01

20

01

01

01

20

07

01

01

Cal Date a) This plot shows timeplots of the two series. Do you think that they are correlated? Estimate the correlation. b) If the variables are expressed on a log scale, will the transformation to logs increase, decrease or not affect the correlation between these series? c) If the both variables are used as explanatory variables in a multiple regression, will you be able to separate the two? d) Suggest some alternative approaches to using the information in both series in a multiple regression that avoid some of the issues caused by collinearity.

25. Simple regression fits a line to summarize the relationship between y and x. If you like geometry, then you can think of multiple regression with two predictors as fitting a 2-d plane to summarize the relationship between y and the pair x1 and x2. The data are points “floating” within a 3-d cube. The residuals are the vertical deviations above the plane surface. Here are two pictures of a multiple regression of the prices of used BMW cars on their age and mileage (see the 4M exercise in Chapter 24).

Estimated Price = 40,300 - 1850 Age - 0.12 Mileage The left view shows the regression surface from the Age side of the cube, and the view on the right shows the fit from the Mileage side.

a) The intersection of two planes produces a line. What is the slope of the line where the regression plane intersects the Age × Price side of the cube (another plane)?

b) You can almost see a plot of the residuals on the fitted values by rotating this cube into the right position. What position approximates the scatterplot of e plotted on

!

ˆ y ?

26. You can appreciate some of the effects of collinearity by taking a visual 3-d look at a regression model. (See Exercise 23.) These views show the multiple regression for percentage changes in IBM stock considered in this chapter. The fitted equation is Est IBM Excess %Chg = 0.62 + 0.46 Market Excess %Chg + 0.99 DJ Excess %Chg The data like in a cigar-shaped cylinder.

Page 4: Ch25 Exercises

4/20/2007 25 Exercises

E25-4

a) Because the data lie in a cigar-shaped cylinder, is the orientation of the plane well-determined, or can we rotate the surface while preserving the fit to the data? b) If we can move the surface while keeping the same fit to the data, are the slopes well-determined, or do they have large standard errors? c) If you wanted to “nail down” the slope to a specific position, where would you put a data point. How would adding this point affect the correlation between the two explanatory variables?

27. The version of the CAPM studied in this chapter specifies a simple regression model as

100 (St - rft) = α + 100 β (Mt - rft) + ε

where Mt are the returns on the market, St are the returns on the stock, and rft are the returns on risk-free investments. Hence, 100 (Mt - rft) are the excess percentage changes on the market and 100 β (St - rft) are the excess percentage changes on the stock.

What happens if we work with the excess returns themselves rather than the percentage changes? In particular, what happens to the t-statistic for the test of H0: α = 0.

28. The following histograms summarize monthly returns on IBM, the whole stock market, and the risk-free asset over the 10 years 1996-2005. Having seen this comparison, explain why it does not make difference to subtract the risk-free rate from the variables in the CAPM regression?

-0.20

-0.10

0.00

0.10

0.20

0.30

5 10 15

Count

-0.2

-0.1

0.0

0.1

0.2

0.3

5 101520 25

Count

-0.2

-0.1

0

0.1

0.2

0.3

25 50 75 125

Count 29. The following correlation matrix and scatterplot matrix show the same data, only we

scrambled the order of the variables in the two views. If the labels X, Y, Z, and T are

Page 5: Ch25 Exercises

4/20/2007 25 Exercises

E25-5

as given in the scatterplot matrix, label the rows and columns in the correlation matrix.

1.0000 -0.4038 0.0204 0.6384 -0.4038 1.0000 0.8610 -0.5913 0.0204 0.8610 1.0000 -0.3180 0.6384 -0.5913 -0.3180 1.0000

-4

0

4

8

10

30

50

70

-2

2

6

10

0

1

2

3

4

5

6

X

-4 0 2 4 6 810

Y

10 30 50 70

Z

-20 2 4 6 810

T

0 1 2 3 4 5 6 30. Identify the variable by matching the description to the data shown in the scatterplot

matrix. The plot shows 75 observations. a) Which variable is the sequence 1,2,3,…,75? b) Which variable has mean 102? c) Which pair of variables is most highly positively correlated? d) Which variables is nearly uncorrelated with Y? e) Identify any outliers in these data. If you don’t find any, then say so.

-3

-2

-1

0

1

2

10

30

50

70

395

400

405

410

415

420

99

100

101

102

103

104

105

X

-3 -2 -1 0 1 2

Y

10 30 50 70

Z

395 405 415

T

99 101 103 105

Page 6: Ch25 Exercises

4/20/2007 25 Exercises

E25-6

31. In the 4M exercise of this chapter, the data show correlation between the income and age of the customer. This produces collinearity and makes the analysis tricky to interpret. The marketing research group could have removed this collinearity by collecting data with these two factors made uncorrelated. For example, for incomes of $60,000, $70,000 through $120,000, they could have found two customers of, say, 25, 35, 45, 55, and 65 years old. That would have given them 70 observations. a) Explain why Income and Age would be uncorrelated for these data. b) Would the marginal slope be the same as the partial slope when analyzing these data? c) Would the marginal slope for Age when estimated to these data have positive or negative sign?

32. To find out whether employees are interested in joining a union, a manufacturing company hired an employee relations firm to survey attitudes toward unionization. In addition to rating their agreement with the statement “I do not think we need a union at this company.” (on a 1-7 Likert scale), the firm also recorded the number of years of experience and the salary of the employee. Both of these are typically positively correlated with agreement with the statement. a) In building a multiple regression of the agreement variable on years of experience and salary, would you expect to find collinearity? Why? b) Would you expect to find the partial slope for salary to be about the same as the marginal slope, or would you expect it to be noticeably larger or smaller?

33. Modern steel mills are very automated and need to monitor their substantial energy costs carefully to be competitive. In making cold-rolled steel (as used in making bodies of cars), it is known that temperature during rolling and the amount of expensive additives (expensive metals like manganese and nickel give steel desired properties) affect the number of pits per 20-foot section. A pit is a small flaw in the surface. To save on costs, a manager suggested the following plan for testing the results at various temperatures and amounts of additives. 90º, 0.5% additive 95º, 1.0% additive 100º, 1.5% additive 105º, 2.0% additive 110º, 2.5% additive Multiple sections of steel for each combination would be produced with the number of pits computed. a) If Temperature and Additive are to be used as predictors together in a multiple regression, will this approach yield useful data? b) Would you stick to this plan, or can you offer an alternative that you think is better? What would that approach be?

34. A builder is interested in which types of homes earn a higher price. For a given number of square feet, the builder gathered prices of homes that use the space differently. In addition to price, the homes vary in the number of rooms devoted to personal use (such as bathrooms or private bedrooms) and rooms devoted to social use (enclosed decks or game rooms). Because the homes are of roughly equal size (equal numbers of square feet), the more space devoted to private use, the less

Page 7: Ch25 Exercises

4/20/2007 25 Exercises

E25-7

devoted to social use. The variable Private denotes the number of square feet used for private space and Social the number of square feet for social rooms. a) Would you expect to find collinearity in a multiple regression of Price on Private and Social? Explain. b) Rather than use both Private and Social as two variables in a multiple regression for Price, suggest an alternative that might in the end be simpler to interpret as well?

You Do It

We investigated the use of the MRM for inference in these examples in Chapter 24. For answering the questions shown here, assume unless indicated otherwise that you can use the MRM for inference. If you’re concerned that it’s not appropriate, see the analyses of these data in Chapter 24. For each data set, if your software supports it, prepare a scatterplot matrix as a first step in your analysis. Otherwise, you might want to look back at the individual scatterplots of these data that were used in the exercises of Chapter 24.

prediction intervals: using regression for prediction. more than two predictors

35. Gold chains (introduced in Chapter 24) These data give the prices (in dollars) for gold link chains at the web site of a discount jeweler. The data include the length of the chain (in inches) and its width (in millimeters). All of the chains are 14 karat gold in a similar link style. Use the price as the response. For one explanatory variable, use the width of the chain. For the second, calculate the “volume” of the chain as π times its length times the square of half the width, Volume = π Metric Length × (Width/2)2. To make the units of volume mm3, first convert the length to millimeters (25.4 mm = 1 inch).

a) The explanatory variable Volume includes Width. Are these explanatory variables perfectly correlated? Can we use them both in the same multiple regression?

b) Fit the multiple regression of Price on Width and Volume. Do both explanatory variables improve the fit of the model?

c) Find the variance inflation factor and interpret the value that you obtain.

d) What’s the interpretation of the coefficient of Volume?

e) The marginal correlation between Width and Price is 0.95, but its slope in the multiple regression is negative. How can this be?

36. Convenience shopping (introduced in Chapter 19) These data describe sales over time at a franchise outlet of a major US oil company. (The data file has values for two stations. For this exercise, use only the 283 cases for site 1.) Each row summarizes sales for one day. This particular station sells gas, and it also has a convenience store and a car wash. The response Sales gives the dollar sales of the convenience store. The explanatory variable Volume gives the number of gallons of gasoline sold and Washes gives the number of car washes sold at the station.

Page 8: Ch25 Exercises

4/20/2007 25 Exercises

E25-8

a) Fit the multiple regression of Sales on Volume and Washes. Do both explanatory variables improve the fit of the model?

b) Which explanatory variable is more important to the success of sales at the convenience store: gasoline sales or car washes? Do the slopes of these variables in the multiple regression provide the full answer?

c) Find the variance inflation factor and interpret the value that you obtain.

d) One of the explanatory variables is just barely statistically significant. Assuming the same estimated value, would a complete lack of collinearity have made this explanatory variable noticeably more statistically significant?

37. Download (introduced in Chapter 19) Before plunging into videoconferencing, a company tested of its current internal computer network. The tests measured how rapidly data moved through its network given the current demand on the network. Eighty files ranging in size from 20 to 100 megabytes (MB) were transmitted over the network at various times of day, and the time to send the files recorded. The time is given as the number of hours past 8 am on the day of the test.

a) Fit the multiple regression of Transfer Time on File Size and Hours past 8. Does the model, taken collectively, explain statistically significant variation in transfer time?

b) Does either explanatory variable improve the fit of the model that uses the other? Use a test statistic for each.

c) Find the variance inflation factors for both explanatory variables. Interpret the value that you obtain.

d) Can collinearity explain the paradoxical results found in “a” and “b”?

e) Would it have been possible to obtain data in this situation in a manner that would have avoided the effects of collinearity?

38. Production costs (introduced in Chapter 19) A manufacturer produces custom metal blanks that are used by its customers for computer-aided machining. The customer sends a design via computer, and the manufacturer replies with an estimated price per unit. This cost estimate determines a price for the customer. The data for the analysis were sampled from the accounting records of 195 orders that were filled during the previous 3 months.

a) Fit the multiple regression of Average Cost on Material Cost and Labor Hours. Both explanatory variables are per unit produced. Do both explanatory variables improve the fit of the model that uses the other?

b) The estimated slope for labor hours per unit is much larger than the slope for material cost per unit. Does this difference mean that labor costs form a larger proportion of production costs than material costs?

c) Find the variance inflation factors for both explanatory variables. Interpret the value that you obtain.

Page 9: Ch25 Exercises

4/20/2007 25 Exercises

E25-9

d) Suppose that one formulated this regression using total costs of each production run rather than average cost per unit. Would collinearity have been a problem in this model? Explain.

39. Home prices (introduced in Chapter 24) In order to assist clients determine the price at which their house is likely to sell, a realtor gathered a sample of 150 purchase transactions in her area during a recent three-month period. The price of the home is measured in thousands of dollars. The number of square feet is also expressed in thousands, and the number of bathrooms is just that. Fit the multiple regression of Price on Square Feet and Bathrooms.

a) Thinking marginally for a moment, should there be a correlation between the square feet and the number of bathrooms in a home?

b) One of the two explanatory variables in this model does not explain statistically significant variation in the price. Had the two explanatory variables been uncorrelated (and produced these estimates), would it have been statistically significant? Use the VIF to see.

c) We can see the effects of collinearity by constructing a plot that shows the slope of the multiple regression. To do this, we have to remove the effect of one of the explanatory variables from the other variables. Here’s how to make a so-called partial regression leverage plot for these data. First, regress Price on Square Feet and save the residuals. Second, regress Bathrooms on Square Feet and save these residuals. Now, scatterplot the residuals from the regression of Price on Square Feet on the residuals from the regression of Bathrooms on Square Feet. Fit the simple regression for this scatterplot, and compare the slope in this fit to the partial slope For Bathrooms in the multiple regression. Are they different?

d) Compare the scatterplot of Price on Bathrooms to the partial regression plot constructed in “c”. What’s changed?

40. Leases (introduced in Chapter 19) This data table gives annual costs of 223 commercial leases. All of these leases provide office space in a Midwestern city in the US. The cost of the lease is measured in dollars per square foot, per year. The number of square feet is as labeled and Parking counts the number of parking spots in an adjacent garage that the realtor will build into the cost of the lease. Fit the multiple regression of Cost per SqFt on 1/SqFt and Parking/Sqft. (Recall that the slope of 1/SqFt captures fixed costs of the lease, those present regardless of the number of square feet.)

a) Thinking marginally for a moment, should there be a correlation between the number of parking spots and the fixed cost of a lease?

b) Interpret the coefficient of Parking/SqFt. Once you figure out the units of the slope, you should be able to get the interpretation.

c) One of the two explanatory variables is just barely explains statistically significant variation in the price. Had the two explanatory variables been uncorrelated (and produced these estimates), would it have been more impressively statistically significant? Use the VIF to see.

Page 10: Ch25 Exercises

4/20/2007 25 Exercises

E25-10

d) We can see the effects of collinearity by constructing a plot that shows the slope of the multiple regression. To do this, we have to remove the effect of one of the explanatory variables from the other variables. Here’s how to make a so-called partial regression leverage plot for these data. First, regress Cost per SqFt on Parking/SqFt and save the residuals. Second, regress 1/SqFt on Parking/SqFt and save these residuals. Now, scatterplot the residuals from the regression of Cost per SqFt on Parking/SqFt on the residuals from the regression of 1/SqFt on Parking/SqFt. Fit the simple regression for this scatterplot, and compare the slope in this fit to the partial slope for 1/SqFt in the multiple regression. Are they different?

e) Compare the scatterplot of Cost per SqFt on 1/SqF to the partial regression plot constructed in “d”. What’s changed?

41. R&D expenses (Introduced in Chapter 19) This data table contains accounting and financial data that describe 493 companies operating in technology industries: software, systems design, and semiconductor manufacturing. The variables include the expenses on research and development (R&D), total assets of the company, and net sales. All columns are reported in millions of dollars, so 1000 = $1 billion.) Use the natural logs of all variables and fit the regression of log R&D Expenses on log Assets and log Net Sales.

a) Thinking marginally for a moment, would you expect to find a correlation between the log of the total assets and the log of net sales?

b) Does the correlation between the explanatory variables change if you work with the data on the original scale rather than on a log scale? In which case is the correlation between the explanatory variables larger?

c) In which case does correlation provide a more useful summary of the association between the two explanatory variables?

d) What is the impact of the collinearity on the standard errors in the multiple regression using the variables on a log scale? Does the size of the VIF tell you that the two explanatory variables are not statistically significant?

e) We can see the effects of collinearity by constructing a plot that shows the slope of the multiple regression. To do this, we have to remove the effect of one of the explanatory variables from the other variables. Here’s how to make a so-called partial regression leverage plot for these data. First, regress Log R&D Expenses on Log Net Sales and save the residuals. Second, regress Log Assets on Log Net Sales and save these residuals. Now, scatterplot the residuals from the regression of Log R&D Expenses on Log Net Sales on the residuals from the regression of Log Assets on Log Net Sales. Fit the simple regression for this scatterplot, and compare the slope in this fit to the partial slope for Log Assets in the multiple regression. Are they different?

f) Compare the scatterplot of Log R&D Expenses on Log Assets to the partial regression plot constructed in “e”. What’s changed?

42. Cars (Introduced in Chapter 19) This data table gives characteristics of 223 types of cars sold in the US during the 2003 and 2004 model years. Fit a multiple regression with the log10 of the base price

Page 11: Ch25 Exercises

4/20/2007 25 Exercises

E25-11

as the response and the log10 of the horsepower of the engine (HP) and the log10 of the weight of the car (given in thousands of pounds) as explanatory variables.

a) Does it seem natural to find correlation between these two explanatory variables, either on a log scale or in the original units?

b) How will collinearity on the log scale affect the standard errors of the slopes of these predictors in the multiple regression?

c) One of the explanatory variables in the multiple regression is not statistically significant. Can this be attributed to the effect of collinearity of the standard error of this estimate?

d) We can see the effects of collinearity by constructing a plot that shows the slope of the multiple regression. To do this, we have to remove the effect of one of the explanatory variables from the other variables. Here’s how to make a so-called partial regression leverage plot for these data. First, regress Log10 Price on Log10 HP and save the residuals. Second, regress Log10 Weight on Log10 HP and save these residuals. Now, scatterplot the residuals from the regression of Log10 Price on Log10 HP on the residuals from the regression of Log10 Weight on Log10 HP. Fit the simple regression for this scatterplot, and compare the slope in this fit to the partial slope for Log10 Weight in the multiple regression. Are they different?

e) Compare the scatterplot of Log10 Price on Log10 Weight to the partial regression plot constructed in “e”. What’s changed?

43. OECD (introduced in Chapter 19) An analyst at the UN is developing a model that describes GDP (gross domestic product per capita, a measure of the overall production in an economy per citizen) among developed countries. For this analysis, she uses national data for 30 countries from the 2005 report of the Organization for Economic Co-operation and Development (OECD). Her current equation is Estimated per capita GDP =β0 + β1 Trade Balance + β2 Waste per capita Trade balance is measured as a percentage of GDP. Exporting countries tend have large positive trade balances. Importers have negative balances. The other explanatory variable is the annual number of kilograms of municipal waste per person.

a) Is there any natural reason to expect for these explanatory variables to be correlated? Suppose she had formulated her model using national totals as GDP =β0 + β1 Net Export ($) + β2 Total Waste (kg) would this model have more or less collinearity? (You should not need to explicitly form these variables to answer this question.)

b) One nation is particularly leveraged in the marginal relationship between per capita GDP. Which is it?

c) Does collinearity exert a strong influence on the standard errors of the estimates in her multiple regression?

d) Because multiple regression estimates the partial effect of an explanatory variable rather than its marginal effect, we cannot judge the effect of outliers on the partial

Page 12: Ch25 Exercises

4/20/2007 25 Exercises

E25-12

slope from their position in the scatterplot of y on x. We can, however, see their effect by constructing a plot that shows the partial slope. To do this, we have to remove the effect of one of the explanatory variables from the other variables. Here’s how to make a so-called partial regression leverage plot for these data. First, regress per capita GDP on per capita Waste and save the residuals. Second, regress Trade Balance on per capita Waste and save these residuals. These regressions remove the effects of waste from the other two variables. Now, scatterplot the residuals from the regression of per capita GDP on per capita Waste on the residuals from the regression of Trade Balance on per capita Waste. Fit the simple regression for this scatterplot, and compare the slope in this fit to the partial slope for Trade Balance in the multiple regression. Are they different?

e) Which nation, if any, is leveraged in the partial regression leverage plot constructed in “d”? What would happen to the estimate for this partial slope if the outlier were excluded?

44. Hiring (Introduced in Chapter 19) A operates a large, direct-to-consumer sales force. The firm would like to build a system to monitor the progress of new agents. The goal is to identify “superstar agents” as rapidly as possible, offer them incentives, and keep them with the firm. A key task for agents is to open new accounts; an account is a new customer to the business. The response of interest is the profit to the firm (in dollars) of contracts sold by agents over their first year. These data summarize the early performance of 464 agents. Among the possible explanations of performance are the number of new accounts developed by the agent during the first 3 months of work and the commission earned on early sales activity. An analyst at the firm is using an equation of the form (with natural logs) Log Profit = β0 + β1 Log Accounts + β2 Log Early Commission For cases have value 0 for early commission, the analyst replaced zero with $1.

a) The choice of the analyst to fill in the 0 values of early commission with 1 so as to be able to take the log is a common choice (you cannot take the log of 0). From the scatterplot of Log Profit on Log Early Commission, you can see the effect of what the analyst did. What’s the impact of these filled-in values on the marginal association?

b) Is there much collinearity between the explanatory variables? How does the presence of these filled-in values affect the collinearity?

c) Using all of the cases, does collinearity exert a strong influence on the standard errors of the estimates in her multiple regression?

d) Because multiple regression estimates the partial effect of an explanatory variable rather than its marginal effect, we cannot judge the effect of outliers on the partial slope from their position in the scatterplot of y on x. We can, however, see their effect by constructing a plot that shows the partial slope. To do this, we have to remove the effect of one of the explanatory variables from the other variables. Here’s how to make a so-called partial regression leverage plot for these data. First, regress Log Profit on Log Accounts and save the residuals. Second, regress Log Commission on Log Accounts and save these residuals. These regressions remove the effects of the number of accounts opened from the other two variables. Now,

Page 13: Ch25 Exercises

4/20/2007 25 Exercises

E25-13

scatterplot the residuals from the regression of Log Profit on Log Accounts on the residuals from the regression of Log Commission on Log Accounts. Fit the simple regression for this scatterplot, and compare the slope in this fit to the partial slope for Log Commission in the multiple regression. Are they different?

e) Do the filled-in cases remain leveraged in the partial regression leverage plot constructed in “d”? What does this view of the data suggest would happen to the estimate for this partial slope if these cases were excluded?

f) What do you think about filling in these cases with 1 so that we can take the log? Should something else be done with them?

The next two exercises multiple regression with 3 explanatory variables. The analysis is much the same, only now you have to be on the watch for more sources of collinearity.

45. Promotion (Introduced in Chapter 19) These data describe promotional spending by a pharmaceutical company for a cholesterol-lowering drug. The data covers 39 consecutive weeks and isolates the area around Boston. The variables in this collection are shares. Marketing research often describes the level of promotion in terms of voice. In place of the level spending, voice is the share of advertising devoted to a specific product. The column Market Share is the ratio of sales of this product divided by total sales for such drugs in the Boston area. The column Detail Voice is the ratio of detailing for this drug to the amount of detailing for all cholesterol-lowering drugs in Boston. Detailing counts the number of promotional visits made by representatives of a pharmaceutical company to doctors’ offices. Similarly, Sample Voice is the share of samples in this market that are from this manufacturer.

a) Do any of these variables have linear patterns over time? Use timeplots of each one to see. (A scatterplot matrix becomes particularly useful.) Do any weeks stand out as unusual?

b) Fit the multiple regression of Market Share on three explanatory variables: Detail Voice, Sample Voice, and Week (which is a simple time trend, numbering the weeks of the study from 1 to 39). Does the multiple regression, taken as a whole, explain statistically significant variation in the response?

c) Does collinearity affect the estimated effects of these explanatory variables in the estimated equation? In particular, do the partial effects create a different sense of importance from what is suggested by marginal effects?

d) Which explanatory variable has the largest VIF?

e) What’s your substantive interpretation of the fitted equation? Take into account collinearity and statistically significant.

f) Should both of the explanatory variables that are not statistically significant be removed from the model at the same time? Explain why doing this would not be such a good idea, in general. (Hint: are they collinear?)

Page 14: Ch25 Exercises

4/20/2007 25 Exercises

E25-14

46. Apple (Introduced in Chapter 19) These data track monthly performance of stock in Apple Computer since its inception in 1980. The data includes 300 monthly returns on Apple Computer, as well as returns on the entire stock market, the S&P 500 index, stock in IBM, and Treasury Bills (short term, 30-day loans to Uncle Sam). (The column Whole Market Return is the return on a value-weighted portfolio that purchases stock in the 3 major US markets in proportion to the size of the company rather than one of each stock.) Formulate the regression with excess returns on Apple as the response and excess returns on the whole market, the S&P 500, and IBM as explanatory variables. (Excess returns are the same as excess percentage changes, only without being multiplied by 100. Just subtract the return on Treasury Bills from each.)

a) Do any of these excess returns have linear patterns over time? Use timeplots of each one to see. (A scatterplot matrix becomes particularly useful.) Do any months stand out as unusual?

b) Fit the indicated multiple regression. Does the estimated multiple regression explain statistically significant variation in the excess returns on Apple?

c) Does collinearity affect the estimated effects of these explanatory variables in the estimated equation? In particular, do the partial effects create a different sense of importance from what is suggested by marginal effects?

d) Which explanatory variable has the largest VIF?

e) How would you suggest improving this model, or would you just leave it as is?

f) Interpret substantively the fit of your model (which might be the one the question starts with).

Page 15: Ch25 Exercises

4/20/2007 25 Exercises

E25-15

4M Budget Allocation Collinearity among the predictors is common in many applications, particularly those that track the growth of a new business over time. The problem is worse when the business has steadily grown – or fallen. Because the growth of a business affects many attributes of the business (such as assets, sales, number of employees and so forth; see Question 41), the simultaneous changes that take place make it hard to separate important factors from coincidences.

For this exercise, you’re the manager who allocates advertising dollars. You have a fixed total budget for advertising, and you have to decide how to spend it. We’ve simplified things so that you have two choices: printed ads or television. Managers today face many more choices, such as advertising via the Internet, but lets focus on two so we can learn more about the effects of collinearity.

Your company in this exercise has been doing well. The past 2 years have been a time of growth, as you can see from the timeplot of weekly sales during this period. The data are in thousands of dollars, so you can see from the plot that weekly sales have grown from about $2.5 million up to around $4.5 million.

2000

2500

3000

3500

4000

4500

Sale

s (

$M

)

0 20 40 60 80 100

Week

Other things have grown as well, namely your spending for advertising. This timeplot shows the two of them (TV in green, and printed ads in red), over the same 104 weeks.

0

100

200

Spendin

g (

$M

)

0 20 40 60 80 100

Week With everything getting larger over time, all three variables are highly correlated.

Page 16: Ch25 Exercises

4/20/2007 25 Exercises

E25-16

Motivation

(a) How are you going to decide how to allocate your budget between these two types of promotion?

Method

(b) Explain how you can use multiple regression to help decide how to allocate the advertising budget between printed ads and television ads.

(c) Why would it not be enough to work with several, more easily understood simple regression models, such as sales on spending for television ads or sales on spending on printed ads?

(d) Look at the scatterplot matrix of Sales, the two explanatory variables (TV Adv and Print Adv) and a time trend (Week). Do the relationships between the variables seem straight enough for fitting a multiple regression?

(e) Do you anticipate that collinearity will affect the estimates and standard errors in the multiple regression. Use the correlation matrix of these variables to help construct your answer.

Mechanics

Fit the multiple regression of Sales on TV Adv and Print Adv.

(f) Does the model satisfy the assumptions for the use of the MRM?

(g) Assuming that the model satisfies the conditions for the MRM, (i) Does the model as a whole explain statistically significant variation in Sales? (ii) Does each individual explanatory variable improve the fit of the model, beyond the variation explained by the other alone?

(h) Do the results from the multiple regression suggest a method for allocating your budget? Assume that your budget for the next week is $360,000.

(i) Does the fit of this model promise an accurate prediction of sales in the next week, accurate enough for you to think that you have nailed the right allocation?

Message

(j) Everyone at the budget meeting knows the information in the plots shown in the introduction to this exercise: sales and both types of advertising are up. Make a recommendation with enough justification to satisfy their concerns.

(k) Identify any important concerns or limitations that you feel should be understood to appreciate your recommendation.