Ch22 Answers

Embed Size (px)

Citation preview

  • 7/28/2019 Ch22 Answers

    1/15

    4/21/2008

    22 Answers

    Mix and Match

    1. i or f2. c3. f or i4. a5. b6. h7. e8. d9. j10.gTrue/False

    11.FalseWhen the variance of the errors is not constant, the prediction intervals are likely tobe too short in some cases (where the variance is large) and too long in others(where the variance is small).

    12.FalseThe predictions are correct on average, but the prediction intervals have theincorrect length around the prediction.

    13.True14.False

    The Durbin-Watson statistic tests for dependence between adjacent errors.

    15.True16.False

    The presence of an outlier reduces r2 if the outlier deviates from the pattern in therest of the data, but it need not be located in this way. In some cases, as in the text

    example, the presence of the outlier inflates r2

    .17.False

    Residuals look normally distributed when all of those omitted lurking factor areroughly of the same size, so their net affect the sum of their effects tends to havebell-shaped variation.

    18.TrueWithout normality, we cannot rely on the usual 2 implies 95% heuristic.

  • 7/28/2019 Ch22 Answers

    2/15

    22 Answers

    A22-2

    19.FalseThe decision to exclude data should take into account many things, particularly thesubstantive importance of the case. That unusual case might be the most interestingpart of the data, telling you what you have omitted from the model.

    20.FalseIf D2, then theres no evident dependence between adjacent residuals. Also notethat the residuals could show other patterns; we cannot prove independence, onlyreject the null of independence.

    21.FalseBecause it removes the trend, its easier to see changes in the variation in the plot ofthe residuals on x.

    22.True

    Think About It

    23.The data would most likely have unequal variation, with more variation amonglarger stores. Some large stores will do well (little competition, lots of people nearby)whereas others will not. Smaller locations have a smaller range of opportunities todo well, as well as a smaller down side. You might anticipate other problems aswell.

    24. The association between income and education might be nonlinear, but certainlyexpect the variation to be larger for the data associated with small towns than forlarger communities. The variation of the average (in particular, its standard error)

    goes down with the size of the sample.

    25.The analyst was hasty because the analyst failed to realize that the problem with theregression is an evident lack of constant variation. These residuals should not becombined into one histogram. The variance is clearly smaller at the left than theright.

    26.It is true that the regression line tracks the average of y as x increases. In general,even with dependence, predictions will be correct on average. The width of theprediction intervals, however, will in general be too long for small values of x andtoo narrow for large values of x.

    27.a) The slope will become closer to zero.b) The r2 would change, but it is hard to say by how much. (In fact, it drops from0.24 to 0.21.) Indeed, we must be very careful comparing r2 for equations that do notdescribe the same variation. se would be smaller without this one, since it is thelargest residual in the data.c) Yes, this case is leveraged because it is near the right-hand side of the plot.

    28. a) The slope will stay about the same. The outlier basically pulls the regression linedown without tilting it to either side.b) Without the outlier, r2 will be larger and se smaller. Again, be cautious

  • 7/28/2019 Ch22 Answers

    3/15

    22 Answers

    A22-3

    interpreting differences in r2 when the response changes.c) This case is not leveraged because it lies near x

    29.a) The slope will increase, moving to near zero.b) R2 will decrease, whereas se will stay about the same or be slightly smaller. r2 falls

    because most of the variation that is explained is the distinction of this outlier fromthe rest of the cases (r2 drops from about 30% to near 10%.)c) This case is leveraged because lies far below the other cases.

    30.a) The slope gets much more steep, fitting the evident pattern in the cluster of pointsat the left of the figure..b) Without the outlier, r2 will be larger (it gets about 3 times larger) and se smaller.Be cautious interpreting differences in r2 with the response changes.c) This case is extremely leveraged because it lies far outside the range of the othervalues of the explanatory variable.

    31.Answers will vary, but you can think of other macroeconomic factors that were alsochanging over this time period, such as trends in the stock market, interest rates,inflation, etc.

    32.These data are time series, and an obvious candidate to produce a pattern in theresiduals is a day-of-the-week effect, with Mondays and Fridays perhaps beingdifferent from days in the middle of the week. Shipping schedules and deliverydates might introduce a sequential pattern as well.

    33.No. The Durbin-Watson statistic tests the assumption of independence. If we donot reject this hypothesis, it may still be false. We just failed to reject it. We have notproven that independence is true. Statistical tests never prove H0.

    34.The value of D in this example is small (indicating a problem) because we used aline to summarize data that show a nonlinear pattern The Durbin-Watson statisticinterprets the positive, negative, positive pattern in the residuals as a sign ofdependence rather than that we have fit the wrong pattern. The Durbin-Watsonassumes you have fit the right equations so that the model is right on average. Ifnot, the Durbin-Watson statistic confuses this lack of fit for dependence.

    You Do It

    35.Diamond ringsThe price of the Hope Diamond comes to S$ 56 million.a)

    With this point added, the scaling on the plot is such that you can see only 2points: one for the Hope Diamond, and one for the other 48 points.b) The fitted line essentially goes through these 2 points, as summarized below. Theslope becomes much, much steeper. The intercept becomes even more negative.Without Hope Diamond (green, nearly horizontal in figure)

    Estimated Price (Singapore dollars) = -260 + 3721 Weight (carats)With Hope Diamond (red)

    Estimated Price (Singapore dollars) = -251696 + 1235667 Weight (carats)c) The value of R2 grows from 0.978 to 0.999925 and se gets huge, swelling from S$ 32

  • 7/28/2019 Ch22 Answers

    4/15

    22 Answers

    A22-4

    to S$ 69,962. We should not directly compare R2 because weve changed theresponse. Its grown so large because most of the variation in the new response isthe difference between rings with little diamonds and this huge stone. se is largerbecause even a small error in fitting the Hope Diamond is large when compared to

    the costs of the other rings. The model is not even close to them now. (See thescatterplot on the right that shows the fit of the new equation to the small rings; thefit is the vertical line in the figure!)d) The point for the Hope Diamond is incredibly leveraged, with a value of x that isabout 100 times heavier than any other in the data. The least squares regression hasto fit this outlier, no matter what it does to the fit the other data.

    -10000000

    0

    10000000

    20000000

    30000000

    40000000

    50000000

    60000000

    Price

    (Sin

    gapore

    dollars)

    0 10 20 30 40 50

    Weight (carats)

    0

    250

    500

    750

    1000

    Price

    (Singapore

    dollars)

    0 .1 .2 .3 .4

    Weight (carats)

    Term Estimate Std Error t Ratio Prob>|t|Intercept -251696.1 10148.5 -24.80

  • 7/28/2019 Ch22 Answers

    5/15

    22 Answers

    A22-5

    this day (its a Wednesday) given that it was not such a big day at the pumps.

    1000

    2000

    3000

    Sales(Dollars)

    1000 2000 3000 4000 5000

    Volume (Gallons)

    -500

    0

    500

    1000

    Residua

    l

    0 100 200 300

    Row Number

    37.Download

    a) Neither plot suggests a problem. The fitted equation isEstimated Transfer Time (sec) = 7.2746633 + 0.3133071 File Size (MB)

    The residual plot versus both the explanatory variable and the time order seem fineat first glance. Both plots are shown below.b) The Durbin-Watson D statistic is D = 2.67. For a sequence of this length, this isstatistically significantly different from 2. Our software computes thep-value at0.003.c) In this example, the pattern is one that we have not seen. Rather than showing themeandering pattern, these results flip sign. The residuals basically have the patternpositive/negative/positive/negative with alternating sign.

    -15

    -10

    -5

    0

    5

    10

    15

    Residual

    20 30 40 50 60 70 80 90 100

    File Size (MB) -15

    -10

    -5

    0

    5

    10

    15

    20

    ResidualsTransferTime

    (sec)

    0 10 20 30 40 50 60 70 80

    Rows

    38.Production costsa) Neither the scatterplot of y on x or the residual plot (shown below, left) suggests a

    problem. The relative sparseness of the data for larger material costs simply reflectsthe fact that most orders have relatively small material costs per unit.b) Typical economics and common sense suggests that other cost inputs wouldbe needed to account for energy consumption, labor, and perhaps other fixed costs.c) The residual plot indicates a positive correlation between the residuals and theamount of labor. This is not simple variation; we can explain the variation in theseresiduals. Labor input appears to be a lurking variable.

  • 7/28/2019 Ch22 Answers

    6/15

    22 Answers

    A22-6

    -20

    -10

    0

    10

    20

    Res

    idual

    1 2 3 4 5 6 7 8

    Material Cost ($/unit)

    -20

    -10

    0

    10

    20

    ResidualsAverage

    Cost($/unit)

    .1 .2 .3 .4 .5 .6 .7 .8

    Labor Hours ($/unit)

    39.Seattle homes

    a) The two fits are shown together in the plot below. The red, near horizontal lineincludes the outlier. The green line does not.b) Use the model without the outlier as a basis for setting the size of confidence

    intervals. The estimates with the outlier are not very close to those obtained withoutthis home, but the slope nonetheless falls within the range of uncertainty indicted bythe confidence intervals (we only have a small sample, so these intervals are wide).For the slope, the gap between the estimates is

    (5175.4905 - 57923.342)/ 34515.8 - 1.5and for the intercept, the gap is smaller in absolute size, but considerable on thestandard error scale:

    (201.01784 - 155.72096) / 21.80695 2.1c) The intercept represents variable costs (estimated to be $156 per square footwithout the outlier). This estimate is more affected by the outlier than the slope.

    While the slope changes by more in absolute terms, the change is within the realmsof plausibility. The intercept lies outside the confidence interval if the outlier isincluded.d) Yes, the lot for this home is more than 3 times larger than any other. Youregetting a lot more land with this home than the others, helping to explain why thishome costs 3 times as much as others of the same number of square feet of house.With the lot size taken into account, the cost of this property seems in line with thatof the home in row 23 (which costs $575,000 for 2452 square feet of home on a lotwith 248,000 square feet).

    100

    200

    300

    400

    500

    600

    Price

    ($/SqFt)

    .0002 .0004 .0006 .0008 .001 .0012

    1/Sq Ft

    R2 0.097731 0.000186

  • 7/28/2019 Ch22 Answers

    7/15

    22 Answers

    A22-7

    se 41.27091 88.49126n 28 29

    With 28Term Estimate Std Error t Ratio Prob>|t|Intercept 155.72096 21.80695 7.14 |t|Intercept 201.01784 45.718 4.40 0.00021/Sq Ft 5175.4905 73121.26 0.07 0.9441

    40.Leasesa) The 8 leases with large residuals are highlighted in the plot shown below. All 8

    fall on the right-hand side of the scatterplot of y on x (as defined by the median ofthe explanatory variable). The chance of 8 tosses of a coin coming up heads in a rowis 1/256 0.004. Seems like these are not simple residuals; the residuals associatedwith smaller properties have more variation in the shown average costs than largerproperties.b) Any time you talk about property, its location, location, location. Other factorsinclude the facilities of the building, the access to parking and transportation, agesince renovated, and so forth.c) All of these 4 properties are right near the heart of the city. The plot (right) of theresiduals versus Distance to City highlights these. Some are even expensive for theirlocation, suggesting other factors at work as well. At some point between 2 and 3miles from the city, the farther out you are makes no matter to the price. But thereseems to be a premium to being in the heart of the city.

    12

    14

    16

    18

    20

    22

    24

    26

    CostperSq

    Foot

    0 .0001 .0003 .0005 .0007 .0009

    1/Sq Feet

    -4

    -2

    0

    2

    4

    6

    8

    ResidualsCostperSq

    Foot

    0 1 2 3 4 5 6

    Distance to City

    41.R&D expenses

    a) The plot has an odd flat-top appearance, with the variation above the fitted linebeing smaller, more compact than that below the line. Notice the scale in theresidual plot. Negative deviations seem much more spread out than positivedeviations.

  • 7/28/2019 Ch22 Answers

    8/15

    22 Answers

    A22-8

    b) . The normal distribution is symmetric.c) We counted 26 companies whose values lie outside the indicated predictionintervals. Of these, only 4 are positive. Wed expect half, or 13. The SD of abinomial with n = 26 andp = is np (1-p) = 26/4 2.5. That means the observed

    count of 4 lies (4-13)/2.5 = -3.6 SDs below the mean. The central limit theorem(applied to the binomial) tells us that this is rather unusual. Seems that indeed theerrors are not nearly normal. (The quantile plot confirms this impression, but itsnice to know some alternatives if you cant do the quantile plot easily.)

    -2

    -1

    0

    1

    2

    3

    4

    Log

    10

    R&DE

    xpense

    -2 -1 0 1 2 3 4 5

    Log 10 Assets

    -1

    0

    1

    Residual

    -2 -1 0 1 2 3 4 5

    Log 10 Assets

    42.Cars

    a) The log-log specification is much closer to the conditions needed by the SRM. Inthe original units, the relationship is seen to bend and clearly does not have similarvariances.

    b) The slopes do not mean the same thing. For the price and horsepower, it saysthat the price rises by a fixed amount on average with increasing horsepower. Forthe log-log model, the slope is the elasticity. For each 1% increase in HP, we get aconstant 1.4% increase in price, on average. Shown on the original scale (the curvein the figure on the left), the log-log model shows that added HP becomesincreasingly expensive as the power of the engine goes up.c) We counted 11 cars at or very near this boundary, all on the right-hand side of theplot. The cars with this property are highlighted in the figure below. The chance oftossing a coin 11 times and getting heads every time is pretty small, only 0.511 =1/2048 = 0.0005. Thats not a fair coin and we have strong evidence that thevariance of the errors is not constant.

  • 7/28/2019 Ch22 Answers

    9/15

    22 Answers

    A22-9

    0

    10000

    20000

    30000

    40000

    50000

    60000

    70000

    80000

    90000

    100000

    BasePric

    eMSRP

    100 150 200 250 300 350 400

    Horsepower

    4

    4.2

    4.4

    4.6

    4.8

    5

    Log10P

    rice

    2 2.1 2.2 2.3 2.4 2.5 2.6

    Log 10 HP

    43.OECD

    a) Visually, the fit does not change by very much, as shown in the plots below. The

    two fitted questions produce very similar fits to the data. More precisely, we can usethe confidence interval for the fit based on all of the data. The fitted equations usingall of the data and then without Luxembourg areAll 30 countries

    Estimated GDP (per cap) = 26804 + 1617 Trade Bal (%GDP)Without Luxembourg

    Estimated GDP (per cap) = 26714 + 1441 Trade Bal (%GDP)The confidence interval for the slope using all of the data is

    1617.47 - 2 * 303.86, 1617.47 + 2 * 303.86 = 1009.75 to 2225.19The slope without Luxembourg is well within the confidence interval.b) These summary statistics change quite a bit. As always, we have to be careful

    comparing the values of R2 since we have changed the response by removing a case.R2 se

    All 0.503 11,298Without Luxembourg 0.369 11,336

    The change in R2 is so much larger because Luxembourg is also the largest value onthe response. When we remove it, we remove a large contributor to the variation onthe y scale, variation that we had been explaining. se changes relatively little sincethe fit remains the same and the residual at Luxembourg was fairly typical of thoseat other points.c) No. The regression does not take into account the sizes of the countries. All are

    equally weighted. Thats a problem in the sense that data for a small country mightbe more variable from year to year than that for a larger country. Think of theanalogy to averages: averages of larger samples are more stable than averages ofsmaller samples.

    44.Hiringa) The outlier highlighted below is row 169.b) The values of the columns Early Commission and Early Selling are zero for thiswoman, whereas most others are positive. This might be an explanation. This caseis the only one in the data for which both of these columns are zero.

  • 7/28/2019 Ch22 Answers

    10/15

    22 Answers

    A22-10

    c) The fit with this one point excluded is nearly identical. The results both with andwithout this case are shown below. If we use the model without the outlier as ourbaseline, the fitted model with the outlier produces estimates well within theconfidence intervals. Both models yield very similar values of R2 and se.

    d) The fits are so similar for two reasons: the large sample size and the fact that thisobservation is not so leveraged as to overwhelm the opinions of the other cases asto where the line should go.

    5

    6

    7

    8

    9

    10

    11

    12

    Log

    Profit

    0 1 2 3 4 5 6 7Log Accounts

    R2 0.176184 0.175831se 0.717014 0.693895n 464 463

    With all 464 casesTerm Estimate Std Error t Ratio Prob>|t|Intercept 8.9444533 0.100374 89.11

  • 7/28/2019 Ch22 Answers

    11/15

    22 Answers

    A22-11

    and for the slope(0.1313929 - 0.1607747)/ 0.059047 = -.497600216 (b1 is smaller )

    Both changes are on the order of about of a standard error, well within the rangeof plausibility suggested by the confidence intervals from the model without the

    outlier.c) Week 6 is highly leveraged, so it increases the variation in the explanatoryvariable. Without this case, we have less variation in x and hence get a largerstandard error for the slope, even though se is smaller. We also have a smaller nwithout the outlier.d) The Durbin-Watson statistic D = 2.02 and the timeplot of the residuals shows nopattern. Theres no evidence of a lurking factor over time.

    With WithoutR2 0.14467 0.170771se 0.007102 0.007086n 39 38

    With all 39Term Estimate Std Error t Ratio Prob>|t|Intercept 0.2111775 0.004964 42.54 |t|Intercept 0.2082504 0.005646 36.88

  • 7/28/2019 Ch22 Answers

    12/15

    22 Answers

    A22-12

    and the iPod arrived in 2001.b) Retaining the outlier for October 1987 keeps (at the left) keeps the t-statistic larger.c) The presence of October 1987 retains the t-statistic because it lies along the fitthrough the other cases (it has a small residual) and its leveraged, meaning that the

    presence of this point adds variation to x, lowering the SE of the slope. Note in theoutput the change in the SE of the estimated slope when October 1987 is excluded.d) The change in the fit is small because of the large sample size and the lack ofleverage at these two points. October 1987 is leveraged (the largest drop in themarket), but is not many standard deviations of x from the rest of the explanatoryvariable.

    -0.6

    -0.4

    -0.2

    0

    0.2

    0.4

    Appl

    e

    Return

    -0.2 -0.1 0 .1

    Market Return

    Without

    October 87Without Sept

    2000R2 0.205462 0.191704 0.20082se 0.133264 0.133479 0.130215n 300

    299

    299

    All of the casesTerm Estimate Std Error t Ratio Prob>|t|Intercept 0.0049275 0.007911 0.62 0.5338Market Return 1.5371567 0.175106 8.78 |t|Intercept 0.0047156 0.007995 0.59 0.5558Market Return 1.548498 0.184503 8.39 |t|Intercept 0.0071919 0.007751 0.93 0.3543Market Return 1.483011 0.171666 8.64

  • 7/28/2019 Ch22 Answers

    13/15

    22 Answers

    A22-13

    4M Do Fences Make Good Neighbors?

    (a) Cost of the security fence is $35,000 per house, so the value added by reducing theperceived crime rate from 15 to 10 per 1000 has to be more than this.

    (b) No, for two reasons. First, significant effect does not equate to cost effective.Second, the model only describes association. There could very well be other lurkingfactors that operate; this model is not causal.(c) The plot appears straight-enough to proceed. There are clearly several outliers.

    (d) The linear equation shown in this figure isEstimated House Price ($) = 97921.6 + 1301.3762 1000/Crime

    Based on this fit of this equation, the average selling price at a crime rate of 1000/15 is(with a multiplier of 2 to account for doubling)

    2*(97921.6 + 1301.3762 * 1000/15) $369,360and at 1/10, the estimated price is

    2*(97921.6 + 1301.3762 * 1000/10) $456,118The increase in average price, (456,118-369,360) if this is real, certainly seems to be largeenough to cover the $35,000 per home in costs to add the gate and fence.

    e) The three most leveraged communities are the 3 at the right of the scatterplot, withvery low crime rates and hence very large values for 1000/Crime. The most leveragedis Upper Providence, with Northampton and Solebury close by. At the left side isCenter City Philadelphia (shown as an o), with a very high crime rate. Its farthest to theleft, but not so leveraged since most of the data are near this side of the plot.

    f) The 4 largest residuals (all positive) are Gladwyn, Villanova, Haverford, andHorsham. The prices in these areas are much larger than would be expected otherwise due to a lurking factor: Location. The first 3 of these are located on the Main Line, aprestigious suburban area outside Philadelphia. The Main Line is named for a rail linethat once took the local gentry to summer homes in the country.

  • 7/28/2019 Ch22 Answers

    14/15

    22 Answers

    A22-14

    -100000

    0

    100000

    200000

    300000

    R

    esidual

    GladwyneHaverford

    Horsham

    Villanova

    10 20 30 40 50 60 70 80 90100 120 140 160

    1000/Crime

    (g) The data do not conform to the conditions specified by the SRM. The fit is straight-enough, but the errors do not seem symmetrically spread around the fitted equation(partially because of the outliers). If you know the geography of this area, you can alsoidentify the evident lurking factor. The model does not account for the prestigiousMain Line, and so produces a cluster of positive residuals. But for this cluster of

    residuals, the errors seem to have similar variances. Even allowing for these outliers, theerrors in the normal quantile plot are not nearly normal. The skewness is evident andsystematic. Prediction intervals would thus be questionable. For inference about theparameters, however, we can resort to the skewness and kurtosis if we are not tooconcerned about dependence (such as from the adjacency of the communities). Theskewness and kurtosis of the residuals are K3 = 1.9 and K4 = 4.7. We have more than 47cases, so we can rely on the CLT for producing normally distributed samplingdistributions for estimates of the slope and intercept.

    -100000

    0

    100000

    200000

    300000

    10 20 30 40

    Count

    .01 .05.10 .25 .50 .75 .90.95 .99

    -3 -2 -1 0 1 2 3

    Normal Quantile Plot

    h) From d, the estimated difference in average selling price is (the intercept dropsout)

    2*(b1 1000/10 - b1 1000/15)=1000 b1 (2/10-2/15) = 66.667 b1

    The estimated change in average value is then66.667 * 1301.3762 $86,759

    with standard error66.667 * se(b1) = 66.667 * 287.982 $19,199

    The estimated improvement thus lies(86759-35000)/19199 2.70

    standard errors above the break-even point. If we ignore possible dependence due tolurking variables, we can signal the builder to go ahead.

  • 7/28/2019 Ch22 Answers

    15/15

    22 Answers

    A22-15

    Term Estimate Std Error t Stat p-valueIntercept 97920.6 15462.18 6.33