538
All of Statistics: A Concise Course in Statistical Inference Brief Contents 1. Introduction…………………………………………………………11 Part I Probability 2. Probability…………………………………………………………...21 3. Random Variables…………………………….…………………….37 4. Expectation…………………………………………………………..69 5. Equalities…………………………………………………………….85 6. Convergence of Random Variables………………………………...89 Part II Statistical Inference 7. Models, Statistical Inference and Learning………………………105 8. Estimating the CDF and Statistical Functionals…………………117 9. The Bootstrap………………………………………………………129 10. Parametric Inference……………………………………………..145 11. Hypothesis Testing and p-values…………………………………179 12. Bayesian Inference………………………………………………..205 13. Statistical Decision Theory……………………………………….227 Part III Statistical Models and Methods

All of Statistics: A Concise Course in Statistical Inference Brief

Embed Size (px)

Citation preview

Page 1: All of Statistics: A Concise Course in Statistical Inference Brief

All of Statistics: A Concise Course in Statistical Inference

Brief Contents 1. Introduction…………………………………………………………11

Part I Probability

2. Probability…………………………………………………………...21

3. Random Variables…………………………….…………………….37

4. Expectation…………………………………………………………..69

5. Equalities…………………………………………………………….85

6. Convergence of Random Variables………………………………...89

Part II Statistical Inference

7. Models, Statistical Inference and Learning………………………105

8. Estimating the CDF and Statistical Functionals…………………117

9. The Bootstrap………………………………………………………129

10. Parametric Inference……………………………………………..145

11. Hypothesis Testing and p-values…………………………………179

12. Bayesian Inference………………………………………………..205

13. Statistical Decision Theory……………………………………….227

Part III Statistical Models and Methods

Page 2: All of Statistics: A Concise Course in Statistical Inference Brief

14. Linear Regression………………………………………………...245

15. Multivariate Models……………………………………………...269

16. Inference about Independence…………………………………..279

17. Undirected Graphs and Conditional Independence……………297

18. Loglinear Models…………………………………………………309

19. Causal Inference………………………………………………….327

20. Directed Graphs………………………………………………….343

21. Nonparametric curve Estimation……………………………….359

22. Smoothing Using Orthogonal Functions..………………………393

23. Classification……………………………………………………..425

24. Stochastic Processes………………………………………………473

25. Simulation Methods………………………………………………505

Appendix Fundamental Concepts in Inference

Page 3: All of Statistics: A Concise Course in Statistical Inference Brief

� � ��� ����

� � ������� � ������� �

�������! #"%$& %')(*�,+.-0/1 2 !34+5-6(* 87�9: <;2+.=,�>"?/�9@ #"%=A/B"%CD3E�!9: %F�G�=�+5G�7,9: !/B"H/�+.$5+5(JIK"%G�=-@(*"H(*+5-@(*+5CL-M'N !9O-P(*F�=,�LG#(*-�+.G�-P(D"H(:+.-@(:+.CQ-LRSCL !T�7�F,(:�L9U-:CQ+.�QG�CL��VW�Q-:7X�LCQ+Y"%$5$5I�=B"H(D" T�+.G[Z+.G��>"%G�=8T>"%CD��+.G,�\$5�]"%9@G�+.G,�E^_RBT>"H(*�,�LT>"H(*+5CL-LR�"%G�=K9@�L$."H(*�Q=`=�+5-:CQ+.7�$5+.G��Q-La)����+.-b/X [ !3CL <;!�Q9:-K"cTdF�CD�feb+5=��L9?9:"%G��!� %'\(* %7�+.CQ-?(:�B"%Gf"�(JI[7,+.C]"H$6+.G#(*9@ [=,F�CQ(: !9@I�(:�_g2(8 !GTh"i(*���QTh"H(:+.CL"%$j-P(D"H(:+.-@(:+.CQ-La)kl(m+5G�CL$5F�=��Q-nT� 2=��L9@G?(: !7�+5CL-b$.+53%�0G� !G�7�"%9*"%T��Q(:9:+5C6CLF�9P;!��L-P(*+.T>"H(:+. !GoRi/X [ %(:-@(:9*"%7�7,+.G��m"%G�=MCQ$Y"%-@-:+qpBC]"H(:+. !GjRi(: !7�+.CQ-�(:�B"H(&"H9:�rF,-:FB"%$5$5I09@�L$.�Q�#"H(:�L=(* 0'N !$.$5 <esZJF�7OCQ !F�9:-@�L-Qa)���,�m9@�]"%=��Q9S+.-)"%-:-@F�T��L=>(: t3EG� <euCL"%$.CQF�$.F,-S"%G�=v"0$.+q(:(:$.�r$5+.G[Z�]"%9n"%$5�!�L/,9*",a&wn U7�9:�_;[+5 !F�-b3EG� <eb$.�Q=��!�\ %'x7,9: !/B"H/�+.$5+5(JIh"%G�=?-@(*"H(*+5-@(*+5CL-�+5-�9:�LyEF�+59:�Q=oa�����0(*�_g2(n+.-�-@F�+5(*"%/�$5�6'N !9b"%=,;H"%G�CQ�L=8F�G�=��Q9:�!9:"%=�FB"i(*�L-m"%G�=8�!9:"%=�FB"H(:�z-@(:F�=��LG#(:-La

{j|<}[|�~��_|�~��#� RE� },|]}4��~W��~W��� "%G�= ��}B�i��~������N�2},�H��~W��� "H9:�n"%$.$BCQ !G�CQ�L9:G,�L=heb+q(*�CL !$5$.�QCQ(*+5G��4"HG�=�"%GB"H$5I2�L+5G��4=B"H(*",a �B !9t-: !T��>(:+.T��%Rx-P(D"H(:+.-P(*+.CQ-d9@�L-:�L"%9:CD�cer"%-MCL %G[Z=�F�C_(*�L=�+5Gh-P(D"H(:+.-@(:+.CQ-r=��Q7B"%9@(:T>�QG#(*-reb��+5$.�n=B"H(*"tT�+.G�+5G��d"HG�=vT>"%CD��+.G,�m$5�]"%9@G�+.G��\9:��Z-:�L"%9:CD�\es"%-�CL %G�=�F�C_(*�L=t+.G\CL !T�7�F,(:�L9�-@CL+.�QG�CL�)=��Q7B"%9@(:T>�QG#(*-La&�2(D"H(:+.-P(*+.CQ+Y"%G,-j(*�� %F��!�#((*�B"i(vCL %T>7�F[(*�L9>-:CQ+.�LG#(:+.-@(:-he��L9:� 9@�L+5GE;%�LG#(*+5G���(:���Keb���Q�L$�a��r !T>7,F,(*�Q9h-:CQ+.�QGE(:+.-P(*-(*�� %F��!�#(b(*�B"H(b-P(D"H(:+.-P(*+.CL"%$j(:���L !9PI�=�+5=�Go��(n"%7�7�$qIh(* U(*�,�L+.9�7,9: !/�$5�LT�-La

����+.G��%-U"%9:��CD�B"HG��!+.G,��a��2(*"H(*+5-@(:+.CL+."%G�-tG� <e�9:�QCL !�%G�+.�Q�v(*�B"H(MCL !T�7�F,(:�L9O-:CQ+.�QG[Z(*+5-@(*-r"%9:�mTh"H32+5G��\G� <;!�L$1CL !G#(:9:+./,F,(*+5 !G�-Seb��+.$5�mCQ !T>7,F,(*�Q9r-:CQ+.�QGE(:+.-P(*-rG, �ef9@�LCL %�!G�+.�Q�(*���0�!�QG��L9:"%$.+q(JIv %'�-@(*"H(*+5-@(:+.C]"H$1(*�,�L !9PI?"%G�=8T>�_(*�� 2=� !$5 !�%I%a&�r$.�_;!�Q9m=B"i(D"OT>+5G�+.G,�O"%$qZ�! !9@+5(*�,T>-z"H9:�UT> %9:�M-@C]"%$."%/�$.�M(:�B"%GA-@(*"H(*+5-@(*+5CL+."%G�-6�_;!�L9z(:�� !F��!�A7X !-:-@+./�$5�%aU�B !9:T>"%$-@(*"H(*+5-@(*+5C]"%$#(:���L !9PId+5-xT� !9:�)7X�L9@;H"%-@+5;%�r(*�B"HGMCL !T�7�F,(:�L9�-:CL+5�LG#(*+5-@(:-��B"%=O9@�]"%$5+.�L�Q=oa���$.$"%�!9@�L�U-@(:F�=��LG#(:-\eb�� 8=��L"%$xeb+5(:� (*����"%GB"H$5I2-:+5-6 %'�=B"H(D"�-@�� !F�$5=A/X�Oer�Q$.$��%9: !F�G,=��L=+.G?/B"%-@+.Cz7�9@ !/B"%/,+.$.+q(JIv"HG�=?Th"H(:���LT>"H(:+.C]"H$j-@(*"H(*+5-@(:+.CL-QaS�m-@+.G��O'W"%G�C_Iv(* 2 !$5-s$.+53%�zG��QF[Z

�!�

Page 4: All of Statistics: A Concise Course in Statistical Inference Brief

��� ��������� ����������������������� �!�

9*"H$1G��_(*-QR�/1 2 !-P(*+.G,�M"%G�=�-@F�7�7X !9@(r;!�LC_(* !9�T>"%CD��+5G��L-)eb+5(:�� !F,(�F,G�=��L9@-@(*"%G�=�+5G��O/B"%-@+.C-@(*"H(*+5-@(:+.CL-�+5-�$.+53%�z=� !+5G��O/�9:"%+.G?-:F,9:�!�Q9@I?/X�Q'N !9@�z32G, �eb+5G����� <e (* �F�-@�t"O/B"HG�=B"%+5=oa

"rF,(>eb���L9@� C]"%Gu-P(*F�=��QG#(*-�$.�L"%9:G /B"%-@+.C47�9@ !/B"%/,+.$.+q(JIc"%G�= -@(D"i(*+.-P(*+5CL-hyEF�+5CD3E$5I$#wm <eb���Q9:�Ha\�s(6$.�]"H-@(]RX(*�B"i(6es"H-6TdI`CL !G�CQ$.F�-@+. !G`eb���LG TdI8CL %T>7�F[(*�L9�-:CL+5�LG�CQ�OCL !$ Z$.�L"%�!F��Q-�3%�Q7,(�"%-:3E+.G,�nT��&%�'�( ���L9@�)CL"%G0k1-@�LG�=tTdI6-P(*F�=,�LG#(*-�(* ��!�_(�"s�! [ 2=0"�F�G�=��Q9PZ-@(*"%G�=�+5G��8 %'rT� 2=��L9@G�-P(D"H(:+.-@(:+.CQ-zy2F,+.CD3E$5I$#*) �����U(JI[7,+.C]"H$&Th"H(:���LT>"H(:+.C]"H$�-@(*"H(*+5-@(*+5CCL %F�9:-@�`-@71�QG�=�-h(: [ TMF�CD� (*+5T>�? !Gc(*�Q=�+. !F,-h"%G,= F�G,+.G�-@7�+.9@+.G�� (* %7�+.CQ-4VNCL !F�G#(:+.G��T��Q(*�, [=�-QR�(Jer �=�+5T>�QG�-:+5 !GB"%$1+.G#(*�Q�!9*"%$5-��Q(:C%a ^>"H(�(:���t��g[71�QG�-:�\ %'�CQ <;!�L9@+.G���T� [=��Q9:GCL %G�CL�Q7,(*->VW/X 2 %(*-P(*9*"H7�7�+.G,��RoCLF,9@;!�U�Q-@(*+5Th"i(*+. %GoRj�!9:"%7���+5C]"%$�T> 2=��Q$.-��_(*CHa ^�aU�[ ?km-@�Q( !F,(b(: >9@�L=��Q-:+.�%G` !F,9nF,G�=��L9@�!9*"H=�FB"H(:�t�� %G� !9:-nCL %F�9:-@�t !G87�9@ !/B"%/�+5$.+q(JIv"%G,=KT>"H(*����ZT>"H(*+5C]"%$�-P(D"H(:+.-@(:+.CQ-Lan����+.-m/1 2 !38"%9: %-:�t'N9@ !T (*�B"i(6CL %F�9:-@�%a�+n�L9:�t+.-�"h-:F�T�T>"%9@I? %'(*�,�\T>"%+5Gv'N�]"H(:F�9:�Q-n H'�(:��+.-�/X [ %3Xa� ab������/1 2 !3U+.-)-:F�+q(D"%/,$.�m'N %9��� %G� !9:-rF�G�=��Q9:�!9:"%=�FB"H(:�L-s+5GvT>"H(*�oR2-@(*"H(*+5-@(:+.CL-�"%G�=CQ !T>7,F,(*�Q9�-@CL+.�QG�CL�6"%-�e��L$5$j"%-��!9*"H=�FB"H(:��-@(:F�=��QGE(:-�+.G>CL !T�7�F,(:�L9r-@CL+5�LG�CQ�6"%G�= %(:���L9�yEFB"HGE(:+5(*"H(*+q;!�zpB�L$5=�-La

� abkvCQ <;!�L94"%=,;H"%G�CQ�L= (: !7�+.CQ-8(:�B"H(K"%9:� (*9:"%=�+q(*+. %GB"%$.$qI G� %(8(D"%F��%�E(K+.G " pB9@-@(CQ !F�9:-@�%a �B !9>��g,"%T>7,$.�%RrG� !G�7B"H9*"%T��Q(:9:+.C?9@�L�!9@�L-:-@+. !GjR�/1 2 %(*-P(*9:"%7�7�+5G���Rr=,�LG[Z-@+5(JIv�Q-@(:+.T>"H(*+5 !G?"%G�=8�!9*"H7���+.CL"%$jT� 2=��L$5-La

, abkO�B"];%�4 !T�+5(:(:�L= (* !7�+5CL->+.G 7,9: !/B"H/�+.$5+5(JI�(*��"H(v=� �G� %(v7,$Y"]I "�CL�QG#(*9*"H$�9: %$.�+5G�-@(*"H(*+5-@(*+5C]"%$)+.G['N�L9:�QG�CL�HaA� %9d�_g,"%T�7�$5�%R&CL %F�G#(*+.G,� T>�_(*�� 2=�-M"H9:�h;2+.9P(*FB"H$.$5I"%/,-:�LG#(La

- abkJG �!�LG,�L9*"H$�R)kt(*9PI�(* A"];! !+5= /X�L$."%/X !9:+5G��4(*�Q=�+. !F,-�CL"%$.CQF�$Y"i(*+. %G�-M+.Gc'W"];! !9U %'�QT>7���"%-:+5�L+.G,�OCL !G�CQ�L7,(:-La

. abk1CL <;!�Q9xG� !G,7B"%9*"HT>�_(*9:+5CS+.G,'N�Q9:�LG,CL�)/1�_'N !9:�)7B"%9:"%T>�_(*9@+.C&+.G['N�L9:�QG�CL�Hax����+5-�+5-�(*��� !7,71 !-@+5(:�& %',T� !-P(�-P(D"H(:+.-P(*+.CQ-�/1 2 !3E-�/�F,(�kj/X�L$5+.�_;!�S+5(�+5-o(*���)9:+5�!�#(�es"]Iz(* n=� b+5(La/�"%9*"%T��Q(:9:+5C\T� 2=��L$5-m"H9:�dF,G�9:�L"%$.+5-@(*+5C\"%G�=K7X�L=�"%�! !�!+5C]"%$5$5IhF�G�GB"H(:F�9*"H$�a\V0+n <ee� !F�$5=>e���3EG� <e (*���n�Q;%�L9@IE(:��+.G��U"H/1 !F[(�(:���m=,+.-@(:9:+5/�F,(*+5 !G��_g[CL�Q7,(s'N !9S !G��m !9(Je� M7B"%9:"%T��Q(*�Q9:-1#i^dk&+5G#(*9: 2=�F�CQ�m-P(D"H(:+.-P(*+.CL"%$B'NF�G�C_(*+. %GB"%$.-)"%G,=v/1 2 %(:-@(*9:"%7�7�+5G��;%�L9@I?�L"%9:$qI�"%G,=`-P(*F�=��QG#(*-bpBG�=?(*��+5-�y2F,+5(*�0GB"i(*F�9:"%$�a

2 abkt"%/B"%G,=� !Gc(*���?F�-@FB"%$3':�x+59:-@(M���Q9:T 45/&9: %/B"%/�+5$.+5(JI6) "HG�=7':�[�LCQ !G�= ���L9@T4��2(D"H(:+.-P(*+.CQ-8)8"%7�7,9: #"%CD�ja��[ !T���-P(*F�=,�LG#(*-\ !G�$qIK(D"H3%�O(*���OpB9@-@(t�B"%$q'r"HG�=�+q(

Page 5: All of Statistics: A Concise Course in Statistical Inference Brief

� ,

e� !F�$.=\/X�r"bCL9@+.T��)+q'[(:���QI0=,+.=\G� %(�-:�Q�s"%G#I0-@(*"H(*+5-@(:+.C]"H$%(*���Q !9@I%a��BF�9P(*���Q9:T� !9:�HR7�9@ !/B"%/�+5$.+q(JI`+5-6T� !9:�O�QG��#"%�!+5G��heb���QG�-P(*F�=,�LG#(*-\C]"%G -@�L�U+5(z7,F,(z(* �e� !9:34+5G(*�,�\CQ !G#(*��g[(m %'�-@(*"H(*+5-@(*+5CL-Qa

� ab����� CQ !F�9:-@� T> <;!�Q-v;!�Q9@IuyEF�+5CD32$qIu"%G,=uCL <;!�Q9:-8TMF�CD� Th"i(*�L9@+Y"%$�a��8I CL !$ Z$.�L"%�!F��Q-��@ !3%�U(*�B"i(0k�CL <;%�L9M"H$.$� %'�-@(*"H(*+5-@(:+.CL-z+5GA(:��+.-zCQ !F�9@-:�>"%G�=A���LG�CQ�>(:���(*+q(*$5�%a������sCL %F�9:-@��+.-�=��QTh"HG�=�+.G,�6/�F,(&k��B"];%�ser !9@3%�Q=>�B"H9:=M(* �Th"%3H�r(*�,�sTh"�Z(*�Q9:+."%$#"%-�+.G#(*F,+5(*+q;!�S"%-�7X !-:-@+./�$5�&-: m(:�B"H(�(*�,��T>"H(*�Q9:+."%$%+5-�;%�L9PItF,G�=��L9@-@(*"%G�=B"%/,$.�=��Q-:7�+q(*�z(*���z'W"%-P(n7B"%CL�HaS��G#IEes"]I!RX-:$5 <e CQ !F�9:-@�L-m"%9:�0/X !9:+5G���a

� ab��-��n+5CD�B"%9@= �B�QI2G�T>"%G 71 !+5G#(*�L= !F,(LRs9:+5�! !9U"%G�= CL$Y"H9:+5(JI�"%9:�8G� %(>-@I2G� !G#I#ZT� !F�-La�kn��"<;%�M(*9@+.�L=`(* v-P(*9:+53%�O">�! [ 2=4/B"H$Y"%G�CQ�%a6�� �"];! %+.= �!�_(:(*+5G��h/X !�!�%�L==� <ebGM+5GdF�G�+5G#(*�L9@�L-P(*+.G,�m(*�QCD��G�+5C]"%$2=��Q(*"%+.$5-LR�T>"%G#It9@�L-@F�$5(:-x"%9@�r-P(D"H(:�L=teb+q(*�� %F,(7�9@ [ %' a`������/�+5/�$.+5 !�!9*"H7���+.CO9@�Q'N�Q9:�LG,CL�L-U"i(d(*�,�v�LG�=� H'b�]"%CD��CD�B"H7,(*�Q9O71 !+5G#((*�,�\-P(*F�=,�LG#(m(: �"%7�7�9@ !7�9:+."H(*�z-@ !F�9@CL�L-Qa

a�6GdTdIze��L/,-:+5(:��"%9@�)pB$5�L-�eb+5(:����CL 2=��Seb��+.CD�M-P(*F�=,�LG#(*-�CL"%GdF,-:��'N %9�=� %+.G��m"%$.$(*�,��CL %T>7�F[(*+.G,��a�+n �e��Q;%�L9QR,(*���m/1 2 !3O+5-)G, %(�(:+.�Q=>(* � "%G�=v"HGEI�CQ !T>7,F,(*+5G��$Y"HG��!FB"%�%�6C]"%G8/X�\F,-:�L=ja

�����mpB9@-@(�7B"%9P(r %'1(*���n(*��g[(�+.-SCQ !G�CL�Q9:G��Q=veb+5(:�>7�9@ !/B"%/,+.$.+q(JId(:���L !9PI!RE(*���b'N !9@Th"%$$Y"%G,�!FB"%�!�z H'xF�G,CL�L9P(D"%+5G#(JI�eb��+.CD�8+.-s(:���0/B"%-:+5-b %'x-P(D"H(:+.-P(*+.CL"%$o+5G,'N�L9@�LG�CQ�%aS�����\/B"%-:+5C7�9: %/�$.�QT�(*��"H(ber�0-@(:F�=,I?+5G87,9: !/B"H/�+.$5+5(JI>+.- %

��� ������������������� ��� !"���#�$���&% !('*)+�-,�,+.0/21����3� !"�4�51���% !('0%6��!(�#�7�+,8'�9:�51��;'0<��>=)-'*?4�+,#@

�����)-:�LCQ !G�=\7B"%9P(� %'[(:���S/1 2 !3m+.-�"%/X !F,(�-P(D"H(:+.-@(:+.CL"%$%+5G,'N�L9@�LG�CQ��"HG�=t+q(*-�CQ$. !-@�&CL !F�-@+.G�-QR=B"H(*"0T�+.G,+.G��z"HG�=UTh"%CD�,+.G���$5�]"%9@G�+.G��,a������b/B"H-:+.C�7,9: !/�$5�LT %'X-@(D"i(*+.-P(*+5C]"%$[+.G,'N�Q9:�LG,CL�+.-s(:���0+.G#;!�Q9:-:�\ %'�7�9: %/B"%/�+5$.+5(JI %

��� �����A�51��B'0<��C)-'*?4�+,+.0/21����3)-���B/D�;,E�GF �IH6'*<����51���% !('*)+�-,�,��51����J�6����� !K=���C�-�L�51��A��������@

�����L-@�O+.=��L"%-6"%9@�d+5$.$.F,-@(*9:"H(*�Q=`+5G4�x+5�!F�9@� � a � a�/&9:�Q=�+.C_(*+. %GoR1CQ$Y"%-@-:+qpBC]"H(:+. !GjRXCQ$.F�- Z(*�Q9:+.G,�M"%G�=h�L-@(:+.T>"H(*+5 !G>"%9:�m"%$.$ -@71�QCL+Y"H$1CL"%-:�Q-s %'�-P(D"H(:+.-@(:+.CL"%$ +.G,'N�Q9:�LG,CL�%aNMz"i(D"d"HGB"%$qZI2-:+.-QR�T>"%CD��+.G,��$.�L"%9:G,+.G��d"%G�=v=B"i(D"dT�+.G�+5G��t"%9:��;H"%9@+. !F,-rGB"%T��L-��!+5;%�LGv(: d(*����7�9*"%C�Z(*+5CL�h H'�-@(*"H(*+5-@(*+5C]"%$S+5G,'N�L9@�LG�CQ�%R&=��Q71�QG�=�+.G,�4 !G�(:���hCQ !G#(*�_g2(La4�����v-@�LCL %G�=c7B"%9@(t %'

Page 6: All of Statistics: A Concise Course in Statistical Inference Brief

� - ��������� ����������������������� �!�

Mz"H(*"O�!�LG��Q9*"H(:+.G��U7�9@ [CQ�L-@- 6/�-:�Q9@;%�L=K=�"H(D"

/S9@ !/B"%/,+.$.+q(JI

kJG,'N�L9@�LG�CQ�d"HG�= Mz"i(D" �`+5G�+.G��

�x+5�!F�9:� � a � % /&9@ !/B"%/�+5$.+q(JIv"%G,=8+5G,'N�L9@�LG�CQ�%a

(*�,�K/X [ !3�CQ !G#(D"%+5G�-� !G��KT� !9@�`CD�B"H7,(*�Q9h !G�7,9: !/B"H/�+.$5+5(JI�(:�B"H(hCL <;!�Q9:-v-P(* 2CD�B"%-@(:+.C7�9@ [CQ�L-:-@�L-n+.G�CQ$.F�=,+.G�� �K"H9:3% <;?CD�B"H+.G�-Qa

kh�B"];%�A=,9*"]ebG ���L"];[+5$5I !G %(*���Q9�/1 2 !3E-?+.G T>"%G#I�7�$."%CL�Q-La �` !-P(8CD�B"%7,(:�L9:-CL %GE(*"%+.G "�-@�LC_(*+. %GfCL"%$.$5�L= "r+5$./�$5+. !�%9*"%7��,+.C �b�LT>"%9:3E-veb��+5CD�f-@�L9P;!�L-8/X %(*�u(* c"HC_Z3EG� <eb$.�Q=��!�8TdIc=��L/[(>(: � %(:���L9>"%F,(:�� !9:->"%G�=c(* �71 %+.G#(�9:�]"H=��L9@-�(: � %(:���L9UF�-@�Q'NF�$9:�_'N�L9@�LG�CQ�L-Laukter !F,$.= �L-:7X�LCQ+Y"%$5$5I�$5+.3H�?(: �T��LG#(:+. !G�(*�,�`/X [ %32-�/#I M6� � 9: 2 %(�"%G�=�[CD���Q9@;2+.-@� V ������� ^m"%G,= � 9@+.T�T>�_(:(m"%G�=A�E(*+.9@�]"%3H�L9UV ��6� � ^b'N9: %T eb�,+.CD� kn"%=B"%7[(*�L=T>"%G#Iv�_g,"%T�7�$5�L-n"%G�=8�_g[CL�Q9:CQ+.-:�Q-La

Page 7: All of Statistics: A Concise Course in Statistical Inference Brief

��.

� �����#�7,C�#�7)-,����N�������L� �6� ������� )+�#�7'*��� !(F

�2(D"H(:+.-P(*+.CQ+Y"%G,-b"%G�=KCQ !T�7�F,(*�Q9b-:CL+5�LG#(*+5-@(:-m %' (:�LG8F�-:�t=�+��1�L9@�LG#(n$Y"%G,�!FB"%�!�z'N !9�(:���-*"%T��K(:��+.G��,a +m�Q9:�4+5-v"�=�+5CQ(*+5 !GB"%9PI�(*�B"i(v(*���K9:�]"H=��L9vT>"]Ices"%G#(�(: �9:�_(*F�9@G�(* (*��9@ !F��!�, !F,(b(*���0CQ !F�9:-@�%a

� �����#�7,C�#�7)-, '*?�%I<��C��! � )��7����)-� ���-���6�$����L-P(*+.T>"H(:+. !G $.�L"%9:G�+5G�� F�-@+.G��U=B"i(D"M(* U�L-P(*+5Th"H(:�

"%G?F�G,32G, �ebG`yEFB"%G#(*+q(JICL$."%-:-@+5pBCL"H(*+5 !G -:F�7X�L9P;2+.-:�Q=K$5�]"%9@G�+.G�� 7�9@�L=�+5CQ(:+.G��U"U=�+.-@CL9@�Q(*�� 'N9: %T �����CL$5F�-@(:�L9:+5G�� F�G�-@F�71�Q9@;2+.-@�L=4$5�]"%9@G�+.G,� 7�F[(:(*+5G��O=B"H(*"�+5G#(* O�!9@ !F�7�-=B"H(*" (*9:"%+.G�+5G��O-*"HT>7�$5� V���������P^�����������V���������%^CL <;H"%9:+."H(*�Q- 'N�]"H(:F�9:�Q- (:��� �"!�� -CL$."%-:-@+5pB�Q9 �#I[7X %(:���L-@+.- "OT>"%7�'N9: !T�CL <;H"%9:+."H(*�Q-s(* � !F,(*CQ !T��L-�#I[7X %(:���L-@+.- # -@F�/�-:�_(� H'�"O7B"%9:"%T��Q(*�Q9�-:7B"HCL�%$CL !G[pB=��LG,CL�\+.G#(*�Q9@;H"%$ # +5GE(:�L9P;%"H$j(*�B"i(bCL !G#(D"H+.G�-bF�G�3EG� <ebG`y2F�"%G#(*+5(JI

eb+q(*�8"O7�9:�Q-:CL9@+./X�L=?'N9:�Qy2F,�LG�C_I=�+.9@�LC_(*�L=`"%C_I[CQ$.+5Cz�!9*"%7,� "s"]I!�Q-mG��_( TMF,$5(*+q;H"%9:+."H(*�n=�+.-P(*9@+./�F,(:+. !Gveb+q(*�

-@71�QCL+5p��L=8CL !G�=,+5(*+5 !GB"%$+5G�=��L7X�LG,=��LG�CQ�t9@�L$."H(*+5 !G�-

"s"]I!�Q-:+Y"HG`+5G,'N�L9@�LG�CQ� "s"]I!�Q-:+Y"HGK+.G['N�L9:�QG�CL� -P(D"H(:+.-@(:+.CL"%$jT��Q(:�� [=,-b'N %9�F�-:+5G��U=B"H(D"(: �F,71=B"i(*�0-:F�/��@�LCQ(:+5;%�t/X�L$5+.�Q'N-

'N9:�Qy2F,�LG#(*+5-@(m+.G,'N�Q9:�LG,CL� # -P(D"H(:+.-@(:+.CL"%$jT��Q(:�� [=,-b'N %9�7�9: 2=�F�CQ+.G��7X !+.G#(��Q-@(:+.T>"H(*�Q-m"HG�=?CL !G,p�=��LG�CQ�d+5GE(:�L9P;%"H$.-eb+q(*�?�!FB"%9:"%G#(*�Q�L-b !G?'N9:�LyEF��QG�CQI8/X�L�B"];2+. %9

$Y"%9@�!�6=��_;2+Y"H(:+. !G?/X !F�G�=�- /1�n� $5�]"%9@G�+.G,� F�G,+5'N !9@T�/X !F�G�=,-m !G?7�9@ !/B"%/,+.$.+q(JIh H'x�Q9:9: %9:-

Page 8: All of Statistics: A Concise Course in Statistical Inference Brief

� 2 ��������� ����������������������� �!�

� '������#�7'*�

� FI?�H6'�� ���-���6�$���� 9@�]"%$oGEF�Td/1�Q9:-+.G['������xV �1^ +5G,pBTMF�T %x(:���0$Y"%9@�!�L-P(bG2F,TM/1�Q9��h-:F,CD�`(*��"H(������V �1^�'N %9b"%$.$�� ���

(:��+.G�3> %'�(*��+5-b"%-�(*���0T�+.G�+5TMF�T %'�-:F,7 ����� �V �1^ -@F�7�9:�QTMF�T %�(*���0-@Th"%$5$.�Q-@(�GEF�TM/X�L9��v-@F�CD�`(:�B"H(�����xV �1^r'N !9b"%$5$�� ���

(:��+.G�3> %'�(*��+5-b"%-�(*���0T>"ig[+.TMF�T %'���� �! V �#" � ^ V �#" � ^ %$&$&$' , � �( � )+* �-,) ,/. �-0 )21 ,3)V54x^ � "%T�Th"t'NF,G�CQ(:+. !G76�89 �;: 0���<-0>=+? �@ !F[(*CL %T>�A -:"%T>7,$.�6-@7B"%CL�hVW-@�Q(n %'� !F,(:CL !T��L-*^� �_;!�LG#(MVW-@F�/�-:�_(� H' A ^B�xV @ ^ +5G�=�+.CL"H(* %9r'NF�G�C_(*+. %GDC � +5' @ �E� "%G�= � H(*���Q9@eb+.-@�F�VG�6^ 7�9@ !/B"%/,+.$.+q(JI� %'x�_;!�QGE(H�I � I GEF�TM/X�L9b %'�7X !+.G#(*-�+5G?-:�Q(��JLK CQF�TMF�$."H(*+q;!�z=�+5-@(*9@+./�F[(*+. %Gh'NF�G�C_(*+. %GMK 7�9@ !/B"%/,+.$.+q(JI�=��LG�-@+5(JI�VN !9�Th"H-:-D^�'NF,G�CQ(:+. !G�ONPJ � �B"%-b=�+5-@(:9:+./,F,(*+5 !GQJ�ONR � �B"%-b=��QG�-:+q(JI7� S4 � "%G�= �B"];!�0(*�,�\-:"%T��\=,+.-@(:9:+5/�F,(*+5 !G� � �������������TNPJ -:"%T>7,$.�6 %'x-:+5�L� � 'N9: !TUJV -P(D"%G�=�"%9:=`wm !9@Th"H$j7�9@ !/B"%/�+5$.+q(JIh=��QG�-:+q(JIW -P(D"%G�=�"%9:=`wm !9@Th"H$j=�+5-@(*9@+./�F[(*+. %Gv'NF�G�CQ(:+. !GX : F�7,71�Q9Y4�yEFB"HGE(:+.$5�z %'[Z�V � � � ^r+�a �%a\W 0�� V � " 4�^]6V �A^ 4^6_� ? JhV`�1^ ��g,7X�LC_(*�Q=`;H"%$.F,��VWT��]"%G ^r H'x9:"%G�=� !T�;%"H9:+Y"H/�$.� �]6VGa,V �A^@^ 4 6 a�V`�1^ ? JhV`�1^ ��g,7X�LC_(*�Q=`;H"%$.F,��VWT��]"%G ^r H'�a,V �A^bUV �A^ ;H"%9@+Y"%G�CQ�z %'�9*"%G�=, !T ;H"%9:+."%/�$5� � '�� V � � M^ CQ �;H"%9@+Y"%G,CL�z/X�Q(Je��L�LG � "HG�= � � ������������� =B"i(D"� -:"%T>7,$.�6-@+.�L�c"ed CQ !G#;!�L9@�!�LG,CL�d+5G?7�9: !/�"%/�+.$5+5(JIf CQ !G#;!�L9@�!�LG,CL�d+5G?=�+.-P(*9:+5/�F,(:+. !Ggih"ed CQ !G#;!�L9@�!�LG,CL�d+5G?y2F�"%=�9*"i(*+.CzT��]"HG���TjPZ�V k �2lDm� ^ V��"� " k�^onpl � f Z�V � � � ^

Page 9: All of Statistics: A Concise Course in Statistical Inference Brief

� �

� '������#� '0� '0���#�$�I<��-�� FI?�H�'�� ���-���6�$���� -P(D"H(:+.-P(*+.CL"%$oT� [=,�L$ C,"O-:�Q(n %'x=�+.-P(*9@+./�F,(:+. !G>'NF�G�CQ(:+. !G,-LR

=,�LG�-@+5(JI�'NF�G�C_(*+5 !G�-� !9b9:�Q�!9:�Q-:-@+. !G?'NF�G�C_(*+. %G�-� 7�"%9*"%T��Q(:�L9�� �Q-@(:+.T>"H(*�z %'�7B"%9:"%T��Q(*�Q9�MVGJd^ -P(D"H(:+.-P(*+.CL"%$j'NF�G,CQ(*+5 !GB"%$&VN(*�,�\T��]"HGoR,'N !9���g�"HT>7�$5�<^� �BV � ^ $5+.3H�L$.+5�� 2 [=h'NF,G�CQ(:+. !G��� 4��,V�� �#^ ��� n� � d ���� 4�vV�� �%^ I ��� n�� � I +.-�/X !F�G�=,�L=?'N !9�$Y"%9@�!� ��"� 4��� &V�� �#^ ��� n� � c"�d �

�"� 4�� SV�� �!^ I ��� n� � I +.-�/X !F�G�=,�L=K+5G�7�9: %/B"%/�+5$.+5(JI�'N !9�$."%9:�!� �

Page 10: All of Statistics: A Concise Course in Statistical Inference Brief

��� ��������� ����������������������� �!�

� , �G9 < � � ���51�����)+�C,

< � 4�� 8) � 9���) , 4 �� � � ��

m , � $&$&$� 8� �

) a � 4 � �� 0 � 'N !9 �� a �$.+5T ��� 8 ( ������ * � 4 < ������ � "%T>T>"\'NF,G�CQ(:+. !G�+5-s+.-s=,�QpBG��Q=?/EI#3�VG4�^ 4�6\89 � : 0�� < 0>= ? ��'N !9 4 � � a&kl'

4�� � (*���QG�3)V54x^ 4 V54 " � ^�3)V54 " � ^�aSkl' � +.-n"%G8+.G#(*�Q�!�L9s(:���LG 3)V � ^ 4 V �Q" � ^ � a�[ !T��z-:7X�LCQ+Y"%$1;%"H$.F��Q-m"H9:�&%�3)V � ^ 4 � "%G�= 3)V � n � ^ 4�� �&a

Page 11: All of Statistics: A Concise Course in Statistical Inference Brief

� ����� �

� ��� �� �� ������

���

Page 12: All of Statistics: A Concise Course in Statistical Inference Brief
Page 13: All of Statistics: A Concise Course in Statistical Inference Brief

� � ��� ��� � �

� ��� �� �� ������

��� �� ������������������� ���! #"%$&"('*)*',+.-/'*01+3254�67$8+32(4�67$8+3':9;$&)<)*$&=5>#?%$8>#4A@B &�DCE?%$&=<+3',@F-G':=5>H?(=59�4I�!+J$8'*=<+.-#KDL�49;$&=M$&N5N():-ON5�! #"%$&"('*)*',+.-P+Q254� &�!-R+Q H$TS5':U&4��!0Q4A0Q4I+1 &@(N5�Q &"5)*4I670IVW@B�! #6X9I #'*=/Y%':N5N5'*=(>Z+3 +3254/$8=%$&):-[0!'*0\ 8@]9I #6�N5?(+34I��$&):># #�Q',+3256�0�K1^�254O0_+J$&�_+3'*=(>`Na #':=<+Z'*0A+3 `0!Na4I9�':@F-b03$&6�N5):40QN%$89�4&V%+Q254/0Q4I+c &@]Nd #0Q0!'*"5):4O #?(+39I #6�4�0�K

��e� fgih j�k�l f�jmgi�clRnogi �� prq7lR �Dn^�254`s�t%u�vxwBy�s�v1t%z#y|{RVa':0c+Q254i0!4I+H &@DNd #0Q0!'*"5):4M &?(+39I #674I0P 8@�$&=}4e~GNa4I�Q'*6�4�=<+�K�] #':=<+30O��'*=�{�$&�!479�$&)*):4�S�s�t%u�v]w�y��d���Wz#�du�y[s� &�M�WyEt%w����<t(�����d�1s&���H�dyG�d�Wsb$&�!40Q?5"(0Q4I+Q0H &@x{RK�E�&�<�M�<� ���5�,���B���\�\ ¢¡�£J£T¤¦¥J¡¨§F©� ª��§«¥J�\ F¬%�­© {¯®�°¨±m±³²J±|´T²Q´P±³²Q´T´/µG¶�· ¬%�c�­¸&�­©%  B¬(¤& A F¬%��¹�ºJ£­ A .¡�£J£�§*£O¬%�Q¤<»¨£/§*£T¼ ®½°¨±m±|²J±³´/µG¶\¾

�E�&�<�M�<� ���5�¿�¯À��­  �ÂÁ �O B¬%�á¨Ä[ ¢¥J¡¨Åb�¦¡¢�M¤bÅb�Q¤¨£­Ä[º3�­Åb�­©% Z¡¢�R£�¡¨Åb�ZÆ%¬[ÇW£­§«¥Q¤&È�É�Ä5¤&©%Ê ª§F �Ç�Ë%�;¡¨ºM�!ÌE¤&ÅHÆdÈ*�JËx ¢�­ÅHÆ%�­º!¤& �ÄGº3� ¶7· ¬%�­© {¯®ÎÍήÂÏ_ÐHÑX²eÑÓÒ�¶`· ¬%�i�­¸&�­©% D F¬(¤& � B¬%�Åb�Q¤¨£­ÄGº3�­Åb�­©% ]§*£HÈ,¤&ºªÔ[�­ºc B¬(¤&©³Õ[Ö Á Ä[ xÈ*�J£J£P B¬(¤&©m¡¨ºO�QÉ�Ä5¤&È× .¡¦ØEÙ`§*£�¼ ®ÂÏ ��Ú ²eÛ&ܨÝ.¶�¾�E�&�<�M�<� ���5�¿Þß�B�T�\�P .¡�£J£R¤b¥J¡¨§F©R�;¡¨º3�­¸&�­ºT F¬%�­©| F¬%�T£�¤&ÅHÆdÈ*��£.Æ(¤[¥J�c§*£P B¬%�P§F©;¹�©%§F .��£��­ 

{¯®áà%��®�ÏF��âe²!�]ã;²_�xä�²;å;å;å­²eÒ­²c�]æ�ç�°¨±³²Q´/µ%è1åÀ��­ xé Á �� B¬%�`�­¸&�­©% � F¬(¤& \ F¬%��¹�º3£­ �¬%�Q¤<»¤­Æ<Æ%�Q¤&º3£¦¡¨©m F¬%�� F¬[§FºQ»b .¡�£J£ ¶�· ¬%�­©é ®êàDÏF��âe²!�]ã;²!�]ä;²�å;å;å�²JÒ\ëP��â�®Î´T²_�xã\®ì´T²_�xä\®X±|²1�1æ1ç�°¨±³²Q´/µZ@B &�Hí�î¯Üaè1åH¾

Û �

Page 14: All of Statistics: A Concise Course in Statistical Inference Brief

Û#Û ������������ ������������������������

� ',U#4I=ß$8= 4­U#4I=E+ ¼ VA)*4­+ ¼� ® °;� ç½{�!¦�#"ç ¼ µ}S54I=5 &+Q4�+Q2549� &67N5):4�6�4�=<+ &@ ¼ K%$.=(@B &�Q67$&)*),-#V ¼ 9�$&= "a4Ã�Q4�$&S�$&0'&!=5 &+ ¼ K)(½^�254�9� #6�N5):4�6�4�=<+O &@\{ ':0R+32544�6�N(+.-0!4I+�*5K�^�254M?(=5'* #=� &@]4IU&4�=<+30 ¼ $&=5S,+á':0cS(4.-%=54IS ¼0/ +®½°;� ç {�!T� ç¼ #�D�¯ç1+� #�\� çm"d &+32 µ�2T25'*9J279;$&=�"a4T+32( #?5>#2<+� &@�$&03& ¼ #�4+bK5(1$ @ ¼ âe² ¼ ã;²;å�å;å'*0T$`0!4�CE?54�=(9�4i 8@D0Q4­+30�+32(4�=67æ98aâ ¼ æ ®êà×� ç�{rë`�¯ç ¼ æ�@B #�T$8+c)*4�$&0!+T &=54/'_è1å

^�254`':=E+Q4��!0Q4�9­+3': #= &@ ¼ $&=5S0+ '*0 ¼0: + ® °;� çß{�!R��ç ¼ $&=(S�� ç;+�µ7�!4;$&S& ¼ $8=5S1+bK5(=<G #6�4I+Q'*6�4�0>2�4?2T�Q',+34 ¼ :;+ $&0 ¼ +bK@$ @ ¼ âe² ¼ ã�²;å;å�åa':0O$�0Q4IC[?(4�=59I4 &@]0Q4­+30T+3254I= 6Aæ98aâ ¼ æ ® à � ç�{rëÃ�¯ç ¼ æ�@B #�T$&)*)a' è åB�4I+ ¼ ÐC+®�°;�¯ë¦� ç ¼ ²!�D"çE+7µEKF$ @]4IU&4��_-�4�)*4I674I=<+T &@ ¼ ':0�$&):0Q `9� &=E+3$&'*=(4�S�'*=+D2�4�2T�!':+34 ¼=G + #�IVG4IC[?(':U8$&)*4I=<+3):-&VH+JI ¼ KF$ @ ¼ ':0\$K-%=5':+Q4P0!4I+�V()*4­+ML ¼ L8S54�=( &+34+32(4�=E?56i"a4I�c 8@x4I)*4�6�4�=<+Q0c':= ¼ K�<G4I4�^]$&"5):4 � @B &�T$Ã0!?56�6b$&�_-#K

^]$&"5):4 � K4<($&6�N5)*4R0!N%$&9I4i$&=(S³4­U#4�=<+Q0�K{ 03$&6�N5)*4H0QN%$&9I4� #?(+39I #6�4¼ 4IU#4I=<+`ÏB0Q?5"50!4I+P &@x{TÒL ¼ L =[?(6¦"a4I�T &@xNd #':=E+Q0�'*= ¼ Ï�',@ ¼ ':0N-%=5',+34�Ò¼ 9� #6�N5)*4I674I=<+� &@ ¼ ÏB=5 &+ ¼ Ò¼0/ + ?5=5'* &=�Ï ¼ #��+`Ò¼0: + #� ¼ + '*=<+34I�Q0Q4I9I+Q'* #=�Ï ¼ $&=5S,+`Ò¼ ÐO+ 0Q4I+cS5'QP 4I�Q4I=59�47Ï�Nd #':=E+Q0�'*= ¼ +32%$¨+c$&�Q4/=5 &+T':='+`Ò¼=G + 0Q4I+c':=59�):?50Q': #= Ï ¼ ':0P$`0!?5"50Q4­+H 8@x #��4ICE?%$&)�+3 R+`Ò* =[?()*) 4­U#4�=<+¦Ïª$&)Q2Z$;-[0�@�$&)*0!4�Ò{ +3�Q?(4�4­U#4I=E+¦Ïª$8)S2\$�-[0�+3�!?54�Ò

L�4�0Q$;-R+32%$¨+ ¼ â­² ¼ ã;²;å;å�å;$&�!4�T]��sVU��d���a�� #��$&�!4�u �����1t%w�wXW�yZY�z<w��]s;�F�dyT':@ ¼ æ :ß¼�[ ®*\2T254�=54­U#4I�ií%]®_^%Ka`% #�O4­~($&6�N5)*48V ¼ âH®cb Ú ² � Ò­² ¼ ã/®cb � ²eÛ#Ò­² ¼ ä/®cb¿Û[²JÜ<Ò­²�å;å;å $8�Q4

Page 15: All of Statistics: A Concise Course in Statistical Inference Brief

� � ����������������� ����� Û&ÜS5'*0��! #'*=<+�K�� v1t5���W�B�����d�� &@1{Ó'*0�$M0Q4�CE?54I=59�4� &@�S5':0��! #':=<+Z0Q4­+30 ¼ â­² ¼ ã;²�å;å;å[0Q?59J2+325$8+/ 6æ98aâ ¼ æ ®r{RK � ':U&4�=|$&=|4IU&4�=<+ ¼ V%S54.-5=54O+3254����T1�ªz#tG���×���.�1�1z&�W���d� �� ¼ "E-

��xÏF�\Ò�® aÏF� ç ¼ Ò�®�� � ':@A�¯ç ¼Ú ':@\�="ç ¼ å� 0Q4�CE?54I=59�4³ &@c0Q4­+30 ¼ âe² ¼ ã�²;å;å�åx'*0¦u��d�1�%���d��yÓ�ª�1z#�WyEt%s;����� ':@ ¼ â G ¼ ã G����� $&=5S 2�4³S54 -%=54�)*':6���� 6 ¼ � ® / 6æ98aâ ¼ æ.K�� 0Q4�CE?54I=59�4} &@H0!4I+Q0 ¼ âe² ¼ ã�²�å;å;åx':0u��d�1�%���d��yCT�y[z#�Wy[t5s;�ª��� ':@ ¼ â�I ¼ ãMI ����� $&=5Sm+Q254�= 2�4¦S54 -%=54`)*':6���� 6 ¼ �7®: 6æ98aâ ¼ æ K�$.=³4I':+Q254���9�$&0Q48V 2�4�2T'*):) 2T�Q',+34 ¼ ��� ¼ K�E�&�<�M�<� ���5���ÓÀ��­  { ®ìÍ ¤&©d»�È*�­ d¼ æ ® b Ú ² � "8í.Ò �;¡¨º í]® � ²eÛ[²;å;å;å:¶Ã· ¬%�­©?/ 6æ98aâ ¼ æ ®b Ú ² � Ò ¤&©d» : 6æ98aâ ¼ æD®�° Ú µG¶ �F�¦§F©(£­ .�Q¤<»|�\�`»[�B¹�© �R¼ æ�® Ï Ú ² � "¨í.Ò  F¬%�­©E/ 6æ 8aâ ¼ æ�®Ï Ú ² � Ò ¤&©d»3: 6æ98aâ ¼ æ ® *׶\¾

���� �Â���! mg� ��;k��I�#"L�4�2Z$8=E+�+Q i$&0!0Q':>#=�$��!4;$&)%=E?56¦"d4��%$\Ï ¼ Òx+3 /4IU&4��!-�4­U#4�=<+ ¼ VE9;$8)*)*4IS`+3254Tv1�¨�&('t)&]�ªw��B� W� 8@ ¼ KDL�4/$8)*0Q `9�$&)*)+*Â$¦v1�W�,&1t)&]�ªw��F� WOT]��sI���¨��&x���W�B�a�� #�Z$Ãv1�W�&]t)&]��wª�F� W.-0/214365879/ �<� :��<; 3 � 7=3?>3@1BA � � /?7 � 3C3@1 DE5 ��9F 7EA � A=1 � 1 / ; /?7 �<GW�9FH;�<GW� 5�/ ¼ 1 IJ/�K � 3 �<� >�<� � 3 �¨�MLI� 143 � �9F D �ON3QP L K � 3R/�K �S: K87 � �F �;�<�b� 1T5 �&� -U583C/ ���MVWN:��Ó� 3X3�14DY5 �9F 7EA � A=1 � >1 /Z1 � 3[/?7 � � 1 � 1 / ��VL8� � 3C3\7OI]3 � / L;�<� � ��V�R^ >`_ �8� V×� aG��� /�K �/ ��L Kb5=1 L;�<� �<�<�#� 5 V 1 �I�7 F�V<� / � 1 � 3 �

u�yEt%s;���Wyd��^1 �C[?5$&)*',@F-�$&0c$`N5�! #"%$&"5':)*',+.-#V)* 2%$&0�+3 Ã03$¨+3'*0_@F-b+Q25�Q4I4i$¨~G': #670Iëc�� _,5=1 /Z1 7E5 �5�0d]eá�eÄG© ¥­ �§«¡¨© $  F¬(¤& Ã¤¨£J£­§,Ô&©(£}¤ º3�Q¤&ÈP©%Ä[Å Á �­º $\Ï ¼ Ò  .¡ �Q¤[¥J¬�­¸&�­©%  ¼á§*£/¤ v]�W�&]tE&x��wª�F� W T1��sI���¨�f&]���W���d� ¡¨º/¤ v1�W�,&1t)&]�ªw��F� W�u�yEt%s;���Wy § �§F �£�¤& �§*£B¹Z�J£/ F¬%���;¡¨ÈFÈ*¡¨��§F©[Ô� F¬[º3�J�i¤;Ì#§«¡¨Åi£�gh Y����du i g $\Ï ¼ Òkj Ú �;¡¨ºi�­¸&�­ºeÇM¼h Y����du l g $\Ϫ{TÒ�® �h Y����du m gZ�F�T¼ âe² ¼ ã�²;å;å;å ¤&º3�M»&§*£�n;¡¨§F©% � F¬%�­©

$Ro 67æ98aâ ¼ æqp® 6r æ98aâ $�Ï ¼ æ�Òeå^�254��!4�$&�Q4A67$&=<-/'*=<+34I�QN5�!4I+J$¨+3'* &=50] 8@E$\Ï ¼ Ò­Kx^�2(4�+ 2� R9� #6�6� #=M':=<+34��!N5�Q4­+J$8+Q'* #=(0$&�Q4Ã@B�Q4ICE?54�=59I'*4I0`$&=5S�S54�>#�!4�4I0i &@\"a4I)*'*4­@B0�K $.= @B�Q4IC[?(4�=59­-�':=E+Q4��!N5�Q4­+J$8+Q'* #= V�$\Ï ¼ ÒO':0

Page 16: All of Statistics: A Concise Course in Statistical Inference Brief

�� ������������ ������������������������

+32(4M): #=5>Ã�Q?5=³N5�Q &Na #�_+3': #= &@x+Q'*6�4�0�+325$8+ ¼ ':0T+Q�Q?54�'*=��!4�Nd4I+3',+3': #=50�K�`% #�T4­~($&6�N5)*48V':@ 2�4M0Q$�-+Q2%$8+T+32(4�N(�Q #"%$8"5'*):':+.-� 8@x254�$&S50c'*0 ��� Û[V 2T254�=³674�$&=|+Q2%$8+T',@2�4MY%':N+32549� &'*=}67$&=<-�+3':674I0c+Q254�=m+Q254¦N5�! #Na &�!+3': #=| &@�+3':674I0�2�4¦>#4­+R254�$&S50H+Q4�=5S50P+3 ��� Û7$&0+32(4H=E?56¦"d4��Z 8@ +3 #0!0Q4I0�'*=59I�Q4;$80Q4�0IK �H=�':= -%=5',+34�),-Ã)* #=(>5VE?5=5N5�!4�S5':9I+J$8"5)*4P0Q4�CE?54I=59�4/ &@+3 &0Q0Q4I0�2T25 &0Q4H):'*6�':+Q'*=5>RN(�Q #Nd #�!+Q'* #=`+Q4�=5S50A+3 i$�9I #=50!+3$&=<+Z':0\$&=7'*S54�$&)*'��;$8+Q'* #= V&6¦?59J2)*'��&4�+3254`':S54;$� &@�$�0_+3�3$8'*>#2<+O):'*=54M'*=�>#4I #674­+3�_-#K�^�254`S54�>&�Q4�4��. &@� "a4I)*':4I@�'*=<+Q4��QN(�Q4I+3$�+3': #= ':0M+325$8+�$\Ï ¼ ÒM674�$&0Q?(�Q4�0¦$&=� #"50!4��!U&4����¿0Ã0!+Q�Q4I=5>&+32 &@T"d4�):'*4­@�+32%$8+ ¼ '*0�+3�Q?(4&K$.=}4I':+Q254��T'*=<+Q4��QN(�Q4I+3$8+3': #= V 2�4i�!4�CE?5'*�!4�+Q2%$8+ �c~G': #670 � +3 7Ü�25 &)*S K�^�254iS5'QP 4I�Q4I=59�4'*=�'*=<+34I�QN5�!4I+J$¨+3'* &=C2T'*):)�=5 8+`6b$¨+Q+34I�i6¦?(9J2 ?(=E+Q'*)�2\4�S54�$&)�2T',+32�0_+J$8+Q'*0!+Q'*9�$&)�':=(@B4�� �4�=(9�4&K�^�254��!4&V�+3254S('SPa4��Q':=5>m':=E+Q4��!N5�Q4­+J$8+Q'* #=50i)*4;$8S +3 m+ 2� 0!9J25 [ #)*0` &@c':=(@B4��!4�=59I4&ë+32(4/@B�!4�CE?54I=E+Q'*0_+H$8=5S|+Q254��Z$;-#4I0Q' $8=³0!9J25 G &)*0�K�L�4/S54I@B4I�cS('*0Q9I?50Q0!'* #=|?5=<+3':) ) $¨+34��IK� =54/9;$&=|S54��!':U&4�67$&=<-N5�! #Nd4��!+Q'*4I0� &@ $�@B�Q #6á+3254�$¨~G'* &670IK��P4I�Q4M$&�Q4/$¦@B4.2/ë$\Ï *EÒ ® Ú

¼=G + ®�� $ZÏ ¼ Ò�� $�Ï +¦ÒÚ � $\Ï ¼ Ò�� �$\Ï ¼ Ò ® � Ð $ZÏ ¼ Ò

¼ A +½® * ®�� $�� ¼ 7 +��³® $\Ï ¼ Ò�� $\Ï +`Òeå Ï«ÛGK � Ò� )*4I0Q0T #"<U['* #?(0cN(�Q #Nd4��_+.-'*0T>#',U#4�=�':=�+Q254O@B #)*): 2T'*=5>%B�4I6767$(K�8�¨�M�H�¦�5� �"!1¡¨º�¤&©%Ç��­¸&�­©% B£c¼ ¤&©d» + Ë $�Ï ¼ / +`Ò�® $ZÏ ¼ Ò�� $ZÏ +`ÒxÐ $\Ï ¼ +`Ò�¶

�$# �%� `�KRL��!':+34 ¼0/ + ® Ï ¼ + Ò / Ï ¼ +`Ò / Ï ¼ +`Ò�$&=5S½=5 &+34�+Q2%$8+�+32(4�0Q44IU&4�=<+307$&�!4S5'*0��! #'*=<+;K&�P4I=59�48V�67$'�E'*=(>��Q4�Nd4;$¨+34�S ?50!4� &@T+3254�@�$&9­+`+32%$¨+!* '*0¦$&S(�S5',+3':U&4R@B #��S5':0��! #':=E+Z4IU#4I=<+30�V 2�4M0!4�4O+32%$¨+$ � ¼ 7 + � ® $ � Ï ¼ + Ò 7 Ï ¼ +`Ò 7 Ï ¼ +`Ò �® $\Ï ¼ + Ò�� $ZÏ ¼ +`Ò�� $\Ï ¼ +¦Ò® $\Ï ¼ + Ò�� $ZÏ ¼ +`Ò�� $\Ï ¼ +¦Ò���$\Ï ¼ +`ÒxÐ $\Ï ¼ +¦Ò® * � Ï ¼ + Ò 7 Ï ¼ +`Ò � ��$ � Ï ¼ +`Ò 7 Ï ¼ +`Ò � Ð $\Ï ¼ +`Ò® $\Ï ¼ Ò���$\Ï +`ÒxÐ $\Ï ¼ +¦Ò­å}¾

�E�&�<���<� �/�5� ) · �\¡M¥J¡¨§F©� ¢¡�£J£��J£ ¶ À��­  ±bâTÁ �T B¬%�O�­¸&�­©% � F¬(¤&  ¬%�Q¤<»¨£R¡�¥J¥­ÄGº3£R¡¨©b ¢¡�£J£PÕ¤&©d»RÈ*�­  ±`ãPÁ �Z F¬%�T�­¸&�­©%   B¬(¤& d¬%�Q¤<»¨£c¡�¥J¥­Ä[º3£P¡¨©Ã .¡�£J£ZØ ¶ �B�\¤&ÈFÈ%¡¨ÄG .¥J¡¨Åb�J£Z¤&º3�P�QÉ�Ä(¤&ÈFÈ Ç

Page 17: All of Statistics: A Concise Course in Statistical Inference Brief

���H����������������� �����_����� ����������� ������ � �@���� Û �È §��<�­È ÇWË� F¬(¤& /§*£JË $�Ï¢°¨±bâe²J±`ã�µ8Òî $\Ï.°¨±�âJ²Q´ ã;µ8Ò`® $�Ï.°�´1â­²3±ÃãIµ8Ò�® $ZÏ.°�´1âJ²Q´ ãIµ8Ò�®� "�� Ë� B¬%�­© $\Ϫ±bâ / ±`ãeÒ�® $\Ϫ±bâ!Ò�� $\Ϫ±`ãeÒxÐ $�Ϫ±bâ!±`ãeÒ�® âã � âã Ð â

� ®rÜZ"��׶Z¾� K � 7 F �8� l �������+7E5�/Z1B5bP=1 / ; 7OI�� F 7EA � A=1 � 1 /Z1 � 3I��� �B�A¼ � � ¼Â F¬%�­© $ZÏ ¼ �#Ò%� $\Ï ¼ Ò ¤¨£� � Ñ ¶

��� �!�#" $�<G?(N5Na &0Q4R+325$8+ ¼ �¦'*0\6� #=5 8+3 #=54H'*=59I�Q4;$80Q'*=(>¦0Q `+325$8+ ¼ â GÓ¼ ã G ����� KB�4I+ ¼ ® )*':6���� 6 ¼ � ® / 6æ98aâ ¼ æ K&%R4 -%=54 +/â® ¼ â­V�+Pã}® °;� ç { ëê� ç¼ ã�²!� "ç ¼ âeµEV�+Hä�® °;� ç { ë � ç ¼ ä�²!� "ç ¼ ã�²!� "ç ¼ âeµE²�å;å;å4$ +9;$&=ì"d40Q25 2T= +32%$¨+K+�â­² +Hã�²;å;å;åa$&�Q4`S('*0��! &'*=<+;V ¼ �b® / �æ98aâ ¼ æ�® / �æ 8aâ +cæD@B #�R4�$&9J2 � $&=5S/ 6æ98aâ +Pæ ® / 6æ98aâ ¼ æ«KcÏ <G4I4M4e~(9I4��!9�'*0!4 � K Ò%`%�Q #6 �c~G'* &6 Ü(V

$\Ï ¼ �#Ò�® $]o �7æ98aâ +Pæ p® �r æ98aâ $�Ï +cæ�Ò

$&=5S|254�=(9�4&V×?50!'*=5> �T~(': #6 Ü�$&><$8'*= V):'*6��� 6 $\Ï ¼ �#Ò�® ):'*6� � 6

�r æ98aâ $\Ï +PæFÒ�® 6r æ98aâ $\Ï +PæFÒ�® $ o 67æ98aâ +cæ p ® $\Ï ¼ Ò­å}¾��(' �Â���! mg� ��;k��I�#" �� )¯�; ��I�xl fg¦h j�kIl f�jmg¦�TlOn<G?5N5Nd #0Q4¦+Q2%$8+R+Q254¦0Q$&67N()*4i0!N%$&9I4Ã{ ®á°;��âe²;å�å;å�²!���[µÃ'*0�-5=5':+Q4&K�`% #�P4­~($&6�N5)*48V':@ 2\4/+3 &0Q0c$�S('*4R+ 2T'*9I4&V5+Q254�=}{r2%$&0cÜ *Ã4�):4�6�4�=<+30IëA{¯®½°GÏBíQ² ^[Ò !cíQ²V^bç�° � ²;å�å;å+*[µ<µEK

$ @�4;$89J2m #?(+Q9� #6�4/'*0T4�CE?%$&):):-�):' �&4I):-&V5+3254I= $\Ï ¼ Ò\® L ¼ L "&Ü *?2T254I�Q4'L ¼ L[S54I=5 &+Q4�0H+Q254=E?56¦"d4��R &@A4I)*4�6�4�=<+Q0R'*= ¼ K/^�254`N5�Q &"%$&"5':)*':+.-+Q2%$8+R+Q254¦0!?56 &@�+Q254`S5':9�4¦':0 � � ':0Û "&Ü *`0!'*=59I4/+Q254��!4�$&�Q4O+ 2� 7 #?G+39� &674I0T+Q2%$8+T9I #�Q�!4�0QNd #=5S�+3 Ã+32('*0�4IU&4�=<+;K$.=³>#4I=54��Q$&)«V(':@x{r'*0N-5=5':+Q4/$&=5S|':@14;$&9J2³ #?(+Q9� #6�4O'*0�4IC[?5$&)*),-�)*'��&4I):-#VG+3254I=

$ZÏ ¼ Ò�® L ¼ LL {3L ²

2T25'*9J2³'*0�9�$&)*):4�S+3254��1�]��� �×�¨u v]�W�&]tE&x��wª�F� W T]��sI���¨��&x���W�B�d�A��^] Ã9� &67N5?G+34/N5�Q &"(�$&"5':)*':+Q'*4I0�V 2�4=54�4IS +Q 9� #?5=<+`+Q254�=[?(6¦"a4I�` &@cNa &'*=<+30`':= $&= 4IU#4I=<+ ¼ K-,³4­+325 [S50@B #�\9I #?5=<+3':=5>`Nd #'*=<+30\$&�Q4H9;$&):)*4�S79� #6i"5'*=%$¨+3 #�!' $&)%6�4I+Q25 GS(0�KDL�4R=54�4IS5=�� +TS54�),U#4O':=E+Q +3254I0Q4`':=m$&=<-³>#�Q4�$8+HS(4I+J$8'*)«KTL�432T'*):)«V%25 2�4IU#4I��V�=54I4�S $7@B4 2�@�$&9I+Q0P@B�Q &6o9I #?5=<+3':=5>

Page 18: All of Statistics: A Concise Course in Statistical Inference Brief

Û * ������������ ������������������������

+32(4� #�_-�+32%$8+R2T'*):)�"a4|?50!4I@B?5)c) $8+Q4��IK � ',U#4I= � #" �!4�9­+30IVT+Q254|=[?(6¦"a4I�b &@>2Z$;-[0� &@ #�!S54��!'*=5>/+3254I0Q4R &" �!4�9­+30Z':0 � � ® � Ï � Ð � Ò�Ï � Ð�Û#Ò ����� Ü � Û � � K`% #��9I #=<U#4I=5'*4I=59�48V 2\4S54 -%=54 Ú � ® � KDL�4�$&)*0! `S54.-5=54 � � ��� ® � �� � Ï � Ð � Ò � ² Ï«ÛGK Û#Ò�Q4�$&S0& � 9J25 [ #0!4 � (5V 2T25':9J2Ã'*0x+Q254�=E?56¦"d4��� &@×S5':0!+Q'*=59­+�2Z$;-[0� &@d9J25 [ #0Q':=5> � #" �!4I9I+30@B�Q &6 � K4`× &�\4­~($&6�N5):4&V(',@ 2�4O2%$;U#4�$i9�)*$&0Q0\ &@xÛ Ú Nd4� #N5):4O$&=5S 2�4�2\$&=<+�+3 ¦0!4�)*4I9I+c$9� &676�':+!+34�4H &@DÜ`0_+3?5S54I=<+30�V×+32(4�=�+3254I�Q4�$&�Q4� Û ÚÜ � ® Û Ú �Ü � ��� � ® Û Ú� �������Ü � Û � � ® � � � ÚNd #0Q0!'*"5):4O9� #6�67',+Q+Q4�4�0IKxL�4�=( &+34O+32(4/@B &)*)* 2T':=5>¦N5�! #Na4I�!+Q'*4�0Ië� � Ú � ® � �

�� ® � $&=(S � � ��� ® � �

� Ð � � å���� �� ��mlOj�lO ��mlO �� prq�lO ���n$ @ 2\4cY%'*N�$�@�$&':�A9I #'*=�+ 2T'*9I4&VG+Q254�=�+Q254HN(�Q #"%$8"5'*):':+.-` &@ + 2\ ¦2(4;$&S50\'*0 âã � âã K]L�46¦?5),+3':N5):-�+3254MN5�Q &"%$&"5':)*':+Q'*4I0Z"a4I9;$&?(0Q432�4i�!4�><$&�!S³+Q254M+ 2� 7+Q #0Q0!4�0O$&0c'*=(S54�Nd4�=5S(4�=<+;K^�254O@B #�!6b$&)aS54.-5=5':+Q'* #=� &@]'*=5S(4�Nd4�=5S54I=59�4�'*0T$80T@B &)*)* 2T0IKc�� _ 5=1 /Z147E5 �5��� · �\¡��­¸&�­©% F£T¼ ¤&©d» + ¤&º3� ��� T1yEv�y[�T�y[�a� § �$\Ï ¼ +¦ÒA® $\Ï ¼ Ò $\Ï +`Ò ÏªÛGK Ü#Ò¤&©d»7�\����ºe§F .�c¼�� +�¶ e�£��­ �¡¢�¦�­¸&�­©% B£ ° ¼ æ1ë�í�ç %µ §*£�§F©d»[�.Æ%�­©d»[�­©% �§ �

$ o Aæ���� ¼ æ p ®�� æ���� $ZÏ ¼ æFÒ�;¡¨º¦�­¸&�­ºeÇc¹�©%§F .�R£­Ä Á £��­ ��Ρ¢� a¶$.=5S54INa4I=5S54�=(9�4|9;$&= $&�!'*0Q4b'*=�+ 2� S5'*0_+3'*=(9I+32\$;-[0�K <G &674­+3'*6�4�0IV 2�44­~GN5)*':9�',+3):-$&0!0Q?56�4}+325$8+�+ 2� 4­U#4I=E+Q0|$&�Q4m':=5S54INa4I=5S54�=<+�K `% #�74­~($&6�N5)*48Vc':= +3 &0Q0Q':=5> $�9I #'*=

Page 19: All of Statistics: A Concise Course in Statistical Inference Brief

��� �;������4�� ���� �%� ��� �a�#� Û �+ 2T'*9I4&V 2�4�?(0Q?%$&):):-}$&0Q0!?56�4¦+3254¦+Q #0Q0!4�0�$&�Q4i'*=5S54INa4I=5S54I=E+K2T25':9J2 �Q4IY54�9I+Q0O+3254¦@�$&9­++32%$¨+P+Q254i9I #'*=³2%$&0H=( �674I67 &�!- &@x+3254@-%�Q0_+H+3 &0Q0�K�$.=m &+Q254��P'*=50_+J$&=(9�4�0IV 2�4iS54I�Q',U#4'*=5S(4�Nd4�=5S54I=59�4Ã"<-}U&4��!':@F-['*=5>�+Q2%$8+ $�Ï ¼ +`Òc® $\Ï ¼ Ò?$\Ï +`Òc25 #)*S(0�K@`% #�H4e~5$867N5):4&V ':=+3 #0!0Q':=5>�$7@�$&':�PS5':4&V ):4I+ ¼ ®ê°8ÛG² �5² *[µÃ$&=5S�):4I+�+ ®á° � ²eÛ[²JÜ(² �(µEKP^�254�= V ¼0: + ®°8ÛG² �(µEV $ZÏ ¼ +`Ò³® Û " *ß® $\Ï ¼ Ò $\Ï +`Ò|® Ï � "&Û#Ò � Ï«Û "&Ü#Ò�$&=(SÓ0Q ¼ $&=5S + $&�!4'*=5S(4�Nd4�=5S54I=<+;K�$.=}+Q25'*0c9;$80Q4&V 2\4¦S('*S5=�� +H$&0Q0!?5674�+32%$¨+ ¼ $8=5SE+ $&�!4i':=5S54�Nd4�=(S54�=<+':+ �!?50_+T+3?5�!=54�S| #?(+�+Q2%$8+T+3254­- 2�4��Q48K

<G?5N5Nd #0Q4�+32%$¨+ ¼ $&=5SC+ $&�!4�S('*0��! &'*=<+�4IU#4I=<+30�V�4�$&9J2C2T',+32 Na #0!':+Q':U#4ÃN5�! #"%$&"('*) �':+.-&K��Z$&=�+Q254I-�"a47'*=5S54INa4I=5S54I=E+��P (Km^�25'*0�@B #)*): 2T0i0Q':=59�4!$\Ï ¼ Ò $\Ï +`Ò7î Ú -&4I+$\Ï ¼ +¦Ò`® $�Ï *EÒ`® Ú K� x~G9�4�NG+Ã'*=�+Q25'*0¦0!Na4I9�'*$&)�9�$&0Q48V�+3254I�Q4'*0M=5 2\$�-�+3 ��!?5S(>#4'*=5S(4�Nd4�=5S54I=59�4�"<-�)* [ �E'*=5>¦$Ã+Q254/0Q4I+Q0T'*=|$���4�=5=³S5' $&>&�3$&6|K�E�&�<�M�<� ���5�,��� · ¡�£J£/¤��I¤&§FºM¥J¡¨§F©mÕ[Ö¦ ª§FÅb�J£ ¶ À1�­ �¼ ®�� ¤& �È*�Q¤¨£­ \¡¨© �����Q¤<» ¶�� À��­  ´ [Á �� B¬%�`�­¸&�­©% � F¬(¤& � «¤&§FÈ £`¡�¥J¥­Ä[ºJ£`¡¨©} B¬%� ^����  .¡�£J£ ¶7· ¬%�­©

$\Ï ¼ Ò ® � Ð $\Ï ¼ Ò® � Ð $\Ϫ$8)*) +J$8'*)*0QÒ® � Ð $\ÏB´�â.´aã ����� ´�â��­Ò® � Ð $\ÏB´�â!Ò $ZÏF´ ã­Ò ����� $\ÏB´�â��­Ò ?(0Q'*=(>'*=5S54INa4I=5S54I=59�4® � Ð � �Û �

â��� å � � � å�¾

�E�&�<�M�<� ���5�,�&� · �\¡7Æ%�J¡JÆdÈ*� «¤ �<�� ªÄ[ºe©(£� �º­Ç¨§F©[Ô� .¡�£­§F© � ¤ Á ¤¨£ �<�­  Á ¤&ÈFÈ\§F©% .¡ ¤ © �­  ¶� �­ºJ£�¡¨©bÕH£­Ä×¥J¥J�J�Q»¨£���§F F¬PÆdº3¡ Á ¤ Á §FÈ §F �ÇOÕ! aÙ/�x¬[§FÈ*�1Æ%�­º3£�¡¨©ÃØO£­Ä×¥J¥J�J�Q»¨£���§F F¬PÆdº3¡ Á ¤ Á §FÈ §F �ÇÕ! #" ¶%$ ¬(¤& T§*£¦ B¬%�RÆdº3¡ Á ¤ Á §FÈ §F �Ç B¬(¤& DÆ%�­º3£�¡¨©�Õ�£­Ä×¥J¥J�J�Q»¨£ Á �B�;¡¨º3�HÆ%�­º3£�¡¨©�Ø'&�À��­ Aé»[�­© ¡¨ ¢�³ F¬%���­¸&�­©% Ã¡¢�|§F©% ¢�­º3�J£­  ¶ À1�­ H¼N[ Á �³ F¬%���­¸&�­©% M B¬(¤& ¦ F¬%�M¹�ºJ£­ /£­Ä×¥J¥J�J£J£³§*£Á Ç�Æ%�­º3£�¡¨© Õ�¤&©d»} F¬(¤& P§F /¡�¥J¥­ÄGº3£¡¨©� �ºe§�¤&È�©%ÄGÅ Á �­º ^d¶)( ¡¨ .�� B¬(¤& �¼ â­² ¼ ã;²;å;å�å ¤&º3�»&§*£�n;¡¨§F©% �¤&©d»b B¬(¤& ]é ® / 6[ 8aâ ¼N[ ¶ �M�­© ¥J�JË

$\Ï é ÒA® 6r[ 8aâ $�Ï ¼N[ Ò­å

( ¡¨�xË $\Ï ¼ âQÒ�® � "8Ü%¶ ¼ ã ¡�¥J¥­ÄGº3£M§ �/�\�/¬(¤&¸&�� B¬%�/£��QÉ�Ä%�­© ¥J�iÕÅ`§*£J£��J£JËAØ�Å`§*£J£��J£JË\Õ£­Ä×¥J¥J�J�Q»¨£ ¶ · ¬[§*£¦¬(¤¨£OÆdº3¡ Á ¤ Á §FÈ §F �Ç $\Ï ¼ ã­ÒO® ÏªÛ "&Ü<ÒIϪÜZ"��<Ò�Ï � "&Ü<ÒR® Ï � "#Û#ÒIÏ � "8Ü<Ò�¶ !1¡¨È Ê

Page 20: All of Statistics: A Concise Course in Statistical Inference Brief

Û ������������ ������������������������

È*¡¨��§F©[Ô7 F¬[§*£/È*¡QÔ&§«¥��\�R£��J�/ F¬(¤&  $\Ï ¼N[ Ò�®ÂÏ � "#Û#Ò [�� â Ï � "&Ü<ÒI¶ �M�­© ¥J�JË$\Ï é Ò�® 6r

[ 8aâ�Ü � �Û �

[�� â ® �Ü6r[ 8aâ

� �Û �[�� â ® ÛÜ å

���­ºJ���\�OÄE£��Q»b F¬(¤& d�I¤[¥­ A F¬(¤& BËA§ � Ú ����� �  F¬%�­©�� 6[ 8 � [ ® � "5Ï � Ð � Ò�¶Z¾

<[?56767$&�_-� &@$.=5S54INa4I=5S54�=(9�4� K ¼ $&=(S,+á$&�!4�':=5S54INa4I=5S54�=<+c':@�$\Ï ¼ +`Ò�® $\Ï ¼ Ò?$\Ï +`ÒeKÛ[K�$.=(S54�Nd4�=5S(4�=59I4i'*0T0! #674­+3':674I0�$&0Q0!?56�4�S³$&=5S|0Q #6�4I+Q'*6�4�0�S54I�Q':U&4�S KÜGK %R':0��! #':=<+�4IU&4�=<+30�2T':+32�Nd #0Q',+3',U#4ON5�! #"%$&"5':)*',+.-b$8�Q4/=5 &+T':=5S54�Nd4�=(S54�=<+;K

���� ���� ����I�D���� mg¦k �Â���! mg� ��;k��I�#"�H0!0Q?56�'*=5>R+Q2%$8+%$\Ï +¦Ò\î Ú V 2�4TS54.-%=(4T+Q254�9� &=5S5':+Q'* #=5$&)GN(�Q #"%$8"5'*):':+.-� &@ ¼ >#',U#4�=+325$8+�+á25$&0T G9I9�?5�!�Q4�S³$&0�@B #):)* 2T0�Kc�� _ 5=1 /Z147E5 �5�,�;�¯�B� $\Ï +`Ò�î Ú  B¬%�­©  B¬%� z#�d� T]�F�W�B�d�1t%wMv]�W�&]tE&x��wª�F� W ¡¢�`¼Ô&§F¸&�­© + §*£ $ZÏ ¼ L +`Ò�® $\Ï ¼ +`Ò$ZÏ +`Ò å ϪÛGK �<Ò

^�25'*= �b &@�$\Ï ¼ L +`Ò\$80\+3254R@B�Q$&9I+Q'* #=b &@�+3'*6�4�0 ¼ G9I9�?5�!0T$&6� #=5>i+Q25 #0Q4O':= 2T25'*9J2+ [9�9I?5�Q0IK$�P4��!4O$&�!4R0Q #6�4P@�$89I+30\$&"d #?(+Z9I #=5S5',+3': #=%$&)%N5�! #"%$&"('*)*',+3':4�0�K`% #�Z$8=E-a-(~G4IS+ 0Q?(9J2Ã+Q2%$8+�$\Ï +`Ò\î Ú VO$\Ï � L +¦Ò1'*0D$HN5�Q &"%$&"5':)*':+.-/'ªK¿48Kx',+D03$¨+3'*0 -%4�0D+Q254\+Q25�Q4I4T$¨~G'* #6�0 &@�N(�Q #"%$8"5'*):':+.-#K�$.=�N%$8�!+3':9�?5)*$&��VW$\Ï ¼ L +`Ò6j Ú V $�Ï�{3L +`ÒT® � $&=5Sm',@ ¼ â­² ¼ ã;²;å�å;å $8�Q4S5':0��! #':=E+a+3254I= $\Ï / 6æ98aâ ¼ æ L +`Ò�® � 6æ98aâ $ZÏ ¼ æ L +`ÒeK �\?G+�':+�':0 '*=O>#4I=54��Q$&)E���×��+Q�Q?54�+Q2%$8+$\Ï ¼ L + /� MÒO®�$\Ï ¼ L +`Ò �R$\Ï ¼ L� MÒ­Kb^�2(47�!?5)*4I0M &@\N5�Q &"%$&"5':)*':+.-}$&N5N():-m+Q ³4­U#4�=<+Q0

Page 21: All of Statistics: A Concise Course in Statistical Inference Brief

���Z� ����������� � ������� ��������������� ����� Û � #=}+Q254¦)*4­@F+H 8@D+3254i"%$&��K�$.=�>#4I=54��Q$&)]',+H':0R���%�¦+3254M9;$&0!4i+325$8+6$\Ï ¼ L +`Ò\® $\Ï +,L ¼ ÒeK�]4I #N5)*4M>#4I+P+325':0P9I #=(@B?50!4�S�$8)*)�+3254�+3':6748K�`× &�c4e~($&67N()*4&V×+32(4iN5�! #"%$&"5':)*',+.-� &@�0!Na 8+30>#':U&4�=�-# #?|2%$;U#4�6�4;$&0!)*4�0�':0 � "(?(+�+3254/N5�! #"%$&"('*)*',+.-7+Q2%$8+�-& #?}2%$;U&4�6�4;$80Q)*4I0T>#':U&4�=+32%$¨+A-& #?b25$�U&4P0!Na 8+30A'*0�=5 &+ � K $.=�+Q25'*0�9;$80Q4&VE+32(4cS('SPa4��Q4I=59�4c"d4I+ 2�4�4�=�$ZÏ ¼ L +`Ò�$&=5S$\Ï +,L ¼ Ò1':0D #"<U['* #?(0x"5?(+x+3254I�Q4T$&�!4\9;$80Q4�0�2T2(4��Q4�',+D':0x)*4I0Q0D &"EU[': #?50�K]^�25'*0x67':0!+J$��&4\':06b$8S54H &@F+Q4�=4I=5 #?5>#2b'*=�):4�><$&)d9;$80Q4�0�+32%$8+Z':+�'*0\0! #6�4I+3':674I0�9�$&)*):4�Sb+Q254HN(�Q #0!4�9�?G+3 #���¿0@�$&)*)*$&9I-&K

�E�&�<�M�<� ���5�,�;Þ]e Åb�Q»&§«¥Q¤&È� .�J£­ x�;¡¨º¤ »&§*£��Q¤¨£���� ¬(¤¨£�¡¨ÄG .¥J¡¨Åb�J£ � ¤&©d» Ц¶ · ¬%�Ædº3¡ Á ¤ Á §FÈ §F ª§«�J£M¤&º3��g� � � ¶ Ö<Ö��EÕ ¶ Ö��[Ö<ÖÐ ¶ Ö<Ö��[Ö ¶ �[Ö&Õ[Ö

! º3¡¨Å  B¬%�|»[�B¹�©%§F ª§«¡¨©Ó¡¢�³¥J¡¨©d»&§F �§«¡¨©d¤&È�Ædº3¡ Á ¤ Á §FÈ §F �ÇWË $�Ï � L � Òî $\Ï �¦² � Ò "M$�Ï � Ò�®å Ú&Ú � "5Ï¢å Ú&Ú � ��å Ú&Ú#Ú � Ò1® å � ¤&©d» $\Ï_ÐaL � Ò�® $\Ï¢ÐM² � Ò "M$\Ï � Ò�®½å �#Ú ��Ú "5Ï_å �#Ú ��Ú �å Ú �#Ú#Ú Ò � å � ¶ eZÆ<Æ(¤&º3�­©% �È ÇWË� B¬%�  .�J£­ �§*£`�I¤&§FºeÈ Ç ¤[¥J¥­Ä[º!¤& ¢� ¶� §«¥ �mÆ%�J¡JÆdÈ*��Ǩ§«�­È,»¯¤Æ%¡�£­§F ª§F¸&��[Ö`Æ%�­ºJ¥J�­©% R¡¢�` F¬%�� �§FÅb�b¤&©d»�¬%�Q¤&È  B¬[Ç`Æ%�J¡JÆdÈ*�`Ǩ§«�­È,»�¤³© �«Ô<¤& �§F¸&�7¤ Á ¡¨Ä[ ��[ÖÆ%�­º3¥J�­©% A¡¢�T F¬%�H �§FÅb� ¶�� Ä�Æ<Æ%¡�£��cÇ<¡¨Ä`Ô[¡A�;¡¨ºP¤` .�J£­ �¤&©d»OÔ[�­ D¤PÆ%¡�£­§F ª§F¸&� ¶ $ ¬(¤& ]§*£P B¬%�Ædº3¡ Á ¤ Á §FÈ §F ªÇRÇ<¡¨Äi¬(¤&¸&�\ B¬%�T»&§*£��Q¤¨£�� &� ¡�£­ (Æ%�J¡JÆdÈ*�Z¤&©(£­�\�­º ¶ �[Ö ¶¦· ¬%�T¥J¡¨º­º3�J¥­ 1¤&©(£­�\�­º§*£ $\Ï � L �/Ò�® $\Ï �¦² � Ò "M$�Ï �/Ò�®½å Ú#Ú � "5Ï_å Ú#Ú � �må Ú �&Ú#Ú Òx®�å Ú å�· ¬%�TÈ*�J£J£�¡¨©�¬%�­º3�T§*£ B¬(¤& ]ÇE¡¨Ä7© �J�Q»M .¡�¥J¡¨ÅHÆdÄ[ .�c B¬%�R¤&©(£­�\�­º�©%Ä[Åb�­º­§«¥Q¤&ÈFÈ Ç ¶�� ¡¨©��  ] �ºeÄE£­ ]ÇE¡¨Ä[ºP§F©% �Ä[§F ª§«¡¨© ¶¾$ @ ¼ $&=5S'+á$&�Q4/'*=(S54�Nd4�=5S(4�=<+c4IU#4I=<+30P+Q254�=

$\Ï ¼ L +`Ò�® $ZÏ ¼ +`Ò$�Ï +`Ò ® $ZÏ ¼ Ò $ZÏ +`Ò$ZÏ +`Ò ® $\Ï ¼ Ò­å<G `$&=( &+3254I�\':=E+Q4��!N5�Q4­+J$8+Q'* #=7 &@�'*=(S54�Nd4�=5S(4�=59I4O'*0�+Q2%$8+ �E=5 2T'*=5>%+ S5 G4I0Q=�� +�9J2%$8=5>#4+3254/N5�! #"%$&"('*)*',+.-b 8@ ¼ K

`×�! #6 +Q254 S54.-5=5':+Q'* #=Ó &@�9I #=5S5',+3'* &=%$&)HN(�Q #"%$8"5'*):':+.- 2\4 9�$&= 2T�!':+Q4 $ZÏ ¼ +`Ò�®$\Ï ¼ L +`Ò?$\Ï +`Òc$8=5S�$&):0Q $ZÏ ¼ +`Ò\®\$\Ï +'L ¼ Ò $�Ï ¼ Ò­K � @F+34�= Vd+Q254�0!4i@B #�Q6i?5) $&4/>#',U#4�?50$`9� #=<U#4I=5'*4I=<+�2\$�-+Q Ã9� #6�N5?(+Q4 $ZÏ ¼ +`Ò�2T254I= ¼ $&=(S +á$&�Q4O=( &+c'*=5S54INa4I=5S54I=E+�K

Page 22: All of Statistics: A Concise Course in Statistical Inference Brief

Ü Ú ������������ ������������������������

�E�&�<���<� �/�5� � � � º!¤&�  ª�\¡i¥Q¤&ºQ»¨£ �eºJ¡¨Å ¤¦»[�J¥ ��Ë1��§F B¬%¡¨Ä[ 1ºJ�.ÆdÈ,¤[¥J�­Åb�­©%  ¶ À��­ a¼ Á �T B¬%��­¸&�­©% � F¬(¤& 1 F¬%��¹�º3£­ ]»&º!¤&��§*£ eH¥J�H¡¢���DÈ Ä Á £c¤&©d»/È*�­  + Á �T F¬%�H�­¸&�­©% � F¬(¤& 1 F¬%�Z£��J¥J¡¨©d»»&º!¤&��§*£��ZÄ×�J�­©�¡¢� � §�¤&Åb¡¨©d»¨£ ¶M· ¬%�­© $\Ï ¼ ² +`Ò�® $ZÏ ¼ Ò $ZÏ +,L ¼ Ò�®�Ï � " �#Û#Ò � Ï � " � � Ò�¶¾

<[?56767$&�_-� &@ �\ #=5S5',+3': #=%$&) ���! #"%$&"('*)*',+.-� K�$ @ $�Ï +`ÒZî Ú +Q254�= $\Ï ¼ L +¦Ò�® $\Ï ¼ +¦Ò$�Ï +`Ò åÛ[K2$\Ï � L +¦Ò�0Q$8+3':0 -%4I0]+32(4Z$¨~G'* &6701 &@%N5�! #"%$&"('*)*',+.-#V�@B #��-(~G4�Sa+bK$.=¦>#4I=54��Q$&)«VM$ZÏ ¼ L � ÒS( G4I0c=( &+c03$8+Q'*0!@F-7+3254�$¨~G': #670� 8@xN5�! #"%$&"('*)*',+.-#VE@B #�N-(~G4�S ¼ KÜGK�$.=|>#4I=54��Q$&)«VE$\Ï ¼ L +¦Ò�]® $\Ï +,L ¼ Ò­K�(K ¼ $&=(S,+á$&�!4�':=5S54INa4I=5S54�=<+c':@x$&=5S| #=5),-�',@�$\Ï ¼ L +¦Ò�® $\Ï +`ÒeK

���� ��g6"�lOn����mlR����lOh�Z$;-#4I0 �<+3254I #�Q4I6�'*0D$/?(0Q4I@B?()%�Q4�0!?5):+�+Q2%$8+�':0x+3254T"%$80Q'*0� 8@N&_4­~GNa4I�!+A0!-[0_+34�6�0 (M$&=5S

& �Z$;-&4�0 �×=54­+30IK)(C`x'*�!0!+;V�2\4/=54I4�Sm$`N5�!4�):'*6�'*=%$&�_-��Q4�0!?5):+�K� K � 7 F �¨� l��fi� � � K � �8�Q: 7OI � 7O/ �<� � F 7EA � A=1 � 1 / ; ��� À��­ c¼ â­²;å;å;åI² ¼ �Á �³¤�Æ(¤&ºe �§F ª§«¡¨©¡¢� {/¶7· ¬%�­©(Ë%�;¡¨ºM¤&©%Ç�­¸&�­©%  + Ë $\Ï +`Ò�® � æ98aâ $�Ï +,L ¼ æBÒ?$\Ï ¼ æ�ÒI¶

��� �!�#" $!%R4.-5=54 [ ® + ¼N[ $&=5S=5 &+Q4H+325$8+ Pâe²�å;å;å�² R$&�!4RS5'*0��! #'*=<+\$&=5S�+Q2%$8++ ® / [ 8aâ [ K$�c4�=59I4&V

$\Ï +¦ÒA® r[$ZÏ [ Ò�® r

[$\Ï + ¼�[ Ò�® r

[$\Ï +'L ¼N[ Ò $ZÏ ¼N[ Ò

Page 23: All of Statistics: A Concise Course in Statistical Inference Brief

���H����������� �����������>� �;�� ,����� � Ü �0Q':=59�4 $ZÏ + ¼N[ Ò�® $\Ï +'L ¼N[ Ò $ZÏ ¼N[ Ò�@B�! #6�+32(4HS54 -%=5':+Q'* #=� &@�9I #=5S5',+3': #=%$&)%N5�! #"%$&"('*)*',+.-#K¾� K � 7 F �8� l �fi����� �<;I� 3� � K � 7 F �¨� ��� À1�­ (¼ âe²;å�å;å�² ¼ OÁ �\¤�Æ(¤&ºe ª§F �§«¡¨©�¡¢� { £­Ä×¥J¬M F¬(¤& $\Ï ¼ æFÒZî Ú �;¡¨º¦�Q¤[¥J¬ í3¶ �B� $\Ï +¦Ò�î Ú  F¬%�­©(Ë×�;¡¨ºi�Q¤[¥J¬ í]® � ²;å;å�åI² � Ë

$\Ï ¼ æ L +¦ÒA® $\Ï +'L ¼ æFÒ?$�Ï ¼ æ�Ò� [ $\Ï +'L ¼N[ Ò?$\Ï ¼N[ Ò å Ï«Û[K��#Ò

�a�¨�R�9F� ��5� � ) $ �m¥Q¤&ÈFÈ $\Ï ¼ æBÒ  B¬%� v]�¨�B�d��v]�W�&]tE&x��wª�F� W½�,� ¼ ¤&©d» $\Ï ¼ æ L +`Ò  B¬%�v��dsI��yE�¨���×�bv1�W�&]t)&]��wª�F� W��� ¼ ¶��� �!�#" $�L�4\$&N5N5),-R+32(4\S54 -%=5':+Q'* #=� &@(9� #=5S(':+3': #=%$&)<N(�Q #"%$8"5'*):':+.-H+ 2T':9�48V8@B #):)* 2\4IS"<-+3254/) $ 2r &@1+3 &+3$&) N5�! #"%$&"5':)*',+.-aë

$ZÏ ¼ æ L +`Ò�® $\Ï ¼ æX+¦Ò$\Ï +¦Ò ® $\Ï +,L ¼ æBÒ $\Ï ¼ æFÒ$\Ï +`Ò ® $\Ï +,L ¼ æBÒ?$\Ï ¼ æBÒ� [ $\Ï +'L ¼N[ Ò $ZÏ ¼N[ Ò åm¾

�E�&�<�M�<� ���5�,��� �7»&§F¸�§�»[��Å`Ç �­Å�¤&§FÈ�§F©% .¡} F¬[º3�J��¥Q¤& ¢�«Ô[¡¨ºe§«�J£�g¦¼ â�� � £.Æ(¤&ÅiË � ¼ ã��� È*¡¨��Ædºe§«¡¨ºe§F �Ç � ¤&©d»T¼ ä�� � ¬[§,Ô¨¬PÆdºe§«¡¨ºe§F �Ç ¶�� ! º3¡¨Å¯Ædº3�­¸�§«¡¨ÄE£c�!ÌIÆ%�­ºe§«�­© ¥J���a¹�©d»O F¬(¤& $\Ï ¼ âQÒ�®½å � Ë $�Ï ¼ ã­Ò�® å Û ¤&©d» $\Ï ¼ äeÒ�® å � ¶�� �O¥J¡¨Ä[º3£��JË å � ��å Û ��å � ® � ¶ À��­  + Á � B¬%�M�­¸&�­©% � B¬(¤& � F¬%�i�­Å�¤&§FÈ�¥J¡¨©%  ¤&§F©(£R B¬%�R�\¡¨º!» � �eº3�J� ¶�� ! º3¡¨Å�Ædº3�­¸�§«¡¨Ä<£M�!ÌIÆ%�­ºe§«�­© ¥J�JË$\Ï +,L ¼ â!Ò�® å � Ë $\Ï +,L ¼ ãeÒ�®�å Ú � Ë $\Ï +,L ¼ â!Ò\®�å Ú � ¶�� ( ¡¨ .��g å � �Îå Ú � �rå Ú � ]® � ¶�� �º3�J¥J�­§F¸&�c¤&©³�­Å�¤&§FÈG��§F B¬¦ B¬%�T�\¡¨º!» � �eº3�J� ¶�� $ ¬(¤& �§*£T F¬%�xÆdº3¡ Á ¤ Á §FÈ §F �Ç/ B¬(¤& �§F �§*£Z£.Æ(¤&Å�&� ¤&Ç<�J£ �E B¬%�J¡¨º3�­Å Ǩ§«�­È,»¨£JË

$\Ï ¼ â L +¦ÒA® å �� å �Ï_å �� å � Ò��rÏ_å Ú � � å Û#Ò��XÏ¢å Ú � � å � Ò ®½å � � �[å ¾

���� � � �k������7��gij ���;� � lRh g¦�! n^�254�6b$8+Q4��!' $&)A'*= +Q25'*0`9J2%$8N(+34I�`'*0`0!+3$&=5S%$8�QS K %R4I+J$8'*)*0M9;$&= "d4�@B #?5=5S ':= $&=<-=E?56¦"d4��� &@1"d G �E0IK%�Z+�+3254O':=E+Q�Q [S5?59­+3 #�_-b):4IU#4I)«V[+3254I�Q4O'*0 %R4 � �Q [ &+�$&=(S'<G9J254��_U['*0Q2

Page 24: All of Statistics: A Concise Course in Statistical Inference Brief

Ü<Û ������������ ������������������������

Ï«Û Ú#Ú Û#ÒeV�$8+O+3254Ã':=E+Q4��!674IS5' $8+Q4¦)*4­U#4I)«V � �!'*6�674­+Q+R$&=5SC<[+3':� �;$��&4��7Ï ��� Û#ÒP$&=5S���$8�Q�Ï � � � Ü<ÒeV×$&=5Sm$¨+P+Q254i$8S(U8$&=59�4ISm)*4­U#4I)«V �\'*):)*':=5>#0Q):4I-}Ï � � � � Ò�$&=5S �\�Q4I'*67$&=�Ï ��� * Ò­K�$$&S%$8N(+34IS 67$&=<-�4e~5$867N5):4�0/$&=(S N(�Q #"5):4�6�0R@B�Q #6 %R4 � �Q [ &+O$8=5S <[9J254��_UG':0Q2¯Ï«Û Ú#Ú Û#Ò$&=5S � �Q':676�4I+!+�$&=5S <[+Q'*� ��$'�&4I�MÏ ��� Û#Ò­K���� lO� �� ����cgik�� j�j�lO ������� 4I=54��Q$&)*),-#V�',+�'*0�=5 &+�@B4�$&0Q':"5)*4x+Q c$&0Q0!'*>#=�N5�Q &"%$&"5':)*':+Q'*4I0a+3 c$&):)&0!?5"50Q4­+30] &@5$�0Q$&67N()*40QN5$&9�4T{RK�$.=(0!+34�$&S V< #=(4��Q4�0_+3�!'*9I+Q0�$8+!+34�=<+Q'* #=¦+Q O$R0Q4­+� &@d4IU#4I=<+30�9;$&):)*4IS`$ ^ 'Qt%wf�×y &1�¨t #�T$ ^ '�xy[w TC2T25'*9J2|'*0T$`9I) $&0!0�ê+325$8+c03$8+Q'*0 -54�0�ëÏ�' ÒN*�ç�VÏ�':'FÒ�',@ ¼ â­² ¼ ã�²;å;å;åI²;ç á+3254I= / 6æ98aâ ¼ æ1ç��$&=5SÏ�':'*'FÒ ¼ ç��':67N()*'*4I0�+Q2%$8+ ¼ ç�K^�25470Q4I+Q0¦'*=� $&�!470Q$&'*S +3 }"d4�u�y[t5s;�1�Wt)&]wBydKL�4b9�$&)*)PϪ{R²��Ò�$³u�yEt%s;���¨tE&xwBys�v1t%z#y×��$ @ * '*0Z$¦N5�Q #"5$&"5'*):':+.-�6�4;$&0!?5�Q4OS54 -%=54IS³ #=� +Q254�=�Ϫ{R²�²�*iÒ�'*0Z9�$&)*):4�S�$v1�¨�&1t)&]�ªw��B� W s�v1t%z#y×�iLì254�= { ':0c+Q254¦�Q4�$&)1)*'*=(4&V 2\4i+3$'�&4� +3 �"d4M+Q254¦0Q67$&):)*4�0_+^ � -%4�):Sb+32%$¨+�9� #=<+3$&'*=50T$8)*)×+3254R &Na4I=|0!?5"50!4I+30IV 2T25'*9J2�':0Z9;$&):)*4ISb+3254��¦�×�Wy[w ^ '��]y[w T�K

����� p��³�TlO���P��nxlOn� K�`x':)*)¨'*=O+32(4AS(4I+J$8'*)*0 &@G+Q254�N5�Q [ &@G &@G^�254� #�!4�6½ÛGK K��H):0Q 5V¨N5�! WU&4�+3254�6� #=5 &+Q #=54S54I9��!4;$&0!'*=5>�9;$&0!4&KÛGKc���Q �U&4�+Q254/0!+J$¨+34�6�4�=<+Q0c':=�4�CE?%$8+Q'* #=�Ï«Û[K � ÒeKÜ(K�B�4I+M{Â"d4�$|0Q$&6�N5)*4`0!N%$&9�4b$&=5S )*4­+ ¼ âe² ¼ ã;²;å;å;åI²�"a4�4IU&4�=<+30IK %R4.-%=(4 + �³®/ 6æ98E� ¼ æx$8=5S �/® : 6æ98E� ¼ æ«KÏ�$<Ò�<G25 2X+Q2%$8+>+�â�I +Hã�I ����� $&=5S�+32%$¨+ Pâ G +Hã G ����� KÏB"×Ò <G2( 2o+325$8+b� ç : 6� 8aâ + ��',@�$&=5SÓ #=5),- ':@R�á"d4�)* &=5>#07+Q $&=Ó'*=�-%=5':+Q4=E?56¦"d4��T &@1+32(4�4­U#4I=E+Q0 ¼ âe² ¼ ã�²;å;å;å,KÏB9�ÒK<G25 2Â+Q2%$8+O� ç / 6� 8aâ ��':@\$&=5S #=5),-m',@��X"d4�)* &=5>#0O+3 |$&):)x+3254Ã4­U#4�=<+Q0¼ âe² ¼ ã�²;å;å;å54e~G9�4�NG+HNd #0Q0!'*"5),-$%-%=5',+34O=E?56¦"d4��T &@1+325 &0Q4/4IU&4�=<+30IK

�5K�B�4I+P° ¼ æ1ë¦í�ç %µ/"d4R$i9I #)*):4�9I+Q'* #=� &@�4­U#4I=E+Q0�2T254��!4 Ã'*0�$&=$&�!"5':+Q�3$&�_-Ã'*=5S54e~

Page 25: All of Statistics: A Concise Course in Statistical Inference Brief

�����Z����M������ �(� � Ü#Ü0Q4­+;K�<G25 2 +Q2%$8+

o 7æ���� ¼ æ�p ® Aæ���� ¼ æ $&=5S o Aæ���� ¼ æ p ® 7æ���� ¼ æ�P':=<+;ë�`x':�Q0!+TN(�Q �U#4/+325':0Z@B #� `®�° � ²;å;å;å­² � µEK

�GK><G?5N(Na #0!4R2�4Ã+3 #0!0i$@�$&'*�H9� #':=�?5=<+Q'*)�2�47>#4­+�4­~($&9I+Q):-m+ 2� }2(4;$&S50IK %R4I0Q9��!'*"d4+32(4i03$867N5):4/0QN%$&9I4ZKZLì2%$8+H'*0T+3254�N5�! #"%$&"5':)*',+.-b+325$8+H4e~($&9I+Q):- � +Q #0Q0!4�0O$&�!4�Q4ICE?5'*�!4�S *(K�B�4I+H{ή ° Ú ² � ²;å;å;å­²�µEK����! WU&4i+325$8+P+Q254��!4iS5 [4�0P=5 &+P4­~G'*0_+R$�?5=5',@B #�Q6 S5':0!+Q�Q' �"5?(+Q'* #=� #= {�'«K 4&K³':@k$\Ï ¼ Òi® $\Ï +¦ÒK2T254I=54IU&4��EL ¼ L�® L +,Ld+3254I= $Î9;$&=5=( &+03$¨+3'*0_@F-�+32(4M$W~(': #6�0Z &@]N5�Q #"5$&"5'*):':+.-&K� K�B�4I+ ¼ âJ² ¼ ã�²;å�å;å("d4/4IU#4I=<+30�KN<G25 2X+Q2%$8+

$ o 67� 8aâ ¼ � p � 6r � 8aâ $�Ï ¼ �<Ò å�P':=<+;ë %R4 -%=54M+ �¦® ¼ �/Ð / � � âæ98aâ ¼ æ.KZ^�254I=m0Q2( 2 +Q2%$8+c+3254M+ �7$&�!4MS5':0��! #':=<+$&=5S�+325$8+>/ 6� 8aâ ¼ �M® / 6� 8aâ + �GK

K><G?5N(Na #0!4�+Q2%$8+ $\Ï ¼ æBÒ�® � @B #��4�$&9J2³í!KA���! �U#4/+32%$8+$]o 6Aæ98aâ ¼ æ p® � å

� K�`% #�N-(~G4�S +á0!?59J2}+Q2%$8+ $\Ï +`ÒZî Ú V%0Q25 2 +Q2%$8+2$\Ï � L +`Ò\03$¨+3'*0 -%4�0T+32(4M$W~(': #6�0 &@]N5�Q &"%$&"5':)*':+.-&K��Ú K��A #? 25$�U&4N5�Q &"%$&"5),- 254;$&�!S�':+M"a4­@B #�Q48K �P 2 -& #?�9;$&=�0! #):U&4b',+i�!'*># #�! #?50!):-#K

$ +`':0i9;$8)*)*4IS�+Q254 & ,³ &=E+.- �H$&):)A���! #"5):4�6|K)( � N(�Q' �I4b':0iN5)*$&9�4IS $8+¦�Q$&=5S5 #6"d4I+ 2�4�4�= #=54� &@c+325�!4�4�S5 [ #�Q0IK �A &? N('*9 ��$ S5 [ #��K�^1 �"d4�9� #=(9��Q4­+348V\)*4­+ � 00Q?(N5Na &0Q4Ã-# #?�$&)S2\$;-[0�N('*9 �mS5 [ #� � K �P 2 ,³ &=E+.- �P$&)*)]9J25 [ #0Q4I0M #=(4� 8@�+Q254 &+Q254��D+ 2� MS( G #�!0�V< #Nd4�=(0�':+�$&=(S�0!25 2T0A-& #?`+32%$¨+�':+�':0�4�6�N(+.-#K��c4Z+3254I=7>&':U#4I0-# &?�+Q254P &N5Na &�!+3?(=5':+.-M+3 ��&4�4IN7-& #?5��S5 [ #�� #��0 2T',+39J2Ã+3 /+3254T &+Q254���?5=5 &Na4I=54�S

GuoQun
高亮
Page 26: All of Statistics: A Concise Course in Statistical Inference Brief

Ü'� ������������ ������������������������

S5 [ #�IK�<G25 #?()*S�-# #?|0!+3$�-� #��0 2T',+39J2 $.=<+3?(':+3': #=�0Q?5>&>#4�0_+30c':+�S5 [4�0!=�� +P67$8+!+34��IK^�254/9I #�Q�!4�9I+H$&=50 2\4I�P':0Z+32%$¨+c-& #?}0!25 #?5):S�0 2T':+Q9J2 K����! �U#4i',+;KF$ +�2T'*):) 254I)*N�+3 0!Na4I9�':@F-b+3254/03$867N5):4/0QN%$&9I4i$8=5S|+Q254/�Q4I)*4IU8$&=<+c4IU&4�=<+30H9;$&�!4I@B?5):):-#K�^�2E?50�2T�Q',+34{ ®�°GÏF��âe²_�xãJÒ�ë¦�1æ1ç�° � ²eÛ[²JÜGµ<µ�2T254I�Q4O�Aâ�':0�2T254��!4�+Q254ON5�!' ��4O':0T$&=5S��]ãT'*0+Q254/S5 G &� ,³ #=<+.- #Nd4�=50IK� � K><G?(N5Na &0Q4M+Q2%$8+ ¼ $&=5S + $&�Q4�'*=(S54�Nd4�=5S(4�=<+/4IU&4�=<+30IKK<G25 2�+32%$¨+ ¼ $&=5SE+ $&�!4O'*=5S(4�Nd4�=5S54I=<+H4­U#4�=<+Q0�K� ÛGKT^�254I�Q4�$&�!4¦+325�!4�4Ã9;$&�!S50�KÃ^�254%-%�!0!+�'*0O>&�Q4�4I= #=�"d &+Q2 0!'*S54I0�V�+32(4`0Q4�9I #=5S�'*0�!4�S� #=�"d &+32�0!'*S54I0Ã$&=5S +Q254b+32('*�QS�':0i>#�!4�4I= &=� #=54�0!'*S54$8=5S �Q4�S� &= +3254 &+Q254��IKiL�4�9J2( G #0!4�$�9�$&�QS�$8+O�Q$&=5S5 &6 $&=5SE2�4�0!4�4� #=54`0!'*S54³Ïª$8)*0Q 9J2( #0Q4I=$8+M�3$8=5S5 #6bÒ­K\$ @�+Q254b0!'*S54a2\4b0Q4�47'*0�>#�Q4I4�= V2T2%$8+¦':0�+Q254bN(�Q #"%$8"5'*):':+.-m+Q2%$8++Q254T &+3254I��0Q'*S(4c':0�$&)*0! />#�Q4I4�= ,³$&=<-`Na4I #N5):4c':=<+3?5',+3':U&4�),-¦$&=(0 2�4�� ��� ÛGK�<[25 2+Q2%$8+�+32(4�9I #�Q�!4�9­+H$8=50 2�4��c'*0cÛ � Ü(K� Ü(K><G?(N5Na &0Q4P+Q2%$8+Z$�@�$&':�A9I #'*=7'*0A+3 #0!0Q4�Sb�Q4�Nd4;$¨+34�S():-7?5=<+3'*)×"d &+32b$i254�$&S$&=5S7+J$&':)2%$;U&4i$&N(Na4�$&�Q4IS}$8+T):4;$&0_+T #=59I4&K

Ï�$<Ò %R4�0!9��Q':"a4R+Q254/03$&6�N5):4O0QN%$&9I4i{RKÏB"×Ò�Lì2%$8+c'*0Z+Q254/N5�Q &"%$&"5':)*':+.-�+325$8+�+325�!4�4O+3 &0Q0Q4I0�2T':)*)a"a4/�Q4ICE?5'*�!4�S

� �5K><G2( 2á+325$8+`':@ $\Ï ¼ Òî Ú #� $�Ï ¼ Òî � +Q254�= ¼ ':0¦'*=5S(4�Nd4�=5S54I=<+7 8@P4­U#4I�!- &+Q254���4IU&4�=<+;K�<G25 2ì+32%$¨+�':@ ¼ '*0\':=5S54INa4I=5S54�=<+T &@�',+30!4�):@ +3254I= $ZÏ ¼ Ò�':0\4I':+32(4��Ú #� � K� �GKT^�254\N5�Q #"5$&"5'*):':+.-O+Q2%$8+�$P9J25'*):S¦2%$&0]"5)*?54\4I-#4I0�'*0 ��� �5K �H0Q0!?56�4Z'*=5S(4�Nd4�=5S54I=59�4"d4I+ 2�4�4I=}9J25':)*S5�!4�= K �\ #=(0Q'*S(4��c$Ã@�$867':):-a2T':+Q2 �¦9J25':)*S5�!4�= K

Ï�$<Ò?$ @R':+Ã':0 �[=( 2T=ß+32%$8+7$8+�):4;$&0_+7 &=54³9J2('*)*S 2%$807"5):?54�4I-&4�0�VN2T25$8+b':0`+3254N5�! #"%$&"('*)*',+.-Ã+32%$8+P$8+T):4;$&0_+�+325�!4�4/9J25':)*S5�!4�=�2%$;U#4�"5)*?(4�4­-#4I0ÏB"×Ò@$ @T',+¦'*0��E=5 2T=�+32%$8+M+32547-# #?5=(>#4�0_+Ã9J25'*):S�2%$&0i"5)*?5474I-&4�0�V�2T2%$8+i'*0�+3254N5�! #"%$&"('*)*',+.-Ã+32%$8+P$8+T):4;$&0_+�+325�!4�4/9J25':)*S5�!4�=�2%$;U#4�"5)*?(4�4­-#4I0

� *(K><G2( 2r+32%$¨+ $�Ï ¼ +� MÒA® $ZÏ ¼ L +� iÒ $�Ï +,L iÒ $\Ï MÒ­å

Page 27: All of Statistics: A Concise Course in Statistical Inference Brief

�����Z����M������ �(� � Ü ���� K�<G?5N(Na #0!4 � 4IU#4I=<+30b@B #�!6 $�N5$&�!+Q':+3': #= &@H+Q254|03$&6�N5)*4�0!N%$&9�4}{RVZ'«K 4&Kß+3254­-$&�!47S5':0��! #':=<+�$&=5S / æ98aâ ¼ æ\® {RK �P0Q0!?5674�+32%$¨+ $\Ï +`Ò¦î Ú K���Q �U&47+325$8+�':@$\Ï ¼ â L +`Ò � $ZÏ ¼ âQÒ�+Q254�= $\Ï ¼ æ L +`Ò\î $ZÏ ¼ æ�Ò�@B &��0Q #6�4/í1® ÛG²;å;å�å�² � K�� K�<G?5N(Na #0!4¦+32%$¨+/Ü Ú Nd4��Q9I4�=<+/ &@A9� &67N5?G+34��H 2T=(4��Q0/?50!4Ã$ ,}$&9I'*=<+3 #0!2 V � Ú ?(0Q4Lì'*=(S5 2T0]$8=5S¦Û Ú Nd4��!9�4I=E+]?50!4�B�'*=E?G~aK�<G?5N5Nd #0Q4�+Q2%$8+!* �ZNa4I�Q9I4�=<+x 8@G+Q254�,}$&9?50!4��Q0�25$�U&4R0Q?(9�9�?(6¦"a4IS+3 `$i9I #6�N5?(+34I��U['*�!?50�V ÛMNd4��!9�4�=<+T &@�+Q254PLì':=5S5 2T0?50!4��Q0\>#4I+�+3254cUG':�Q?50\$&=5S � Ú Nd4��Q9I4�=<+� &@�+Q254�B�':=[?[~b?(0Q4��!0Z>#4­+\+32(4PU['*�!?50�KDL�40Q4I)*4I9I+T$¦Nd4��!0Q #=�$8+\�3$&=(S5 #6�$&=5S)*4�$&�Q=7+32%$8+Z254��Z0_-[0!+34I6 2\$&0Z'*=G@B4�9I+Q4�S 2T',+32+32(4/U[':�Q?50IK�Lì25$8+T'*0Z+Q254/N5�Q #"5$&"5'*):':+.-�+32%$¨+T0Q254�'*0T$¦Lì'*=(S5 2T0c?(0Q4�� ��� K � "d �~9� #=<+J$8'*=50 �¦9� &'*=50T$&=(S|4�$&9J2|2%$&0c$`S5'SPa4��!4�=<+TN5�Q &"%$&"5':)*':+.-� &@]0Q2( 2T':=5>254�$&S50�K�B�4I+�� âJ²;å;å;åI²����S54I=5 &+34m+Q254mN5�! #"%$&"('*)*',+.- &@/254;$8S50 #=Ó4;$89J2Î9I #'*= K

<G?5N(Na #0!4�+Q2%$8+�aâ�® Ú ²��5ãZ® � "��5²��%ä\® � "#Û[²�� � ®rÜZ"�� $8=5S����\® � å

B�4I+A± S54I=5 &+34 &!2(4;$&S50�':0� #"(+J$8'*=54IS (/$&=5SÃ)*4­+ \æ×S54�=( &+34�+32(4c4­U#4I=E+A+325$8+�9� #':=í�'*0�0!4�)*4I9I+Q4�S KϪ$#Ò�<G4�):4�9I+i$9� #':= $8+��3$&=5S( #6 $&=5S +3 &0Q0/':+�KR<G?5N5Nd #0Q47$�254;$&S '*0O &"(+J$&':=54�S KLì2%$8+]'*0�+Q254�Nd #0_+34��!'* #��N(�Q #"%$8"5'*):':+.-H+Q2%$8+19� &'*=�í 2\$&010Q4�):4�9­+34�SÏ�í1® � ²�å;å;å�² �#Ò $.=| &+3254I��2\ #�!S50�V -5=5S $\Ï \æ L ±mÒ�@B #��í]® � ²;å�å;å�²��[KÏ�"%ÒP^] &0Q0R+32(4¦9� #':=�$&><$8'*= KHLì25$8+O'*0c+3254`N(�Q #"%$8"5'*):':+.-� &@A$&=( &+3254I�R254�$&S �$.= &+Q254���2� #�QS50�-%=(S $\Ϫ±`ã L ±�â¢Ò�2T254��!4i± [ ® &!254;$8S50T #=|+Q #0Q04^%K5(�P 2Ó0!?5N5Nd #0Q4T+325$8+�+3254T4­~GNd4��!'*6�4�=<+�2\$&0�9;$8�Q�Q':4�S� #?(+�$&0�@B #):)* 2T0�K1L�4T0Q4I)*4I9I+$`9� #':=|$8+T�3$&=(S5 #6 $&=(S|+Q #0Q0T':+T?(=E+Q'*)�$`254�$&S|'*0� #"(+3$&'*=(4�S KÏ�9;Ò�`x'*=5S $\Ï \æ L + � Ò�2T254��!43+ � ® &�-%�Q0_+P2(4;$&S|'*0� #"G+J$&':=54�S� #=�+3 #0!0 �5K)(

Û Ú K�Ï �+7 ��� P�/ �9F��E�[�&�9F 1 �P� 5�/eK Ò <G?5N5Nd #0Q4 $�9I #'*=Ó2%$80N5�Q #"5$&"5'*):':+.-�Î &@/@�$&)*):'*=5>254�$&S50�K�$ @ 2\4OY%':N�+3254O9I #'*=67$&=<-b+Q'*6�4�0IV�2�4K2� #?5):S4­~GNd4�9I+T+Q254ON5�Q &Na #�_+3': #= &@�254�$&S50\+Q ¦"d4R=54;$8���KxL�4�2T'*):)×6b$'�84H+32('*0�@B &�Q67$&)×) $8+Q4���Kx^x$'�84���® å Üi$&=5S� ® �;Ú#Ú#Ú $8=5Sm0Q':6¦?5)*$8+34 � 9I #'*=|Y%'*N(0�KR��): &+P+Q254iN(�Q #Nd #�!+Q'* #=³ &@�254�$&S50O$80R$@B?5=59­+3': #=� &@ � K #c4INa4�$8+�@B #���b® å Ú Ü(K

Û � K�Ï �+7 ��� P�/ �9F��<�[�#�9F 1 �P� 5�/eK Ò%<G?5N5Nd #0!4N2\4�Y%':N`$R9� #':= � +3':674I0D$&=(S�):4I+��ÃS54�=5 8+34+32(47N5�! #"%$&"('*)*',+.-} 8@Z254�$&S50�K B�4­+� "d4Ã+Q254�=[?(6¦"a4I�M 8@Z254�$&S50�K7L�4�9;$&):)

GuoQun
高亮
GuoQun
高亮
Page 28: All of Statistics: A Concise Course in Statistical Inference Brief

Ü * ������������ ������������������������

$ "5':=5 #6�' $&)T�Q$&=5S5 #6 U8$&�Q'*$&"5)*4'2T25'*9J2ì':0�S5'*0!9�?50!0Q4ISX'*=¯+Q254�=54e~G+�9J25$&N(+34I��K$.=<+3?(':+3': #=�0!?5>#>#4I0!+Q0¦+32%$8+ 2T'*):)�"d4�9�): #0Q4�+3 � ��K}^] �0!4�4':@\+325':0¦'*0�+3�Q?(4&V2�4¦9;$8=��Q4�Nd4;$¨+R+325':0H4e~(Nd4��!'*6�4�=<+O67$&=<-³+Q'*6�4�0H$&=5S�$;U#4I�3$&>&4¦+3254 U8$&):?54�0IK�Z$&�!�!-à #?G+Z$�0Q'*6i?5) $8+Q'* #=Ã$8=5Sb9I #67N5$&�Q4T+3254H$;U#4��Q$&>#4P &@ +Q254 � 0�+3 � � KD^]�_-+Q25'*0Z@B #����®½å¿ÜÃ$&=5S � ® ��Ú ² �;Ú#Ú ² ��Ú&Ú#Ú K

Û#ÛGKMÏ �#7 ��� P�/ �9FH�E�[�&�9F 1 �H� 5�/eK Ò �P4I�Q4 2�4 2T':)*)�>#4­+¦0Q &674�4­~GNd4��Q':4�=59I4�0Q':6¦?5)*$8+3':=5>9I #=5S5',+3'* &=%$&)dN5�Q #"5$&"5'*):':+Q'*4�0IK �\ &=50Q':S54���+3 &0Q0Q':=5>Ã$¦@�$&'*��S5'*48K�B�4I+ ¼ ®�°8ÛG² �5² *Gµ$&=(S?+ ®�° � ²eÛ[²JÜ(² �(µEK�^�254I= VO$\Ï ¼ Ò�® � "#Û[V9$\Ï +`Ò�®XÛ "&Üc$&=5S $\Ï ¼ +`Ò�® � "8Ü(K<G':=59�4 $\Ï ¼ +`Ò�® $\Ï ¼ Ò?$\Ï +¦Ò­V<+3254T4­U#4�=<+Q0 ¼ $&=5S + $&�!4T'*=5S54INa4I=5S54I=E+�K4<G'*6i?(�)*$8+34cS5�3$ 2T0A@B�Q #6 +32(4H03$867N5):4P0!N%$&9I4O$&=5SbU&4��!':@F-Ã+32%$¨+ �*�Ï ¼ +`Ò�® �*�Ï ¼ Ò �*bÏ +`Ò2T254I�Q4 �*�Ï ¼ Òi'*0¦+32(4N5�Q #Nd #�_+3'* &= &@c+3'*6�4�0 ¼ G9I9�?5�!�Q4�S '*= +Q2540Q':6¦?5)*$8+3': #=$&=(S70!'*6�'*) $8�Q):-O@B &� �*�Ï ¼ +`Ò�$&=5S �*7Ï +`ÒeK �P 2 -%=(S�+ 2� i4­U#4�=<+Q0 ¼ $&=5S +�+Q2%$8+$&�!4R=5 &+T':=5S54INa4I=5S54�=<+�K �\ &67N5?G+34 �*�Ï ¼ Ò­² �*7Ï +`Ò�$&=5S �*�Ï ¼ +¦Ò­K �\ #6�N%$&�Q4P+32549�$&)*9I?5) $8+Q4�S�U8$&)*?(4�0O+3 �+3254I'*�R+Q254� #�!4I+Q'*9;$8)DU&$8)*?54I0�K #c4INa #�_+/-# #?5���Q4I0Q?5),+30�$&=5S':=E+Q4��!N5�Q4­+;K

GuoQun
高亮
Page 29: All of Statistics: A Concise Course in Statistical Inference Brief

� � ��� ����

� � � � ��� � ��������� �����

��� � !#"%$'&'(�)+*+,-$'./(0"132546287:9;2<7:=>9%4@?BACAD46254�EF7G?H7:?HI�4@JLKM=/NO?B=/K/JL?HK/ACPQ7R2<SCAD462<4BTVUWNXPYAHN�PZK+[G7:?H\

9<4@E0]H[GK^9L]D4@=>K/9_46?HA`KbaOK/?c289d2<N0AD462<4cegf-SBKh[G7:?H\%7G9Q]HJ8N a37:AHK>A`icjk2<SBKh=>NO?H=/K>]B2_N6l'4J<4@?BAHNOE�a64@JL7m4@iH[GK@T

n'o>prqOs t<s uBqwvHxzy|{~}X�D�����r� �B�H}����B�����V�:�����0�b�c�r���3�+� �^� � � ���B�@����5�b�R�@�B�^���<�8�@ ¡�D¢3�0£b�b�¤��¥�¦Z§¨�ª©k�8�3«5�`©�¢¬�­«5©����-¦d®

¯°o/±�²cqOs³±µ´c¶·¶ ¸/¹º´¼»m´½qµ¾¿cuBÀ Á�´�»ÂsôcÄc¶ oÅÀhÆXǪtÄOokÀdo/´�ÇÈÆc»�´cÄc¶ o@x%ÉBo>ot5²Xo tªo/±�²cqOs³±/´½¶�´½Êµ¾ÊOo6qX¿Bs ËÍÌ�uc»�¿co>t;´Osζ Ç/x

Ï 2Q4Ð=/K>JL2<4@7:?%]rNO7G?½2Z7:?%E0NO9;2¤]HJLNOiD4@iH7G[:7R2­jw=/NOÑBJ898K>9/ÒB28SHK_984@EF]B[:Kd9L]D4@=>K_7:9ÓJ<4@JLK/[GjEFK>?c2<7:N@?HK/AÔ4@?BAÔPÓK-PZN@J8\ÐAB7:J8K>=>28[Gj^PQ7G2<SwJ84@?HAHN@E�a64@J87:4@iH[GK/9/T�ÕZÑB2'j@NOÑÔ98SHN@ÑH[:Aw\6K/K/]7:?�E07:?HA%28SD462Q2<SHKÖ984@EF]B[:K_9L]D4@=/K^7:9-JLKµ4@[G[Gj02<SHK>J8K6Òr[GÑHJ8\½7G?HIw7:?k2<SBK^iH4@=5\½IOJ8NOÑB?HA°T×½Ë@´cÀhÊc¶ o^vHxÙØÛÚÜ Ý�·�Þ�M«5©����ß�­�b���������5�/®Íà#�b�á��¥m¦Z§Í£b�Ð���D�Ð�D¢¬�0£b�b�Ô©ªâ^�D�8�cã��Ð���ß���D�� �8ä/¢å�b�æ«5�Z¦d®QÚ�©��Ð�Lç½�@�d�r :�5è'� âZ¦êéìëMëîíïëMëîíïëMëîíQí����D�b�F��¥�¦Z§'éñðH®¤ò×½Ë@´cÀhÊc¶ o^vHxÙvÛà#�b�r�Ûéôó¬¥�õ#ö<÷B§bø'õåùcúF÷3ùQûVü@ýÔ£b�¤���D�¤¢¬�D����ã@�:� «/®ÖþÓ©��B�b��ã3�b�-ã@�L�@ÿ'���3��_�D©����D���Â�@�'�8�@�rã3©����#âÈ�<©����Ö®����M�dÿ'�� � r�0��c�_���3�:�_��ã3�8�w��©��<�Ü�r�5�5«b�:� �d R�@�­�b�/®� {� �5�r� «8�@ D©�¢¬�­«5©����Ó�:�頻â ���D�æ⵩��È�Û¦êé ¥�õ#ö<÷B§/®��°©����Q�Lç½�@�d�r :�5�-©ªâ �8�@�rã3©������@�È���c£/ :�5��@�<�Q��¥�¦Z§áé õ°è��%¥�¦Z§'é ÷åè��Ð¥�¦Z§'é õwú�÷Dè�� ¥�¦Z§áé�� õ ù úê÷ ù ®-ò

�Ö7RaOK/?^4-J<46?HAHNOE a64@J87:4@iH[:K � 4@?HA^4¤98ÑHiB98K>2��ÞN@l32<SBK'J8Kµ46[O[:7G?HK@ÒµABK��D?HK ��� �b¥ � §Üéóµ¦"!M�ì�Ô��¥�¦Z§�! � ý 4@?BA [GK>2

#%$

Page 30: All of Statistics: A Concise Course in Statistical Inference Brief

#�� ��������� ����������������� ����������� �! "

# ¥�� ! � §~é # ¥�� � � ¥ � §8§Üé # ¥­óµ¦"!+�_øQ��¥m¦Z§ ! � ý6§# ¥���é õ §~é # ¥�� � � ¥�õ §L§Üé # ¥ªóµ¦ !+�_øQ��¥�¦Z§'éCõ¡ý6§%$

� AHK/?BN@2<K>9Q28SHKÖJ<4@?BAHNOE�a64@JL7m4@iH[GK¨4@?HA õ AHK/?HN62<K/9ï4Ô] NO9L987GiH[:Kda64@[GÑHK¨N@l � T

×½Ë@´cÀ^Êc¶ oÖvHx'& ÚÜ Ý�·�ô� «5©����ñ��ÿ'� «5���@�rã� :�b�¨� £b�+���D�ß�D¢¬�0£b�b��©ªâî�D�8�cã��/®)(°�D�b�Bè# ¥�� é+*c§-é # ¥­ó íQí^ý6§Zé ü-,/.¬è # ¥�� éÅüX§¤é # ¥ªó�ëîíQöLídëßý6§ZéÅü0,�1+�@�rã # ¥�� é1O§Mé # ¥ªó�ëMëßý6§`é ü-,2.å®3(°�D�M�L�@�rã3©�� ���@�b���c£/ :�M�@�rãÞ�����gã@�:�b�Â�È��£/¢3�Â� ©��ô«8�@� £b��b¢3�w�0�@�b�54��8ãF���Ü⵩� � :©�ÿ��76

¦ # ¥ªóµ¦Qý6§ ��¥m¦Z§(8( 9%:�; <(>= 9%:�; 9=?( 9%:�; 9=�= 9%:�; @

õ # ¥�� éCõ §< ü-,2.9 ü-,�1@ ü-,2.

(H� �Ð�3�b�æ�b�8�@ Ý�54/���3�Ô���3�:�^�­©BADCÓ�·�D�/®¤ò

���FE G .IH�$á& .7J+*`$'./(0" K *+"+,-$'./(0"LHNMÐ"+) O &'(?JPMQJ+.IRµ.>$TS K *+"+, U$'./(0"LH

�Ö7RaOK>?g4%J<4@?HABNOEYa64@JL7m4@iB[:K � ÒæPZKwABK��D?HKÔ46?+7:E0] NOJ;254@?c2Wl�ÑH?H=b2<7:N@?+=µ4@[G[:K/Aî2<SHK=/ÑBEÍÑH[m4�2<7Ga@KFAB7:9L28J87GiHÑB2<7GNO?gl�ÑH?H=>287:NO? ¥ NOJhAH7:9;2<JL7:iHÑB287:NO?gl�ÑH?H=b2<7GNO? § 7:?�28SHKFl�N@[:[:N PQ7G?HIPZ4 j@T

n'obp qOs t<s³uBqÍvHxWVX(°�D�ZY\[��][����_^X��� �_���a`b^ }����c[d^X��� �feg[��ZYh^X��� �ji>kZlnmZo �O�ì�p *Böµü7qQ©ªâ^���L�@�rã3©�� ���@�È���c£/ :�-� �:�Ðã3�srÜ�æ�8ã%£ �

mZoÖ¥�õ §Üé # ¥�� ûÛõ §%$

Page 31: All of Statistics: A Concise Course in Statistical Inference Brief

�����_� � � "!���� ���8Z�%�8�������Q��Z�%����"B����� � � ��� � � � �d�7�������Q��Z�%����" #�

* ü 1

ü

�� �

mZo¨¥�õ §

õT 1� T $

� 7GIOÑHJ8K # T ü@� i>k�l l�NOJ��D7:]B]H7:?HIÔ4Ô=/N@7:?k2­PQ7:=>K ¥���� 46EF]H[GK # T ð T §

� NOÑ`E07:IOSc2QPÓNO?HAHK>JïPQScjîPÓKÐirN@28SHK/JQ2<NFAHK��D?BKh2<SBK i>kZl T � NOÑîPQ7G[:[¡98K>KÐ[:462<K>J2<SD4�2W28SHK i>kZl 7:9Q4ÔÑH9LK>l�ÑH[°l�ÑH?B=>2<7GNO? � 7R2QK�� K/=>287Ga@K/[Gj%=/NO?c25467:?H9W4@[G[ 2<SBK^7G?Bl�NOJLE�46287:NO?4@irNOÑB2-2<SBK^J84@?HAHN@E�a64@J87:4@iH[:K6T×½Ë@´cÀhÊc¶ o^vHx��ÛÚÜ Ý�·�Þ�Öâ>�@���Ô«5©�������ÿ'� «5�0�@�rãk :�b�á� £b�w���D�Í�D¢3�0£b�b�F©ªâÖ�D�8�cã��/® (°�D�b�# ¥���é *c§'é # ¥�� é 1O§'é�ü-,2.0�@�rã # ¥�� éü § éôü-,�1H® (°�D�¤ã@�:�b�Â�È��£/¢3�Â� ©��QâÈ¢3�æ«b�Â� ©���:�

mZo_¥�õ §Üé���� ���* õ�� *ü0,2. *ÔûCõ�ñü# ,2. ü^ûCõ� 1ü õ�� 1�$

(°�D�?i>kZl �:�%�5�D©�ÿ'� ����ÚÜ�R�@¢¬�<�! ½® 9c®�{d Ý���D©�¢@���ê���3�:�M�Lç½�@�d�r :�k�:���b���d�r :�5è¨�b�¢Bã ����_«8�@�5��âÈ¢3 �  �@® iTk�l#" �%«8�@�º£b� �@�b� �M«5©��µâÈ¢½�b���3�c®%$^©��Â� «5�w���B�@�Q���D�¤âÈ¢3�æ«b�Â� ©��º�:�Í�È�R���3�«5©��D�Â���D¢D©�¢c�5èÖ�æ©��'&;ã3�5«b�<�8���b���3�+�@�rã+���B�@�¨���_�:�kã3�srÜ�æ�8ãÐ⵩��k�@ � #õ� �@�b�����D©�¢@���|���D��L�@�rã3©�� ���@�È���c£/ :�Í©��D  �F� ��c�5� ���@ Ý¢D�5��*Böµü%�@�rã 1H®�(Í© �c©�¢�� �5�^ÿ�� � mF¥;ü�$W.½§'é $ $ eò

f-SHKwl�NO[:[GNXPQ7G?HI�JLK/9LÑH[G2/ÒæPQSH7:=5S+PÓKÔAHNk?HN@2¨]BJ8N aOK6Ò�9LSHN PQ9¨2<SD462¨28SHK i>kZl =>NOE�)]H[:Kb2<K>[Gj%ABK>2<K>J8E07:?HK>9-2<SHKÖAH7G9L2<JL7:iHѬ2<7:N@?kN@lá4wJ84@?HAHNOE�a@46J87m46iH[:K6T¯W²Xo>uc»mo6À+*-,/.Và#�b�'� �B� �@� i>k�l m~�@�rã  :�b� ���B� �@�Bi>kZl 0F®%1�â mF¥�õ §ïé20�¥�õ §âµ©��h�@ � Bõ����D�b� # ¥�� ! � §'é # ¥ ��! � §Ü⵩��^�@ �  � ® ¯°o/±�²cqOs³±µ´c¶·¶ ¸/¹43'oÍuBqc¶ ¸

²�´XÁXoVt5²�´ t # ¥�� !� §ôé # ¥ � ! � §Ì�u½»½obÁXo6»G¸dÀïoµ´XÇbÆc»m´½Äc¶ oo>ÁXo6q t � x

¯W²Xo>uc»mo6À+*-,65V{�âÈ¢3�æ«b�Â� ©�� m��0�b�c�r���3�Í���D�_�<�8�@ å Ý���æ�_�­© p *Bö/ü7qá�:�¨� iTk�l¼âµ©��W� ©�����r�<©@£5�c£/�� Ý��� �Í���8���b¢3�<� # � â¨�@�rãF©��D  �Í� âï�����/�@���:�sr¤�5�d���D��⵩� � :©�ÿ'���3�w���3�<�5�h«5©��rã@����� ©��B�76

Page 32: All of Statistics: A Concise Course in Statistical Inference Brief

.\* ��������� ����������������� ����������� �! "

� � m¼�:�^�æ©��'&;ã3�5«b�5�8���b���3�w��®��/®Óõ � � õ ù ���d�r Ý� �5�¨���B�@��mF¥�õ � §¤û mF¥�õ ù §>®� ��� Bm �:�^�æ©��b�0�@ Ý�54��8ã\6 [:7GE�� � ��� mF¥�õ §Üé *+�@�rã [:7GE�� � � m�¥�õ § éôü½®� ����� Bm �:�Ö�È�R���3� &8«5©��D�����D¢å©�¢c�5è ��®��/® mF¥�õ §Üé mF¥�õ��#§á⵩��h�@ � Hõ°è ÿ��D�b�<�

mF¥�õ � § é [:7:E���� �� � mF¥�÷B§%$

������� l�� 1¬ÑH]H]rNO98Kw28SD462 m 7G9Ö4 i>kZl T��#Kb2ÖÑH9Ö98SHN P�2<SD462 ¥ 7:7G7 § SHNO[:AB9/T��¡Kb2 õirKÍ4FJLKµ4@[�?½ÑHEÍirK/Jd4@?HA`[GK>2 ÷ � ö8÷ ù öI$7$I$ i KÐ4F98K��½ÑHK/?H=>KwN@l'JLKµ4@[�?½ÑHEÍirK/J89d98ÑH=5SM28SD462÷ ��� ÷ ù ��� � � 4@?HA�[G7:E�! ÷ ! éCõ T"�#Kb2 �#! é ¥%$�&ñö<÷ ! q 46?HAî[GK>2 � é ¥%$�&ñöLõ q T('WN@2<K2<SH462 � é*) �!,+ � �#!�4@?HAî4@[G98N0?HN@2<K¨28SD462 � �.- � ù -/� � � TZÕZK>=µ4@ÑB98Kh28SHKÖK>a@K/?c2<9¨46J8KE0NO?HN@28NO?HK@ÒB[:7:E0! # ¥ �#! § é # ¥ ) ! �1! § T f-S½ÑH9>Ò

m�¥�õ § é # ¥ � §'é #32�4 ! �#!65 é [G7:E! # ¥ �1! §Üé [G7:E! m�¥�÷ ! §'é m�¥�õ � §%$1¬SHN PQ7G?HI ¥ 7 § 4@?HA ¥ 7G7 § 7G9h9L7:E07:[m46J/T�7'J8N a37:?HI 2<SHK0N@28SHK/JÖAH7GJ8K/=b2<7GNO?�?H4@EFK>[Gj@Ò#28SD462h7Glm 98462<7G9 �DK>9 ¥ 7 § Ò ¥ 7G7 § 4@?BA ¥ 7:7:7 § 28SHK/? 7G2Q7G9Q4 i>kZl l�N@JW9LNOE0K¨J<4@?HABNOE�a64@J87:4@iH[GK@ÒBÑB98K/998N@EFK¨AHK>K/] 2<N3NO[:9¤7G? 4@?D4@[Rj¬9L7:9>T ò8 Ç;o>t s³Ç�±/uBÆcq t;´cÄc¶ o

s Ì s t s Ç prqOs tªo u½»s t ±/´½q ÄOo ÊcÆ t s·q´�uBqXob¾ tªu6¾­u¬qXo|±/uc»Â»mob¾ÇÈÊOu¬qX¿co6qX±/o3Qs t5²ßt5²Xos·q tªo:9Oo�»mÇ>x�¯Q²XoMo>ÁXo�qqcÆcÀ^ÄOo�»mÇb¹ t5²Xo u3¿c¿qcÆcÀ^ÄOo�»mÇÜ´cqX¿^t5²Xo¨»m´ ¾t<s uBq�´c¶ Ç ´6»mo ±/uBÆcq t ¾´cÄc¶ o<;�t5²XoÔÇ;o>t¨u@ÌW»mo/´½¶qcÆcÀ^ÄOo�»mÇ`Ä@o>t 3'o>o6q>=´cqX¿êyîs³ÇFqXu@t^±/uBÆcq t ¾´cÄc¶ o@x

n'obp qOs t<s³uBqÍvHx@?ê� �:�ï��� `7YO}��\^ �+� âÖ��� � ��c�5�Í«5©�¢¬�D� �c£/  �0�0�@� �����@ Ý¢D�5�ó õ � ö8õ ù ö7$I$I$Ýý $

�`�hã3�srÜ�æ�¨���D�1A�}��å���H���Â���s^CB e�[��ZYh^X���r� ©��DA�}X�å���H�����Â� ^CBÞ�|� `7`�eg[��ZYh^X��� �⵩��-� £ � E

o_¥�õ §Üé # ¥�� éCõ §%$

f-S½ÑH9/ÒEo_¥�õ § � * l�NOJÐ4@[:[ õ !ì� 4@?HAGF ! E oÖ¥�õ ! §Íé ü T`f-SBK i>k�l N@l � 7:9

J8K>[m4628K/Ak2<NEo i½j

mZo_¥�õ §Üé # ¥�� ûÛõ §ÜéIH�CJLKM� E o¨¥�õ ! §%$1¬NOE0K>287:E0K/9ZPÓKÖPQJ87G28K

Eo 4@?BA m!o 987:E0]H[Rj%4@9

E4@?HA m T

Page 33: All of Statistics: A Concise Course in Statistical Inference Brief

�����_� � � "!���� ���8Z�%�8�������Q��Z�%����"B����� � � ��� � � � �d�7�������Q��Z�%����" .Hü

* ü 1

ü � � �Eo¨¥�õ §

õT 1� T

� 7:I@ÑHJ8K # T 1¬� 7'JLNOiD4@iH7G[:7R2­jFl�ÑH?B=>2<7GNO?kl�NOJ��D7G]H]H7:?BIÔ40=>NO7:?k2­PQ7:=>K ¥ ��� 4@EF]B[:K # T ð T §

×½Ë@´cÀhÊc¶ o^vHxRy = (°�D�¤�r�<©@£5�c£/�� Ý��� �WâÈ¢3�æ«b��� ©��Ð⵩����áç½�@�d�r :� ½®��Ô�:�

Eo_¥�õ §Üé

���� ���ü-,2. õké *ü-,�1 õké�üü-,2. õké 1* õ�,!|ó/*BöµüOö 13ý $

�°�5�_ÚÜ�R�@¢3�<� ½® @3®¤ò

n'o>prqOs t<s uBqwvHxzyOy|{ �8�@�rã3©�� ���@�È���c£/ :�Í� �:��YO� � ^X�Â�T[�� [Z`ß� âk���D�b�<�M�LçO�:�b���î�âÈ¢¬�æ«b��� ©��

Eo �b¢D«5�ß���B�@�

EoÖ¥�õ § � *F⵩��F�@ � °õ°è�� ���� E

o_¥�õ §��OõÞé üg�@�rãh⵩��� �@�b� �Fû��µè

# ¥ � � ���b§ é����� Eo_¥�õ §��Oõ!$ ¥ # T üX§

(°�D�'âÈ¢3�æ«b�Â� ©��Eo �:�w«8�@ � :�8ãF���D�#A�}X�å���B�����Â� ^CB����3�Z`µ� ^�B e�[��ZYh^X���r����� kZl�� ,

�M�¨�B� �@�¨���B�@�mZoÖ¥�õ §Üé � �

��� Eo_¥��L§����

�@�rãEo_¥�õ §Üé m��o ¥�õ §Í�@�Z�@ � ¬�D©����D���Qõ��@� ÿ��3� «5� m!o �:�Ðã@� �_�b�<�b�D�����c£/ :�/®

1¬NOE0K>2<7GEFK>9¤PZKÖ9LSD4@[:[°PQJL7G28K � E¥�õ §��Oõ NOJ-9L7:E0]H[Gj � E

28N0E0Kµ46? � ���� E¥�õ § �cõ T

Page 34: All of Statistics: A Concise Course in Statistical Inference Brief

. 1 ��������� ����������������� ����������� �! "

* ü

üm!o¨¥�õ §

õ� 7GIOÑHJLK # T # � i>k�l l�NOJ��ï?H7Rl�NOJ8E ¥ * Ò üX§ T

×½Ë@´cÀ^Êc¶ oÖvHxzy Ø �å¢X�c�D© � �Ö���B�@��� �B��� � k�lEo_¥�õ §Üé�� ü l�NOJ *0ûÛõ`ûVü

* N@2<SBK/JLPQ7G98K $þá :�8�@�È  �Xè

Eo¨¥�õ §�� *+�@�rã � E

o_¥�õ § �cõ�é�ü½®Q{Å�L�@�rã3©�� ���@�È���c£/ :�Öÿ'����� ���3�:�Íã3�b�B�b��� ��:�_�/�@��ã��­©w�B� �@�h� �W?H7Gl�NOJLE � < è 9 Fã@�:�b�Â�È��£/¢3�Â� ©��r®�(°�D��i>k�l �:�ï�@� �@�b�+£ �

m!o¨¥�õ §Üé�� � * õ� *

õ *ÔûÛõMûVüü õ � ü�$

�°�5�_ÚÜ�R�@¢3�<� ½®/ ½®¤ò

×½Ë@´cÀ^Êc¶ oÖvHxzy v �å¢X�c�D© � �Ö���B�@��� �B��� � k�lE¥�õ §'é � * l�N@J õ�� *

�� � � ����� N62<SHK>JLPQ7:9LK $�å���æ«5� � E

¥�õ § �cõkéü6è ���3�:�Ö�:�Ð��ÿZ�b �  &;ã3�srÜ�æ�8ã� kZlꮤò �B}6��������� NO?c2<7G?½ÑHNOÑH9ÔJ<46?HAHNOE a@46J87m46iH[:K>9Í=µ4@?ê[GKµ4@Aº28Nß=>NO?Bl�ÑH9L7:NO?æT � 7GJ89L2/Ò

?HN@28Kw2<SD462Ö7Rl � 7G9¨=/NO?c2<7G?½ÑHNOÑH9Ö2<SHK>? # ¥�� é õ §¨é * l�N@J¨K>aOK>JLj õ���� NO?��Ã2¨2<J;jM2<N2<SB7:?H\MN6l

E¥�õ § 4@9 # ¥�� é õ § TFf-SH7G9ÖNO?H[GjMSHNO[:AB9Öl�N@J¨AH7:9L=/J8Kb2<K0J<4@?BAHNOE�a64@JL7m4@iH[GK/9>T� K_IOK>2¤]HJ8NOiH4@iH7:[G7G287:K/9'l�JLNOE�4 � kZl i½j�7:?c2<K>IOJ<4�2<7:?BIHT Ï � kZl =/4@?kirK_iH7:I@IOK/J 2<SD4@?

ü ¥ ÑB?H[:7G\@Kw4kEF4@989dl�ÑB?H=>287:NO? § T � NOJ_K � 4@EF]B[:K@Òæ7GlE¥�õ §dé l�NOJ õ ! p *Böµü-,� -q 4@?HA *

Page 35: All of Statistics: A Concise Course in Statistical Inference Brief

�����_� � � "!���� ���8Z�%�8�������Q��Z�%����"B����� � � ��� � � � �d�7�������Q��Z�%����" . #N@2<SBK/JLPQ7G98K6Òá2<SHK>?

E¥�õ § �n* 4@?HA � E

¥�õ §��Oõ é ü 98NM2<SB7:9Ð7G9Í4îPÓK/[G[ ) AHK��D?BK/A � k�lK>a@K/?º2<SBNOÑHIOS

E¥�õ §Ðé 7:?ß98NOE0K�]H[:4@=/K>9/T��­?|l�4@=b2µÒá4 � k�l =µ46?ºi K0ÑH?½irNOÑH?HAHK>A°T� NOJhK � 4@EF]B[:K@Ò�7GlE¥�õ §^é ¥ 1�, # §ªõ � ����� l�N@J * � õ �¼ü 4@?HA

E¥�õ §hé * N@28SHK/J;PQ7:98K6Ò

2<SHK>? � E¥�õ § �Oõ�é�ü K>a@K/? 2<SHNOÑBIOS

E7G9-?HN@2Wi N@ÑH?HAHK>A°T

×½Ë@´cÀhÊc¶ o^vHxRy7& à#�b� E¥�õ §Üé�� * l�N@J õ� *

�� � � ��� N62<SHK>JLPQ7:9LK $(°�3�:�^�:�Ö�æ©��-� � kZl �b���æ«5� � E

¥�õ §��Oõ�é � �� �Oõ ,H¥;üÓúÞõ §Óé � �� ���T,�%é [GNOI ¥ &C§ é&V®¤ò

6o�ÀhÀd´wvHxzyIV à#�b��m�£b�¨���D��i>kZl�⵩��h�F�L�@�rã3©�� ���@�È���c£/ :�-��®?(°�D�b� 6� � # ¥���é õ §Üé mF¥�õ § $ mF¥�õ � §Öÿ��D�b�5��m�¥�õ � §Üé [:7GE �� � mF¥�÷B§5è� ��� # ¥�õ���� ûC÷B§'é mF¥�÷B§�$ mF¥�õ §5è� ����� # ¥�� � õ §Üéü $ m�¥�õ §5è� � � 1�â¤� �:�Ы5©��D�Â���D¢D©�¢½�^���D�b�# ¥ ��� ���b§ é # ¥��ûÛ� ���b§Üé # ¥ ��� û��b§Üé # ¥ FûÛ� û �b§%$

� 2ï7G9Q4@[:9LNÔÑH98Kbl�ÑH[°2<N0AHK��H?HK^28SHKÖ7:?caOK>J89LK i>kZl ¥ NOJD�½ÑD4@?c2<7G[:K_l�ÑH?B=>2<7GNO? § T

n'o>prqOs t<s uBqwvHxzy �Ûà#�b�#� £b�^�Ô�L�@�rã3©�� ���@�È���c£/ :�¨ÿ'����� iTk�l mF® (°�D�-�Â� �r�½}/`/�i>kZl ©��� T[��D� ^X�Â��� eg[��ZYh^X��� �C�:�Ðã3�srÜ�æ�8ã%£ �

m � � ¥��O§'é 7:?Bl�� õM� mF¥�õ §Zû����⵩���� ! p *¬öµü7q ® 1�â�m �:�¨�b�Â�È� «b��  �����æ«b�<�8���b���3���@�rã «5©��D�����D¢å©�¢c�^���D�b� m � � ¥��O§Ö�:����D�¨¢¬�D��ä/¢D�^�<�8�@ ¡�D¢¬�0£b�b�¤õÞ�b¢D«5�����B�@�cm�¥�õ §'é��B®

�ÃÌ%¸>uBÆ ´�»moñÆcq Ì�´cÀ^s·¶ ¾sô6» 3Qs t5²��;sÎq Ì��D¹��5ÆXǪtt5²OsÎq���u@ÌÛs t ´XÇ t5²XoÀ^s·qOsÎÀ^ÆcÀÍx

� K-=/4@[:[ m � �b¥;ü-,2.c§ 2<SHK �DJ89;2(�3ÑH4@JL287:[:K6Ò m � �>¥;ü0,�1O§ 28SHK-E0K/AH7:4@? ¥ N@Já98K>=/NO?HA��½ÑD4@J )2<7G[:K § 4@?HA m � � ¥ # ,2.½§ 2<SHK¨2<SB7:J8A �½ÑD4@J;2<7G[:K@T

Page 36: All of Statistics: A Concise Course in Statistical Inference Brief

.�. ��������� ����������������� ����������� �! "

f¤PZNgJ84@?HAHNOE a64@J87:4@iH[:K>9 � 4@?HA � 46J8K � >[#�D�w��� ���a`b^ }����c[d^X���r� � PQJL7G2828K/?� �é � � 7Rl m!o¨¥�õ §Óé m��Ó¥�õ § l�NOJQ4@[G[ õ TÜf-SH7:9QAHN3K/9Q?HN62ïE0Kµ46?�28SD462 � 4@?HA � 46J8KK:�½ÑD46[ T��ï4�2<SHK>J/ÒD7G2¤EFK/4@?H9-2<SD4�2d46[:[°]HJLNOiD4@iB7:[:7R2­j09L254�2<K/E0K/?c289d4@irNOÑB2 � 4@?HA � PQ7:[G[irKÖ28SHKÖ9<4@E0K@T

���5� �k(� !�� �|(0&á$�MÍ"%$ G .IHá,W& d$� � MÍ"+)M(�� � MÍ&Ü.%UM JLR� H

�B}6��������%�'� [d^��î� ^ ��^X���r� � � 2'7G9�2<J84@AH7R2<7:N@?D4@[c2<NdPQJL7G28K ��� m 28N¨7:?HAH7G=µ4628K2<SH462 � SD4@9-AH7G9L28J87:iBÑB2<7GNO? m T'f-SH7:9-7G9¤ÑH?Bl�N@JL2<ÑB?D462<K¨?BN@2546287:NO?�987:?B=/K_2<SBK¨9Lj3EÍirNO[ �7:9á4@[:9LN¨ÑH98K>A028N¨AHK/?HN62<KW4@?Ô4@]H]BJ8N � 7:EF462<7GNO?°T�f-SHK-?HN@2<462<7GNO? ��� m 7:9á9LNÖ] K>JLa64@987RaOK2<SH462ZPÓK¨4@JLK_9L28ÑH=5\FPQ7R2<S%7R2µT �QKµ4@A ��� m 4@9�� � SD4@9¤AH7G9L28J87:iBÑB2<7GNO? m��w�#� ^ 4@9 �7:9Q46]H]HJ8N � 7:EF4628K/[Gj m T�! �" ���$#&%('*),+.-/-10#&-2'��(#&354('(#6��% �h� SH4@9¨4%]rNO7:?c2_EF4@9L9_AH7:9;2<JL7:iHÑB287:NO?`462 ÒHPQJ87R282<K>? ���76 � ÒD7Rl # ¥�� é 3§ éü 7G?kPQSH7:=5S =µ4@9LK

m�¥�õ §Üé � * õ��ü õ� $

f-SHKÖ]HJLNOiD4@iH7G[:7R2­jFl�ÑH?B=>2<7GNO?�7:9E¥�õ §Üéü l�NOJ õ�é 4@?HA * N@2<SHK>JLPQ7G98K@T

�! �" 08#&- i � " ' ":9 %�# l � �(;<08#&-2'��(#&354('(#6��% � �¡K>2>= � ü i Kd4hIO7RaOK>?%7:?c2<K>IOK/J>T1¬ÑH]B] NO9LK¨2<SD462 � SD4@9Q]HJLNOiD4@iB7:[:7R2­j�EF4@9L9¤l�ÑH?H=>287:NO?�IO7RaOK>?îicjE

¥�õ §Üé�� ü-, = l�NOJ õ�éôüOöI$I$7$/ö =* N@28SHK/J;PQ7:98K $

� K^984µj%2<SD4�2 � SD469ï4wÑH?B7Gl�NOJLE�AH7G9L2<JL7:iHѬ2<7:N@?kNO? ó½üOöI$I$I$>ö = ý T�! �"@?A" �(% �$4�BCBC#�0#&-2'��(#&354('(#6��% � �¡K>2 � J8K/]BJ8K/9LK/?c2 4|=>NO7:?��D7:]°T f-SHK>?

# ¥���é�üX§'éED 46?HA # ¥�� é *c§'é�ü $:D l�NOJ-98NOE0K D ! p *¬öµü7q T � KÖ9<4µj%28SD462 � SD4@940ÕZK/JL?HNOÑH[G[:7°AH7G9L28J87:iBÑB2<7GNO?%PQJ87R282<K>? ��� ÕZK/JL?HNOÑH[G[:7 ¥FDr§ TÓf-SBKh]HJLNOiD4@iB7:[:7R2­jFl�ÑB?H=>287:NO?7:9

E¥�õ §ÜéED � ¥;ü $:Dr§ � � � l�NOJ õ !|ó/*Böµü6ý T�! �"G? #&% ��;H#I+.BJ08#&-2'��(#&354('(#6��% � 13ÑH]H]rNO98K+PÓK�SD4µa@K|4ê=/NO7G?CPQSH7:=5S l�4@[:[G9

SHK/4@AH9dPQ7R2<SM]HJLNOiD4@iB7:[:7R2­j D l�NOJW98NOE0K *�û,D�û ü T � [:7:]�2<SBKÍ=/NO7G? A 287:E0K/9d4@?HAM[GK>2

Page 37: All of Statistics: A Concise Course in Statistical Inference Brief

��� ��� ">��� �� � ������ ���� � � ">������ ����������� ����������� �! " . � irK�2<SHK�?3ÑBEÍi K>JÍN@lQSHK/4@AH9>T Ï 989LÑHE0K�2<SD4�2Ð2<SBK�2<NO9L98K>9Ô4@J8KF7:?HAHK>] K>?HAHK>?½2/T �#Kb2E¥�õ §Üé # ¥�� éCõ § i K¨2<SBK^EF4@9L9¤l�ÑH?H=>287:NO?æT � 2ï=/4@? i KÖ98SBNXPQ? 2<SD4�2E

¥�õ §'é������ ��� D � ¥ªü1$ Dr§ � � � l�NOJ õké *Bö7$I$I$/ö A* N@28SHK/J;PQ7:9LK $

Ï J<46?HAHNOE a64@J87:4@iH[:KgPQ7G28Sñ2<SBKºE�46989`l�ÑH?B=>2<7GNO? 7G9M=µ4@[G[:K>A 4CÕZ7G?HNOE07m4@[¨J84@?HAHNOEa64@J87:4@iH[:K 4@?BAÛPZK`PQJL7G28K � � ÕZ7G?HNOE07m4@[ ¥aA�ö Dr§ T�� l � � � ÕZ7:?HN@EF7:4@[ ¥aA�ö D � § 4@?HA� ù � ÕZ7G?HNOE07m4@[ ¥sA�ö D ù § 2<SHK>? � � úê� ù � ÕZ7:?HN@EF7:4@[ ¥aA�ö D � ú D ù § T

�H}����Â���� �¡K>2hÑH9^2<4@\@KÔ2<SB7:9^NO]H]rNOJL28ÑH?H7G2­j`2<N ]HJ8KbaOK/?c2w9LNOE0K0=>NO?Bl�ÑH9L7:NO?æT �7:9h4îJ84@?HAHNOE a64@JL7m4@iH[GK ø�õ AHK/?HN62<K/9Í4î]D4@J;2<7G=/ÑH[:4@J¨a@46[:ÑHK0N@l¤28SHK�J84@?HAHNOE a64@JL7m4@iH[GK øA 46?HA D 4@J8K A��H}X�D��� ^µ�3}/` Ò�2<SD4�2h7G9/Ò�� � K/AºJLKµ4@[Ü?½ÑHEÍirK/JL9/T�f-SHK�]H4@J<4@E0K>28K/J D 7G9ÑH98ÑH4@[:[Rj+ÑH?H\½?HN PQ?ê4@?HAßEÍÑH9;2ÐirK0K/9L287:EF462<K>Aßl�JLNOE AD462<4 ø 28SD462 �Ù9ÖPQSD462h9L2<462<7G9L2<7G=µ4@[7:?Bl�K>J8K>?H=/K_7G9Z46[:[r4@irNOÑB2µT �­?�EFNO9;2¤9;2546287:9L287:=/4@[åEFN3AHK/[G9/Òc2<SBK/J8K_4@J8KïJ<4@?HABNOE a@46J87m46iH[:K>94@?HA ]D4@J84@EFKb2<K>J89 � ABNO?��Ã2ï=>NO?Bl�ÑH9LK^28SHK/E T�! �"�" ��; " '��(# i 0#&-/' �(#&354('(#6�$% �_� SD469_4�IOK>NOE0K>2<JL7:=ÖAH7G9L2<JL7:iHѬ2<7:N@? PQ7R2<S

]D4@J84@EFKb2<K>J D !ê¥a*BöµüX§ ÒBPQJ87R2828K/? ��� �¨K/N@E ¥ Dr§ ÒB7Gl# ¥�� é = §ÜéED�¥ªü1$ Då§� � � ö = �Vü�$

� KÖSD4µa@K^28SD462 �H + � # ¥�� é = §Üé D �H + � ¥;ü $:Då§�ïé Dü $ ¥;ü $:Då§ éôü�$

f-SH7:?B\�N@l � 4@9¤28SHK¨?½ÑHEÍirK/J-N@l �D7G]H9¤?HK>K/AHK>AîÑH?c287:[ 28SHK �DJL9L2QSHK/4@AH9-PQSHK/? �D7G]H]H7G?HI4w=/NO7G?°T�! �" ����#&- - ��% 08#&-2'��(#&3 4�' #6�$% �ê� SD4@9Ô4 7�NO7:9L98NO?�AH7:9;2<JL7:iHÑB287:NO?�PQ7G2<Sê]D4 )

J<4@E0K>28K/J �¡ÒHPQJ87R282<K>? ��� 7�NO7:9L98NO? ¥ � § 7RlE¥�õ §'é�� ��� � �õ�� õ � *_$'ïN@28K¨2<SD462 �H ��+ �

E¥�õ §Üé�� ��� �H ��+ � � �õ�� é�� ��� � � é�ü�$

Page 38: All of Statistics: A Concise Course in Statistical Inference Brief

.cð ��������� ����������������� ����������� �! "

f-SHK 7�NO7:9L98NO?07:9áN6l�2<K/?ÔÑH9LK/AF4@9Ü4_E0N¬AHK>[¬l�NOJ�=/NOÑH?c289 N6læJ84@J8K¤KbaOK>?½289Ó[G7:\6K-J<4@AH7GNc4@=b2<7Ga@KAHK>=µ4µjî4@?HAî28J<4��0=Ð4@=/=>7:AHK>?½289/T�� l � � � 7�NO7G9898N@? ¥aA�ö � � § 4@?HA � ù � 7�NO7:9L98NO? ¥sA�ö � ù §2<SBK/? � � úÞ� ù � 7�NO7G989LNO? ¥aA�ö � � ú � ù § T �B}6�������� � KÍAHK��H?HK/AgJ<46?HAHNOE¼a64@J87:4@iH[GK/9Q2<N%irKÍEF4@]H]H7G?HIO9Ql�J8NOEY4�984@EF]B[:K98]H4@=/K � 2<N � iBÑB2ïPÓKÐAB7:A`?BN@2dE0K/?c287:NO?�2<SHK^9<46EF]H[GKÖ98]D4@=>KÐ7G?M4@?cj�N@lá28SHK^AH7:9;2<J87 )iHÑB287:NO?B9d4@irN aOK6T Ï 9 �ZEFK>?½287:NO?BK/A`K/4@J8[G7:K>J/ÒH2<SBKh984@EF]B[:K^98]D4@=>KÐN@l�28K/? �LAH7:984@]H]rKµ4@JL9 �iHÑB2d7G2ï7:9WJ8K/4@[:[Rjk2<SHK>J8KÍ7G?î2<SBKÍiD4@=5\½IOJLNOÑH?HA°T �¡K>2 � 9_=>NO?H9L28J8ÑH=b2Ö4�984@EF]B[:K^98]D4@=>KÍK � )]H[G7:=/7R2<[Rj¨l�NOJ'4ÖÕZK/JL?HNOÑH[G[:7¬J<4@?HABNOE�a64@JL7m4@iH[GK@T��¡K>2 �Ûé p *Bö/ü7q 4@?BA0ABK��D?HK # 2<NÖ9<46287:9;l�j# ¥ p Dö � q�§Üé���$ l�NOJ *Ôû��û��dûVü T � 7 � D ! p *Böµü7q 4@?HA AHK��D?BK

�ê¥�¦Z§'é � ü ¦Ûû D* ¦ � Dd$

f-SHK>? # ¥�� é üX§ïé # ¥m¦ôû Dr§Wé # ¥ p *Bö D qm§ïé D 4@?BA # ¥�� é *c§Wé¼ü $ D Thf-S½ÑH9/Ò� � ÕZK/JL?HNOÑH[G[:7 ¥FDr§ T � KÐ=/NOÑH[GAMAHN%28SH7:9Wl�NOJ_4@[G[�28SHKÍAH7G9L28J87:iBÑB2<7GNO?H9WAHK��D?BK/A�4@i N a@K@T�­?Í]HJ84@=>287:=>K@Ò6PÓKZ28SH7:?H\¨N6lr4ïJ<46?HAHNOEVa64@J87:4@iH[GK [G7:\6KZ4WJ<4@?HABNOEô?3ÑBEÍi K>J�iHѬ2�l�NOJLE�4@[G[Gj7G2¤7:9Q4ÔE�4@]B]H7:?HIwABK��D?HK>A`NO?�9LNOEFKÖ984@EF]B[:K_9L]D4@=/K6T����� �k(� !�� �|(0&'$cMÍ"%$���(0"%$'.µ"�*M(0*LH,� MÍ"+)M(�� � MÍ&Ü.%U

M JLR� H

�! �"19 %�# l � �(;70#&-2'��(#&354('(#6��% T � SD4@9'4 �W?H7Gl�N@J8E ¥ Dö �b§ AB7:9L28J87GiHÑB2<7GNO?°Ò�PQJ87G2 )2<K>? ��� �W?H7Gl�NOJLE ¥ Dö �b§ ÒH7RlE

¥�õ §Üé � �� � � l�NOJ õ�! p Dö � q* N@2<SBK/JLPQ7G98K

PQSHK>J8K ��� TÜf-SHKÖAH7:9;2<JL7:iHÑB287:NO?%l�ÑB?H=>287:NO?�7:9

m�¥�õ § é�� � * õ���� � �� � � õ�! p Dö � q

ü õ � �0$��� �(; +.B�� � +(4�- -/#I+.% �Ó� SD469Q4 'ïNOJLE�4@[ ¥ N@J �^46ÑH989L7m4@? § AH7:9;2<J87GiHÑB287:NO?�PQ7R2<S

]D4@J84@E0K>2<K>J89��ß4@?HA� áÒDAHK>?HN@28K/A i½j �����|¥ � ö ùb§ ÒH7RlE¥�õ §Üé ü �� 1�� K � ] �D$ ü

1 ù ¥�õ $ � § ù�� ö õ�!`�

Page 39: All of Statistics: A Concise Course in Statistical Inference Brief

������� ">��� �� � ������ ���� �8�8��Z� � � � �c"j����������� ����������� �! " . $

PQSHK/JLK � !Å� 4@?HA � * T �#4628K/J%PÓK+98SD46[:[d9LK/KM28SD462 �V7G9�2<SHK*�;=/K/?c28K/J �ì¥ NOJEFK/4@? § N@lÓ2<SHKÔAH7G9L2<JL7:iHѬ2<7:N@?�46?HA �7:9¨28SHK �;98]HJLKµ4@A �º¥ NOJÖ9;254@?HAH4@J8AßAHK>a37:462<7GNO? § N@l2<SHKîAH7:9;2<JL7:iHÑB287:NO?æTCf-SHK 'ïN@J8EF4@[-]H[m4µj39%46?Û7:E0] N@JL2546?½20JLNO[:K�7:?�]HJ8N@iD4@iH7G[:7G2­jÞ4@?HA9L2<462<7G9L2<7G=/9>T��`4@?cj�]HSHK>?HNOE0K/?D4Ô7:?�?D4628ÑHJ8KÖSD4µa@KÐ4@]B]HJ8N � 7:EF462<K>[Gj 'ïNOJLE�46[æAH7G9L2<JL7:iHÑ )2<7GNO?H9/T(�#4628K/J>ÒBPZKÖ9LSD4@[:[ 98K>K^28SD462¤28SHKÖAH7:9;2<J87GiHÑB287:NO?%N6l�4w9LÑHE N@l�J<46?HAHNOEÅa@46J87m46iH[:K>9=µ4@? irKh46]H]HJ8N � 7:EF4628K/A�icj�4�'ïNOJLE�4@[æAH7:9;2<J87GiHÑB287:NO? ¥ 2<SHK¨=>K/?c2<J84@[#[:7:E07G2 2<SBK/NOJLK/E § T� KÜ9<4µj_28SD462 � SD469�4 `%^ �D�����H}�� � �å}6���D�å���a`b^ }����c[d^X��� � 7Rl � é * 4@?BA éü Tf�J84@AH7R2<7:N@?FAH7G=>2<462<K>9Ó2<SH462¤4^9L2<4@?HAD4@JLA 'ïNOJLE�46[DJ<4@?HABNOE a@46J87m46iH[:KQ7:9 AHK>?HN@2<K>A%icj � Tf-SHK � k�l 4@?HA i>kZl N6l_4M9;254@?HAH4@J8A 'ïN@J8EF4@[Z46J8KkAHK>?HN@2<K>A�icj�� ¥��O§ 4@?HA� ¥��c§ Tf-SHK � k�l 7:9-]H[GN@2828K/A�7:? � 7GIOÑHJLK # T . T'f-SHK/JLKh7G9¤?HN0=/[GNO98K>A ) l�NOJLE�K � ]HJ8K>98987GNO?îl�N@J�dTUïK>J8K^4@J8KÖ9LNOEFK¨ÑB98K>l�ÑB[#l�46=>2<9 �¥ 7 § � l �����º¥ � ö °ù § 28SHK/? � é ¥�� $ � § , � �º¥ *¬öµüX§ T¥ 7G7 § � l � ���|¥ *Bö/üX§ 2<SHK>? � é � ú � ���|¥ � ö ùȧ T¥ 7G7:7 § � l � ! ���º¥ ��! ö ù! § Ò�� é�ü@öI$I$I$bö A 4@J8K¨7G?HAHK>] K>?HAHK/?c2W2<SHK>?

�H ! + � � ! ���� �H !,+ � ��! ö �H ! + � ù!�� $� 2Ql�NO[:[GNXPQ9Ól�J8NOE ¥ 7 § 28SD462Q7Rl �����º¥ � ö ùb§ 28SHK/?

# ¥� ��� � �b§ é # 0$ � � � � �"$ � � é � ��$ � � $ � 0$ � � $¥ # T 1O§

f-S3ÑB9�PÓKÜ=µ4@?^=/NOE0]HÑB28KÜ4@?cj¨]HJLNOiD4@iH7G[:7R2<7:K>9rPÓK'P¤46?½2�4@9¡[GNO?HIQ4@9°PÓKÜ=µ4@?^=/NOE0]HÑB28K2<SHK i>kZl � ¥��c§ N6l#4Ð9L2<4@?HAD46J8A 'WNOJ8EF4@[ T Ï [G[å9L254�2<7:9;2<7G=µ4@[D=>NOEF]BÑB2<7G?HIÐ]D46=5\646IOK/9 PQ7:[:[=/NOE0]HÑB28K�� ¥��c§ 4@?HA�� � � ¥��O§ T Ï [G[¤9L2<462<7G9L287:=/9Í28K � 2<9>ÒZ7G?H=/[GÑHAH7:?BIM2<SH7G9wNO?HK@Ò SD4µaOKî4254@iB[:K¨N@l�a64@[:ÑHK>9QN@l�� ¥��c§ T×½Ë@´cÀhÊc¶ o^vHxRy�� �å¢ �c�D© � �Ö���B�@�������º¥ # ö O§>®QÚÜ���rã # ¥�� � ü §/® (°�D�¨� ©� Ý¢3�Â� ©��`�:�

# ¥�� � ü §'é�üD$ # ¥�� � üX§'é�ü $ # � � ü $ #� � éü $ � ¥%$�*_$ �� .h.½§'é $ � ü�$

Page 40: All of Statistics: A Concise Course in Statistical Inference Brief

. � ��������� ����������������� ����������� �! "

* ü 1$^ü$�1EoÖ¥�õ §

õ� 7:IOÑBJ8K # T .H� � K/?B987G2­j%N@lá4w9;254@?HAH4@J8A 'ïNOJLE�46[ T

$^©�ÿ rÜ�rã ���b¢å«5�`���B�@� # ¥�� � �O§Zé $ 1H®�1È�Þ©����D�b�^ÿZ©��8ã��5è rÜ�rã �wé � � � ¥ $ 1O§>®��`�� ©�  �@�Ö���3�:�У �Fÿ'�È�������3�

$ 1Öé # ¥�� � �O§Üé # � � ��$ � � é � � $ � � $Úæ�5©������D��$^©��È�0�@ ¡� �c£/ :�5è � ¥ $B$ � .Bü ðc§Üé $ 1H® (°�D�b�<��⵩��<�5è$B$ � .Bü ðÐé ��$ � é ��$ #� �@�rãÔ�D�b�æ«5���^é # $ $ � .Bü ð � ^éü�$Gü@ü � ü�$ ò

��� � �$% " %�' #I+�B 0#&-2'��(#&354('(#6��% �¤� SD4@9ï4@? ��� ]rNO?HK/?c287m4@[#AH7G9L2<JL7:iHѬ2<7:N@?�PQ7R2<S]D4@J84@E0K>2<K>J��ÜÒHAHK/?BN@2<K>A`icj ��� � � ] ¥ � § ÒH7GlE

¥�õ §Üé ü�� � � ��� ö õ � *

PQSHK>J8K�� � * T-f-SHKhK � ]rNO?HK/?c287m4@[�AH7:9;2<JL7:iHÑB287:NO?�7G9WÑB98K/AM28NFE0N3AHK/[¡2<SBKh[G7Gl�K>287:E0K/9-N@lK/[GK/=b2<J8N@?H7:=¨=>NOEF]rNO?HK>?c2<9W4@?HAk2<SBKÖPZ4@7R2<7:?BIÔ287:E0K/9¤irK>2­PÓK/K>?MJ<46J8K¨K>a@K/?c2<9>T

� +.;H; +@08#&-2'��(#&3 4�' #6�$% � � NOJ� � * Ò#28SHK� �D�º���Ne�[���Yh^ ���r� 7:9ÖAHK �D?HK>Aicj � ¥ � §-é � �� ÷�� � � � � �c÷ T � SD469¨4 �^46EFEF40AB7:9L28J87GiHÑB2<7GNO?�PQ7G2<S`]D4@J<46EFKb2<K/JL9��4@?HA��ÜÒDAHK>?HN@2<K>Aîicj ��� �Ö4@EFEF4 ¥ � ö � § ÒB7RlE

¥�õ §Üé ü� � � ¥ � § õ � � � � � � ��� ö õ � *

Page 41: All of Statistics: A Concise Course in Statistical Inference Brief

������� ">��� �� � ������ ���� �8�8��Z� � � � �c"j����������� ����������� �! " . PQSHK/JLK � ö � � * T�f-SHKQK � ] N@?HK/?c2<7:4@[HAH7G9L28J87:iBÑB2<7GNO?Ð7G9 � ÑH9L2 4 �Ö4@E0E�4 ¥;ü@ö � § AH7G9L2<JL7:iHÑ )2<7GNO?°T � l � ! � �Ö4@EFEF4 ¥ ��! ö � § 46J8K'7:?BAHK/]rK/?HABK/?c2µÒX2<SHK>? F �! + � � ! � �Ö4@EFEF4 ¥ F �!,+ � ��! ö � § T�! �" ?A" ' + 08#&-2'��(#&3 4�' #6�$% � � SH4@9¤4@?�ÕZK>2<4ÐAH7G9L2<JL7:iHѬ2<7:N@?ÔPQ7R2<SF]D4@J<46EFKb2<K/JL9

� � * 4@?HA � � * ÒHAHK/?BN@2<K>A`icj ��� ÕZK>2<4 ¥ � ö � § ÒB7RlE¥�õ §Üé � ¥ � ú � §

� ¥ � § � ¥ � § õ � � � ¥;ü $ºõ § � � � ö *�� õ��ìü�$

� +�% k�� + 4 i �� 0#&-2'��(#&354('(#6��% �Ü� SD469¤4 � AH7:9;2<J87GiHÑB287:NO?wPQ7G28S��ÔAHK/IOJLK/K>9ZN@ll�J8K>K/AHNOE � PQJL7G2828K/? ������ � 7RlE

¥�õ § é � � � � �ù �� � � ù �

ü� üÓú ���� � � � � � � � ù $

f-SHK � AH7G9L2<JL7:iHѬ2<7:N@?�7:9Ó987GEF7G[m4@Já28Nw4�'ïN@J8EF4@[åiHÑB2-7R2¤SD4@9 2<SH7G=5\@K>J¤254@7G[:9/T �­?%l�4@=>2/Ò¬28SHK'ïNOJLE�46[D=/NOJLJ8K/9L] N@?HAH9Z28NÐ4 � PQ7G2<S� é�& Táf-SHK 4@ÑH=5Scj�AB7:9L28J87GiHÑB2<7GNO?Ô7:9Ó4h9L] K>=/7m46[=µ4@9LKhN6l�28SHK � AH7G9L28J87:iBÑB2<7GNO?k=/NOJLJ8K>98]rNO?HAH7G?HI028N�� é�ü T'f-SHKÖAHK/?H9L7G2­j�7G9E

¥�õ §'é ü�Ü¥;üÓúêõ ù § $f�N098K>K¨2<SD462Q28SH7:9-7G9-7:?HAHK>K/Aî4wAHK/?H9L7G2­j@Òr[GK>2 �Ù9-AHNÔ2<SBK^7G?c2<K/I@J<4@[ �� �

��� E¥�õ §��Oõ é ü� � �

��� �OõüÓúêõ ù é ü� � �

��� � 2546? � ��Oõé ü� � 2546? � � ¥ &C§�$ 2<4@? � � ¥%$�&C§��Wé ü��� � 1 $�� $ � 1���� éü�$

�! �"�� ù k #&-2'��(#&354('(#6��% �'� SD4@9�4 � ù AH7G9L2<JL7:iHѬ2<7:N@?_PQ7G2<S D AHK/I@J8K/K>9�N@l3l�J8K>K/AHN@E� PQJ87G2L2<K>? ��� � ù� � 7GlE

¥�õ § é ü� ¥FD ,�1@§ 1 � � ù õ � � � ù � � � � � � � ù öñõ � *_$

� l � � öI$7$I$/ö � � 4@J8K�7:?HAHK>] K>?HAHK>?½2�9;254@?HAH4@J8A 'ïN@J8EF4@[@J84@?HAHNOE a64@J87:4@iH[:K>9 28SHK/? F � !,+ � � ù! �� ù� T

Page 42: All of Statistics: A Concise Course in Statistical Inference Brief

h* ��������� ����������������� ����������� �! "

����� ��.��BMÍ&Ü.7Mh$$ G .IH�$á&Ü.IJ+*`$'./(0"LH�Ö7RaOK>?ì4º]H4@7:J%N6lhAB7:98=>J8Kb2<K+J<46?HAHNOE a@46J87m46iH[:K>9 � 4@?HA � ÒWAHK �D?HK+2<SHK�� � ��� ^

�|� `7`?e�[���Yh^ ���r� icjE¥�õ#ö<÷B§ é # ¥�� éìõ 4@?HA � é ÷H§ T � J8NOE�?HN PôNO?°ÒDPZK^PQJ87R2<K

# ¥���é õ 4@?HA �ôé ÷H§ 4@9 # ¥���é õ#ö �é ÷H§ T � KWPQJL7G2<KE4@9

Eo�� � PQSHK/?0PZKWP¤4@?c2

2<NÔirKÖEFNOJLK¨K � ]H[G7:=/7R2µT×½Ë@´cÀ^Êc¶ oÖvHxzy =h�b�<�`�:�g��£/� ���@�È���@�ª�Mã@�:�b���b��£/¢3��� ©��ß⵩�� �ÂÿZ©|�L�@�rã3©�� ���@�b���c£/ :�5�0��@�rã�� �8�3«5��� ������3� ���@ Ý¢D�5� <k©�� 9\6

�é * ��éü� � < 9%:�� @7:�� 9%: � � 9 @7:�� ;h:�� 9%:

9%: 9%: 9(°�3¢c�5è # ¥���é�üOö �ôéüX§'é

E¥ªüOöµüX§áé . , ®¤ò

n'obp qOs t<s³uBqÍvHxRy ? 1b�+���D�Ô«5©��D�Â���D¢D©�¢c�w«8��� �5è ÿZ�0«8�@ � ��dâÈ¢3�æ«b�Â� ©��E¥�õ#ö<÷B§Ð�h�BãLâ

⵩��k���D� �L�@�rã3©�� ���@�È���c£/ :�5� ¥��Mö �Ч � â � � E¥�õ#ö<÷B§ � * ⵩�� �@ � -¥�õ#ö<÷B§5è � ��� � �

��� � ���� E

¥�õ#ö8÷H§ �cõ �c÷ é üê�@�rã�èÓ⵩��`�@� �g� �b� � � ���ê�-è # ¥8¥��Mö �w§ !� § é�� �� E

¥�õ#ö<÷H§ �Oõ �½÷æ® 1b�%���D�dã@�:� «b�<�b�ª�¨©��_«5©��D�����D¢å©�¢c�_«8��� �WÿZ�ïã3�srÜ�æ�-���D�� ©����D�ci>kZl ����mZo�� �Ó¥�õ#ö8÷H§'é # ¥�� ûÛõ#ö � ûC÷B§/®

×½Ë@´cÀ^Êc¶ oÖvHx Ø =Ûà��b�¤¥��`ö �ͧh£b�Ö¢¬�D� ⵩��È��©��M���D�^¢3�D���Ü�/ä/¢B�@�<�/®?(°�D�b�BèE¥�õ#ö8÷H§áé � ü 7Gl *wûÛõMûVü@ö *wûC÷%ûVü

* N@28SHK/J;PQ7:98K $ÚÜ���rã # ¥�� �ìü-,�1¬ö �2� ü-,�1O§>® (°�D�d� �@�b�D� � éôó � �ñü-,�1¬ö �2�ìü-,�13ýF«5©��b�<�5�­�D©��rã���­©��0�b¢H£b� �b�Q©ªâÖ���D�^¢3�D��� �/ä/¢B�@�<�/® 1È�D�ª� �@�L�@�����3� E

©�@�b�Ö���3�:�^�b¢B£b� �b�-«5©��b�<�5�­�D©��rã��5èÜ������3�:�|«8��� �5è��ª©Û«5©��d�r¢3�����3�����D���@�<�8�C©ªâM���D�M� �b� � ÿ��3� «5�C�:�]9%:�;r® �°© è # ¥�� �ü-,�13ö �2�ñü-,h1O§'é�ü-,2.D®¤ò

Page 43: All of Statistics: A Concise Course in Statistical Inference Brief

����� �N� �0��� ������� � � "!���� ���8Z�%�8��" ¬ü×½Ë@´cÀhÊc¶ o^vHxÙجyêà#�b�Z¥��Mö �ͧ¨�B� �@�^ã3�b�B�b��� �E

¥�õ#ö<÷B§'é � õÔúê÷ 7Gl *wûCõMûVüOö�*0ûC÷�ûVü* N@28SHK/J;PQ7:98K $

(°�D�b� � �� � �

� ¥�õwú�÷B§��Oõ �c÷ é � ��

� � �� õ��Oõ�� �c÷¨ú � �

�� � �� ÷ �Oõ�� �c÷

é � ��

ü1 �½÷_ú�� �

� ÷ �½÷wé ü1 ú ü

1 éüÿ��3� «5� �@�b�È� r¤�5�Ö���B�@�Ó���3�:�^�:�Ð� � k�l��ïò×½Ë@´cÀhÊc¶ o^vHxÙØ@Ø 1�âÜ���D�¤ã@�:�b���b��£/¢3��� ©��w�:�-ã3�srÜ�æ�8㨩�@�b�Z�_�æ©��'&­�<�5«b� �@�3�@¢¬ R�@�'�<� �@� ©��Bèæ���D�b����D�Ô«8�@ :«b¢B�@�Â� ©��B�^�@�5�Ð��£/��� ��©��<�w«5©��d�r Ý� «8�@�ª�8ãc®�=^�b�<�Ð�:�Ð�@�+�D� «5�w�Lç½�@�d�r :�¨ÿ��3� «5� 1£b©��È�5©�ÿZ�8ãQâÈ�5©�� (Ð���¤�5© ©��Ó�@�rã��°«5�D�b� � �:�5� �g@_<\</@ @®-à��b�Z¥��`ö �ͧ_�B� �@�hã3�b�B�b��� �E

¥�õ#ö<÷H§�é���� õåù5÷ 7Gl õåùïûC÷%ûVü* N@28SHK/J;PQ7:98K $

$h©��­�8rÜ�<�b�W���B�@� $^ü%û�õ�ûÅü½® $^©�ÿ  :�b�Q¢c�8rÜ�rã ���D� ���@ Ý¢D�%©ªâ � ® (°�D�Í�Â�È� « �%�D�b�<��:�Í�­©k£b�F«8�@�<��âÈ¢3 á�c£b©�¢3�¤���D�Í�L�@�3�3�0©ªâÐ���D�­� �@�L�@�Â� ©��r® �`�ï�r� « ��©��æ�����@�b���c£/ :�5è�õê�/� �Xè�@�rãÍ :�b�á�����L�@�3�3�Ö©�@�b�ï����� ���@ Ý¢å�5�/® (°�D�b�Bè¬âµ©��¨�8�3«5��r�ç¬�8ã ���@ Ý¢å�Ö©ªâÜõ°è'ÿZ�ï :�b�æ÷ ���@� �©�@�b�������%�L�@�3�3�Fÿ��3� «5���:�^õåù û�÷�û ü½® 1b�d�0� � �D�b  � � â��c©�¢Þ :© © �M�@�cr��@¢3�5� ½®c®(°�3¢½�5è

ü é � � E¥�õ#ö<÷B§��c÷ �Oõ�é � � �

� �� ���� õ ù ÷ �c÷ �Oõ

é � � �� �

õ ù � � �� � ÷ �c÷ � �Oõ�é � � �� �

õ ù ü1$�õ��1 �cõké . �1¬ü $

=h�b�æ«5�5è � é 1¬ü0,2. $ $h©�ÿV :�b���«5©��d�r¢3�ª� # ¥�� � �w§>® (°�3�:�k«5©��È�<�5�­�D©��rã��Ô�­©î���D�� �b� � é�ó¬¥�õ#ö8÷H§Èø *�ûÅõìû üOö8õåùkû�÷�ûÅõ¡ý¬® �� ©�¢�«8�@�Þ� �5�����3�:�k£ �+ã@�8�@ÿ'���3�g�ã@��� �@�L�@�0®� �°© è

# ¥�� � �Ч~é 1¬ü. � �

� � ���� õ ù ÷ �½÷ �Oõ�é 1¬ü. � �

� õ ù � � ���� ÷ �½÷ � �cõé 1¬ü

. � �� õ ù õåù"$ºõ��

1 �cõké #1h* $|ò

Page 44: All of Statistics: A Concise Course in Statistical Inference Brief

�1 ��������� ����������������� ����������� �! "

* ü

ü

÷ÔéCõåù÷Ôé õ

õ

÷

� 7GIOÑHJ8K # T 3� f-SHKÔ[:7:I@S½2_98SD4@ABK/AßJ8K/I@7:NO?+7G9 õ ù û�÷�û�ü Twf-SHKÔAHK>?H987R2­j+7:9¨]rNO987R2<7RaOKN aOK>Jh2<SB7:9^J8K/I@7:NO?°T�f-SHKFSD4628=5SHK/AÞJLK/IO7GNO?�7G9^28SHK�KbaOK>?½2 � � � 7:?c28K/J89LK/=b2<K/AºPQ7R2<Sõ ù ûC÷%ûVü T

����� � MÍ&��0.µ"PMQR G .IH�$á& .7J+*`$'./(0"LH

n'obp qOs t<s³uBqÍvHxÙØ@v 1�âÜ¥��Mö �ͧ'�B� �@� � ©����D�°ã@�:�b���È��£/¢¬��� ©��wÿ'�����¨�0���5�åâÈ¢¬�æ«b��� ©��Eo � �'è

���D�b�+���D�ï���H} ���#�D�Z�|� `I` eg[��ZYh^X��� � e �å}�� �:�hã3�srÜ�æ�8ã�£ �Eo¨¥�õ §Üé # ¥�� é õ §Üé H # ¥�� éCõ#ö �éì÷H§'é H

E¥�õ#ö<÷B§ ¥ # T # §

�@�rãF���D�ï�|�H} r�����D�Ó�|� `I`�e�[���Yh^ ���r�Ne �å} �Y�:�hã3�srÜ�æ�8ã%£ �E�Ó¥�÷B§áé # ¥ ��é ÷H§'é H � # ¥���é õ#ö �é ÷B§'é H � E

¥�õ#ö<÷H§F$ ¥ # T .c§

×½Ë@´cÀ^Êc¶ oÖvHx Øh& �å¢X�c�D© � � ���B�@�Eo�� �+�:���@� �@�b�Í���w���D� � �c£/ :�'���B�@��⵩� � :©�ÿ��/® (°�D�Ü�0�@�Â�@���r�@ 

ã@�:�b���b��£/¢3��� ©��d⵩��á� «5©��È�<�5�­�D©��rã��Q�­©Ö���D�W�<©�ÿº�­©�� �@ ·�ï�@�rãh���D�Q�0�@�Â�@���r�@ åã@�:�b�Â�È��£/¢3�Â� ©��⵩�� � «5©��È�<�5�­�D©��rã��¨�­©0���D�w«5©� Ý¢3�w�B�Ö�­©�� �@ ·�/®

Page 45: All of Statistics: A Concise Course in Statistical Inference Brief

�����\� �D�����8� ��� � � � "!���� � �8Z�%�8� " #�é * �éü

� � < 9%:T9�< @7:T9�< b:T9�<� � 9 b:T9�< ;h:T9�< � :T9�<

;h:T9�< � :T9�< 9Ú�©��Ð�Lç½�@�d�r :�5è

Eo_¥ *c§'é # ,¬ü0*M�@�rã

EoÖ¥ªüX§'é $ ,¬ü0*D®¤ò

n'o>prqOs t<s uBqwvHx ØhVCÚ�©��-«5©��D�����D¢å©�¢c�¤�L�@�rã3©������@�È���c£/ :�5�5è°���D�-�0�@�Â�@���r�@ 3ã3�b�B�b����� �5��@�<� E

oÖ¥�õ §Üé � E¥�õ#ö<÷B§��c÷ ö 4@?BA

E�Ó¥�÷H§'é � E

¥�õ#ö<÷B§��Oõ!$ ¥ # T O§(°�D��«5©��b�<�5�­�D©��rã@���3�`�0�@�Â�@���r�@ Óã@�:�b�Â�È��£/¢3�Â� ©���âÈ¢3�æ«b�Â� ©��B�%�@�<��ã3�b�æ©��­�8ãg£ � mZo�@�rãQm��Ó®

×½Ë@´cÀhÊc¶ o^vHxÙØ � �å¢ �c�D© � �Ö���B�@� Eo�� �Ó¥�õ#ö8÷H§'é�� � � � � �

⵩��-õ#ö<÷ � *D® (°�D�b� EoÖ¥�õ §Üé�� � � � �� � � �c÷Ôé�� � � ®-ò

×½Ë@´cÀhÊc¶ o^vHxÙØ � �å¢ �c�D© � �Ö���B�@�E¥�õ#ö<÷B§'é � õÔúê÷ 7Gl *wûCõMûVüOö�*0ûC÷�ûVü

* N@28SHK/J;PQ7:98K $(°�D�b� E

�Ó¥�÷B§'é � �� ¥�õÔúê÷H§ �Oõ�é � �

� õ �cõÍú�� �� ÷ �c÷Ôé ü

1 ú�÷ $�ò

×½Ë@´cÀhÊc¶ o^vHxÙØ Cà#�b�Z¥��Mö �ͧ¨�B� �@�^ã3�b�B�b��� �E¥�õ#ö<÷B§áé � ù �� õåù5÷ 7Gl õåùïûC÷%ûVü

* N@2<SBK/JLPQ7G98K $(°�3¢½�5è E

oÖ¥�õ §Üé�� E¥�õ#ö<÷B§��c÷Íé 1¬ü

. õ ù � ���� ÷ �c÷wé 1¬ü� õ ù ¥ªü1$ºõ � §

⵩�� $hü^ûÛõMûVü��@�rãEoÖ¥�õ §Üé *ß©����D�b�Èÿ'�:� �/®¤ò

Page 46: All of Statistics: A Concise Course in Statistical Inference Brief

2. ��������� ����������������� ����������� �! "

����� !#"+) �� ¨"+) ¨"%$ � MÐ"+)M(�� � MÐ& .bM JLR� H

n'obp qOs t<s³uBqÍvHxÙØ ?X(#ÿZ© �L�@�rã3©�� ���@�b���c£/ :�5�Ö� �@�rã � �@�<�Ð�Â���#� A'�3�����3� ^�� â<è⵩��Í� �@�b� � � �@�rã��0è

# ¥�� ! � ö � !��w§Üé # ¥�� ! � § # ¥ ��!��w§F$ ¥ # T ðO§�`�^ÿ'�È���ª�Q��� �%®

�­?Ô]HJ87G?H=/7G]H[:K6Ò@2<N¨=5SHK>=5\wPQSHK>28SHK/J � 4@?HA � 4@JLK-7:?HAHK>] K>?HAHK>?½2ÜPÓKQ?HK/K>A028N¨=5SHK/=5\K:�½ÑD4�2<7:N@? ¥ # T ðc§ l�NOJÖ4@[:[�98ÑHiH9LK>289�� 4@?HA � T � NOJL28ÑH?D4628K/[Gj@Ò°PZK0SD4µaOKÔ2<SBKÔl�N@[:[:N PQ7G?HIJ8K>98ÑH[R2¨PQSH7:=5SgPÓK09;254628Kwl�NOJ¨=/NO?c287:?½ÑHNOÑH9ÖJ84@?HAHNOEYa64@J87:4@iH[:K>9d2<SBNOÑHIOSg7G2Ö7G9d2<JLÑHKwl�NOJAH7G98=/JLK>28K^J84@?HAHN@E�a64@J87:4@iH[:K>9Z2<N3NHT¯Q²Xo/uc»mo�À+* ,6*���à#�b�¡� �@�rã �¼�B� �@� � ©����D�°�BãLâ

Eo � �Z®?(°�D�b�0�� � � â^�@�rãk©��D  �0� âE

o � � ¥�õ#ö<÷B§'éEo_¥�õ §

E�Ó¥�÷B§á⵩��Ð�@ �  ���@ Ý¢å�5�Qõ��@�rãh÷殯Q²Xo Ǫt;´ tªo6Àïo6q t s³Ç

qXu6t »Âs 9@u½»:uBÆXÇ Ä@ob¾±/´½ÆXÇ;o%t5²Xo%¿co�qXÇÈs t¸gs³Ç¿cobp qXo>¿ uBqc¶ ¸�ÆcÊ tªuÇ;o>tªÇQu@Ì Àïoµ´XÇbÆc»:o�=Hx ×½Ë@´cÀ^Êc¶ oÖvHx vByÞà��b��� �@�rã �¼�B� �@�Ö���D�Ü⵩� � :©�ÿ'���3�Fã@�:�b�Â�È��£/¢3�Â� ©�� 6

�é * ��éü� � < 9%:�; 9%:�; 9%: @� � 9 9%:�; 9%:�; 9%: @

9%: @ 9%: @ 9

(°�D�b�BèEo¨¥a*c§Wé

Eo_¥;üX§Wé�ü-,�1ß�@�rã

E�Ó¥ *c§Qé

E�Ó¥ªüX§Qé�ü-,�1H®^� �@�rã�� �@�<�w���rã3� &

�D�b�rã3�b�D�ï£b�5«8�@¢½� �Eo_¥ *c§

E�Ó¥ *c§dé

E¥a*Bö *c§<è

Eo_¥ *c§

E�Ó¥;üX§dé

E¥a*BöµüX§<è

Eo¨¥;üX§

E� ¥ *O§¨éE

¥ªüOö *c§<èEoÖ¥ªüX§

E� ¥;üX§Öé

E¥;üOöµü §/® �å¢ �c�D© � �0���B�b�ª�8�cãî���B�@�Ó� �@�rã � �B� �@�0���D�Q⵩�  &

 :©�ÿ'���3�%ã@�:�b���È��£/¢¬��� ©�� 6�é * ��éü

� � < 9%: @ < 9%: @� � 9 < 9%: @ 9%: @

9%: @ 9%: @ 9

Page 47: All of Statistics: A Concise Course in Statistical Inference Brief

�����0��� ��� � ��� � ����������� ����������� �! " � (°�D�5� ���@�<�Þ�æ©��k���rã3�­�D�b�rã3�b�D�k£b�5«8�@¢½� �

Eo_¥ *c§

E�Ó¥;üX§Cé ¥;ü0,�1O§/¥ªü-,�1O§ é ü-,2.��c�b�E

¥ *¬öµüX§'é *H®-ò

×½Ë@´cÀhÊc¶ o^vHxÙv@Ø �å¢ �c�D© � �Q���B�@�å� �@�rã � �@�<�Q���rã3�­�D�b�rã3�b�D���@�rãÍ£b©����Í�B� �@�Q���D�-�/�@���ã3�b�B�b��� � E

¥�õ §Üé � 1�õ 7Rl *ÔûÛõ`ûôü* N62<SHK>JLPQ7:9LK $

à#�b�Ó¢c� rÜ�rã # ¥���ú � ûôüX§>®��¡�b���3�����rã3�­�D�b�rã3�b�æ«5�5è����D� � ©����D�¤ã3�b�B�b��� �F�:�E¥�õ#ö<÷B§'é

Eo_¥�õ §

E�Ó¥�÷B§'é � .@õå÷ 7Rl *wûÛõ`ûü@ö *wûC÷%ûVü

* N@28SHK/J;PQ7:9LK $$h©�ÿ�è

# ¥���ú � ûVüX§~é � � � � K � E ¥�õ#ö<÷B§��c÷ �Oõé . � �

� õ � � � � �� ÷ �c÷ � �Oõ

é . � �� õ ¥;ü $ºõ §;ù

1 �cõké üð $�ò

f-SHK^l�N@[:[:N PQ7G?HIÍJ8K>98ÑH[R2Q7:9-SHK>[:]Bl�ÑB[°l�NOJ¤aOK>J87Gl�j37G?HIÔ7:?HAHK>] K>?HAHK>?H=/K6T¯W²Xo>uc»mo6À+*-,6* * �å¢X�c�D© � �î���B�@�Í���D�`�L�@�3�3�g©ªâÔ� �@�rã � �:�M� �G�D© �5�b��£/  �|���IrÜ�D���­� �<�5«b� �@�3�@ :�/® 1�â E

¥�õ#ö<÷B§Óé��æ¥�õ §��#¥�÷B§Ü⵩��_� ©����ÜâÈ¢¬�æ«b��� ©��B���|�@�rã� � �æ©��Ó�æ�5«5�5�5�/�@�b��  ��r�<©@£5�c£/�� Ý��� ��ã3�b�B�b��� �-âÈ¢3�æ«b�Â� ©��B� 0���D�b��� �@�rã � �@�<�^���rã3�­�D�b�rã3�b�D� ®×½Ë@´cÀhÊc¶ o^vHxÙv2& à#�b�#� �@�rã�� �B� �@�^ã3�b�B�b��� �E

¥�õ#ö<÷B§'é � 1 � � � � �Hù � 7Rl õ � * 4@?HA ÷ � ** N62<SHK>JLPQ7:9LK $

(°�D�Ü�L�@�3�3�-©ªâ¡�~�@�rã ��:� ���D� �<�5«b� �@�3�@ :�Ó¥ *Bö &C§ �ï¥ *BöC&C§>® �M�-«8�@�wÿ'�È���­�E¥�õ#ö8÷H§áé

�æ¥�õ §�#¥�÷H§¨ÿ��D�b�<���æ¥�õ §'é 1 � � � �@�rã�#¥�÷B§'é�� � ù ®?(°�3¢½�5è¡��� �%®-ò

Page 48: All of Statistics: A Concise Course in Statistical Inference Brief

@ð ��������� ����������������� ����������� �! "

����� ��(0"+)+.>$á./(0"PM R G .7H�$'&Ü.IJ+*`$á./(0"LH� l � 4@?HA � 4@J8KÓAH7:9L=/JLK>2<K6Ò@2<SBK/?ÍPÓK-=µ4@?Í=>NOE0]HÑB2<K 2<SHKZ=/NO?HAB7G2<7GNO?D4@[cAB7:9L28J87GiHÑB2<7GNO?

N@l � IO7Ga@K/?g2<SD4�2_PZK0SD4µaOKÔNOiB98K/J;aOK>A �¼é ÷ TÔ1¬]rK/=/7 �D=µ46[:[Gj@Ò # ¥�� é õ�� �¼é ÷H§ïé# ¥�� é�õ#ö � é�÷H§ , # ¥ ��é�÷B§ T%f-SH7G9Ð[GKµ4@AH9^ÑH9Ö2<NîAHK��D?BKF2<SBK�=/N@?HAH7G287:NO?H4@['E�46989l�ÑH?H=b2<7GNO? 4@9-l�NO[:[GNXPQ9>T

n'obp qOs t<s³uBqÍvHxÙv2V (°�D��YO�r����� ^X��� �#�D�(A�}X�å���H�������s^CB��|� `7`�eg[��ZYh^X��� � � `Eo�� � ¥�õ�� ÷B§'é # ¥���é õ�� �éì÷H§�é # ¥���é õ#ö �é ÷B§

# ¥ �éì÷B§ éEo�� �Ó¥�õ#ö<÷B§E�Ó¥�÷B§

� âE�Z¥�÷H§ � *D®

� NOJ¨=/NO?c287:?½ÑHNOÑH9^AH7G9L2<JL7:iHѬ2<7:N@?H9dPÓKFÑH9LK028SHKÔ9<4@E0KwAHK �D?H7G287:NO?B9/TÔf-SHK07:?c2<K>J8]HJLK )�+o�´�»moÞt5»mo/´�¿Bs·q 9ôs·q¿co>o6Ê 3'´Xtªo�» ²Xo6»:oOx��²Xo6q 3'oô±>uBÀhÊcÆ tªo# ¥�� ! � � � é ÷B§s·q t5²Xo ±/u¬q t<sÎqcÆXuBÆXDZ/´�Ç;o 3'o ´�»moÛ±>uBqX¿Bs ¾t<s uBqOsÎq 9ÐuBq¨t5²Xo_o>ÁXo�q tó��é ÷rý 3W²Os ±6² ²�´�ÇÊ�»mu¬Ä�´½ÄOs·¶Ùs t�¸/=Hx �go´ ÁXu¬s ¿ t5²Os³Ç Ê�»muBÄc¶ o6ÀÄb¸ ¿co>prqOsÎq 9 t5²OsÎq 9OÇs·q tªo�» ÀïÇ u@Ì t5²Xo� kZlÞx ¯W²Xo Ì�´�±btt5²�´ tÐt5²Os Ç%¶ o/´�¿cÇÍtªuß´3áo6¶Ù¶ ¾­¿cobp qXo>¿ t5²Xo>u½»G¸s Ç�Ê�»mu@ÁXo>¿ s·q~Àduc»mo´X¿OÁ�´cqX±/o/¿ ±/uBÆc»:ÇLo>Ç/x�+o ÇÈs·ÀhÊc¶ ¸ t;´ �5o�s t´XÇQ´Ð¿cobp qOs t<s³uBq3x

254�2<7:N@?`AH7 �æK>J89 � 7G?`2<SHKÐAH7:9L=/J8Kb2<KÍ=/4@98K6ÒEo�� � ¥�õ�� ÷B§ 7:9 # ¥�� éôõ�� �Åé÷B§ iHѬ2_7:?î2<SHK

=/N@?½287:?½ÑHNOÑB9W=/4@98K6ÒDPZKÖEÍÑH9;2W7:?c2<K>IOJ<4628K_2<NÔIOKb2d4w]HJLNOiD4@iB7:[:7R2­jOT

n'obp qOs t<s³uBqÍvHxÙv � Ú�©��Ó«5©��D�Â���D¢D©�¢c� �8�@�rã3©�� ���@�b���c£/ :�5�5è ���D��YO�r�����s^ ���r�#�D� A�}X�å����H�����Â� ^CBê���3�Z`µ� ^�B e�[��ZYh^X���r�Û�a`E

o�� � ¥�õ�� ÷B§'éEo�� � ¥�õ#ö<÷B§E�Ó¥�÷B§

���5�b¢3�w���3�F���B�@�E�Z¥�÷H§ � *D®?(°�D�b�Bè# ¥�� ! � � �ôé ÷H§'é�� � E

o�� �Z¥�õ�� ÷B§��Oõ!$×½Ë@´cÀ^Êc¶ oÖvHx v �Ûà��b�'� �@�rã�� �B� �@�0��¢¬�D� ⵩��È� ã@�:�b���È��£/¢¬��� ©���©������D�Í¢¬�D���¤�/ä/¢B�@�5�/® �b�È� â �Ö���B�@�

Eo� � ¥�õ�� ÷B§'é�üÓ⵩�� *wûÛõMûVüw�@�rã�<w©����D�b�Èÿ'�:� �/® (°�3¢½�5èæ�@� �@�b� ��é ÷åè

� �:� �W?H7Gl�N@J8E � < è 9 @® �`�w«8�@�Mÿ'�È���ª�^���3�:�Ð���Q��� �ôéì÷ � �W?H7Gl ¥ *Bö/üX§/®¤ò� J8NOE�28SHKÔAHK��H?H7G287:NO?gN@lÓ2<SHKÔ=/N@?HAH7G287:NO?H4@[áAHK>?H987R2­jOÒ#PZK098K>K028SD462

Eo�� �Ó¥�õ#ö<÷H§dé

Page 49: All of Statistics: A Concise Course in Statistical Inference Brief

������� �8�8��� �7Z�%�8��� � � � "!���� ��� Z�%����" $Eo�� � ¥�õ�� ÷B§

E� ¥�÷B§^é

E� � o ¥�÷ � õ §

Eod¥�õ § Tîf-SH7G9Ð=/4@?º98N@EFKb2<7:E0K/9ÖirK�ÑH9LK>l�ÑH[Z4@9Ð7G?�28SHK

?HK � 2ïK � 4@E0]H[:K6T×½Ë@´cÀhÊc¶ o^vHxÙv Cà#�b�E

¥�õ#ö<÷B§'é � õÔúê÷ 7Gl *wûCõMûVüOö�*0ûC÷�ûVü* N@28SHK/J;PQ7:98K $

à#�b�h¢c� rÜ�rã # ¥�� �~ü-,2. � � é ü-, # §>® 1È� �Lç½�@�d�r :� ½® @ �`ÿZ�%�/�@ÿ ���B�@�E�Ó¥�÷B§Fé

÷¨úì¥;ü-,�1@§/®�=^�b�æ«5�5è Eo� � ¥�õ�� ÷H§'é

Eo�� �Ó¥�õ#ö8÷H§E�Ó¥�÷B§ é õÔú�÷

÷Öú �ù$

�°© è

� B� � ü.����

�é ü# � é � ��� �

�Eo� � Hõ �

���

ü# � �Oõ

é � ��� ��

õwú ���� ú �ù�Oõ�é �� ù ú ���� ú �ù

é ü7.# 1 $ºò

×½Ë@´cÀhÊc¶ o^vHxÙv ? �å¢ �c�D© � �Ö���B�@����� �ï?H7Rl ¥ *Böµü §/®¤{áâÈ�ª�b�Í©@£/� �@���D���3�%� ���@ Ý¢D�Í©ªâ-� ÿZ��3�b�æ�b�L�@�ª� � � ��éCõ1� �ï?H7Rl�NOJ8E ¥�õ#öµüX§>® �%�B�@�'�:�_���D�_�0�@�Â�@���r�@ æã@�:�b���È��£/¢¬��� ©��ß©ªâ ���

ÚÜ���<�b�Ó�æ©��­�^���B�@��è Eo¨¥�õ §Üé � ü 7Rl *0ûÛõ`ûVü

* N@28SHK/J;PQ7:9LK�@�rã E

� � o_¥�÷ � õ §áé � �� � � 7Gl *�� õ���÷ � ü* N@2<SHK>JLPQ7G98K $

�°© è Eo�� �Ó¥�õ#ö<÷B§'é

E� � o_¥�÷ � õ §

Eod¥�õ §Üé�� �� � � 7Rl * ��õ�� ÷ � ü

* N62<SHK>JLPQ7:9LK $(°�D�Ö�0�@�Â�@���r�@ @⵩�� � �:�E

�Ó¥�÷B§'é � �

Eo�� �Ó¥�õ#ö<÷H§ �Oõké �

��Oõ

ü $|õ é $ � � � �

���� é/$ [:N@I ¥;ü $Þ÷B§

⵩���* � ÷ �ñü½®-ò

Page 50: All of Statistics: A Concise Course in Statistical Inference Brief

� ��������� ����������������� ����������� �! "

×½Ë@´cÀ^Êc¶ oÖvHx'& =þÓ©��B�b��ã3�b�k���D�îã3�b�B�b��� �ß��� ��ç½�@�d�r :�! ½® @��½®Cà#�b� " � rÜ�rã E� � o ¥�÷ � õ §/®

���D�b�ß� é õ°èï÷Û�w¢c�b�d�/�@�Â�:��â �%õ ù û¼÷ û~ü½® �-�@�È Ý� �b�<è^ÿZ���/�@ÿ���B�@�EoÖ¥�õ §Ôé

¥ 13ü-, � §­õåùX¥;üD$ºõ��>§>®�=h�b�æ«5�5èD⵩��-õåùWûC÷%ûVü6èE� � o ¥�÷ � õ §áé

E¥�õ#ö<÷B§Eo¨¥�õ § é ù �� õåù<÷ù �� õ ù ¥ªü $ºõ � § é 16÷

ü $|õ � $$^©�ÿ  :�b�¨¢c�`«5©��d�r¢3�ª� # ¥ � � # ,2. � � é ü-,�1O§>®+(°�3�:� «8�@�C£b�`«5©��d�r¢3�­�8ã�£ � rÜ�<�b��æ©������3�F���B�@�

E� � o_¥�÷ �Îü-,�1@§áé # 1�÷ ,¬ü H®�(°�3¢c�5è

# ¥ � � # ,/. � � é�ü0,�1O§'é � �� � �

E¥�÷ �·ü-,�1O§ �½÷hé � �

� � �# 16÷ü �c÷wé $

ü $|ò

����� � *LR>$á.��BMÍ&Ü.7Mh$$ G .7H�$'&Ü.IJ+*`$á./(0"LHNMÍ"+) ����� � M!� �LR� H�¡K>2 � éY¥�� � öI$I$7$/ö8� � § PQSHK/JLK � � öI$7$I$>öL� � 4@J8KÔJ<46?HAHNOEYa@46J87m46iH[:K>9/T � Kw=µ4@[G[� 4 �8�@�rã3©�� �@�5«b�­©�� T �¡K>2 E

¥�õ � öI$I$7$>ö8õ � § ABK/?HN@28Kï28SHK_]rABlªT � 2¤7G9Ó]rNO9L987:iB[:KQ2<NÐABK��D?HK2<SBK/7:J�EF4@J8IO7G?D4@[G9/Ò�=/NO?HAB7G2<7GNO?D4@[G9#K>2<=6T�EÍÑB=5SÍ2<SHK¤984@E0KÓPZ4 jÍ4@9�7G?h28SHK¤iH7Ra@46J87m4�2<K =µ4@9LK@T� K^984µj%2<SD4�2 � � ö7$I$I$/öL� � 4@J8KÖ7:?BAHK/]rK/?HABK/?c2W7GlªÒBl�NOJ-KbaOK>JLj � � ö7$I$I$/ö � � Ò

# ¥�� � ! � � öI$I$I$>ö8� � ! � � §Üé � !,+ � # ¥�� ! ! �1! §F$ ¥ # T $ §

� 2Ü98Ñ �F=>K/9 2<N^=5SHK/=5\Í28SD462E¥�õ � öI$I$I$>ö8õ � §Üé� �!,+ � E o J ¥�õ ! § T � l � � öI$I$7$>ö8� � 4@JLK-7:?HABK )

]rK/?HAHK>?c2Q4@?HAkK/4@=5SkSD4@9Z28SHK_984@EFKïE�46J8IO7G?D4@[DAH7G9L2<JL7:iHѬ2<7:N@?0PQ7R2<S%AHK>?H987R2­jEÒ½PÓK¨9<4µj

2<SH462 � � ö7$I$I$bö8� � 46J8K # # kV¥ 7:?HAHK>] K>?HAHK>?½2_4@?HAî7:AHK>?½287:=/4@[:[Rj�AH7G9L28J87:iBÑB2<K>A § T � K^98SD4@[G[PQJ87R2<K^2<SH7G9_4@9 � � ö7$I$I$ª� � �

ENOJ/Ò 7:?`28K/JLEF9ïN@l'2<SHK i>k�l Ò � � öI$I$I$;� � � m Tïf-SH7:9

E0Kµ4@?H9^2<SD4�2 � � öI$7$I$/ö8� � 4@J8K07:?BAHK/]rK/?HABK/?c2wAHJ84 PQ9hl�J8NOE 2<SBK�9<46EFK0AH7:9;2<JL7:iHÑB287:NO?æT� Kh46[:98Nw=/4@[:[ � � ö7$I$I$bö8� � 4 �L�@�rã3©����/�@�d�r :� l�JLNOE m T�îÑH=5SMN6l�9;2546287:9L287:=/4@[æ2<SBK/NOJ;j 4@?HA ]HJ<4@=b2<7G=/KÖi K>IO7:?B9QPQ7R2<S # # k N@iH98K>JLa6462<7GNO?H9W4@?HAPÓK^9LSD4@[G[¡9L2<ÑBABj%2<SB7:9Q=µ4@9LKÖ7:?�AHK>2<4@7:[ PQSHK/? PZKÖAH7G98=/ÑB989W9L254�2<7:9;2<7G=/9/T��� � � ��� ( !�� �|(0&á$�MÐ"k$ � *LR>$á.��BMÍ&Ü.7M^$� G .7H�$'&Ü.IJ+*DU

$á./(0"LH

Page 51: All of Statistics: A Concise Course in Statistical Inference Brief

�������\�+�� � � � ���8��_� � �� ���dZ�0����������� � � "!���� ��� Z�%����" ) 4�B�'(#&% �$;H#I+.B TÜf-SHK_EÍÑB[G2<7Ra64@J87:462<K¤a@K/J89L7:NO?�N@l�4ÍÕZ7G?HNOE07m4@[H7G9Ó=/4@[:[GK/A�4 �îÑH[R2<7 )

?HNOE07m4@[ÂT NO?H987GAHK/J AHJ84 PQ7G?HI^4^iH4@[:[Bl�JLNOE 4@?0ÑHJL?0PQSH7G=5S�SD469 iH4@[:[G9áPQ7G2<S8=wAH76� K/JLK/?c2=/NO[GNOJ89�[m4@irK/[GK/AÛ=>NO[:NOJ ü ÒW=/NO[GNOJ 1 Ò^TRTGTÅÒï=>NO[:N@J�\åT �¡Kb2 D�é ¥ D � öI$I$7$/ö D § PQSHK>J8KD�� � * 4@?HA F � + � D��wé ü 46?HA�9LÑH]H]rNO98K02<SD4�2 D�� 7:9_2<SBKF]HJLNOiD4@iB7:[:7R2­j`N6l¤AHJ84 PQ7G?HI4%iD4@[G[�N@l =/N@[:NOJDT � J<4µP A 2<7GEFK>9 ¥ 7G?HAHK/]rK/?BAHK/?c2ÖAHJ84 PQ9¨PQ7R2<S+J8K>]H[m46=/K/E0K/?c2 § 4@?HA[:Kb2 � é ¥�� � öI$I$7$/ö8� § PQSBK/J8K ��� 7G9¤2<SHKÖ?½ÑHEÍirK/J-N@l�2<7GEFK>9¤2<SD4�2Q=/NO[GNOJ��46]H] K/4@J89>TUïK>?H=/K6Ò Aêé F � + � ��� T � Kw9<4µj+2<SD4�2 � SH4@9Ö4 �îÑH[G287:?HN@EF7:4@[ ¥sA Ò Dr§ AH7G9L2<JL7:iHѬ2<7:N@?PQJ87R282<K>? ��� �îÑB[G2<7G?HNOE07m4@[ ¥aA�ö Dr§ T'f-SHK^]HJLNOiD4@iH7G[:7R2­j0l�ÑB?H=>287:NO? 7:9E

¥�õ §Üé Aõ � $7$I$Lõ � D �� � � � � D ��� ¥ # T � §

PQSHK/JLK Aõ � $I$7$Lõ � é A �

õ � � � � � õ � $6o�ÀhÀd´wvHx'&Hy �å¢ �c�D© � �F���B�@�Ó� � � ÑH[G287:?HNOE07m46[ ¥aA�ö Dr§Fÿ��D�b�<�^� é~¥�� � öI$7$I$/ö8� §�@�rã D%é ¥FD � öI$I$I$>ö D §/®?(°�D�Ö�0�@�Â�@���r�@ #ã@�:�b���È��£/¢¬��� ©��|©ªâ¤���w�:���Q���æ©��w���@  � A#è D�� @®) 4�B�'(#�� +��(#I+ ' " ��� �(; +.B TFf-SBK0ÑB?H7Ga64@JL7m4628K 'ïNOJLE�46[áSD4@Ag2­PZN ]D4@J84@E0K>2<K>J89>Ò�Û4@?BA áT �­?|28SHK�EÐÑH[G287Ga64@J87:462<KhaOK/JL987GNO?°Ò��7:9h4�a@K/=b2<NOJw46?HA �7:9^J8K/]B[m4@=>K/A|icj�4

E�4�2<J87 ��� TÜf�NFirK/I@7:?°ÒH[GK>2� é

��� � �TTT�

����PQSHK/JLK � � öI$7$I$/ö � ���º¥a*BöµüX§ 46J8KÖ7:?HABK/]rK/?HAHK>?c2µTÜf-SHK^AHK/?H9L7G2­jkN6l � 7:9 �ÃÌ ´cqX¿ �`´6»:oMÁXo/±b¾

tªu½»:Ǽt5²Xo6q �� � éF !,+ � ! � ! xE¥��c§Üé !,+ �

E¥�� ! § é ü

¥ 1��#§ � ù K � ] � $ ü1H � + � � ù� � é ü

¥ 1 ��§ � ù K � ] � $ ü1 � � � � $

� Kd9<4µjF28SD462 � SD4@9¤4Ð9L2<4@?HAD4@JLA%EÍÑH[R2<7Ra@46J87m4�2<K 'ïN@J8EF4@[åAH7:9;2<J87GiHÑB287:NO?ÔPQJ87R282<K>? �*��º¥a*Bö"!¬§ PQSHK>J8KÔ7G2¨7G9_ÑH?HABK/J89;2<N3N¬A+28SD462 * JLK/]HJLK/98K>?c2<9h4ka@K/=b2<NOJ¨N@l =$#/K>J8N3K/9Ö4@?BA !7:9¤28SHKA= � =k7:ABK/?c2<7R2­j%EF462<JL7 � T�îNOJ8KFIOK/?BK/J<46[:[Gj@Ò�4�aOK>=>2<N@J � SH4@9Í4 EÍÑH[G287Ga64@JL7m4628K 'ïNOJLE�46['AH7:9;2<J87GiHÑB287:NO?°Ò#AHK )?HN@28K/A i½j �����|¥ � ö%�-§ ÒD7Gl�7G2QSH4@9QAHK/?H9L7G2­j � � � s Ç�t5²Xo_sÎq ÁXo�»mÇ;o-u@Ì

t5²XoÔÀd´ t5»Âs Ë&�_xE¥�õ#ø � ö%�-§Üé ü

¥ 1���§ � ù AHKb2 ¥'�-§ ��� ù K � ] �D$ ü1 ¥�õ $ � § � � � � ¥�õ $ � § � ¥ # T §

Page 52: All of Statistics: A Concise Course in Statistical Inference Brief

ð�* ��������� ����������������� ����������� �! "

PQSHK>J8KwAHKb2 ¥ � § AHK>?HN@2<K>9¨2<SHKÍABK>2<K>J8E07:?D46?½2_N@l 4%E�4�2<J87 � Ò �Þ7:9d4�aOK>=>2<N@J¨N@l [:K/?BI@2<S =4@?HA � 7:9Ô4 = � =º9Lj3E0EFKb2<J87G=@ÒÜ]rNO987R2<7Ga@K%AHK �D?H7G28K�E�4628J87 � Tº1¬K>2L2<7G?HI � é * 4@?HA8¼Àd´Xt5»�s Ë � s Ç`ÊOu@Ǫ¾

s t<s ÁXoº¿cobp qOs tªoÛs ̪¹hÌ�uc»´c¶·¶ qXuBqµ¾ � o�»mu�ÁXo>±>tªuc»mÇõ#¹Hõ � �Ó� � *Hx

� é ! IO7RaOK/9QiD46=5\�2<SHKÖ9;254@?HAH4@J8A 'ïNOJLE�46[ T1¬7:?B=/K � 7:9h9Lj3EFE0K>28J87G=%4@?HAÞ]rNO9L7G2<7RaOK%ABK��D?H7R2<K6Ò'7G2Í=/4@?ÞirKk98SHN PQ?Þ2<SD4�2Í2<SHK>J8K

K � 7G9L2<9F4gE�4628J87 � � ��� ù�� =µ46[:[:K>A�28SHK 9 �½ÑD4@JLKîJLN¬N@20N@l ��� PQ7G28S�28SHK l�N@[:[:N PQ7G?HI]HJLNO] K>JL287:K/9 �+¥ 7 § � ��� ù 7:9Ð9Lj3EFE0K>28J87G=@Ò ¥ 7G7 §���é � ��� ù�� ��� ù 4@?HA ¥ 7G7:7 § � ��� ù � � ��� ù%é� � ��� ù�� ��� ù-é ! PQSHK>J8K � � ��� ù-é ¥'� ��� ùb§ � � T¯Q²Xo/uc»mo�À+* ,���� 1�â � � �|¥ *Bö !B§ß�@�rãß� é � ú � ��� ù � ���D�b� � � �º¥ � ö%�-§>®þÓ©�� �@�b�<� �b  �Xè'� â¤�����|¥ � ö �-§5è ���D�b��� � ��� ù6¥�� $ � § ���º¥a*Bö"!¬§/®

1¬ÑH]H]rNO9LKFPÓK%]D4@J;2<7G287:NO?ß4 J<4@?HABNOE 'ïNOJLE�4@[áa@K/=b2<NOJ � 4@9 � é ¥�� � ö8� � § � K=µ46?î9L7:E07:[m46J8[GjÍ]D4@JL287G287:NO? � é ¥ � � ö � � § 4@?BA��é � � � � � �� � � � � � � $¯Q²Xo/uc»mo�À+* ,�� *à#�b����� �º¥ � ö%�-§>®?(°�D�b� 6

� 9 L(°�D�^�0�@�Â�@���r�@ ¡ã@�:�b���È��£/���Â� ©��|©ªâ¤� � �:�-� � ���º¥ � � ö � ��� §/®�g@ L(°�D�w«5©��rã@����� ©��r�@ ¡ã@�:�b���b��£/����� ©��|©ªâ¤� � �@� �@�b�%� � éCõ � �:�� � � � � éCõ � ��� � � � ú � � � � � �� � ¥�õ � $ � � §bö�� � � $ � � � � � ���� � � � � $

� � 1�â +�:�h� �@�5«b�­©��Ö���D�b� �r�����º¥��� � ö �� � ¬§>®� ; é ¥�� $ � § � � � � ¥�� $ � § � � ù ®

��� �Ô� ��& MÐ"LH��O(0& � Mh$á./(0"LHY( �J� MÐ"+)M(�� � MÐ& .bM JLR� H1¬ÑH]H]rNO9LK028SD462 � 7G9^4 J84@?HAHNOE a@46J87m46iH[:KÍPQ7R2<S � k�l E

o 4@?HA i>k�l mZo T �¡K>2� é��B¥��g§ irK%4�l�ÑH?H=>287:NO?�N@l � Ò�l�NOJhK � 4@EF]B[:K@Ò � é�� ùÐNOJ � é � o T � KF=µ4@[G[�é��B¥��ߧ 4¨2<J84@?H9Ll�N@J8EF462<7GNO?wN@l � T'UïN PCAHN^PZKW=/N@EF]HѬ2<KQ2<SHK � k�l 46?HA i>kZl N@l� e��­?�28SHKÖAH7:9L=/JLK>2<K¨=/4@98K6ÒH2<SHK^4@?B9LPÓK/JQ7:9¤K/4@9Lj@TÜf-SHKÖE�4@9L9¤l�ÑH?H=b2<7:N@? N6l � 7G9¤IO7RaOK/?icj

Page 53: All of Statistics: A Concise Course in Statistical Inference Brief

����� ���+���� ��" � �8���D��Z�%�8��" � � ����������� ����������� �! " ðBü

E� ¥�÷H§'é # ¥ �ôéì÷H§'é # ¥ �B¥��ߧ'éì÷B§áé # ¥­ó õ#ø �H¥�õ § é ÷rý6§áé # ¥�� ! � � � ¥�÷B§8§%$

×½Ë@´cÀhÊc¶ o^vHxW&h& �å¢ �c�D© � �^���B�@� # ¥�� é $hüX§Zé # ¥�� é ü §¤é ü-,2.ß�@�rã # ¥�� é+*c§Zéü-,�1H®Mà#�b� � é�� ù ® (°�D�b�Bè # ¥ � én*c§^é # ¥�� én*c§^é�ü-,�1Þ�@�rã # ¥ � é üX§hé# ¥���é�üX§¡ú # ¥���é $hü § éôü-,�1H® �墬�w�0�@�È�54/���3�\6

õEo¨¥�õ §& 9 9%:�;

< 9%: @9 9%:�;

÷E�Ó¥�÷H§

< 9%: @9 9%: @

� � ��c�5�Qâµ�bÿZ�b� ���@ Ý¢å�5�0���B�@�M� £b�5«8�@¢c� �0���D�0�Â�L�@�B��⵩��È�0�@�Â� ©��|�:�Ô�æ©��Ö©��æ� &­�­© &8©��æ�/®ò

f-SHKh=>NO?c2<7G?3ÑBNOÑH9Q=µ4@9LKh7G9-SD4@JLAHK/J>T f-SBK/J8K^4@JLK¨2<SHJLK/KÖ9L28K/]H9Ql�NOJ �D?HAB7:?HIE�'�

f-SBJ8K/KÖ9;2<K/]B9Wl�NOJ¤28J<4@?B9Ll�NOJLE�46287:NO?B9ü T � NOJ-K/4@=5S ÷ Ò �D?BAî2<SBK^9LK>2 � éó õM���H¥�õ §Zû ÷rý T1 T � 7G?HAk2<SHK i>kZl

m��Ó¥�÷H§ é # ¥ � û ÷B§áé # ¥ �H¥��ߧZûC÷H§é # ¥­ó õ#ø �B¥�õ §¤ûC÷åý6§áé � � � E o¨¥�õ § �cõ ¥ # T üI*c§

# T-f-SHK � k�l 7:9E�Z¥�÷H§'é m �� ¥�÷B§ T

×½Ë@´cÀhÊc¶ o^vHxW&hV à#�b�Eo¨¥�õ §`é � � � ⵩��0õ � *D® (°�D�b� mZo¨¥�õ §îé � �� E

oÖ¥�§ ��

éü $ � � � ®Qà#�b� �é��B¥��g§'é [:NOI �ê®?(°�D�b� � éôó õ`�%õ`û � ýî�@�rãm��Ó¥�÷B§'é # ¥ � ûC÷B§'é # ¥ [:NOI � ûC÷H§'é # ¥�� û � §'é m!o¨¥ � §Üéôü1$ � ��� � $

(°�D�b�<��⵩��5�5èE�Ó¥�÷B§áé � � ��� � ⵩��Q÷ !`�d®Öò

Page 54: All of Statistics: A Concise Course in Statistical Inference Brief

ð\1 ��������� ����������������� ����������� �! "

×½Ë@´cÀ^Êc¶ oÖvHx'& �Ûà��b�#��� �ï?H7Rl ¥%$hü@ö # §/®-ÚÜ���rã0���D�-�BãLâЩªâ �é �MùX®?(°�D�hã3�b�B�b��� �k©ªâ� �:� E

oÖ¥�õ § é � ü-,2. 7Gl $ ü ��õ�� #* N@2<SHK>JLPQ7G98K $

� «8�@�ñ©��D  �ß� ��c� ���@ Ý¢å�5� ��� ¥a*Bö §>® þÓ©��B�b��ã3�b�k�ÂÿZ©ê«8��� �76"� � D* ��÷ � üê�@�rã� ��� kü�û�÷ � ®FÚ�©��Ô«8��� � � � è � é p $ � ÷ ö � ÷\qd�@�rã?m.�Z¥�÷H§Wé � � � E o¨¥�õ §��Oõ|é¥;ü0,�1O§ � ÷ ®ÓÚ�©��Z«8��� � � ��� è � é p $^üOö � ÷\q¡�@�rã m.�Ó¥�÷B§'é �� � E o¨¥�õ §��Oõ�é ¥ªü-,2.½§>¥ � ÷¬úüX§>®�(Ö� �_�b�<�b�D�Â���@�����3�Qm¼ÿZ�d�3�b�

E�Z¥�÷H§'é

���� ������� 7Gl * � ÷ �ñü�� � 7Gl ü � ÷ � * N@2<SBK/JLPQ7G98K $ºò

� SHK/? � 7:9Ö9;2<J87G=>28[GjME0NO?HN62<NO?HKw7G?H=/JLKµ4@9L7:?HI�NOJÖ9;2<J87G=>28[GjME0NO?HN62<NO?HKwABK/=/JLKµ4@9L7:?HI2<SBK/? � SD4@9W4@? 7:?caOK>J89LK

�é � � � 4@?BAî7G?�28SH7:9-=/4@98KÖNO?BK^=/4@? 98SHN P 28SD462E�Ó¥�÷B§áé

Eo_¥

�¥�÷H§L§ ��

��

��¥�÷B§�c÷ �

���

$ ¥ # T üOüX§

��� �BE ��& MÐ"LH��O(0& � Mh$á./(0"LH ( � � ��¨& M R�� MÍ"+)M(�� � MÐ&Ü.%UMQJLR� H

�­?%98N@EFKW=µ4@9LK/9ZPÓK¨4@JLKï7G?c2<K/JLK/9;2<K/Ak7G?F28J<4@?H9;l�NOJ8EF46287:NO?ÔN@l¡9LK>a@K/J<46[°J<4@?BAHNOE a64@J87 )4@iH[GK/9>T � N@JWK � 4@EF]B[:K@Òr7Gl � 4@?BA � 4@J8KhIO7Ga@K/?`J84@?HAHN@E a@46J87m46iH[:K>9/ÒDPZKhEF7GIOSc2QP¤4@?c22<N`\3?BNXPÅ2<SBKkAH7:9;2<J87GiHÑB287:NO?�N@l � , � Ò � ú � Ò'EF4 �åó �`ö �0ý NOJhEF7G? ó �Mö �0ý T �¡K>2�¼é �H¥��Mö �Ч irKF2<SBK�l�ÑH?H=b2<7:N@?ºN@lQ7:?c28K/J8K>9L2/TMf-SHK%9;2<K/]B9Íl�NOJ �D?BAH7:?HI

E�

4@J8K02<SHK9<46EFKÖ4@9QirK>l�NOJLK �

Page 55: All of Statistics: A Concise Course in Statistical Inference Brief

����� �_�+���� ��" � �8���D��Z�%�8��" � � "_ �� ����� ��� ������� ��� ����� � �! " ð #

ü T � NOJ-K/4@=5S � Ò �H?HA 28SHKÖ98Kb2 ��� éó¬¥�õ#ö<÷B§Ó� �B¥�õ#ö<÷B§Óû �¬ý T1 T � 7G?HAk2<SHK i>kZl

m�¥��c§ é # ¥ � û �c§ é # ¥ �B¥��`ö �ͧ¤û �c§

é # ¥ªó¬¥�õ#ö8÷H§Èø �H¥�õ#ö<÷H§ û �3ý6§ é � � ��

Eo�� �Ó¥�õ#ö8÷H§ �Oõ �½÷ $

# T-f-SHK>?E�¥��c§Üé m �

�¥��c§ T

×½Ë@´cÀhÊc¶ o^vHxW& �Cà#�b�¤� � ö8� ù � �W?H7Gl ¥a*BöµüX§k£b�%���rã3�­�D�b�rã3�b�D� ®|ÚÜ���rã+���D��ã3�b�B�b��� �|©ªâ�é � � úÞ� ù ® (°�D� � ©����D�Óã3�b�B�b��� �k©ªâh¥�� � ö8� ù §Ö�:�E

¥�õ � öLõ ù §'é � ü *�� õ � � üOö�*%��õ ù �ñü* N@28SHK/J;PQ7:9LK $

à#�b� �B¥�õ � ö8õ ù § é õ � úÞõ ù ® $h©�ÿ�èm��Ó¥�÷H§ é # ¥ � ûC÷H§'é # ¥ �H¥�� � ö8� ù §ZûC÷H§

é # ¥­ó¬¥�õ � ö8õ ù §-� �B¥�õ � ö8õ ù §ZûC÷rý6§'é � � � � E ¥�õ � ö8õ ù § �cõ � �Oõ ù $$h©�ÿ«5©����5�Í���D�Í�B�@�Lãw�B�@�È� 68rÜ�rã@���3� � ®0ÚÜ���<�b�¤�b¢X�c�D© � �Í���B�@� * �V÷gû ü½® (°�D�b�� �:�%���D�����b���@�3�@ :�%ÿ'�������@�b�È�Â� «5�5�k¥ *¬ö *c§bö ¥�÷rö *c§��@�rãߥ *¬ö<÷H§>® �°�5�0ÚÜ�R�@¢3�<� ½®��@® 1b����3�:�+«8��� �5è�� �� � E ¥�õ � öLõ ù §��Oõ � �Oõ ù �:� ���D�`�@�<�8��©ªâ����3�:� �Â�È���@�3�@ :��ÿ��3� «5�ß÷3ù ,h1H® 1�âü �ô÷ � 1ß���D�b� � �:�F� �@�b� �����3���3�î���|���D�Ô¢3�D���-�/ä/¢B�@�<�%�L笫5�­�r�-���D�Ô�Â�È���@�3�@ :�Íÿ'������@�b�È�Â� «5�5�Ð¥ªüOö<÷ $CüX§böX¥ªüOöµüX§ÈöX¥�÷ $CüOö/üX§/® (°�3�:�d� �b�Ü�B���h�@�<�8��ü $º÷3ùF,�1H®�(°�D�b�5��⵩��<�5è

m.�Ó¥�÷B§'é������ �����

* ÷ � * �ù *wû ÷FûVüüD$ �ù üÖû ÷Fû 1ü ÷ � 1�$� ��ã@� �_�b�5�b�D�����@��� ©��BèÜ���D� � kZl �:�E

�Ó¥�÷H§'é��� ��

÷ *0ûC÷%ûVüüD$Þ÷ ühûC÷%û 1* N@2<SHK>JLPQ7G98K $ºò

Page 56: All of Statistics: A Concise Course in Statistical Inference Brief

ðh. ��������� ����������������� ����������� �! "

* ü*

ü

¥a*Bö<÷B§

¥�÷rö *c§

f-SH7:9Q7:9¤28SHKÖ=µ4@9LK *0ûC÷%ûVü T* ü*

ü

¥;ü@ö<÷0$Cü §

¥�÷�$ÛüOö/üX§

f-SH7G9W7G9¤2<SHKÖ=/4@98K ü � ÷%û 1 T

� 7:IOÑBJ8K # T ðB� f-SHK^98Kb2 � l�NOJ-K � 4@EF]B[:K # T . $ T

��� �Ö� �E¨,��+"+.µ,�M R�� � � ¨"+)+.���WK>=µ4@[G[�2<SH462h4�]BJ8NOiD46iH7:[G7G2­jîEFK/4@98ÑHJLK # 7:9ÖAHK �D?HK/A�NO?�4 4) �DK>[:A��N@lZ4�9<4@E�)

]H[GK`98]H4@=/K � T Ï J<4@?BAHNOE a@46J87m46iH[:K � 7G9%4 �|�½� `7[�}X�H����� EF4@] � �W�~� � T�îK/4@98ÑHJ84@iH[GK^E0Kµ46?H9¤2<SD4�2µÒHl�NOJ-KbaOK/J;j õ Ò óµ¦Û�Ô��¥m¦Z§ZûÛõ¡ý ! �kT��� � � ��î,>¨&Ü,ï.7H$ H

ü Tï1¬SBNXPì2<SD4�2# ¥���é õ §Üé mF¥�õ � § $ mF¥�õ � §

4@?BAmF¥�õ ù § $ mF¥�õ � §'é # ¥�� ûÛõ ù § $ # ¥�� ûÛõ � §%$

1 T1�¡K>2 � i Kß98ÑB=5Sñ2<SD4�2 # ¥�� é 1O§ßé # ¥�� é # §ßé ü-,¬ü0* 4@?BA # ¥�� é O§�é � ,¬üI* T 7'[:N@2Ô28SHK iTk�l m T �ï9LK m 28N �D?HA # ¥ 1 � � û . $ � § 4@?HA# ¥ 1ÔûÛ� û . $ � § T

# T17'J8N a@K��¡K/E0EF4 # T ü T. T1�¡K>2 � SD4µaOK^]HJLNOiD4@iH7G[:7R2­jFABK/?H9L7G2­jkl�ÑH?H=b2<7GNO?E

o_¥�õ §Üé�� � ü-,2. *�� õ�� ü# , � # � õ�� * N@28SHK/J;PQ7:9LK $

Page 57: All of Statistics: A Concise Course in Statistical Inference Brief

����� ���N�� �� � �c� "_ " ð ¥ 4 § � 7:?HAk28SHKÖ=/ÑHEÍÑB[m46287GaOK¨AB7:9L28J87GiHÑB2<7GNO?�l�ÑH?B=>2<7GNO?�N@l � T¥ i § �¡K>2 �Åé�ü-,�� T � 7:?BA`2<SHKw]BJ8NOiD46iH7:[G7G2­jkAHK>?H987R2­j`l�ÑH?B=>2<7GNO?

E� ¥�÷H§ l�NOJ � T

Uï7G?c2 � NO?H9L7:AHK>J-2<SHJLK/KÖ=µ4@9LK/9 � ��ûC÷%û �� Ò �� ûC÷%ûVü 4@?BA ÷ �Vü T

T1�¡K>2 � 46?HA � irKFAB7:98=>J8Kb2<K0J<4@?HABNOE�a64@J87:4@iH[GK/9/T�1¬SHN P 28SD462 � 4@?HA � 4@JLK7:?BAHK/]rK/?HABK/?c2d7Rl�4@?HA NO?H[Gj�7Gl

Eo � �Z¥�õ#ö<÷B§'é

Eo¨¥�õ §

E�Ó¥�÷B§ l�NOJQ4@[G[ õ 46?HA ÷ T

ð T1�¡K>2 � SD4µa@KWAB7:9L28J87GiHÑB2<7GNO? m 4@?HAwAHK>?H987R2­jÍl�ÑH?H=b2<7:N@?E4@?HAÔ[:Kb2 �ÛirKQ4_98ÑHiH9LK>2

N@l�2<SHKÖJLKµ4@[¡[G7:?HK6T �¡K>2 ! � ¥�õ § irK_2<SHKÖ7:?BAH7:=/462<NOJÓl�ÑH?H=>287:NO?kl�NOJ � �! � ¥�õ §Üé�� ü õ ! �* õ�,! � $�¡K>2 � é ! � ¥��ߧ T � 7:?HA 4@?îK � ]HJLK/989L7:NO?îl�NOJ-2<SHK^=/ÑHEÐÑH[m46287Ga@K^AB7:9L28J87GiHÑB2<7GNO?kN@l

� T ¥ Uï7:?c2 � �DJ89;2 �H?HA 28SHKÖ]HJ8N@iD4@iH7G[:7G2­jFEF4@989¤l�ÑH?B=>2<7GNO?kl�NOJ � T §$ T1�¡K>2 � 4@?BA � irKZ7G?HAHK>] K>?HAHK/?c2á4@?HAÐ9LÑH]H]rNO98KÓ2<SD462�Kµ46=5SÍSD4@9�4 �ï?B7Gl�NOJLE ¥ *Böµü §

AH7G9L2<JL7:iHѬ2<7:N@?°T �¡K>2 � é E07:? ó �Mö �Ôý T � 7:?HA%28SHKÖAHK/?B987G2­jE�¥��c§ l�NOJ � TÜUï7:?c2 �

� 2WEF7GIOSc2-irK^K/4@987GK/J¤28N �HJ89L2 �D?BA # ¥ � � �O§ T� T1�¡K>2 � SD4µaOK^=/ABl m T � 7:?HAk2<SBK^=>ABláN6l � � é EF4 �åó/*BöL�ºý T T1�¡K>2 ��� ��� ] ¥ � § T � 7G?HA m�¥�õ § 46?HA m � � ¥��O§ Tü0* T �¡K>2 � 4@?BA � i K¨7G?HAHK>] K>?HAHK/?c2/TZ13SHN Pì28SD462 �æ¥��g§ 7:9Ó7:?HAHK>] K>?HAHK>?½2WN@l �#¥ �Ч

PQSHK>J8K � 4@?HA � 46J8K¨l�ÑH?H=b2<7:N@?H9/TüOü TW1¬ÑH]B] NO9LK¨PZK_28NO989W4Í=/N@7:?kNO?H=>KÖ4@?HA�[:Kb2 D i K_28SHK¨]HJLNOiD4@iH7G[:7R2­jÔN@l�SBKµ4@AH9>T"�#Kb2

� AHK/?BN@2<KÖ2<SHKÖ?½ÑHEÍirK/JQN@l�SHK/4@AH9W4@?HA [:Kb2 � ABK/?HN@28K¨2<SHKÖ?½ÑHEÍirK/JQN@l�254@7G[:9>T¥ 4 § 7ÜJLN aOKÖ2<SD462 � 4@?HA � 4@JLK^ABK/]rK/?HAHK>?c2µT¥ i § �#Kb2 � � 7�N@7:989LNO? ¥ � § 4@?BA�9LÑH]H]rNO98KwPÓK028NO989^4k=/NO7G? � 287:E0K/9>T �¡K>2 �4@?HA � irKî2<SBK`?½ÑHEÍirK/JFN@l¨SHK/4@AH9k4@?BA 254@7G[:9>T 1¬SHN P 2<SD462 � 4@?BA � 4@JLK7:?BAHK/]rK/?HABK/?c2µT

ü-1 T 7'J8N aOK^f-SHK>NOJ8K>E # T # # T

Page 58: All of Statistics: A Concise Course in Statistical Inference Brief

ðOð ��������� ����������������� ����������� �! "

ü # T1�¡K>2 �����º¥a*BöµüX§ 46?HAî[GK>2 �é�� o T¥ 4 § � 7:?BA%2<SHKÖ]rABl�l�NOJ � T"7Ü[GN@2-7R2µT

¥ i §+¥��æuBÀ^ÊcÆ tªo6»0×½Ë3Ê@o6»Âs·Àdo�q tbx·§ �¨K/?HK>J<4628Kg4Þa@K/=b2<NOJ õ é ¥�õ � öI$I$I$bö8õ � � � � � � §=>NO?H987G9L287:?HIïN@l ü0* Ò *�*�* J<4@?BAHNOEô9L2<4@?HAD4@JLA�'ïNOJLE�46[:9/T �¡K>2 ÷Ôé ¥�÷ � ö7$I$I$/ö8÷ � � � � � � §PQSHK>J8K ÷ ! é � �CJ T � J<4µP 4^SH7:9;2<NOIOJ84@E N@l ÷ 4@?HA�=/NOE0]D4@JLKd7G2 2<Nh2<SHK � kZl jONOÑl�NOÑB?HA�7:?�]D4@J;2 ¥ 4 § T

üI. T1�¡K>2 ¥��Mö �Ч i K-ÑB?H7Gl�NOJLEF[Rj^AH7:9;2<J87GiHÑB28K/AÔNO?w2<SHK-ÑB?H7G2ÜAH7G98= ó¬¥�õ#ö<÷H§Z�¨õ ù úk÷ ù ûü@ý T(�¡Kb2�� é � � ù ú � ù T � 7:?HAk2<SBK^=>ABl'46?HA�] ABl�N@l��ÍT

ü T ¥ 8 ÆcqOs ÁXo6»mÇL´c¶^»m´½qX¿cu¬À qcÆcÀhÄ@o6» 9Oo6qXo�»�´ tªuc»­xÙ§ �¡K>2 � SH4 a@K 4 =/N@?½287:?½ÑHNOÑB9/Ò9;2<J87G=>28[GjÖ7:?B=/J8K/4@987G?HI iTk�lnm T �¡K>2 �é m�¥��ߧ T � 7G?HAh28SHKZAHK>?H987R2­jÐN6l � T�f-SH7:97G9 =/4@[:[GK/Aw2<SBKï]HJLNOiD4@iB7:[:7R2­jÍ7G?½28K/IOJ84@[H2<J84@?H9;l�NOJ8E T 'WNXPÛ[:K>2�� � �W?H7Gl�N@J8E ¥ *¬öµüX§4@?BAß[GK>2 � é m � � ¥ � § TÔ1¬SHN P2<SD4�2 � �Xm T�'WN P PQJ87R2<KÔ4%]HJLNOIOJ84@E 28SD4622<4@\@K>9 �ï?H7Rl�NOJ8E ¥ * Ò üX§ J<4@?HABNOE¼a@46J87m46iH[:K>9d4@?BA`IOK>?HK/J8462<K>9_J<4@?BAHNOE¼a64@J87:4@iH[:K>9l�JLNOE 4@? ��� ] N@?HK/?c2<7:4@[ ¥ � § AH7G9L28J87:iBÑB2<7GNO?°T

ü ð T1�¡K>2 ��� 7�NO7:9L98NO? ¥ � § 4@?HA �@� 7�NO7G989LNO? ¥ � § 4@?HA04@9L98ÑHE0KZ2<SH462 � 4@?HA � 46J8K7G?HAHK/]rK/?BAHK/?c2µTÍ1¬SHN P2<SD462d2<SHKwAH7G9L28J87:iBÑB2<7GNO?`N@l � IO7RaOK>?g28SD462 �¼ú ��é�A7G9QÕZ7:?HN@EF7:4@[ ¥aA�ö ��§ PQSHK>J8K �+é � ,B¥ � ú � § TUW7:?c2 üO� � NOÑgEF4µjMÑH9LK028SHKÍl�NO[:[GN PQ7:?HI�l�46=>2 � � l � � 7�NO7G9898N@? ¥ � § 4@?HA � �7�NO7:9L98NO? ¥ � § ÒB4@?HA � 4@?HA � 46J8K_7G?HAHK/]rK/?BAHK/?c2µÒ¬2<SHK>? � ú � � 7�NO7:9L98NO? ¥ � ú� § TUW7:?c2 1¬� 'ïN62<K¨2<SD4�2 ó � é õ#öW��ú ��é A�ýÖéVó � éCõ#ö �é A $�õ¡ý T

ü $ T1�¡K>2 Eo�� �Ó¥�õ#ö<÷B§'é ��� ¥�õwú�÷ ù §3*0ûÛõ`ûVü 46?HA *0ûC÷%ûVü

* N@2<SBK/JLPQ7G98K $� 7G?HA � � � � �ù ���é �ù � T

ü � T1�¡K>2 � � �º¥ # öµü ðO§ T 1¬N@[GaOK�2<SBKîl�NO[G[:N PQ7:?HI+ÑB987:?BI�2<SBK 'WNOJ8EF4@[¤2546iH[:K 4@?HAÑH9L7:?HI04w=/N@EF]HѬ2<K/JQ]D46=5\646IOK@T¥ 4 § � 7:?HA # ¥�� � $ § T

Page 59: All of Statistics: A Concise Course in Statistical Inference Brief

����� ���N�� �� � �c� "_ " ð $¥ i § � 7G?HA # ¥�� � $�1O§ T¥ = § � 7G?HA õ 9LÑH=5Sî28SD462 # ¥�� � õ §Üé $ * T¥ A § � 7G?HA # ¥ *ÔûÛ� �N.½§ T¥ K § � 7G?HA õ 9LÑH=5Sî28SD462 # ¥ � � � � � õ��³§'é $ * T

ü T 7'J8N aOKÖl�NOJLEÍÑH[m4 ¥ # T üOüX§ T1h* T �¡K>2 �`ö � � �ï?H7Rl ¥a*BöµüX§ irKÍ7:?HABK/]rK/?HAHK>?c2µT � 7:?BA`2<SHK � kZl l�NOJ � $ � 4@?HA

� , � T1¬ü T �¡K>2 � � öI$I$I$>ö8� � � ��� ] ¥ � § i K # # k T �¡Kb2 � é E�4 �åó � � ö7$I$I$/öL� � ý T � 7G?HA2<SBK � kZl N@l � TÜUW7:?c2 � � ûC÷ 7Rl�46?HAîN@?H[Gj�7Gl � ! û ÷ l�NOJ � é�üOöI$I$7$>ö A T1�1 T �¡K>2 � 46?HA � i K^J<46?HAHNOE�a64@JL7m4@iB[:K/9>T-1¬ÑB]H] N@98KÖ2<SD462�� ¥ � � �ߧ éñ� T-1¬SBNXP

2<SH462 �°u6ÁH¥��Mö �ͧ'é��0¥��ߧ T1 # T �¡K>2 ��� �ï?H7Rl�NOJ8E ¥a*BöµüX§ T �¡K>2 *��� � � �ìü T(�¡K>2

�é � ü *�� õ����* N@2<SBK/JLPQ7G98K

4@?HA [:Kb2� é � ü ��õ�� ü

* N@28SHK/J;PQ7:9LK¥ 4 § Ï J8K � 4@?HA � 7:?HABK/]rK/?HAHK>?c2Èe � Scj�� � S½j�?BN@2Èe¥ i §Í¥;ü0* ] NO7G?c2<9 § � 7:?HA�� ¥ � � �d§ TÖUï7G?c2 � � SD462_a64@[GÑHK/9 � =/4@? � 254@\6K e 'WNXP�D?HA�� ¥ � � � é �c§ T

Page 60: All of Statistics: A Concise Course in Statistical Inference Brief
Page 61: All of Statistics: A Concise Course in Statistical Inference Brief

� � ��� ���� � � � �� ����������������� ���! �"$#&%(')%+*-,/. ,/01' 23'4.65�,/7 8�':9;*-'4<6=-"

>&?A@B@DCFEG@-HJILKMILNPORQTSUOWVYX/@ZKWQ\[]OW^_K`VLKMQAaAORXcbMKWVdNeKWfAge@ihjNPkYIL?A@:K�bW@-VdKWlR@:bMKWgemn@OW^oh�p+>&?A@i^qORVdXrKWgsan@JtuQANvILNPORQwNek]KWk&^qORgPgeO�x]k-py+zJ{G|R} ~L} �n|:�u�������u�4�����+���W�������A�u�U�o�A�����4�����u�(�4���i���M D�¡��¢G�����£�W�)�¥¤h ¦e§4¨ �q©;ªs� ¨r« �`¬D�

­$Sqh®[+¯±°³²Y´�µ`S¶²£[+¯³·³¸1¹ ²£º»Sq²£[ NP^]hjNPk¼anNekdHJVd@DIL@½ ²£º(Sq²£[¾´R² NP^]hjNPk¼HJORQ�ILNPQ�mnORmAk SU¿npPÀÁ[ §Ã§DÄFÅB¦ ª�Æ « �  «Ç« �u� §DÄ�ÅÉÈ ��� ¦ ª « �ÊÆW� ÂWË Ì ¦e§ÎÍ � ˶ËÐÏ ¨ �q©;ªs� ¨�Ñ`Ò � Ä�§ � « �u�(¤Z� Ë¶Ë � Í Ï¦ ª�Ærªs� «  «Ó¦ ��ª « � ¨ �Dªs� « � « �u�B�ÕÔJÖu�Ã× « � ¨rØ ÂWË Ä �:�¥¤ hTÙ­ÎSqh®[+¯Ú­Çhɯ۰ܲδ�µrSq²£[+¯1Ý!¯ÚÝsÞYß SU¿npáàR[>&?A@4@âCnEG@-HDIÃKMIdNeORQãNekäKåORQA@âæçQ�mAX:fG@-VäkdmAX/XrKWVÕè`OW^»Id?A@�anNekÕIdVdNPfAmnILNPORQ�p;>&?ANPQAéêOW^­iS¶h®[ëKWk]IL?A@)KZbR@JVLKWlW@)bMKWgPmA@$èRORmãxìORmngeaãORfnIÃKMNeQwNP^oèRORmãH-OWXrEAmFIL@-aãIL?A@iQ�mAX/@-VÕNeHZKMgKZbR@-VdKWlR@îíðïuñ ¸1òóÐô ñ h ó OM^YK�geKWVdlW@rQ�mAX:fG@-VBOM^¼õÓõ¶ö÷aAVdK�x]k:h ñâø ßZßZß ø h ò p6>&?A@ê^UKWHDIIL?uK�I¼­ÎSqh®[ìùí ïuñ ¸ òó�ô ñ h ó NPkÎKWHDILmuKMgegPèêXrORVÕ@)Id?uKWQúK`?A@-mnVdNek¾ILNPHWûëNvIYNPkYKåIL?A@JORVd@JXHZKWgPge@Ja�IL?A@úgüKZxcOW^igüKMVdlR@!Q�mnX:f£@JVdkêId?uKMI`x_@�x]NegPg¼anNekdHJmAkdkîgeKMIL@JV-p±>&?n@�QAOWILKMILNPORQ

ýRþ

Page 62: All of Statistics: A Concise Course in Statistical Inference Brief

��� ������� ��������������������������

½ ²Y´�µ`S¶²£[oaA@Jkd@JVÕbR@Jk;kdORX/@ëH-OWXrX/@-Q�IZp ��@ëmAkd@&NvI»X/@-VÕ@-gvè4KMk+KYH-OWQ�bW@-QANP@-Q�IÇmAQANv^¶èFNPQAlQAOWILKMILNPORQ6kdOwx_@raAOWQ"! Ii?AK�bW@/IdOãx]VÕNPId@ ¸ ¹ ²£º»Sq²£[Y^qORV$aANPkdH-VÕ@JId@/VdKWQAaAOWX�bMKWVdNeKWfAge@JkKWQAa ½ ²£º(S¶²£[Õ´R²å^qORV»HJORQ�ILNeQ�mAOWmAk(VLKMQAaAORX�bMKWVdNeKWfAge@Jk�fnmnI(èROWmBkd?AORmngea:fG@&KZxëKMVd@ìId?uKMI½ ²Y´�µ`S¶²£[ì?uKWkäKBEAVd@JH-NPkd@)Xr@-KWQANeQnl:IL?uKMI]NPk&aANekÕH-mAkÕkd@Ja�NeQîVÕ@ZKWgðKWQngPè�kdNPk&H-ORmAVÕkd@Jk-p>�O]@JQAkdmAVÕ@+IL?uKMI�­ÎS¶h®[£Nek�x_@-gPgRaA@DtuQA@-a$#�x_@;kLKZèYId?uKMI�­ÎSqh®[s@âCFNekÕIdk�NP^ ½ ¹ % ² % ´�µðÞ$S¶²£[�&' p�(YIL?A@JVÕx]NPkd@$xì@)kLKZèêId?uKMI]IL?n@)@âCFE£@JHJIÃK�ILNeOWQãanOF@JkäQnOWIä@DCFNek¾IZp

)+*�,.-0/.1 z$�u�3254ð� «(h 687ì@-VÕQAORmAgPgeNUS:9G[-Ñ ���u�Dª ­ÎS¶h®[¼¯ ¸ ñ¹Dô<; ²£º(S¶²£[Y¯cS �>= S¥À�?9G[Õ[A@ÛS¾À = 9\[ǯB9oÑ�C)+*�,.-0/.1 z$�u�3D5E Ë ¦ Ö Â ¤  ¦ �Y×Ã� ¦ ª «UÍ � «U¦¶Å � §-Ñ 4ð� «uh ¬D� « �u�äª ÄFÅ ¬D�D�Y�¥¤_�u�  ¨�§-Ñ ���u�Dªn�­ÎSqh®[]¯ ½ ²G´�µoÞiS¶²£[]¯ ¸ ¹ ²£ºÁÞiS¶²£[¼¯³S �F= º(S � [d[ @ S¾À = º»S¾ÀÁ[Õ[ @ SÓà = º(SÊàR[Õ[]¯S �G= S¥ÀIHM¿�[Õ["@ÛS¥À = S¾ÀIHRàW[d["@1SÊà = S¾ÀIH�¿�[d[+¯�ÀRß"C)+*�,.-0/.1 z$�u� �J4ð� «�hK6MLäQANP^ZSN?)À øPO [-Ñ ���u�Dªn� ­$S¶h®[+¯ ½ ²G´�µðÞiS¶²£[ǯ ½ ²£º�ÞiSq²£[¾´R²w¯ñQ ½SRïuñ ²Y´�²ê¯ À�ÑTC)+*�,.-0/.1 z$�u�VUJW]�Ã× ÂWË¶Ë « �  «  �  ª ¨ � Å Ø Â � ¦  ¬ Ë �ì�  § ÂFX;Â Ä ×Ã��Y ¨W¦e§D« � ¦ ¬ Ä�«Ó¦ ��ª ¦ ¤ ¦¶« �  §¨ �Dª §D¦¶« Y ºÁÞiS¶²£[¼¯[Z]\ÇS¥À�@�²_^â[P` ïuñ Ñba�§D¦ ª�Æ ¦ ª « �ÊÆW�  «Ó¦ ��ª ¬cY4Ö Â � «q§ � ÈÓ§ � «Sd®¯ ²  ª ¨e ¯ÚIÃKWQ ïuñ ² Ì �

° % ² % ´�µrSq²£[;¯ à\ °�f

;²Y´R²Àg@ ² ^ ¯ihá² ILKWQ ïuñ S¶²£[kj f; ? °lf

;ILKWQ ïuñ ²Î´R²î¯ '

§ � « �u� Å �  ª ¨ ��� § ªs� « �ÕÔ ¦e§D«ÊÑnm ¤oY�� Ä�§D¦¶ÅBÄ Ëv « � Â5X;Â Ä ×Ã��Y ¨W¦e§D« � ¦ ¬ Ä�«Ó¦ ��ª Å Â ªpY«U¦¶Å � §  ª ¨/« Ârq � « �u� Â Ø �D�  Æ��Ã��Y�� ÄwÍ+¦ Ë¶Ë § �Ã� « �  «;« �u� Â Ø �D�  Æ��iªs� Ø �D� § � «U« Ë � §4¨ � Í ª Ñ��� ¦e§`¦e§ ¬D�Ã× Â Ä�§ � « �u� X;Â Ä ×Ã��Yú�  §î« � ¦ × q «  ¦ Ë §  ª ¨ �u�Dªs×Ã�ã�ÕÔ « �Ã� Å �!�W¬ § �D� Ø Â «Ó¦ ��ª §Â �L�:×Ã� ÅBÅ ��ª ÑsCtuVdORX QAO�xÜORQ"#Çx]?A@-Qn@JbR@JVrx_@waANekÕH-mAkÕkr@âCnEG@-HDIÃKMIdNeORQnkc#_x_@ãNeX/EAgeNPH-NPIdgPè¡KMkdkdmnXr@IL?AKMI&IL?A@Dèw@DCFNek¾IZpu�@JIwv¯JxnSqh¡[DpSyäOÁx1aAO4x_@$H-OWXrEAmFIL@ä­$Szv4[|{}(ÎQA@¼xëKZèêNPk_IdO4tuQAaãºI~ìS��A[;KWQAaIL?n@-QãH-ORX/EAmnId@$­ÎS�v:[;¯ ½ �AºI~ÇS��n[Õ´.�£p�7ìmnI&IL?n@-Vd@)Nek]KMQã@-KWkdNP@-V&xìK�èWp

Page 63: All of Statistics: A Concise Course in Statistical Inference Brief

����� ����� �������������w�[��� � �w����G��� ����w� ��� � � À ��ÁzJ���üz -��������� ��Áz����.1 zi���o~��ÁzG1 ,! #"�$¥~N,�~L}%$¥~L}%&�} ,�|'�)( 4o� « v¯ xnSqh®[JÑ ���u�Dª

­$Szv:[;¯÷­iS�xAS¶h®[d[(¯Û° xnSq²£[¾´�µoÞÎSq²£[Dß SU¿np O [>&?ANekrVd@-kÕmAgPI`X`KMéW@-krNeQ�ILmnNPILNvbR@ãkd@JQAkd@MpÛ>&?ANPQAé OW^$EAgeK�è�NPQAl K®l�KWX/@ãx]?A@JVd@úx_@aAVLKZxh KMI$VLKMQAaAORX�KWQAa6IL?A@JQ+*¼EuKZè!èRORm vܯ xnSqh®[âp-,ÇOWmAV$KZbR@JVLKWlR@åNPQAH-ORX/@4NPk

xAS¶²£[$IdNeX/@-kÎId?A@`HÃ?uKMQAH-@/IL?uKMIih ¯ ² #»kÕmAX/Xr@Ja÷SqORViNeQ�IL@JlRVLKMId@-a\[$O�bR@JV4KWgPg(bMKWgemA@JkOW^+²ðp0y¼@JVd@:NPkÎKêkdEG@-HJNüKWgoHZKMkd@Wp0u�@JI. f£@:KWQ�@JbW@-Q�I�KWQAa6gP@JIxnSq²£[]¯0/21(S¶²£[äx]?A@JVd@/31»Sq²£[;¯ ÀiNP^ð²546.�KWQAa7/31»Sq²£[;¯ � NP^ð² H46.4p;>&?A@-Q

­ÎS8/21(S¶h®[d[+¯ ° /21(S¶²£[dºÁÞ$Sq²£[¾´�²î¯ ° 1 º�Þ$S¶²£[Õ´R²w¯:9ìS¶h;46.$[Dß*çQ!OMIL?A@JV]x_ORVÕaAkc#uEAVÕORfuKWfnNegeNvIçè`NPk]KBkdEG@-H-NeKWg�HZKMkd@)OW^»@âCFE£@JHJIÃK�ILNeOWQ�p)+*�,.- /.1 zi�u�=<l4ð� «ohK6ML¼QnNP^-S � ø ÀÁ[-Ñ 4ð� « v¯ xAS¶h®[+¯?> Þ Ñ ���u�Dªn�

­ÎSzv:[;¯1° ñ;> ¹ º(Sq²£[¾´�²î¯±° ñ

;> ¹ ´R²w¯?>�? ÀRß

@ Ë « �D�⪠ «U¦¶Ø � Ë YÁ�0Y�� Ä ×Ã� Ä Ë ¨ ©;ª ¨!ºI~_S��n[`Í � ¦ ×Ã� «UÄ �⪠§ � Ä�«i« �¡¬D� ºI~ìS��A[B¯ ÀIH � ¤Z���À & � &A>�Ñ ���u�Dªn� ­iS�v:[;¯ ½CBñ �ëº(S��n[Õ´.�4¯D>�? À�ÑsC)+*�,.- /.1 zi�u�=E�� Ârq �  §D«U¦ × q �¥¤ Ä ª ¦¶« Ë �Dª�Æ « �  ª ¨ ¬-�L� Ârq ¦¶«  « �  ª ¨ � Å/Ñ 4ð� «Av ¬D� « �u�Ë �Dª�Æ « ���¥¤ « �u� Ë ��ª�Æ��D�¼Ö ¦ �Ã×Ã� Ñ�Ò �  «]¦e§B« �u� Å �  ª �¥¤ vGF m ¤ h ¦e§B« �u�r¬-�L� Ârq Öu� ¦ ª «« �u�Dª h 6 L¼QANv^-S � ø À�[  ª ¨Gv ¯ xAS¶h®[ë¯XrK�C Z�h ø À�?�hb`FÑ ��� Ä�§ � xAS¶²£[&¯ À�?T²Í �u�Dª � & ²}&ÛÀIHRà  ª ¨0xAS¶²£[ǯڲ Í �u�Dª ÀIHRà�H÷²}&ÛÀ�ÑJI �Dªs×Ã�Ã�

­ÎSzvB[+¯ ° xnSq²£[¾´�µrSq²£[+¯ ° ñ�K ^;

S¾À�? ²£[Õ´R² @ ° ññ�K ^ ²Î´R²w¯ O¿ ß C

t\mnQAHJIdNeORQAkoOW^Fkd@DbR@-VdKWgFbMKWVÕNüKWfnge@-kðKMVd@;?uKWQAange@-a)NeQ�K]kdNeX/NegeKWV£xìKZèRpL* ^NM1¯5xAS¶h ø v4[IL?A@JQ ­ÎS�MÎ[;¯ ­$S�xnSqh ø v4[Õ[ǯ۰ ° xnSq² ø �n[Õ´�µrSq² ø �n[Dß Sq¿Ap ¿�[

Page 64: All of Statistics: A Concise Course in Statistical Inference Brief

� à ������� ��������������������������

)+*�,.-0/.1 z$�u���54ð� «sSqh ø v:[ � Â Ø � Â�� � ¦ ª « Ë Y Ä ª ¦ ¤Z��� Å�¨W¦e§D« � ¦ ¬ ÄF«U¦ ��ª`��ª « �u� Ä ª ¦¶«\§��-Ä Â �L� Ñ4ð� «LM�¯JxnSqh ø v:[;¯ h ^�@ v ^ÁÑ ���u�Dªn�­ÎS MÎ[;¯ ° ° xnSq² ø �n[Õ´�µrSq² ø �n[»¯ ° ñ; ° ñ

;Sq² ^ @� ^ [A´R²G´+�:¯ ° ñ

;² ^ ´�² @ ° ñ

;� ^ ´+�:¯ à

O�C

>&?A@���� ��¢G�����£� OW^�h NPk&aA@JtAQA@-aúIdOrfG@έÎSqh �Z[]KMkdkdmnXrNPQAlBIL?uKMI&­ÎS % h % �-[�&' p���@)kÕ?uKWgPg�VLKWVÕ@-gvèêXrKWéW@iX:mAHÃ?ãmAkÕ@)OM^(X/ORXr@JQ�ILk&fG@JèROWQAa��/¯±àFp����� ��9;,/ �"Î9+%(*-"�� ,/0±���! �"$#&%(')%+*-,/. � J�Áz-���üzr-�� ����� m ¤ h ñâø ßZßZß ø h ò  �L�ã�  ª ¨ � ÅjØ Â � ¦  ¬ Ë � §  ª ¨ � ñâø ß-ßZß ø � ò  �Ã�6×Ã��ª ϧD«  ª «q§ � « �u�Dª ­���� ó � ó h ó�� ¯�� ó � ó ­ÎSqh ó [âß SU¿Ap! R[)+*�,.-0/.1 z$�u���R�B4o� «Çh 6 7ìNeQnORXrNeKWgqSqí ø 9G[JÑ Ò �  «ä¦e§/« �u� Å �  ª �¥¤ h F Ò �î×Ã� Ä Ë ¨« ��Y « �  Ö�Öu� ÂWË « � « �u� ¨ �q©;ª ¦¶«U¦ ��ª Ù

­$Sqh®[+¯Û°Ü²Y´�µoÞÎSq²£[;¯ � ¹ ²£º�ÞiSq²£[;¯ ò� ¹Dô<; ²#" í ²%$ 9 ¹ S¾Àw?}9G[ ò ï ¹¬ Ä�«)« � ¦e§w¦e§ ªs� «  ªÚ�  § Y §DÄ�Å « � � Ø ÂWË Ä Â « � Ñ m ª §D« �  ¨ �)ªs� « � « �  «]h ¯ ¸Ûòó�ô ñ h óÍ �u�D�L� h ó ¯³Àú¦ ¤ « �u��& ���« � §Ã§å¦e§ �u�  ¨�§  ª ¨Bh ó ¯ � � « �u�D� Í+¦e§ � Ñ ���u�Dª ­ÎSqh ó [¼¯S 9 = À�[A@ÛSdS¥Àw?}9G[ =}� [;¯ 9  ª ¨i­iS¶h®[+¯Ú­ÎS ¸1ó h ó [ǯ ¸1ó ­iS¶h ó [+¯1í�9oÑ J�Áz-���üzr-�� ���' 4ð� «oh ñâø ßZßZß ø h ò ¬D� ¦ ª ¨ �çÖu�Dª ¨ �Dª « �  ª ¨ � ÅÜØ Â � ¦  ¬ Ë � §-Ñ ���u�Dªn�

­(� ò) óÐô ñ h ó�� ¯ ) ó ­ÎSqh ó [Dß SU¿Ap ý [*¼OWIdNeHJ@/Id?uKMI)IL?A@rkdmAX/X`K�ILNeOWQ®VÕmAge@åaAO�@-k�QAOWI�Vd@,+�mnNeVd@rNeQAan@-EG@-QAaA@JQAH-@`fAmnI)IL?A@X:mAgvILNPEAgeNPHZKMIdNeORQ/VdmAgP@$aAO�@-k-p

Page 65: All of Statistics: A Concise Course in Statistical Inference Brief

�������0����w� �� ���l�� � ��� ����w� �� ��� � O����� 8�'49;*-':.6#]" '4.65 � ,��)':9;*-'4.6#ä">&?A@)bMKWVÕNüKWQnH-@$X/@ZKWkÕmAVd@Jk]Id?A@��ÕkdEnVd@ZKMarOM^+KBaANek¾ILVÕNefAmnIdNeORQsp& ,�|�� ~ � $Õz ­iS¶h ?

, $>,b-¼z , $ ���ezú���z ,� $Ã}á| &-z ­ÎSqh ?¯É­$Sqh®[�?�ݯ? Ý ¯ � � �¡z|l,�|� $¾��-YzD~L} -¼z3$� $¾z ­ % h ? Ý % , $-¼z , $ ���ez�� �L$ / �ez ,� ~ -¼�'�ez����¶~¥z�|��+z~��Áz��r, �U}V,�| &-zW�

y+zJ{G|R} ~L} �n|:�u���]Dl4ð� «nh ¬D�  �  ª ¨ � Å Ø Â � ¦  ¬ Ë � Í+¦¶« � Å �  ª Ý;Ñ ���u�»�n�A���U�u�o�R��¥¤ h � ¨ �Dªs� « � ¨ ¬cY�� ^ ����� ^Þ ����� Sqh¡[ ����� Sqh®[ ����� h � ¦e§i¨ �q©;ªs� ¨ ¬cY� ^ ¯Ú­ÎSqh ? Ýð[ ^ ¯ ° Sq²F?�Ýð[ ^ ´�µrSq²£[ SU¿np � [

 §Ã§DÄFÅB¦ ª�Æ « � ¦e§ �ÕÔJÖu�Ã× «  «U¦ ��ª!�ÕÔ ¦e§D«q§-Ñ ���u�Ç J�Z�u���ð�A�����ð�����U�n���U¢G� ¦e§Çkda�Sqh®[;¯� � Sqh®[  ª ¨r¦e§ ÂWË § � ¨ �Dªs� « � ¨ ¬cY �  ª ¨ � ÞiÑ

��ÁzJ���üz -�������� @ §Ã§DÄ�ÅB¦ ª�Æ « �u� Ø Â � ¦  ªs×Ã� ¦e§úÍ � Ë¶Ë ¨ �q©;ªs� ¨ � ¦¶« �  §ú« �u��¤Z� Ë¶Ë � Í+¦ ª�ÆÖG�L�ÃÖu�D� «U¦ � §-Ù! Ñ � Sqh®[+¯Ú­ÎS¶h ^â[�? Ý"^ÁÑ" Ñ�m ¤ �  ª ¨$#  �L�B×Ã��ª §D«  ª «q§i« �u�Dª%� S �Rh[@&#D[;¯ � ^ � S¶h®[-Ñ' Ñ�m ¤ h ñDø ßZßZß ø h ò  �L� ¦ ª ¨ �çÖu�Dª ¨ �Dª «  ª ¨�� ñDø ßZß-ß ø � ò  �Ã�4×Ã��ª §D«  ª «¶§ � « �u�Dª

� � ò� óÐô ñ � ó h ó � ¯ ò� óÐô ñ � ^ó � Sqh ó [âß Sq¿Ap)(�[)+*�,.- /.1 zi�u�v�cU54ð� «(h 6 7ìNPQAORX/NüKWg¶SUí ø 9G[-Ñ�Ò � Í � ¦¶« � h ¯ ¸ ó h ó Í �u�D�L� h ó ¯�À¦ ¤ « � §Ã§ & ¦e§ �u�  ¨�§  ª ¨Bh ó ¯ � � « �u�D� Í+¦e§ � Ñ ���u�Dª h ¯ ¸ ó h ó  ª ¨ã« �u�å�  ª ¨ � ÅØ Â � ¦  ¬ Ë � §  �L� ¦ ª ¨ �çÖu�Dª ¨ �Dª «ÊÑ @ Ë § ��� 9ëS¶h ó ¯ ÀÁ[B¯ 9  ª ¨ 9ìSqh ó ¯ � [å¯ À ?B9oÑW]�Ã× ÂWË¶Ë « �  « ­$Sqh ó [+¯+* 9 = À-,+@.*vS¾Àw?o9G[ = � ,\¯ 9�ß/ � Í � ­$Sqh ^ó [+¯+* 9 = À ^ ,�@.*vS¾Àw?}9G[ =}� ^ ,G¯ 9�ß���u�D�L�q¤Z���Ã�Ã�0� S¶h ó [w¯ ­ÎSqh ^ó [�? 9 ^ ¯ 9 ? 9 ^ ¯ 9oS¾À ? 9G[JÑ E ¦ ª ÂWË¶Ë Y��0� Sqh¡[ã¯� S ¸1ó h ó [ë¯ ¸1ó � S¶h ó [ë¯ ¸1ó 9oS¾À�? 9G[ë¯ í 9oS¥À�? 9G[JÑ�/ � «U¦ ×Ã� « �  « � Sqh®[ì¯ � ¦ ¤9î¯�À ��� 9`¯ � Ñ21 Ârq � §DÄ �L�0Y�� Äw§ �Ã� Í ��Y « � ¦e§iÅ Ârq � §i¦ ª «UÄF¦¶«U¦¶Ø � § �Dª § � Ñ

Page 66: All of Statistics: A Concise Course in Statistical Inference Brief

� ¿ ������� ��������������������������

* ^»h ñâø ß-ßZß ø h ò KMVd@iVLKWQnaAORX�bMKWVÕNüKWfnge@-k_IL?A@JQ!x_@)an@JtuQA@iIL?n@  J�u�����U�ú�����u� ILOfG@ h ò ¯ Àí ò� ó�ô ñ h ó SU¿Ap þ [KWQAaîId?A@  -�u�����U�ã�A�A���q�u�o�R� ILOåfG@

� ^ò ¯ Àí ?÷À ò� ó�ô ñ � h ó ? h ò � ^ ß Sq¿ApvÀ � [ J�Áz-���üzr-�� ��� � 4ð� «oh ñâø ßZßZß ø h ò ¬D� õUõ¶ö  ª ¨ Ë � «�Ý!¯Ú­ÎS¶h ó [ � � ^ ¯ � Sqh ó [JÑ ���u�Dª­ÎS h ò [+¯�Ý ø � S h ò [+¯ � ^í KWQAa ­ÎS � ^ò [+¯ � ^ ß

* ^Yh KWQAa v KWVd@wVdKWQAaAORX bMKWVdNeKWfAge@Jkc#+IL?n@-Q Id?A@wH-O�bMKWVdNeKWQAHJ@!KMQAa H-ORVÕVd@-geKMILNPORQfG@JIçx_@-@-Q�hjKWQAa}v Xr@-KWkdmnVd@4?nOÁx�k¾ILVÕORQAl`Id?A@4gPNeQA@-KWV]Vd@-geKMILNPORQAkÕ?ANeEwNekäfG@JIçx_@-@-Q6hKWQAa v`py+zD{£|R} ~L} �n|4�u�v� <54ð� «Çh  ª ¨ v ¬D�å�  ª ¨ � ÅÉØ Â � ¦  ¬ Ë � §åÍ+¦¶« � Å �  ª §iÝsÞ Â ª ¨Ý"~  ª ¨Î§D«  ª ¨  � ¨�¨ � Ø�¦  «U¦ ��ª § � Þ Â ª ¨ � ~_Ñ�� �q©;ªs� « �u�+�R¢F�n�A���U�u�o�R�î¬D� «UÍ �Ã�Dªh  ª ¨ v ¬cY � � � S¶h ø v4[ǯ÷­ *PSqh ?�ÝsÞë[-S�vM?�Ý"~o[ , SU¿npPÀRÀ�[ ª ¨r« �u�¼�R¢\�Á�����q�n���q¢G�Ú¬cY

� ¯ � Þ�� ~¡¯ � Sqh ø v:[;¯� � � Sqh ø vB[� ¹ � ~ ß SU¿npPÀÁàW[

J�Áz-���üzr-�� ���3���u�:×Ã� Ø Â � ¦  ªs×Ã� §  «U¦e§ ©ë� §-Ù� � � Sqh ø v:[;¯ ­$Sqh v:[ ?�­ÎSqh®[ ­ÎSzv:[âß���u�:×Ã���â�L� Ëv «Ó¦ ��ª §  «Ó¦e§ ©ë� §-Ù?)À H � S¶h ø v:[ H ÀRß

Page 67: All of Statistics: A Concise Course in Statistical Inference Brief

��� ���F��� �������������w� ���� ����w� �� ���B��� ���o������<���G �w����G�J� � ����� � �� ����� m ¤ v ¯ � @ #Lh ¤Z��� § � Å �î×Ã��ª §D«  ª «q§ �  ª ¨ #w« �u�Dª � Sqh ø v:[i¯3Àú¦ ¤ #�� �  ª ¨� Sqh ø vB[+¯n?)ÀΦ ¤ #& � Ñ�m ¤ h  ª ¨sv  �L� ¦ ª ¨ �çÖu�Dª ¨ �Dª « � « �u�Dª � � � Sqh ø vB[+¯ � ¯ � Ñ���u�:×Ã��ª Ø �D� § � ¦e§ ªs� «Ç« � Ä � ¦ ªãÆ��Dªs�D� ÂWË Ñ ��ÁzJ���üz -�������� � S¶h @Jv4[i¯ � Sqh®[�@ � Szv:[ @1à � � � Sqh ø v:[  ª ¨ � S¶h ?lvB[$¯� S¶h®[ @ � Szv:[�?$à � � � Sqh ø v:[JÑ 1 ���L�(Æ��Dªs�D� ÂWË¶Ë Y���¤Z���;�  ª ¨ � ÅÛØ Â � ¦  ¬ Ë � §�h ñDø ßZßZß ø h ò �

� � � ó � ó h ó � ¯ � ó � ^ó � Sqh ó [�@ à � � ó��� � ó � �� � � Sqh ó ø h � [âß

���¾� ���! �"$#&%(')%+*-,/. '4.65 8�'49Ç*J':.6#]" ,å0�ð7 �,/9(%('4.ê%23'4.65�,r7 8�':9;*-'4<6=-"��y¼@-VÕ@)x_@iVd@JH-ORVÕaãId?A@i@DCFE£@JHJILKMILNPORQ!OM^»kÕORXr@iNPXrEGORVÕILKWQ�IëVdKWQAaAORX�bWKMVdNüKMfAge@Jk-pyì}%$¥~��Ó} ����~L} �n| �:z ,�| � ,!�Ó}V,�| &-z��OWNeQ�IäX`KWkÕk]KMI � � �7ì@JVdQAORmngegeN+SUE\[ E EðS¾ÀJæ E\[7ìNPQAORX/NüKWg»SUQ"# E\[ E Q!E�S¾ÀJæ E\[� @-ORX/@JIdVdNeH/SqE\[ À��ME S¾Àw?}9G[|H 9 ^��OWNekdkÕORQ�S��G[ � �L¼QnNP^qORVÕX SÓK # f\[ SÓK.@Îf\[��Rà S #T? ��[ ^ HFÀÁà*¼OWVdXrKWg;SUÝ ø � ^ [ Ý � ^�»CFE£ORQn@-Q�ILNeKWgÇS���[ � ��^� KWXrXrKwS�� ø ��[ ��� ��� ^7ì@DIÃK!S�� ø �»[ ��HAS�� @���[ ��� HASdS�� @���[ ^ S�� @�� @ÚÀÁ[d[�! � SqNP^#"$�±ÀÁ[ "<HAS�" ?TàR[$SqNP^#"%� àR[& ^' 9 à�9(!mAgvILNPQAORX/NüKWg»SUí ø 9G[ QAE kd@-@)fG@-geO�x(!mAgvILNvbWKMVdNüK�IL@�*äORVdXrKWg+SqÝ ø*) [ Ý )

��@]aA@JVdNPbW@-aB­iS¶h®[(KWQAa � Sqh®[�^qOWV(IL?A@&fANPQAORX/NüKWg�NeQBId?A@&EAVd@Db�NeORmAk;kÕ@-HJIdNeORQsp(>&?A@HZKWgPH-mAgeKMILNPORQAkì^qOWV&kdORX/@$OW^oIL?A@iOWId?A@-VÕkYKWVÕ@$NeQîIL?n@)@âCFH-@-VÕH-NPkd@-kJp

Page 68: All of Statistics: A Concise Course in Statistical Inference Brief

� ý ������� ��������������������������

>&?A@igüKWk¾IëIçx_O/@JQ�ILVdNP@-k&NPQêIL?A@$IÃKMfAge@$KMVd@ÎX:mngPILNvbMKWVdNeKMIL@äXrO�aA@Jgekìx]?ANPHÃ?wNeQ�bRORgvbR@iKVLKMQAaAORX bR@-HDILORV]h OW^oIL?A@$^qORVÕXh ¯ ��� h ñppph �

���� ß>&?A@iX/@ZKWQwOW^»K/VdKWQAaAOWX�bR@JHJIdORVäh NPkäan@JtuQA@Jaúf�è

Ýú¯ ��� Ý ñpppÝ ����� ¯ ��� ­$Sqh ñ [ppp­ÎSqh � [

���� ß>&?A@ �A�n� �q�u�o�R���Ã�R¢��A�A���q�u���R�����n�����¶� ) Nek&an@JtuQA@Ja!IdO/fG@

� S¶h®[;¯ �� S¶h ñ [ � � � S¶h ñâø h ^ [ � � � � � � Sqh ñâø h � [� � � S¶h ^ ø h ñ [ � S¶h ^ [ � � � � � � Sqh ^ ø h � [ppp ppp ppp ppp� � � S¶h � ø h ñ [ � � � S¶h � ø h ^ [�� � � � Sqh � [

������ ß* ^»hi6 (!mAgvILNPQAORX/NüKWgqSqí ø 9\[_IL?A@JQã­ÎSqh®[+¯�í 9ê¯�í+S:9 ñDø ßZß-ß ø 9 � [ëKWQAa

� S¶h®[+¯ �����í�9 ñ S¥À�?}9 ñ [ ?¼í�9 ñ 9 ^ � � � ?¼í�9 ñ 9 �?äí 9 ^ 9 ñ í 9 ^ S¥À�? 9 ^ [�� � � ?¼í�9 ^ 9 �ppp ppp ppp ppp?äí 9 � 9 ñ ?¼í�9 � 9 ^ � � �±í 9 � S¾Àw?o9 � [

������ ß>oO�kÕ@-@!Id?ANek #ìQAOMIL@wIL?uKMIåId?A@!XrKWVÕlRNeQuKMgìaANPkÕILVÕNefAmFILNeOWQTOM^$KWQ�è ORQA@ãH-OWXrEGORQA@JQ�I/OW^IL?n@ãbW@-HJIdORV`NPkrfnNeQAORX/NüKMg #(IL?AKMI`NPk/h ó 6 7ìNPQAORX/NüKWg¶SUí ø 9 ó [âp >&?�mnkc#ë­ÎSqh ó [î¯ í 9 óKWQAa � S¶h ó [&¯�í�9 ó S¥À�? 9 ó [Dp#*äOWIL@)IL?uK�IYh ó @ h � 6�7ìNeQnORXrNeKWgqSqí ø 9 ó @ 9 � [Dp¼>&?�mAkc#� Sqh ó @�h � [+¯1í(S 9 ó @ 9 � [-S¥ÀS? * 9 ó @ 9 � ,ü[Dp�(ÎQ`Id?A@YOWId?A@-V_?uKWQAa$#nmAkÕNeQAl)IL?A@ä^qORVdX:mngüK^qORVBIL?n@ãbMKWVÕNüKWQAHJ@wOW^$K¡kdmAX #_x_@ã?uKZbR@!Id?uKMI � Sqh ó @Ûh � [ê¯ � Sqh ó [T@ � Sqh ó [g@à � � � Sqh ó ø h � [;¯1í�9 ó S¾Àw?}9 ó [A@ í 9 � S¥Àw?}9 � [A@ à � � � Sqh ó ø h � [Dp * ^»èWORmú@,+�muKMIL@iIL?nNek^qORVÕX:mAgüK)x]NPIL?ãí+S:9 ó @ 9 � [JS 9 ó @ 9 � []KMQAawkdORgvbR@�#uOWQA@ilR@JIdk � � � Sqh ó ø h � [+¯n?äí 9 ó 9 � pt»NeQAKWgegvè #�?A@JVd@ÇNekoKägP@-X/X`KìId?uKMIoHZKMQ4fG@Çmnkd@J^qmng�^qORV�tuQnaANeQAläX/@ZKWQAkoKWQna�bMKWVdNeKWQAHJ@-kOW^�geNPQA@ZKWVìH-ORX:fnNeQuKMIdNeORQnkëOW^�X:mAgPIdNPbMKWVÕNüKMId@¼VdKWQAaAORX�bR@-HDILORVÕk-p�Mzr- -�,4�u�32�� m ¤ �Ú¦e§ Â Ø �Ã× « ���  ª ¨!h ¦e§  �  ª ¨ � ÅjØ �Ã× « ��� Í+¦¶« � Å �  ª Ý Â ª ¨Ø  � ¦  ªs×Ã� ) « �u�Dª ­ÎS ���\h®[;¯ ���£Ý  ª ¨ � S ���\h®[;¯ ��� ) �GÑwm ¤ .³¦e§  Š « � ¦ Ô « �u�Dª­ÎS8.äh®[;¯:.YÝ Â ª ¨ � S .¼h®[;¯ . ) .��(Ñ

Page 69: All of Statistics: A Concise Course in Statistical Inference Brief

�����+�M����� ��c���������b��� �������������w� � ������ � ,/.656*J%+*-,/.�'4= ���! �"$#&%(')%+*-,/.� mAEAEGORkd@êId?uKMI�h KWQAa v3KWVd@rVLKWQnaAORX bMKWVdNeKWfAge@Jk-p �Ú?uKMIBNPk)Id?A@êX/@ZKWQ�OW^ëhKWX/ORQAlåIL?AORkÕ@4ILNPXr@Jk&x]?A@-Q v ¯M��{�>&?A@BKWQAk¾xì@JVÎNek]IL?AKMI¼x_@:HJORXrEnmnIL@iIL?A@�Xr@-KWQOW^Îh KMk`fG@J^qORVÕ@!fAmFI`x_@�kdmAfnkÕILNvILmnId@¡º Þ�� ~ Sq² % �A[:^qORV`ºÁÞiS¶²£[/NPQ IL?n@úaA@DtuQANPIdNeORQTOW^@DCFEG@-HJILKMILNPORQ�py+zJ{G|R} ~L} �n|:�u�32n�����u�:×Ã��ª ¨W¦¶«U¦ ��ª ÂWË �ÕÔJÖu�Ã× «  «U¦ ��ª��¥¤ h Æ ¦¶Ø �Dª v¯ �¡¦e§

­ÎS¶h % v�¯ �n[+¯�·�¸ ²$º�Þ�� ~ÇSq² % �A[A´R² aANekÕH-VÕ@JIL@4HZKWkÕ@½ ²Îº Þ�� ~ Sq² % �n[A´R² H-ORQ�IdNeQ�mAORmAkYHZKWkÕ@Rß SU¿ApvÀ O [m ¤ xnSq² ø �A[$¦e§  ¤ Ä ªs× «Ó¦ ��ª��¥¤ ²  ª ¨ �¡« �u�Dª­ÎS�xnSqh ø v:[ % v¯J�n[+¯ · ¸ xAS¶² ø �A[uº Þ�� ~ S¶² % �n[A´R² aANPkdHJVd@JId@:HZKMkd@½ xnSq² ø �A[uº�Þ�� ~;S¶² % �n[A´�² HJORQ�ILNeQ�mAOWmAkÎHZKMkd@RßSU¿ApvÀZ¿�[

�Ú?A@-VÕ@ZKWk #ð­ÎSqh¡[YNek$KîQ�mAX:fG@-V #�­ÎS¶h % v³¯ �n[¼NPk$Kê^qmAQAHJIdNeORQ6OM^T�Gp 7ì@D^qORVd@:x_@ORfAkÕ@-VÕbW@�v>#Rxì@äaAORQ"! I;é�QAO�x÷IL?n@]bMKWgPmA@ëOW^\­$Sqh % v�¯ �A[okdO)NPI+NPk;KiVdKWQAaAOWX�bMKWVdNeKWfAge@x]?ANeHÃ?�xì@/aA@-QnOWIL@å­$S¶h % v4[Dp *çQ�OWIL?A@JV)x_ORVÕaAkc#ð­iS¶h % v�[iNPk$IL?A@rVLKWQAanORXÉbMKWVdNeKWfAge@x]?AORkÕ@úbMKWgemA@ãNek:­$Sqh % v ¯ �n[Bx]?A@JQl�Ú¯ �Gp � NPXrNPgüKWVÕgPè #»­$S�xnSqh ø v�[ % v4[/NPk/Id?A@VLKWQnaAORX bWKMVdNüKMfAge@åx]?AORkÕ@êbWKMgemA@êNPk)­$S�xnSqh ø v:[ % v ¯ �A[$x]?A@-Q ��¯ �£p6>&?nNek:NPk:KbR@JVÕèwH-OWQn^qmAkdNPQAl/E£ORNPQ�I&kdO/ge@DI]mAk]geO�ORéêKMIäKMQã@âCnKWXrEnge@Wp)+*�,.- /.1 zi�u� 2�2� Ä Ö�Öu� § � Í � ¨ �  Í�h 6KL¼QANv^-S � ø À�[-Ñ @ ¤ « �D� Í �ã�W¬ § �D� Ø � h ¯ ² �Í � ¨ � Â Í v % h ¯�² 6[LäQANP^-S¶² ø ÀÁ[-Ñ>m ª «UÄF¦¶«U¦¶Ø � Ë Y�� Í �`�ÕÔJÖu�Ã× «ä« �  «+­$S�v % h ¯�²£[ίS¾Àg@ ²£[ HRàAÑ�m ª4¤ Â × « � ºI~� Þ$S�� % ²£[+¯ ÀIHAS¥À�? ²£[ ¤Z��� ² & � &ÛÀ  ª ¨

­$S�v % h�¯ ²£[ǯ ° ñ¹ �&º ~�� Þ S�� % ²£[Õ´.�4¯ ÀÀw? ² ° ñ¹ �ì´.�B¯ Àg@ ²à § �ÕÔJÖu�Ã× « � ¨�Ñ ��� Ä�§ � ­$Szv % h®[+¯ S¾À$@ãh®[|HWàAÑ / � «U¦ ×Ã� « �  «G­ÎS�v % h®[+¯ S¾À$@ãh®[ HRà/¦e§Â �  ª ¨ � Å Ø Â � ¦  ¬ Ë � Í �u� § � Ø ÂWË Ä � ¦e§)« �u�)ª Ä�Å ¬D�D� ­$Szv % hɯ�²£[ǯ S¾Àg@T²£[ HRà ��ªs×Ã�h ¯ ² ¦e§ �W¬ § �D� Ø � ¨�ÑsC

Page 70: All of Statistics: A Concise Course in Statistical Inference Brief

� ( ������� ��������������������������

J�Áz-���üzr-�� � '����� ��Áz����.1 zi���;} ~¥z � ,Á~¥z êz *�/Wz3&D~N,Á~L} �n| $2�)( E����+�  ª ¨ � Å±Ø Â � ¦  ¬ Ë � §�h  ª ¨v �  §Ã§DÄ�ÅB¦ ª�Æ « �u�B�ÕÔJÖu�Ã× «  «U¦ ��ª § �ÕÔ ¦e§D« � Í �$� Â Ø � « �  «­ * ­ÎSzv % h®[ ,£¯÷­iS�v:[ KWQAa ­ * ­$Sqh % v�[ ,£¯Ú­ÎSqh¡[Dß Sq¿ApvÀ R[1 ���Ã�YÆ��Dªs�D� ÂWË¶Ë Y��u¤Z���  ªpY]¤ Ä ªs× «U¦ ��ª xAS¶² ø �n[iÍ �Î� Â Ø �­ * ­$S�xAS¶h ø v4[ % h®[ ,G¯ ­$S�xAS¶h ø v4[d[ KWQAa ­ * ­$S�xnSqh ø v:[ % h®[ ,G¯ ­$S�xnSqh ø v:[Õ[DßSq¿ApvÀ ý [

���������_��@�! geg�EAVdO�bR@ëId?A@ìtuVÕkÕI+@,+�muKMILNPORQ�p L¼kdNPQAlYId?A@ëaA@DtuQANPIdNeORQ4OW^\H-ORQnaANPIdNeORQuKMg@DCFEG@-HDIÃKMIdNeORQãKWQAawIL?n@i^UKMHJI&IL?AKMIYº(Sq² ø �n[(¯±º»Sq²£[dº»S�� % ²£[�#­ * ­ÎSzv % h®[ , ¯ ° ­$Szv % h�¯�²£[ÕºÁÞÎSq²£[Õ´R²î¯ °c° �nº(S�� % ²£[¾´.�Aº(S¶²£[Õ´R²

¯ ° ° �nº(S�� % ²£[Õº(Sq²£[¾´R²G´+�4¯±°c° �Aº»Sq² ø �n[Õ´R²G´.�:¯÷­iS�v:[âß}C)+*�,.-0/.1 z$�u�32W� X ��ª §D¦U¨ �D���ÕÔ Â Å Ö Ë � � Ñ " " Ñ I � Í × Â ª Í ��×Ã� Å Ö Ä�« � ­$Szv4[ F � ªs�Å � « �u� ¨¼¦e§Ç« �ð©;ª ¨ä« �u� � � ¦ ª «s¨ �Dª §D¦¶« Y º(S¶² ø �n[  ª ¨Y« �u�Dªr×Ã� Å Ö Ä�« � ­ÎSzv:[+¯ ½`½ �Aº(S¶² ø �n[Õ´R²G´.��Ñ@ ªr�  §D¦ �D� Í Â Y ¦e§Ç« � ¨ � « � ¦e§Ç¦ ª «UÍ � §D« �çÖ §-Ñ E ¦ � §D« � Í � ÂWË �Ã�  ¨ Y q ªs� Í�« �  «�­$Szv % h®[+¯S¾Àg@�h®[|HRànÑ ��� Ä�§ �­$Szv:[;¯÷­_­ÎSzv % h®[+¯ ­ " S¥Àg@ h®[à $ ¯ S¾Àg@�­ÎS¶h®[d[à ¯ S¥Àg@ÛS¾À]HRàR[d[à ¯ O HM¿Aß C

y+zJ{G|R} ~L} �n|:�A� 2�U ���u�:×Ã��ª ¨W¦¶«U¦ ��ª ÂWË Ø Â � ¦  ªs×Ã� ¦e§�¨ �q©;ªs� ¨  §� Szv % h�¯�²£[;¯ ° S�� ?�Ý;S¶²£[d[ ^ º(S�� % ²£[¾´+� Sq¿ApvÀ � [

Í �u�D�L� Ý;S¶²£[ǯ÷­iS�v % hɯ ²£[-Ñ J�Áz-���üzr-�� � ' � Eo���$�  ª ¨ � Å³Ø Â � ¦  ¬ Ë � §ìh  ª ¨ v �

� S�vB[+¯Ú­ � S�v % h®[A@ � ­ÎSzv % h®[Dß

Page 71: All of Statistics: A Concise Course in Statistical Inference Brief

�����.� ����������� ��B�������� �� � � þ)+*�,.- /.1 zi�u� 2�< � � Â Í Â ×Ã� Ä ª « Y  « �  ª ¨ � Å ¤â�L� Å « �u� a ª ¦¶« � ¨ � «  « � §-Ñ ���u�Dª ¨ �  Íí Öu�Ã�ÃÖ Ë �  « �  ª ¨ � Å ¤â�L� Å « �u�w×Ã� Ä ª « Y Ñ 4ð� «_h ¬D� « �u�`ª Ä�Å ¬D�D�ê�¥¤ « �u� § �iÖu�Ã�ÃÖ Ë �Í �u�å� Â Ø �  ×Ã�D� «  ¦ ª ¨W¦e§ �  § � Ñ m ¤�� ¨ �Dªs� « � §�« �u�¼ÖG�L�ÃÖu��� «Ó¦ ��ª �¥¤å�¥¤&Öu�Ã�ÃÖ Ë � ¦ ª « �  «×Ã� Ä ª « Y Í+¦¶« � « �u� ¨W¦e§ �  § � « �u�Dª�� ¦e§ ÂWË § �  �  ª ¨ � Å Ø Â � ¦  ¬ Ë � §D¦ ªs×Ã� ¦¶«+Ø Â � ¦ � § ¤â�L� Å×Ã� Ä ª « Y « ��×Ã� Ä ª « Y Ñ��ë¦¶Ø �Dª�� ¯� � Í �/� Â Ø � « �  «ìh 6 7ìNPQAORX/NüKWgUSqí ø �W[-Ñ ��� Ä�§ �­iS¶h % � ¯�W[¼¯�í��  ª ¨ � Sqh % � ¯��W[ä¯�í �nS¾À�?��R[JÑ � Ä Ö�Öu� § � « �  «&« �u�å�  ª ¨ � ÅØ Â � ¦  ¬ Ë ��� �  §  a ª ¦ ¤Z��� Å È�� � ! Ì ¨W¦e§D« � ¦ ¬ Ä�«Ó¦ ��ª Ñ ���u�Dªn� ­iS¶h®[`¯3­Ç­$S¶h % � [`¯­iSqí � [Y¯ íG­ÎS � [ί�í"HRàAÑ 4ð� «äÄ�§ ×Ã� Å Ö Ä�« � « �u� Ø Â � ¦  ªs×Ã�r�¥¤ hTÑ / � Í � � S¶h®[Y¯­ � Sqh % � [S@ � ­iS¶h % � [Dß 4ð� «�� § ×Ã� Å Ö Ä�« � « �u� § � «UÍ � « �D� Å4§-Ñ E ¦ � §D« � ­ � Sqh % � [:¯­ * í � S¾À ? � [ ,£¯1íG­ÎS � S¾À ? � [Õ[ǯ�í ½ �nS¾À ?��W[dº(S��R[¾´��$¯1í ½ ñ; �AS¥À ?��W[Õ´��i¯�í"H ý Ñ/ �ÕÔ « � � ­$Sqh % � [&¯ � Sqí � [&¯í ^ � S � [&¯í ^ ½ S��0?1S¥ÀIHRàR[Õ[ ^ ´��B¯�í ^ HFÀ�àAÑ I �Dªs×Ã�Ã�� S¶h®[+¯ SqíAH ý [A@1SUí ^ HFÀÁàR[JÑsC����� �T"Î#��6.6*Z#]':=�� 6 �"$.656*D��� "!# �$ %'&�(*),+ -/.0-/132�45.768.94;:<4�-=),>@?A.9B

>&?A@åNeQ�IL@JlRVLKMg»OW^_KîXr@-KWkdmnVLKWfAgP@4^qmAQAHDILNPORQ xAS¶²£[YNekÎan@JtuQA@Ja�KMkÎ^qORgegPOÁx]kJp t�NeVdk¾I$kdmnEFæE£OWkd@BIL?uK�I x`Nek$kdNPXrEnge@�#�X/@ZKMQANeQAlêId?uKMIiNvI$IÃKWéM@-kituQANPId@-gvè�XrKWQ�èúbMKWgemn@-k � ñDø ßZßZß ø � �O�bR@-VãK�EuKWV¾ILNvILNeOWQA. ñDø ß-ßZß ø . � p�>&?A@-Q ½ xnSq²£[¾´�µrSq²£[6¯ ¸ �ó�ô ñ � ó 9ìS�xnSqh®[ 4 . ó [âp>&?A@YNPQ�IL@-lWVLKWg\OW^ðK)E£OWkdNPIdNPbW@¼X/@ZKWkÕmAVLKMfAge@]^qmAQAHDILNeOWQFxiNekÇaA@JtAQA@-aêf�è ½ xnSq²£[¾´�µrSq²£[+¯geNPX ó ½ x ó Sq²£[Õ´�µrSq²£[ox]?n@-Vd@wx ó NPk+KÎkd@,+�mA@-QAHJ@¼OM^£kÕNeX/EAge@Ç^qmAQAHDILNeOWQAk+kdmAHÃ?åIL?AKMISx ó S¶²£[�HxAS¶²£[rKWQna x ó S¶²£[DC xnSq²£[/KWk & C ' pÛ>&?nNek`anOF@JkêQAOWI`aA@-EG@-QAa OWQ÷IL?A@!EuKWVÕIdNeHâæmAgüKMVãkÕ@�+�mA@JQAH-@Mp >&?A@�NPQ�Id@-lRVdKWg$OW^:KTX/@ZKWkÕmAVLKMfAge@6^qmAQAHDILNPORQ x�NPkwaA@JtuQn@-aÛIdO fG@½ xnSq²£[¾´�µrSq²£[`¯ ½ xFE+S¶²£[Õ´�µ`S¶²£[w? ½ x�ïoS¶²£[Õ´�µ`S¶²£[/KMkdkdmnXrNPQAl�fGOWId? NPQ�Id@-lRVdKWgek/KWVÕ@tuQANvIL@�#Ax]?n@-Vd@ x E Sq²£[;¯1XrK�C_ZIxnSq²£[ ø � `4KWQAa x ï Sq²£[;¯n?�X/NeQ_ZIxAS¶²£[ ø � `�p�� "!# HG I 2@JK)L4�-NMN)L4O),?A.0-A1P4O>RQOST4O+ -A132�4O6

Page 72: All of Statistics: A Concise Course in Statistical Inference Brief

( � ������� ��������������������������

y+zD{£|R} ~L} �n|4�u� 2�E ���u�o��¢G�����£���\���o���Á�n���U�����ç�o�o�W���U¢G�G�d����� (Z�ð�����;�A���U�u�R����Á�u�o � ¢\���T�&�¥¤ h ¦e§4¨ �q©;ªs� ¨ ¬cY ÞiS � [;¯ ­$S�> � Þ [+¯ ° > � ¹ ´�µrSq²£[

Í �u�D�L� � Ø Â � ¦ � § � Ø �D� « �u�)�Ã� ÂWË ª Ä�Å ¬D�D� §-Ñ*çQãx]?AKMI¼^qOWgegeO�x]k #Fx_@4KWkÕkdmAX/@iIL?uKMI&Id?A@)XrlW^oNPk]x_@-gPg�aA@JtAQA@-aã^qORVäKWgeg � NeQwkdXrKWgPgQA@JNelR?�fGORVd?AO�O�aîOW^ � p���Vd@JgüKMId@-aw^qmAQAHDILNeOWQãNPkëIL?A@)HÃ?uKWVdKWHJId@-VÕNekÕIdNeH$^qmAQAHDILNPORQ"#AaA@DtuQA@Jaf�è�­$S�> ó � Þ [�x]?A@-VÕ@ & ¯� ?)ÀRp(>&?ANPk(^qmAQAHDILNeOWQBNek+KWgvxëKZè�k+xì@JgegAan@JtuQA@Ja/^qOWV+KWgeg � p(>&?A@X/lW^(NPkämnkd@J^qmngo^qORVäkd@DbR@JVLKWg�Vd@ZKMkdORQAkJp�t»NeVÕkÕI #\NvI¼?n@-geEnk¼mAk¼H-ORX/EAmnId@)Id?A@)XrORX/@-Q�Idk¼OW^KîaANek¾ILVÕNefAmnIdNeORQsp � @JH-ORQAa$#�NvI$?A@-gPEAkimAk$tuQAa¡IL?A@åaANPkÕIdVdNefnmnILNPORQ6OW^_kdmAX/k$OW^_VLKWQAanORXbMKWVdNeKWfAgP@-k-p�>&?ANPVda"#äNPI`Nekêmnkd@-aÚILO�EnVdO�bR@6IL?n@6H-@-Q�IdVLKWgÎgPNeX/NPIåIL?n@-ORVÕ@-X x]?nNeHÃ? xì@aANPkdH-mnkdkägüKMId@-V-p�Ú?A@-QBId?A@&XrlM^\Nek�x_@-gPgAaA@DtuQA@-a$#�NvI(H-KWQBf£@ìkd?AO�x]QåIL?uKMI»xì@&HÃ?uKMQAlR@&NeQ�Id@-VdHÃ?AKWQAlR@IL?n@)OWE£@JVLKMIdNeORQAkìOW^(aAN��s@JVd@JQ�IdNüKMIdNeORQêKWQAa �¾IÃKWé�NPQAlB@DCFEG@-HJILKMILNPORQ�p �>&?ANPk]ge@ZKMaAkëILO

�� S � [+¯�� ´´ � ­�> � Þ�� � ô<; ¯÷­�� ´´ � > � Þ�� � ô<; ¯Ú­ h h+> � Þ j � ô<; ¯ ­$Sqh®[âß7_èrIÃKWé�NeQnl:^qORV&aA@-VÕNPbMKMIdNPbR@Jkëxì@$HJORQAHJgemAaA@$Id?uKMI �� ��� S � [+¯Ú­ÎSqh � [Dp+>&?ANPk&lRNPbW@-k&mAkäKX/@JIL?nOFaî^qORV&HJORXrEnmnILNPQAl:IL?A@iX/ORX/@-Q�ILk&OW^»KåaANek¾ILVdNPfAmnIdNeORQ�p)+*�,.-0/.1 z$�u�32 �l4o� «ohi6 �(CFEoS¥ÀÁ[-Ñ Eo���  ªpY � &±À �

Þ$S � [;¯Ú­ > � Þ ¯ °lf;

> � ¹ > ï ¹ ´R²w¯ °lf;

> � � ïuñ � ¹ ´�²î¯ ÀÀw? � ß���u� ¦ ª « �ÊÆW� ÂWË ¦e§:¨W¦¶Ø �D�ÓÆ��Dª «Ç¦ ¤ ��� À�Ñ ������ Þ$S � [ǯ ÀIHAS¾Àw? � [ ¤Z��� ÂWË¶Ë � &�À�Ñ�/ � Í � � S � [B¯ À  ª ¨ � � S � [B¯�àAÑ I �Dªs×Ã�Ã� ­ÎSqh¡[B¯ À  ª ¨ � Sqh®[å¯Ü­$S¶h ^ [s? Ý ^ ¯à?÷ÀY¯À�Ñ�Mzr- -�,4�u�3D����_�Ã�ÃÖu�D� «U¦ � § �¥¤ « �u� Å ÆÕ¤ ÑÈ ! Ì m ¤ v¯ ��h[@ #/« �u�Dª� ~ìS � [+¯?>! � ÞiS�� � [-ÑÈ " Ì m ¤ h ñâø ßZßZß ø h ò  �L� ¦ ª ¨ �çÖu�Dª ¨ �Dª «  ª ¨ v�¯ ¸1ó h ó « �u�Dª" ~_S � [+¯$# ó ó S � [Í �u�D�L�� ó ¦e§)« �u� Å ÆÕ¤��¥¤ h ó Ñ

Page 73: All of Statistics: A Concise Course in Statistical Inference Brief

�����]����� ����� � � ��� � (nÀ)+*�,.- /.1 zi�u� DF� 4ð� «+h 6 7_NeQAORX/NüKMgqSUí ø 9G[-Ñ @ § ¬D�q¤Z���L� Í � q ªs� Í1« �  «+h ¯ ¸ ó h óÍ �u�D�L� 9ìS¶h ó ¯�ÀÁ[;¯ 9  ª ¨ 9ëS¶h ó ¯ � [+¯ À�?G9oÑ / � Í ó S � [;¯Ú­ > Þ�� ��¯�S 9 = >��U[�@SdS¥À�?}9\[d[+¯ 9 > � @��wÍ �u�D�L� �)¯ À�?}9ðÑ ��� Ä�§ �� ÞiS � [ǯ # ó ó S � [;¯�S:9 > � @ �R[ ò Ñ ��ÁzJ���üz -���� � ' 4ð� «�h  ª ¨ v ¬D�)�  ª ¨ � ÅÜØ Â � ¦  ¬ Ë � §-Ñsm ¤� ÞiS � [;¯ ~_S � [ ¤Z��� ÂWË¶Ë �¦ ª  ª��ÃÖu�Dª ¦ ª « �D� Ø ÂWËo �Ã� Ä ª ¨�� � « �u�Dª h �¯ vêÑ)+*�,.- /.1 zi�u� D�D 4ð� «�hK6M7ìNeQAOWXrNeKWgqSUí ñDø 9G[  ª ¨ihi6M7ìNPQAORX/NüKWgUSqí ^ ø 9G[ ¬D� ¦ ª ¨ �çÖu�Dª Ϩ �Dª « Ñ 4ð� « v¯Úh ñ @�h ^ Ñ0/ � Í ~_S � [;¯ ñ S � [ ^ S � [;¯ S:9 > � @ �R[ ò�� S 9 > � @��W[ ò�� ¯ S 9 > � @��W[ ò� E ò� ª ¨/Í �Î�Ã�Ã×Ã�dÆWª ¦�� � « �u� Ëv «Ó« �D�  §$« �u� Å ÆÕ¤)�¥¤  7_NeQAORX/NüKMgqSUí ñ @�í ^ ø 9G[i¨W¦e§D« � ¦ ¬ Ä�«Ó¦ ��ª Ñ� ¦ ªs×Ã� « �u� Å ÆÕ¤B×Ã�  � Â × « �D� ¦�� � §)« �u� ¨W¦e§D« � ¦ ¬ Ä�«Ó¦ ��ª È ¦UÑ � Ñ$« �u�D�L�å× Â ª � « ¬D�  ªs� « �u�D�)�  ª Ϩ � Å Ø Â � ¦  ¬ Ë � Í � ¦ ×Ã�:�  §]« �u� §  Š� Å ÆÕ¤ Ì Í �$×Ã��ªs× Ë ÄA¨ � « �  «$v 6� & í+Sqí ñ @îí ^ ø 9\[-Ñ

(!OWXr@JQ�I � @-QA@JVLKMIdNeQAl tumAQAHDILNPORQê^qORV � ORX/@��ìORX/XrORQ��ÎNPkÕIdVdNefnmnILNPORQAk�ÎNek¾ILVdNPfAmnIdNeORQ X/lW^7ì@-VÕQAORmAgPgeN;SqE\[ 9N> � @1S¾Àw?}9G[7ìNeQnORXrNeKWg»SUQ"# E\[ S:9 > � @ÛS¾Àw?}9G[Õ[ ò��ORNPkdkÕORQ S��£[ >�� � B�� ïuñ �*¼ORVÕX`KMg;SUÝ�# � [ @âCFE��GÝ � @�� � � �^��� KMXrXrKwS���# ��[ ���� ï ���� ^qOWV � & �

���"! ���!#]"$9;#¼* ��"��ÀRp � mAEnE£ORkÕ@äx_@äEAgüKZèåKil�KWX/@ëx]?A@-VÕ@äx_@äkÕIÃKMVÕIÇx]NPIL?�#ëaAOWgegüKMVdk-p�(ÎQr@-KWHÃ?`EAgeKZè:OW^IL?n@Yl�KWX/@]èRORmê@JNPId?A@-VÇaAORmAfnge@YOWV_?uKMgP^sèWORmAV_XrORQn@Jè #�x]NPId?ê@�+�muKMg£EAVÕORfuKWfnNegeNvIçèRp�Ú?uKMIäNPkëèRORmAV]@âCnEG@-HDIL@Ja!^qORV¾ILmAQA@)KM^¶Id@-V&í6ILVÕNüKWgPkP{àFp � ?AO�xcIL?AKMI � Sqh¡[㯠� NP^$KWQAa�OWQAgPèTNP^¼IL?A@JVd@!NekêK®H-ORQnkÕIÃKMQ�I$#ãkÕmAHÃ?÷IL?AKMI� Sqh�¯%#J[;¯ ÀRp

O p�u�@JI¼h ñDø ßZßZß ø h ò 6 L¼QANv^qORVdXãS � ø ÀÁ[]KWQAaúge@JIv ò ¯XrK�C_Z�h ñDø ßZßZß ø h ò `�pt»NPQAa­ÎSzv ò [âp

Page 74: All of Statistics: A Concise Course in Statistical Inference Brief

(�à ������� ��������������������������

¿Ap��3EuKWVÕIdNeHJge@úk¾IÃKWV¾ILkîK�IêIL?A@úORVdNPlRNeQTOW^iIL?A@úVd@ZKMgYgeNPQA@!KWQAa X/OÁbW@-kwKWgPORQAl®IL?A@gPNeQA@iNeQ��ÕmAX/EAk¼OM^+ORQA@�mAQANPI-p�t\OWV¼@-KWHÃ?��ÕmAX/E!Id?A@:EAVÕORfuKWfANPgeNvIçèêNek�9�Id?uKMIäIL?A@EuKMVÕILNPH-gP@¼x]NegPg��ÕmnXrEêOWQA@$mAQANvIìILO:Id?A@$ge@D^¶I&KWQnawId?A@ÎEAVÕORfuKWfnNegeNvIçè/NekYÀT? 9îId?uKMIId?A@iEuKWVÕIdNeHJge@Îx]NegPg��ÕmAX/EwORQA@)mAQANvI&ILOåIL?A@iVdNPlR?�IZp�u�@JI]h ò fG@)Id?A@iE£ORkÕNPIdNeORQîOW^Id?A@4EuKMVÕILNPH-gP@:KM^¶Id@-VYí�mAQANvILk-pt»NeQAaã­$S¶h ò [äKWQAa � Sqh ò [DpåSq>&?ANekÎNPkYé�QAO�x]Q¡KWkKBVLKMQAaAORX³xìKWgPéGp [ Fp�� ^UKWNeV)H-ORNPQ Nek)ILORkÕkd@JaTmnQ�IdNeg_Kú?A@-KWa�Nek�ORfnIÃKMNeQA@Ja�po�Ú?uKMI:Nek�Id?A@ê@âCnEG@-HDIL@JaQ�mAX:fG@-V]OW^oILOWkdkd@JkäId?uKMI&x]NPgegsfG@iVd@,+�mnNeVd@Ja_{ý p �+VdO�bW@�>&?A@JORVd@JX ¿Ap ý ^qOWV&aANekÕH-Vd@DIL@iVLKMQAaAORX³bMKWVÕNüKWfnge@-kJp� p�u�@JIåh f£@!K¡H-ORQ�ILNPQ�mAORmAk/VLKWQnaAORX bMKWVÕNüKWfAgP@êx]NPId?���ö �6µ/p � mAEnE£ORkÕ@!Id?uKMI� Sqh � � [ί À`KMQAa®Id?uKMIέ$Sqh®[Y@âCFNekÕIdk-p � ?AO�x�Id?uKMI$­ÎSqh®[Y¯ ½ f; 9_S¶h �²£[¾´�²ðpyäNeQ�IZû �ìORQAkÕNeaA@JV�NPQ�Id@-lRVdKMILNPQAlãf�è¡EAKWVÕIdk-pî>&?A@/^qORgegPOÁx]NPQAlê^UKWHDI�NPk�?n@-geEF^qmAgÊû)NP^­ÎSqh¡[ì@DCFNek¾ILk]IL?A@JQ!gPNeX ¹�� f ² *�Àw? µrSq²£[ ,£¯ � p

(np �+VdO�bW@�>&?A@JORVd@JX ¿ApvÀ ý pþ p�S � ��-0/���~¥z � )+*�/Wz �U} -¼zM|�~ p [[u�@JIåh ñâø h ^ ø ßZß-ß ø h ò fG@�S � ø ÀÁ[:VLKWQAanORX bMKWVdN�æKWfnge@-k�KWQAaige@DI h ò ¯�íðïuñ ¸ òó�ô ñ h ó p#�+gPOWI h ò bW@-VÕkdmAkoí4^qORV�í!¯À ø ßZßZß ø À � ø ��� � p�]@-EG@ZKMI&^qOWVëh ñÃø h ^ ø ßZßZß ø h ò 6 �ëKWmAHÃ?�èðp �»CFEAgüKMNeQîx]?�èêIL?A@JVd@iNek&kÕmAHÃ?�KBaANv^üæ^q@JVd@-QnH-@WpÀ � p�u�@JI&hK6� S � ø ÀÁ[ëKMQAa!gP@JI�v¯?> Þ p�t»NeQnaî­ÎS�vB[ëKWQAa � S�v:[âpÀRÀRp�S � ��-0/���~¥z �g)+*�/Wz �U} -¼zM|�~� ��\} - �.1 ,�~L}á|��å~��Áz��F~¥� &�� � , ���ÃzJ~ p [ u�@JI�v ñâø v ^ ø ßZßZßnf£@NPQAaA@-EG@-QnaA@-Q�I�VLKWQAanORXÉbMKWVdNeKWfAge@Jk$kdmAHÃ?�IL?uKMI � Szv ó ¯3ÀÁ[$¯ � Szv ó ¯ ?)ÀÁ[i¯ÀIHWàFp�u�@JIäh ò ¯ ¸ òó�ô ñ v ó p&>&?ANPQAéwOW^Sv ó ¯ À:KMk �¾Id?A@�k¾ILO�HÃéwEAVdNPH-@)NeQAHJVd@ZKMkd@-af�è6ORQn@raAOWgegüKMV <#Av ó ¯ ?�ÀrKWk��¾IL?n@/k¾ILO�HÃé6EAVdNPH-@åaA@JH-Vd@-KWkd@Ja f�è¡ORQA@åaAORgPgüKWV KWQnawh ò KWk&Id?A@$bWKMgemA@$OW^oIL?n@)k¾ILO�HÃéêORQãauKZèîí»pSUK�[st»NeQna`­iS¶h ò [ëKWQAa � Sqh ò [âpSqf\[ � NeX:mAgeKMIL@&h ò KWQAaêEAgPOWI;h ò bW@-VdkÕmAk&í!^qOWV_í!¯ À ø à ø ßZßZß ø À � ø � � � p��]@-EG@ZKMIId?A@)x]?nORge@)kdNPX:mAgüK�ILNeOWQãkÕ@JbW@-VLKMg�ILNPXr@Jk-p *äOWILNPH-@iIçxìO`Id?ANeQnlRk-pwt»NPVdk¾I #\NPIc! kä@ZKWk¾èIdO��ÕkÕ@-@ úEuKMIÕIL@JVdQAk�NeQ6IL?A@rkd@,+�mn@-QAHJ@ê@JbW@-Q�Id?AORmAlR?�NPIiNPk�VdKWQAaAORXãp � @-HJORQAa"#

Page 75: All of Statistics: A Concise Course in Statistical Inference Brief

�����]����� ����� � � ��� � ( OèROWm�x]NegPg(tuQAa�Id?uKMI)IL?A@/^qORmAViVdmAQnk4geO�ORé�bW@-VÕè®aAN��s@JVd@JQ�I4@JbW@-Q IL?nORmAlR?®IL?A@Dèx_@-Vd@ålR@JQA@-VdKMIL@Ja6IL?A@BkdKWXr@�xëKZèRp y¼O�x aAOîIL?A@BH-KWgeHJmAgüKMIdNeORQnk¼NPQTSÓK�[¼@âCFEAgüKWNPQIL?n@)kÕ@-HJORQAaúOWfAkd@JVÕbMKMILNPORQ_{ÀÁàFp �+VdO�bR@åIL?n@/^qORVÕX:mAgeKWkÎlRNvbR@-Q®NeQ6Id?A@BIÃKWfAgP@åKMIiIL?A@åfG@-lRNPQAQANPQAlîOW^ � @-HJIdNeORQ¡¿Ap ¿^qORV4Id?A@ 7ì@JVdQAORmngegeNz# �oORNekÕkdORQ"#�LäQANP^qORVÕX # �(CFEGORQA@JQ�IdNüKWgz# � KMXrXrKúKWQna 7_@JIÃKFp

y¼@JVd@`KWVd@rkdORX/@`?ANPQ�ILk-p tuORV$IL?A@rXr@-KWQ OW^ìIL?A@ ��ORNPkdkdOWQ"#(mAkÕ@`IL?A@/^UKWHJI)IL?AKMI>��ä¯ ¸ f¹Dô<; � ¹ H�²��vp»>oOYHJORX/EAmnIL@ÇIL?A@ÇbMKWVdNeKWQAH-@ #�tAVdkÕI»H-ORX/EAmnId@Ç­ÎSqh Sqh ?rÀÁ[d[âptuORV:IL?n@îXr@-KWQ OW^¼Id?A@ � KMXrXrK�#+NPI4x]Negeg_?A@-gPETIdO¡X:mAgPIdNeEAgvè�KWQAa aANPb�NPaA@îf�è�ÇS��F@�ÀÁ[|H� E£ñ KWQAaîmnkd@$IL?A@Î^UKMHJIëId?uKMI]K � KWXrXrK�an@-QAkÕNPIçèêNPQ�IL@-lWVLKMId@-këILOîÀWptuORVëIL?n@ 7_@JIÃK #uX:mAgPIdNeEAgvèrKWQAawaANPb�NPaA@$f�è��_S�� @�ÀÁ[��ÇS���[|H��ÇS�� @�� @�ÀÁ[âp

À O p � mAEnE£ORkÕ@:xì@BlW@-QA@JVLKMId@BK`VLKMQAaAORXcbMKWVdNeKWfAge@��NPQúIL?A@�^qORgegPOÁx]NPQAlåxëKZèRp0t»NPVdk¾Ix_@�uNPE�K6^UKWNPVBH-ORNPQ�p * ^YIL?A@wH-OWNeQTNekå?A@ZKMaAkc#ìILKWéW@�� ILO®?uKZbR@!K LäQANP^ÕS � #PÀ�[aANPkÕILVÕNefAmFILNeOWQ�p�* ^£Id?A@ëH-OWNeQBNek�IÃKWNPgekc#�IÃKWéM@��÷ILO$?AK�bW@äK L¼QANv^ÕS O # ¿�[oanNekÕIdVdNPfAmnILNPORQ�pSÓKR[st»NeQAaîId?A@iXr@-KWQwOW^ �4pSUfu[wt�NeQAaêId?A@ikÕIÃKMQAauKWVÕa!aA@Db�NüKMIdNeORQwOW^��:pÀZ¿Apwu�@JIäh ñâø ßZßZß ø h��ÚKWQAaov ñÃø ßZßZß ø v ò f£@)VLKWQnaAORXÜbMKWVdNeKWfAgP@-k]KWQAaãge@DI � ñDø ßZßZß ø ���KWQAa�# ñDø ßZßZß ø # ò fG@)HJORQAk¾IÃKWQ�ILkJp � ?AO�xÛIL?AKMI� � � � �� ó�ô ñ � ó h ó ø ò�

�Õô ñ # � v � � ¯ �� ó�ô ñ ò��dô ñ � ó # �

� � � Sqh ó ø v � [âßÀ Fpwu�@JI ºÁÞ�� ~ÇS¶² ø �n[+¯�� ñR Sq²G@ �A[ � H÷²6HÀ ø � H � H�à

� OMIL?A@JVÕx]NekÕ@Wßt»NPQAa � SÓà�h ? O v5@&(R[Dp

À ý pwu�@JIäV�SüCA[&fG@iKB^qmAQAHJIdNeORQwOW^ðCãKWQnaãgP@JI]kZSqèn[]fG@)K:^qmAQAHJIdNeORQwOW^oèRp � ?nOÁx1IL?uK�I­ÎS�xAS¶h®[���Szv4[ % h®[+¯JxnSqh®[ ­ÎS���Szv4[ % h¡[Dß

�YgPkdO<#nkd?AO�x±Id?uKMIë­$S�xnSqh®[ % h®[(¯5xAS¶h®[Dp

Page 76: All of Statistics: A Concise Course in Statistical Inference Brief

(W¿ ������� ��������������������������

À � p �+VdO�bW@)Id?uKMI� S�vB[+¯Ú­ � Szv % h®[A@ � ­$Szv % h®[âß

yäNeQ�IZû u�@JI�� ¯ ­$S�v:[(KWQna/gP@JI�#�S¶²£[_¯÷­iS�v % h�¯�²£[âp *¼OMIL@ëIL?AKMI(­ÎS #�Sqh®[Õ[;¯­;­ÎSzv % h®[Y¯�­ÎSzv4[ί��6ß"7ì@ZKWViNPQ6XrNPQAa�IL?uK�I�#4NekiKê^qmAQAHDILNPORQ¡OW^_²ðp *äO�xx]VÕNPIL@ � S�v:[;¯ ­$Szv ?��ã[k^ë¯Ú­ÎSÕSzv ? #MS¶h®[d[p@�S #�Sqh®["?��ã[Õ[N^�ß �(CFEuKMQAaîIL?A@k +�mAKWVd@ìKWQAa4ILKWéW@ÇIL?A@_@DCFE£@JHJILKMILNPORQ�p ,ÇORm4IL?n@-Q:?uKZbW@_ILOäIÃKWéM@ìIL?n@_@âCnEG@-HDIÃKMIdNeORQOW^;Id?AVd@J@:IL@-VÕXrkJp *çQ¡@-KWHÃ?¡HZKWkÕ@�#�mAkÕ@åId?A@BVdmnge@:OW^+Id?A@BNPId@-VdKMIL@Ja6@DCFE£@JHJILKMILNPORQ�ûNÓp @Mp»­$SqkÕIdm �Ç[;¯Ú­ÎS¶­ÎSUk¾ILm � % h®[d[âßÀ�(np � ?nOÁxIL?AKMI$NP^+­ÎSqh % v�¯ �n[]¯ #)^qORVÎkÕORX/@BH-ORQAk¾IÃKWQ�I�#�Id?A@-Q6h KWQna v KMVd@mAQnH-ORVÕVd@-geKMIL@Ja�pÀ þ p]>&?ANPk#+�mA@-k¾ILNPORQ¡NekYIdOî?A@-gPE�èRORm¡mAQAan@-Vdk¾IÃKWQna®Id?A@BNeaA@-KêOW^_K  -�u����� �Ó����� �U  ��������»�����q¢G� � u�@DI$h ñÃø ßZßZß ø h ò fG@/õÓõ¶ö x]NvIL?6X/@ZKWQ¡Ý KWQAa�bMKWVÕNüKWQnH-@ � ^ p u�@JIh ò ¯íðïuñ ¸1òó�ôñ h ó p¼>&?A@-Q h ò NPkÎK  J�Z�n���U J���U� #AIL?uKMIYNek #�Kr^qmAQnHJILNPORQúOM^+IL?A@auK�IÃKnp � NeQAHJ@ h ò Nek&K4VdKWQAaAOWX bMKWVÕNüKWfAgP@�#�NPIì?uKWk&K:anNekÕIdVdNPfAmnILNPORQ�po>&?ANekìaANek¾ILVdN�æfAmFILNeOWQ�Nek¼HZKWgPge@-a!IL?A@`§ Â Å Ö Ë ¦ ª�Æ ¨W¦e§D« � ¦ ¬ Ä�«U¦ ��ªT�¥¤ « �u� §D«  «U¦e§D«Ó¦ × Ñ �ä@JHZKWgPg�^qVdOWX>&?A@JORVd@JXc¿ApPÀ ý IL?AKMIä­ÎS h ò [ì¯ Ý�KWQAa � S h ò [_¯ � ^ HMí»p �ÎORQ$! I¼H-ORQF^qmAkd@)IL?A@aANPkÕIdVdNefnmnILNPORQ:OW^\IL?n@&auKMIÃK)º�Þ KWQAaBIL?n@&aANek¾ILVdNPfAmnIdNeORQ:OM^£Id?A@&kÕILKMILNPkÕILNPH]º Þ�� p�>�OXrKWéW@äIL?ANPkìHJge@ZKMVc#�ge@JI_h ñâø ßZßZß ø h ò 6ML¼QnNP^qORVÕXúS � ø ÀÁ[âp�u�@JI]ºÁÞ f£@YId?A@ÎaA@JQAkdNvIçèOW^nIL?A@�LäQANP^qOWVdXãS � ø ÀÁ[Dp��+geOWI(ºÁÞYp *äOÁxTgP@JI h ò ¯1íðïuñ ¸ òóÐô ñ h ó p�t»NPQAa�­ÎS h ò [KWQna � S h ò [Dp �+geOMI»IL?n@-X KMk+KY^qmAQAHDILNeOWQ:OW^Gí»p �ìORXrX/@-Q�I-p *¼O�x kdNPX:mAgüK�IL@;IL?A@aANPkÕIdVdNefnmnILNPORQiOW^ h ò ^qORVoí!¯À ø ø à ø À � � p �ì?A@JHÃé)Id?uKMIoIL?A@_kdNPX:mAgüK�IL@-aibMKWgemn@-kOW^Ç­iS h ò [)KWQna � S h ò [)KMlRVd@J@rx]NPId?�èWORmAV�Id?A@-OWVd@JIdNeH-KWg;HZKWgPH-mAgeKMILNPORQAk-p �Ú?uKMIaAOBèWORm!QnOWILNPH-@)KWfGORmnI&IL?n@)kdKWX/EAgeNPQAl:aANPkÕILVÕNefAmFILNeOWQîOW^ h ò KWk]í¡NeQAHJVd@ZKMkd@-k {à � p �+VdO�bW@ u�@-X/XrK:¿Apáà � p

Page 77: All of Statistics: A Concise Course in Statistical Inference Brief

� � ��� ����

� � ����� �������������

����� !#"%$'&)( !#* + ,�-/.1032546-/.7( 89*/.;:�</!#=?>A@B>C.1D

EGFIHCJLKNMPORQRSTQUHCVWMPXTHYKIVTH[Z\KIO6Z\]^X`_a]^KbFIcIQUFbd/JLKNMPFeSfQgSfQUHAVWSfhIMiS#jkQUd^heS`]PSfhbHCXmlnQRVTH�_oHhNMPXmcpST]'qC]^jkrIKbSTHPs�tuhIH[v3lnQUOROwMiOUVT]W_aH1KIVmHCcpQUF3SThIH;SfhbHC]^XxvY]PZyqA]^Fez^HCXmd^HCFbqCH{lnhIQUq|hQUVucIQRVTqAKIVTVmHCc/QRF3SfhIH{FIH~}�S�q|hNMPr�SfHCXAs��1KbXn�IXTVmS�QRFIHCJLKNMPORQRSGv�QUV���MPXm�P]�zo��V�QUFbHCJLKNMPORQRSGv^s

�����A�e���i���������A�'�i���|�i�L�� {¡£¢��C¤I¥¦�e§�¨ ©«ªL��¬®­y¯[°)± ²[¯�³µ´w¶¦´N·G´w¯�¸e³P°�¹»ºP¯®¼m³P´o½¾¶¦¿º¦³P¼~¹�³e²CÀU¯)³P´o½kÁ[Â�ÃeÃN¶�Á�¯{°\Äb³P°�Å1Æ\±®ÇW¯mÈ^¹UÁ[°\ÁCÉ�Êy¶¦¼{³P´NË{ÌÎÍÐÏLÑ

Ò Æ\±ÓÍÔÌmÇÎÕ Å;Æ\±ÖÇÌ × Æ�Ø sRÙ Ç

Ú;ÛwÜ�ÜÞÝwß Å;Æ\±®Ç�àâá�ãäæåaç Æ å Çxè å àâáÞéäBåaç Æ å Çxè åaê á%ãé åaç Æ å Çxè å ë á�ãé åaç Æ å Çmè å/ëÌ á ãé ç Æ å Çxè å àìÌ Ò Æ»±�ÍíÌmÇ s�î

ï Ø

Page 78: All of Statistics: A Concise Course in Statistical Inference Brief

� ���������� �������������������������!

�n���C�e���¦�����#"Ô�%$Þ���'&Aª� ~���A�L�  7¨ ¢��A¤I¥¦�L§ ¨ ©�ªL��¬ ­9¯[°�( àÅ{Æ»±®Ç ³P´o½�)+*1à-, Æ\±ÖÇCÉ. ÄN¯[´bÑ

Ò Æ%/ ±102(3/ ë ÌmÇuÕ)4*Ì * MPFIc Ò Æ5/76�/ ë98 ÇÎÕ

Ù8 *

Æ«Ø s;: Ç< ÄN¯[¼f¯=6 à Æ\±>0?(�ÇA@B)%ÉDC[´ Ãb³P¼~°�¹FE[Â�Àg³P¼fÑ Ò Æ%/G6�/�Í : Ç`Õ Ù @'H�³P´o½ Ò Æ%/G6�/�ÍI ÇÎÕ Ù @BJNÉ

Ú;ÛwÜ�ÜÞÝaß�K H{KIVmH#��MiXT�P]�zo� V�QRFIHCJLKNMPORQRSGv�Sf] qC]^FbqCOUKbcIH1SfhNMiS

Ò Æ%/ ±L0M(N/ ë ÌmÇ%à Ò Æ%/ ±L0M(N/ * ë Ì * ÇÎÕÅ;Æ\±10M(9ÇO*

Ì * à )4*Ì * ×

tuhIH{VTHAqC]^Fbc�rNMPXxSuZ\]^OUOR]�lnVu_Lv3VTHASmSfQRFId ÌBà 8 ) s6îP�QP�e�SRe§ �UTWVYX9Z Â�ÃeÃN¶�Á�¯ < ¯`° ¯|Á[°Î³WÃo¼f¯T½P¹FE[°�¹�¶¦´ ¿Y¯[°\ÄN¶C½¦Ñγ3´w¯[¾¼T³PÀ9´w¯[°\[?¶¦¼ ¯mÈL³P¿7ÃoÀU¯|Ѷ¦´ ³#Á�¯[° ¶�[N]�´w¯ < ° ¯|Á[°!ET³¦Á�¯|ÁCÉu­9¯[°w±_^wà Ù ¹ [�°»ÄN¯ Ão¼|¯T½P¹FE[°G¶¦¼7¹UÁ < ¼f¶¦´¾¸ ³P´o½;±_^wà Ϲ [)°\ÄN¯nÃo¼f¯T½P¹FE[°G¶¦¼`¹UÁ`¼~¹g¸¦Ä¾° É . ÄN¯[´ ±a` àb]�cWdfe `^hg d ±_^n¹UÁ`°\ÄN¯ ¶P²[Á�¯[¼~ºP¯T½5¯[¼~¼f¶¦¼W¼m³P° ¯CÉi ³jE|Ä�±_^;¿k³PË ²[¯k¼f¯�¸e³P¼m½¾¯T½�³¦ÁY³lk;¯[¼~´w¶¦Â¾À»À£¹ < ¹»°»Ä®Â�´nm¦´w¶ < ´�¿Y¯T³P´poyÉ?q�¯ < ¶¦Â�Àg½À£¹#me¯'°G¶rm¦´w¶ < °\ÄN¯#°«¼~ÂN¯|Ñ1²C¾°nÂ�´nm¦´w¶ < ´�¯[¼~¼f¶¦¼'¼T³P°G¯soyÉ_C~´N°«Â¾¹»°«¹»ºP¯[À£Ë�Ñ < ¯Y¯mÈAÃN¯5E[°u°»Äb³P°±t`pÁ|ÄN¶¦Â�Àg½�²[¯uE[ÀU¶�Á�¯)°G¶�oyÉ�v`¶ < À£¹#me¯[À£Ëk¹UÁ ±t`/°G¶k´w¶¦°u²[¯ < ¹»°»Ä¾¹»´lw�¶�[xozy{q/¯1Äb³PºP¯°»Äb³P°�, Æ ±t`^Ç�à|, Æ\± d Ç}@B]4*6à�oyÆ Ù 0tooÇA@B]µ³P´o½

Ò Æ5/ ±10~o�/¾Í�w|ÇÎÕ ,kÆ ±t`eÇw * à oyÆ Ù 0tooÇ

]+w * Õ ÙH�]4w *

Á[¹»´4E|¯xoyÆ Ù 0~ooÇÎÕ d� [?¶¦¼`³PÀ»À�oyÉ�Êy¶¦¼zw6à × : ³P´o½S]�à Ù Ï^Ï °»ÄN¯W²[¶¦Â¾´o½�¹UÁWÉ����'�j�eÉ î

���%� � &�.�� + >?*t�t�mD 89*/.1:�</!W=?>A@�2��]¾H��acIQUFbdI��VnQRFIHCJLKNMPORQRSGv�QRVnVTQUjkQUOUMPX6QRF VmrIQUXmQRSuST]3��MiXT�P]�zo� V7QUFbHCJLKNMPORQRSGv�_IK�S7QgSnQUV

M{VThNMPXmraHAX6QRFIHCJLKNMPORQRSGvPs�K�H�rIXmHCVmHCFeS6SThIH7XmHCVTKbORS�hIHCXmH�QRFkSGl�]#rNMPXxSfVAsBtuhIH7rbXT]¾]PZ\V6MiXTHQUF�SThIH1SfHCq|hbFIQUqCMPO9MPrIroHCFIcbQg}as

Page 79: All of Statistics: A Concise Course in Statistical Inference Brief

�����n�����z������������� r������_�������� ï�������A�e���i��������­9¯[°�� d�� ×C×?× � �W`�²[¯{¹»´o½¾¯GÃN¯[´o½¾¯[´N°Î¶P²[Á�¯[¼~º¦³P°«¹�¶¦´bÁ�Á[ E|Ä5°»Äb³P°Å;Æ��W^\Ç�à Ï ³P´o½���^yÕ��W^9Õ��5^GÉ�­9¯[° wnÍÐÏNÉ . ÄN¯[´bÑf[?¶¦¼W³P´NË{ÌÎÍÔϾÑ

Ò � `�^hg d

�W^ ë w���Õ� c é�! `" ^hg d é$#&%('*) c,+ ).-$#0/21 × Æ�Ø s I Ç

�����A�e���i�����R�­9¯[°y± d�� ×?×?× � ±�`4365 HCXmFI]^KIOROUQ ÆhooÇAÉ . ÄN¯[´bÑW[?¶¦¼)³P´NË=w�ÍÐÏLÑÒ87 / ±a`U0to�/LÍ|w�9`Õ : c * ` ! # Æ�Ø s HLÇ

< ÄN¯[¼f¯ ±t`)à ] cWd e `^hg d ±�^GÉ

P�QP�e�=Re§ � TWV�:µ­9¯[°6± d�� ×?×C× � ±_`;3<5 HAXTFI]PKIOUORQ Æ ooÇCÉ®­9¯[° ]µà Ù Ï^Ï ³P´o½ w'à × : É q/¯ÁC³ < °»Äb³P°>=�ÄN¯C²CË�Á|ÄN¯[º@? Á1˦¹�¯[Àg½¾¯T½

Ò Æ%/ ±t`U0~o /�Í�w|ÇÎÕ × Ï � : Ø ×A E5E|¶¦¼m½P¹»´¾¸�°G¶uv)¶�¯CB�½P¹»´¾¸D? Á{¹»´w¯�ECÂI³PÀ£¹»°�Ë�ÑÒ Æ%/ ±L0to�/LÍ × : Ç6Õ : c * d äGä %$F * -.# à × Ï^Ï^Ï ���

< ľ¹FE|Äp¹UÁ)¿'ÂWE|Ä3Á[¿k³PÀ»ÀU¯[¼1°\Äb³P´®É����'�j�eÉ î��]¾H��acIQUFIdb��V�QUFIHAJ¾KIMPOUQgSGv�d^QRzPHCV�KIV;MkVTQUjkrIORH{l6M?v ST]YqAXTH?M¦SfH#M�GIHKJML�NDO,JDGIOQPRJTSU OWVYX[Z]\�Z\]^X6M`_IQRFI]^jkQ�MPONrNMiXfMPjkHASTHCX o s�K H�lnQUOROocbQUVTqAKIVTV6qC]^Fb�IcIHCFIqAH1QUFeSfHAXmziMPORV�OUMiSfHAX

_IKbS�hIHAXTH{QUVÎSThIH{_NMPVTQRq{QUcIHCMbs�^ÞQg}`_ ÍµÏ MPFIcpOUH[Sw}``àba Ù

: ] OR]^d�c :_ed�f d / * ×5 vl��]�H �wcIQRFIdI� VuQUFIHAJLKNMPOUQgSGvIgÒ 7 / ±t`�0~o�/¾Í?w}` 9 Õ : c * ` ! #h à _ ×

Page 80: All of Statistics: A Concise Course in Statistical Inference Brief

ï^ï ���������� �������������������������!

��HAS�� à Æ ±t` 0�w � ±t` ê w|Ç sntuhIHAFTgaÒ Æ � @� ooÇÎà Ò Æ5/ ±a` 0 o /NÍ wfÇ�Õ _ s���HCFIqAH gÒ Æho � � Ç ë Ù 0 _ goSThNMiS7QUV g SfhbH#XfMPFbcI]^j QUFeSfHAXmziMPO�� STXfMPrbV�SfhbHWSfXmKIHWrNMiXfMPjkHASTHCXziMPOUKbH o lnQRSfh rIXT]P_NMP_IQROUQRSGv�Ù 0 _��ol�HWqCMPOUO�� M Ù 0 _íqA]^Fb�NcIHAFIqCH'QRFeSfHCXxziMPO�sn�5]^XTH]^F�SfhbQUVuO�M¦SfHCXAs

���� ,�!W< 46-32� �p46-�� !#"�� !W* + ��.1* DÞ.1* 89*/.1:�</!#=?>[@�>C.1DtuhIQUV1VmHCq[SfQU]PF®qA]^FeS|MPQRFIV;SGl�] QRFIHCJLKNMPORQRSTQUHCV�]^F H[}�roHCq[SfHCc®ziMPOUKIHAV;SfhNM¦S)MPXTH']PZ»STHCF

KIVmHAZ\KIO«s�n���C�e���¦�������í�%$��e¥��i��ª������i�����i���W¨ ¢��C¤I¥¦�e§�¨ ©«ªL��¬ C [B± ³P´o½>�æÄb³PºP¯ �%´N¹»° ¯�º¦³P¼~¹»·³P´4E|¯|Á{°»ÄN¯[´

Å2/ ±8� /¾Õ � Å1Æ\± * ÇGÅ;Æ�� * Ç × Æ«Ø s ØPÇ

!�HAq?MPOROwSfhIMiS�M Z\KbFIqASTQU]^F#"3QRV�GIH JKX O%$�QRZ9Z\]^XuHCMPq|h å �'& MPFIc5H?Miq|h _ �)( Ï � Ù+*�g" Æ _ åWê Æ Ù 0 _ Ç & Ç�Õ _ " Æ å Ç ê Æ Ù 0 _ Ç " Æ & Ç ×

E Z,"'QRV SGlnQRqCH�cIQ �aHCXmHCFeSfQUMP_IORH geSfhIHAF3qC]PFLzPH[}�QRSGv�XTHAcIKIqAHCV6ST]Wq|hIHAq|�¾QRFIdWSThNMiS�".- - Æ å Ç ë ÏZ\]^XnMPORO å s%E S�q?MPF5_oH)VmhI]�lnF SThNMiSnQRZ�"�QUVnqC]^FezPH[}pSfhbHCF�QgSuOUQRHCVnMP_o]�z^H)MPFev�ORQUFIH;SThNMiSSf]PKIq|hIHCV/"ÖMiS)Vm]^j�H'ro]^QRFLS g9q?MPOROUHCc®M�S|MPFbd^HCFeS`OUQUFbHPs10 Z\KIFIqASTQU]^F2" QRV4GIH J GIZ�X O QRZ0 " QUV1qA]^Fez^H[}as43Þ}bMPjkrIOUHAV)]iZÎqC]PFLzPH[}®Z\KIFIq[SfQR]^FIV)MPXTH1" Æ å Ç1à å * MPFbc5" Æ å Ç;à +6 s3Þ}bMPj�rbOUHCVu]iZÞqA]^FIq?M?zPH`Z\KIFIq[SfQR]^FIVnMPXmH/" Æ å Ç�à-0 å * MPFIc#" Æ å Ç%à OR]^d å s

�n���C�e���¦�����87Ô�:9��i¢� x�i¢L�  {¡ ¢��A¤I¥¦�L§ ¨ ©�ªL�»¬ C [ " ¹UÁuE|¶¦´NºP¯mÈk°\ÄN¯[´Å " Æ\±ÖÇ ë " Æ�Å ±®Ç × Æ«Ø s � Ç

C [ " ¹UÁ E|¶¦´4ET³PºP¯{°»ÄN¯[´Å " Æ\±ÖÇ6Õ " Æ�Å ±®Ç × Æ«Ø s � Ç

Ú;ÛwÜ�ÜÞÝaß;��H[S;< Æ å Ç à � ê � å _aH1M'ORQUFIH g�S|MiFId^HCFeSuST]=" Æ å Ç MiSÎSThIH{ra]PQUFeS Å1Æ\±®Ç s

Page 81: All of Statistics: A Concise Course in Statistical Inference Brief

�����j� ��������� � ���r��!����� ������� �� �4��� � � ��s������������`� ������_�������� �ï J� QUFbqCH "3QRVuqC]^Fez^H~}KgaQgSuOUQRHCVnMP_o]�z^H1SfhIH{ORQUFIH < Æ å Ç s � ][g

Å " Æ»±®Ç ë Å < Æ\±®Ç%àìÅ;Æ�� ê �f±®Ç%à � ê �TÅ{Æ»±®Ç�à < Æ»Å1Æ\±ÖÇTÇ�à " Æ�Å�±ÖÇ × î^ Xm]^j �PHCFIVmHCF�� VÖQUFIHAJLKNMPOUQgSGvÐl6H�VmHCH�SThNMiS Å ± * ë Æ»Å�±®ÇO* MPFIc Å;Æ Ù @�±®Ç ëÙ @�Å;Æ\±ÖÇ s � QRFIqCH'OR]^dpQUV;qA]^FIqCM�zPH g Å{Æ OU]^d ±ÖÇ)Õ OR]^d Å;Æ\±®Ç s ^N]^X;H[}bMPjkrIORH g�VTKbrIra]PVTHSfhNM¦S ± 3� Æ I � Ù Ç s�tuhIHAF Å;Æ Ù @�±®Ç ë Ù @ I s

���� �í.;46- * >?4n!#=� ���� .1* + >�������"�&�&���&��a��&�.��+ >?*t�t�mD89*/.1:�</!W=?>A@�2

K H;lnQROUOwj�MP�iH;KIVTH{]PZ9SThIH1H[}bMPq[SuZ\]^XTj�]PZytÞM?v¾OU]^XA��VÎSThIHC]PXTHCj��BQgZ�"kQRVnM#VTjk]�]PSThZ\KIFIq[SfQU]PFTgISfhIHAF5SThIHCXmH`QRVnMkFLKIjW_aHAX�� � Æ�Ï ��� Ç VTKbq|h�SfhNM¦S " Æ � Ç�à " Æ«ÏeÇ ê � " -�Æ«ÏeÇ ê #* ""! ! Æ � Ç s

Ú;ÛwÜ�ÜÞÝ)]iZItuhIHA]^XTHAj Ø s H s�^N]^X9MPFev ÌÎÍµÏ g¦l�H6hIM�zPH g¦Z\XT]^j��MiXT�P]�zo� VÞQUFbHCJLKNMPORQRSGvIgSfhNM¦S

Ò � `�^hg d

� ^ ë w � à Ò � Ì `�^hg d

�W^ ë ÌOw � à Ò$# é�% h)'&)(+* ) ë é�!�,Õ c é�! Å # é-% h).&/(+* )0, à� c é�! " ^ Å;Æ� é * ) Ç × Æ«Ø s ï Ç

� QUFbqCH ��^1Õ � ^7Õ � ^ g�l6H3qCMPF�lnXTQgSfH � ^ MPV#M qC]^Fez^H~}�qA]^j#_IQRFNMiSfQR]^F®]PZ ��^ MPFIc �5^ gFNMPjkHCOgvIg � ^wà _ � ^ ê Æ Ù 0 _ Ç&��^ lnhbHCXTH _ à�Æ��W^50���^»ÇA@IÆR� ^50��^�Ç s � ] g^_ev)SThIH6qA]^Fez^H~}bQgSGv]PZ é21 l6H{hNM?z^H é * ) Õ �W^ 0 ��^� ^ 0 ��^ é�'*) ê � ^\0 � ^� ^\0 ��^ é + ) ×tÞMi�PH)H[}�raHAqASfMiSfQR]^FIVn]PZ�_a]PSThpVTQUcbHCV�MPFIcpKIVTH{SfhbH{Z�MiqASuSfhIMiS43 Æ��W^�Ç à Ï ST] d^HAS

3 é * ) Õ 0 ��^� ^\0 ��^ é�'*) ê � ^� ^ 0 ��^ é + ) à� 65 % - Æ«Ø s JeÇ

lnhIHCXmH � à ÌAÆ�� ^\0 ��^�Ç g�" Æ � Ç�à 0�7 � ê OR]^d Æ Ù 0$7 ê 7K Ç MPFIc 75à-0 ��^ @IÆR� ^ 0 ��^»Ç s

Page 82: All of Statistics: A Concise Course in Statistical Inference Brief

J^Ï ���������� �������������������������!

��]PSTH)SThNMiS;" Æ«ÏeÇ%à "%- Æ«ÏeÇ àâÏ s�07OUVm][g " ! ! Æ � ÇÎÕ Ù @BH Z\]^X�MPOUO � ÍìÏ s 5 vpt�M?v¾OU]^XA��VSfhbHC]^XmHCj gbSfhIHAXTH{QUV�M�� � Æ�Ï ��� Ç VTKbq|h SfhIMiS

" Æ � Ç à " Æ«ÏeÇ ê � " ! Æ�ÏeÇ ê � *: " ! ! Æ � Çà � *: " ! ! Æ � Ç6Õ � *ï à Ì * Æ�� ^\0 ��^»Ç *

ï ×��HAFIqCH g

Å é * ) Õ� 65 % - Õ� é$#�%('*) c,+ ).-$#0/21 ×tuhIH{XTHAVTKIOgSnZ\]^OUOR]�lnV�Z\XT]^j Æ�Ø s ï Ç s î

Ú;ÛwÜ�ÜÞÝÖ]PZÎtuhIHA]^XTHAj Ø s Ø s4�9H[S �W^6à Æ Ù @']�ÇCÆ»±_^�0 ooÇ s�tuhIHCF Å;ÆR� ^»Ç)à Ï MPFIc�µÕ<� ^`Õ<� lnhIHAXTH � à 0!o @B] MPFIc �pà Æ Ù 0MooÇA@B] s 07ORVT][g ÆR��0 �¾Ç * à Ù @B] * s07rbrIORv¾QUFbd#SfhIH{O�MiVmSntuhIHA]^XTHAj l6H)d^HAS

Ò Æ ±t`U0to5Í|wfÇ à Òc �^� ^yÍ�w d Õ� c é�! é # /&%(1 ` - ×

tuhIH Mi_a]�z^H�hI]^OUcbV�Z\]^XYMPFev Ì ÍÓÏ s EGFÐrNMPXxSfQUqAKIO�MiX g�SfMP�PH Ì à H�]+w MPFbcÔl6H dPHASÒ Æ ±a` 0 o�Í�w|ÇÎÕ� Bc * ` ! # s 5 v'M;VmQUjkQUOUMPXyMPXTd^Kbj�HAFLSÞl6Hnq?MiFkVmhI]�l SThNMiS�Ò Æ ±t` 0 o��0�w|ÇÎÕ� c * ` ! # s���KbSmSfQRFId'SfhIHAVTH{Sf]^d^H[SfhIHAXul6H)d^HASuÒ 7 / ±t`�0~o�/¾Í?w 9`Õ : c * ` ! # s�î���~� � >?0 =?>C&_�k"�!�� - >?4 � .�� !#"%$�D07F{H[}�qAHCOUORHCFeSyXTH[Z\HCXTHAFIqCH%]^F{rIXm]^_NMP_bQUOUQgSGv�QUFIHAJ¾KIMPOUQgSfQUHAV�MPFbc{SThIHCQRX�KIVTH%QRF1VmS|M¦SfQUVxSfQRqCV

MPFIc rNMiSmSfHCXmF�XmHCqC]Pd^FIQRSTQU]^FpQRV;H[z¾XT]�v^H g��;v� ]PXm� MiFIc ��KId^]PVTQ Æ Ù J�J � Ç s�tuhIHWrbXT]¾]PZB]PZ��]¾H��acIQRFIdI� VuQUFIHAJ¾KIMPOUQgSGvYQRVnZ\Xm]^j�SThNMiSuSTH[}¾S?s����� ��� 4n.1"%4�>?DÞ.1DÙ^s ��HAS ± 3 3Þ}�ro]^FIHCFeSTQ�MPO Æ���Ç s�^ÞQUFbc`Ò Æ%/ ± 0 (��=/ ë98 )��uÇ Z\]^X 8 Í Ù^s��6]Pj�rNMiXTHSThIQUVÎST] SThIH{_a]PKIFIcpv^]^K5d^HASnZ\Xm]^j��6hIHA_ev�VmhIHAzo� V1QUFIHAJLKNMPOUQgSGv^s

:�s ��HAS ± 3 �y]^QUVmVT]^F Æ��aÇ s���VTH��6hbHC_ev¾VThIH[za� V1QUFbHCJLKNMPORQRSGvYST]YVmhI]�lâSfhNM¦S�Ò Æ»± ë: �oÇ6Õ Ù @�� sI s ��HAS ± d�� ×?×?× � ±_`�365 HCXmFI]^KIOROUQ Æ ooÇ MPFbc ±t`)à ]�cWdWe `^ g d ±_^ s 5 ]^KbFIc#Ò Æ%/ ±a`�0o�/bÍ wfÇ KIVmQUFId �6hIHC_ev¾VThIH[zo��V)QUFIHAJLKNMPOUQgSGvpMPFIc�KIVTQRFIdr��]�H �wcIQRFIdI� V7QRFIHCJLKNMPORQRSGvPs

Page 83: All of Statistics: A Concise Course in Statistical Inference Brief

��������� �=���� U� �O n�! J Ù� hI]�l SThNMiS gylnhIHAF ] QUV{OUMPXTd^H g�SfhIHk_o]^KIFIc®Z\XT]^j ��]¾H��acIQUFIdb��V{QUFbHCJLKNMPORQRSGv/QRVVTj�MPOROUHCX�SfhNMPF�SThIH{_a]^KbFIc5Z\Xm]^j��6hbHC_ev¾VThIH[za� V1QUFbHCJLKNMPORQRSGv^s

H s ��HAS ± d�� ×?×C× � ±_` 365 HAXTFI]PKIOUORQ Æ ooÇ sÆ M Ç �9H[S _ ÍÐÏ _aH1�b}�HCc�MPFIcpcIH[�NFIH

w}``à Ù: ] OR]^d c :_edW×

��HAS��of``à9] cWd e `^hg d ±_^ s ;HA�NFIH�� `)à�Æ �of`U02w}` � �of` ê w}`^Ç s ��VTH!��]�H �wcIQRFIdI� VQUFbHCJLKNMPORQRSGvYST] VThI]�l SThNMiSÒ Æ � ` qC]PFLSfMPQUFbV ooÇ ë Ù 0 _ ×

K�HWq?MiOUO � ` M Ù 0 _ E|¶¦´���½¾¯[´4E|¯)¹»´N° ¯[¼~º¦³PÀ�[?¶¦¼�oyÉ EGF�rbXfMPq[SfQUqAH g l6H)SfXmKIFIq?M¦SfHSfhbH)QRFeSfHCXxziMPO�VT]kQRSncb]�HAV�Fb]PSnd^] _aHAOU]�l Ï ]^XnMi_a]�z^H Ù^sÆ _ Ç)Æ $����=Re¥�© �¦� P�QjRP�i��¨ ���i¢�© V�Ç ��HAS?� V�H[}bMPjkQUFIH1SfhbH`rIXm]^roHCXmSTQUHAV�]iZÞSfhbQUVnqC]^F����cIHAFIqCH)QUFeSfHAXmziMPO«s �9H[S _ à Ï × Ï^Ø MiFIc o3à Ï × H s �6]^FIcbKIqAS7MkVmQUj#KbO�MiSTQU]^F3VxSfKIc�vSf]�VTHAHWhI]�l ]PZ»STHCF5SfhIH)QUFeSfHAXmziMPO9qC]PFLSfMPQUFbV o�Æ q?MiOUOUHAc�SfhIH)qC]�z^HAXfMPdPH Ç s ;] SfhIQRVZ\]^XnziMPXTQR]^KIVuziMPOUKIHAV�]iZ ] _oHASGl�HCHAF�Ù#MPFbc Ù Ï^ÏPÏ^Ï s ��OU]iS�SThIH`qA]�z^HCXTMPd^HWzPHCXmVTKIV] sÆ q Ç ��OU]PSySfhbHÎOUHAFIdPSfhW]PZNSfhIH6QUFeSfHAXmziMPOLz^HCXmVTKIV ] s � KbrIra]PVTHÎl�H6lÎMiFLS�SThIH6ORHCFIdiSfh]PZySfhIH{QRFLSTHCXxzPMiOwSf]k_aH1Fb]�jk]^XTH;SThNMPF × ÏeØ s���]�l OUMPXTdPH;VThI]PKIOUc ] _aH��

Page 84: All of Statistics: A Concise Course in Statistical Inference Brief
Page 85: All of Statistics: A Concise Course in Statistical Inference Brief

This is page 89Printer: Opaque this

6Convergence of Random Variables

6.1 Introduction

The most important aspect of probability theory concerns

the behavior of sequences of random variables. This part of

probability is called large sample theory or limit theory

or asymptotic theory. This material is extremely important

for statistical inference. The basic question is this: what can we

say about the limiting behavior of a sequence of random vari-

ables X1, X2, X3, . . .? Since statistics and data mining are all

about gathering data, we will naturally be interested in what

happens as we gather more and more data.

In calculus we say that a sequence of real numbers xn con-

verges to a limit x if, for every ε > 0, |xn − x| < ε for all

large n. In probability, convergence is more subtle. Going back

to calculus for a moment, suppose that xn = x for all n. Then,

trivially, limn xn = x. Consider a probabilistic version of this

example. Suppose that X1, X2, . . . is a sequence of random vari-

ables which are independent and suppose each has a N(0, 1)

distribution. Since these all have the same distribution, we are

Page 86: All of Statistics: A Concise Course in Statistical Inference Brief

90 6. Convergence of Random Variables

tempted to say that Xn “converges” to X ∼ N(0, 1). But this

can’t quite be right since P(Xn = Z) = 0 for all n. (Two con-

tinuous random variables are equal with probability zero.)

Here is another example. Consider X1, X2, . . . where Xi ∼N(0, 1/n). Intuitively,Xn is very concentrated around 0 for large

n. But P (Xn = 0) = 0 for all n. This chapter develops appro-

priate methods of discussing convergence of random variables.

There are two main ideas in this chapter:

1. The law of large numbers says that sample average

Xn = n−1∑n

i=1Xi converges in probability to the ex-

pectation µ = E(X).

2. The central limit theorem says that sample average

has approximately a Normal distribution for large n. More

precisely,√n(Xn − µ) converges in distribution to a

Normal(0, σ2) distribution, where σ2 = V(X).

6.2 Types of Convergence

The two main types of convergence are defined as follows.

Page 87: All of Statistics: A Concise Course in Statistical Inference Brief

6.2 Types of Convergence 91

Definition 6.1 Let X1, X2, . . . be a sequence of random vari-

ables and let X be another random variable. Let Fn de-

note the cdf of Xn and let F denote the cdf of X.

1. Xn converges to X in probability, written XnP−→X,

if, for every ε > 0,

P(|Xn −X| > ε)→ 0 (6.1)

as n→∞.

2. Xn converges to X in distribution, written Xn à X,

if,

limn→∞

Fn(t) = F (t) (6.2)

at all t for which F is continuous.

There is another type of convergence which we introduce

mainly because it is useful for proving convergence in proba-

bility.

Definition 6.2 Xn converges to X in quadratic mean

(also called convergence in L2), written Xnqm−→X, if,

E(Xn −X)2 → 0 (6.3)

as n→∞.

If X is a point mass at c – that is P(X = c) = 1 – we write

Xnqm−→ c instead of X

qm−→X. Similarly, we write XnP−→ c and

Xn à c.

Example 6.3 Let Xn ∼ N(0, 1/n). Intuitively, Xn is concentrat-

ing at 0 so we would like to say that Xn à 0. Let’s see if this

is true. Let F be the distribution function for a point mass at

Page 88: All of Statistics: A Concise Course in Statistical Inference Brief

92 6. Convergence of Random Variables

0. Note that√nXn ∼ N(0, 1). Let Z denote a standard normal

random variable. For t < 0, Fn(t) = P(Xn < t) = P(√nXn <√

nt) = P(Z <√nt) → 0 since

√nt → −∞. For t > 0,

Fn(t) = P(Xn < t) = P(√nXn <

√nt) = P(Z <

√nt) → 1

since√nt → ∞. Hence, Fn(t) → F (t) for all t 6= 0 and so

Xn à 0. But notice that Fn(0) = 1/2 6= F (1/2) = 1 so con-

vergence fails at t = 0. But that doesn’t matter because t = 0 is

not a continuity point of F and the definition of convergence in

distribution only requires convergence at continuity points. ¥

The next theorem gives the relationship between the types of

convergence. The results are summarized in Figure 6.1.

Theorem 6.4 The following relationships hold:

(a) Xnqm−→X implies that Xn

P−→X.

(b) XnP−→X implies that Xn à X.

(c) If Xn à X and if P(X = c) = 1 for some real number c,

then XnP−→X.

In general, none of the reverse implications hold except the

special case in (c).

Proof. We start by proving (a). Suppose that Xnqm−→X. Fix

ε > 0. Then, using Chebyshev’s inequality,

P(|Xn −X| > ε) = P(|Xn −X|2 > ε2) ≤ E|Xn −X|2ε2

→ 0.

Proof of (b). This proof is a little more complicated. You may

skip if it you wish. Fix ε > 0 and let x be a continuity point of

F . Then

Fn(x) = P(Xn ≤ x) = P(Xn ≤ x,X ≤ x+ ε) + P(Xn ≤ x,X > x+ ε)

≤ P(X ≤ x+ ε) + P(|Xn −X| > ε)

= F (x+ ε) + P(|Xn −X| > ε).

Also,

F (x− ε) = P(X ≤ x− ε) = P(X ≤ x− ε,Xn ≤ x) + P(X ≤ x+ ε,Xn > x)

Page 89: All of Statistics: A Concise Course in Statistical Inference Brief

6.2 Types of Convergence 93

≤ Fn(x) + P(|Xn −X| > ε).

Hence,

F (x−ε)−P(|Xn−X| > ε) ≤ Fn(x) ≤ F (x+ε)+P(|Xn−X| > ε).

Take the limit as n→∞ to conclude that

F (x− ε) ≤ lim infn→∞

Fn(x) ≤ lim supn→∞

Fn(x) ≤ F (x+ ε).

This holds for all ε > 0. Take the limit as ε→ 0 and use the fact

that F is continuous at x and conclude that limn Fn(x) = F (x).

Proof of (c). Fix ε > 0. Then,

P(|Xn − c| > ε) = P(Xn < c− ε) + P(Xn > c+ ε)

≤ P(Xn ≤ c− ε) + P(Xn > c+ ε)

= Fn(c− ε) + 1− Fn(c+ ε)

→ F (c− ε) + 1− F (c+ ε)

= 0 + 1− 0 = 0.

Let us now show that the reverse implications do not hold.

Convergence in probability does not imply con-

vergence in quadratic mean. Let U ∼ Unif(0, 1) and let

Xn =√nI(0,1/n)(U). Then P(|Xn| > ε) = P(

√nI(0,1/n)(U) >

ε) = P(0 ≤ U < 1/n) = 1/n → 0. Hence, Then XnP−→ 0. But

E(X2n) = n

∫ 1/n0

du = 1 for all n so Xn does not converge in

quadratic mean.

Convergence in distribution does not imply con-

vergence in probability. Let X ∼ N(0, 1). Let Xn = −Xfor n = 1, 2, 3, . . .; hence Xn ∼ N(0, 1). Xn has the same distri-

bution function as X for all n so, trivially, limn Fn(x) = F (x)

for all x. Therefore, Xnd→ X. But P(|Xn−X| > ε) = P(|2X| >

ε) = P(|X| > ε/2) 6= 0. So Xn does not tend to X in probability.

¥

Page 90: All of Statistics: A Concise Course in Statistical Inference Brief

94 6. Convergence of Random Variables

quadratic mean probability distribution

point-mass distribution

FIGURE 6.1. Relationship between types of convergence.

Warning!One might conjecture that ifXnP−→ b then E(Xn)→

b. This is not true. Let Xn be a random variable defined byWe can concludethat E(Xn) → bif Xn is uniformlyintegrable. See thetechnical appendix.

P(Xn = n2) = 1/n and P(Xn = 0) = 1− (1/n). Now, P(|Xn| <ε) = P(Xn = 0) = 1 − (1/n) → 1. Hence, Z

P−→ 0. However,

E(Xn) = [n2×(1/n)]+[0×(1−(1/n))] = n. Thus, E(Xn)→∞.

Summary. Stare at Figure 6.1.

Some convergence properties are preserved under transforma-

tions.

Theorem 6.5 Let Xn, X, Yn, Y be random variables. Let g be a

continuous function.

(a) If XnP−→X and Yn

P−→Y , then Xn + YnP−→X + Y .

(b) If Xnqm−→X and Yn

qm−→Y , then Xn + Ynqm−→X + Y .

(c) If Xn à X and Yn à c, then Xn + Yn à X + c.

(d) If XnP−→X and Yn

P−→Y , then XnYnP−→XY .

(e) If Xn à X and Yn à c, then XnYn à cX.

(f) If XnP−→X then g(Xn)

P−→ g(X).

(g) If Xn à X then g(Xn)à g(X).

6.3 The Law of Large Numbers

Now we come to a crowning achievement in probability, the law

of large numbers. This theorem says that the mean of a large

sample is close to the mean of the distribution. For example,

the proportion of heads of a large number of tosses is expected

to be close to 1/2. We now make this more precise.

Page 91: All of Statistics: A Concise Course in Statistical Inference Brief

6.3 The Law of Large Numbers 95

Let X1, X2, . . . , be an iid sample and let µ = E(X1) and

σ2 = V(X1). Recall that the sample mean is defined as Xn = Note that µ =E(Xi) is the samefor all i so we candefine µ = E(Xi)for any i. By con-vention, we oftenwrite µ = E(X1).

n−1∑n

i=1Xi and that E(Xn) = µ and V(Xn) = σ2/n.

Theorem 6.6 (The Weak Law of Large Numbers (WLLN).)

If X1, . . . , Xn are iid , then XnP−→µ.

Interpretation of WLLN: The distribution of Xn be-

comes more concentrated around µ as n gets large.

Proof. Assume that σ < ∞. This is not necessary but it

simplifies the proof. Using Chebyshev’s inequality,

P(|Xn − µ| > ε

)≤ V(Xn)

ε2=

σ2

nε2

which tends to 0 as n→∞. ¥ There is a strongertheorem in the ap-pendix called thestrong law of largenumbers.

Example 6.7 Consider flipping a coin for which the probability

of heads is p. Let Xi denote the outcome of a single toss (0

or 1). Hence, p = P (Xi = 1) = E(Xi). The fraction of heads

after n tosses is Xn. According to the law of large numbers,

Xn converges to p in probability. This does not mean that Xn

will numerically equal p. It means that, when n is large, the

distribution of Xn is tightly concentrated around p. Suppose that

p = 1/2. How large should n be so that P (.4 ≤ Xn ≤ .6) ≥ .7?

First, E(Xn) = p = 1/2 and V(Xn) = σ2/n = p(1 − p)/n =

1/(4n). From Chebyshev’s inequality,

P(.4 ≤ Xn ≤ .6) = P(|Xn − µ| ≤ .1)

= 1− P(|Xn − µ| > .1)

≥ 1− 1

4n(.1)2= 1− 25

n.

The last expression will be larger than .7 if n = 84. ¥

Page 92: All of Statistics: A Concise Course in Statistical Inference Brief

96 6. Convergence of Random Variables

6.4 The Central Limit Theorem

Suppose that X1, . . . , Xn are iid with mean µ and variance σ.

The central limit theorem (CLT) says that Xn = n−1∑

iXi

has a distribution which is approximately Normal with mean µ

and variance σ2/n. This is remarkable since nothing is assumed

about the distribution of Xi, except the existence of the mean

and variance.

Theorem 6.8 (The Central Limit Theorem (CLT).) Let X1, . . . , Xn

be iid with mean µ and variance σ2. Let Xn = n−1∑n

i=1Xi.

Then

Zn ≡√n(Xn − µ)

σà Z

where Z ∼ N(0, 1). In other words,

limn→∞

P(Zn ≤ z) = Φ(z) =

∫ z

−∞

1√2πe−x

2/2dx.

Interpretation: Probability statements about Xn can

be approximated using a Normal distribution. It’s the

probability statements that we are approximating, not

the random variable itself.

In addition to Zn à N(0, 1), there are several forms of nota-

tion to denote the fact that the distribution of Zn is converging

to a Normal. They all mean the same thing. Here they are:

Zn ≈ N(0, 1)

Xn ≈ N

(µ,

σ2

n

)

Xn − µ ≈ N

(0,σ2

n

)

√n(Xn − µ) ≈ N

(0, σ2

)

Page 93: All of Statistics: A Concise Course in Statistical Inference Brief

6.4 The Central Limit Theorem 97

√n(Xn − µ)

σ≈ N(0, 1).

Example 6.9 Suppose that the number of errors per computer

program has a Poisson distribution with mean 5. We get 125 pro-

grams. Let X1, . . . , X125 be the number of errors in the programs.

We want to approximate P(X < 5.5). Let µ = E(X1) = λ = 5

and σ2 = V(X1) = λ = 5. Then,

P(X < 5.5) = P(√

n(Xn − µ)σ

<

√n(5.5− µ)

σ

)≈ P(Z < 2.5) = .9938.

The central limit theorem tells us that Zn =√n(X − µ)/σ

is approximately N(0,1). However, we rarely know σ. We can

estimate σ2 from X1, . . . , Xn by

S2n =

1

n− 1

n∑

i=1

(Xi −Xn)2.

This raises the following question: if we replace σ with Sn is the

central limit theorem still true? The answer is yes.

Theorem 6.10 Assume the same conditions as the CLT. Then,

√n(Xn − µ)

Snà N(0, 1).

You might wonder, how accurate the normal approximation

is. The answer is given in the Berry-Esseen theorem.

Theorem 6.11 (Berry-Esseen.) Suppose that E|X1|3 <∞. Then

supz|P(Zn ≤ z)− Φ(z)| ≤ 33

4

E|X1 − µ|3√nσ3

. (6.4)

There is also a multivariate version of the central limit theo-

rem.

Page 94: All of Statistics: A Concise Course in Statistical Inference Brief

98 6. Convergence of Random Variables

Theorem 6.12 (Multivariate central limit theorem) Let X1, . . . , Xn

be iid random vectors where

Xi =

X1i

X2i...Xki

with mean

µ =

µ1µ2...µk

=

E(X1i)E(X2i)

...E(Xki)

and variance matrix Σ. Let

X =

X1

X2...Xk

.

where X i = n−1∑n

i=1X1i. Then,

√n(X − µ)Ã N(0,Σ).

6.5 The Delta Method

If Yn has a limiting Normal distribution then the delta method

allows us to find the limiting distribution of g(Yn) where g is

any smooth function.

Theorem 6.13 (The Delta Method) Suppose that√n(Yn − µ)

σà N(0, 1)

and that g is a differentiable function such that g′(µ) 6= 0. Then√n(g(Yn)− g(µ))|g′(µ)|σ Ã N(0, 1).

Page 95: All of Statistics: A Concise Course in Statistical Inference Brief

6.5 The Delta Method 99

In other words,

Yn ≈ N

(µ,σ2

n

)=⇒ g(Yn) ≈ N

(g(µ), (g′(µ))2

σ2

n

).

Example 6.14 Let X1, . . . , Xn be iid with finite mean µ and fi-

nite variance σ2. By the central limit theorem,√n(Xn)/σ Ã

N(0, 1). Let Wn = eXn. Thus, Wn = g(Xn) where g(s) = es.

Since g′(s) = es, the delta method implies thatWn ≈ N(eµ, e2µσ2/n).

¥

There is also a multivariate version of the delta method.

Theorem 6.15 (The Multivariate Delta Method) Suppose that

Yn = (Yn1, . . . , Ynk) is a sequence of random vectors such that

√n(Yn − µ)Ã N(0,Σ).

Let g : Rk → R ane let

∇g(y) =

∂g∂y1...∂g∂yk

.

Let ∇µ denote ∇g(y) evaluated at y = µ and assume that the

elements of ∇µ are non-zero. Then

√n(g(Yn)− g(µ))Ã N

(0,∇T

µΣ∇µ

).

Example 6.16 Let

(X11

X21

),

(X12

X22

), . . . ,

(X1n

X2n

)

be iid random vectors with mean µ = (µ1, µ2)T and variance Σ.

Let

X1 =1

n

n∑

i=1

X1i, X2 =1

n

n∑

i=1

X2i

Page 96: All of Statistics: A Concise Course in Statistical Inference Brief

100 6. Convergence of Random Variables

and define Yn = X1X2. Thus, Yn = g(X1, X2) where g(s1, s2) =

s1s2. By the central limit theorem,

√n

(X1 − µ1X2 − µ2

)Ã N(0,Σ).

Now

∇g(s) =(

∂g∂s1∂g∂s2

)=

(s2s1

)

and so

∇TµΣ∇µ = (µ2 µ1)

(σ11 σ12σ12 σ22

)(µ2µ1

)= µ22σ11+2µ1µ2σ12+µ21σ22.

Therefore,

√n(X1X2 − µ1µ2)Ã N

(0, µ22σ11 + 2µ1µ2σ12 + µ21σ22

). ¥

6.6 Bibliographic Remarks

Convergence plays a central role in modern probability the-

ory. For more details, see Grimmet and Stirzaker (1982), Karr

(1993) and Billingsley (1979). Advanced convergence theory is

explained in great detail in van der Vaart and Wellner (1996)

and van der Vaart (1998).

6.7 Technical Appendix

6.7.1 Almost Sure and L1 Convergence

We say that Xn converges almost surely to X, written

Xnas−→X, if

P({s : Xn(s)→ X(s)}) = 1.

We say that Xn converges in L1 to X, written XnL1−→X, if

E|Xn −X| → 0

Page 97: All of Statistics: A Concise Course in Statistical Inference Brief

6.7 Technical Appendix 101

as n→∞.

Theorem 6.17 Let Xn and X be random vaiables. Then:

(a) Xnas−→X implies that Xn

P−→X.

(b) Xnqm−→X implies that Xn

L1−→X.

(c) XnL1−→X implies that Xn

P−→X.

The weak law of large numbers says that Xn converges to

EX1 in probability. The strong law asserts that this is also true

almost surely.

Theorem 6.18 (The strong law of large numbers.) Let X1, X2, . . .

be iid. If µ = E|X1| <∞ then Xnas−→µ.

A sequence Xn is asymptotically uniformly integrable if

limM→∞

lim supn→∞

E (|Xn|I(|Xn| > M)) = 0.

If XnP−→ b and Xn is asymptotically uniformly integrable, then

E(Xn)→ b.

6.7.2 Proof of the Central Limit Theorem

If X is a random variable, define its moment generating func-

tion (mgf) by ψX(t) = EetX . Assume in what follows that the

mgf is finite in a neighborhood around t = 0.

Lemma 6.19 Let Z1, Z2, . . . be a sequence of random variables.

Let ψn the mgf of Zn. Let Z be another random variable and

denote its mgf by ψ. If ψn(t) → ψ(t) for all t in some open

interval around 0, then Zn à Z.

proof of the central limit theorem. Let Yi = (Xi −µ)/σ. Then, Zn = n−1/2

∑i Yi. Let ψ(t) be the mgf of Yi. The

mgf of∑

i Yi is (ψ(t))n and mgf of Zn is [ψ(t/√n)]n ≡ ξn(t).

Page 98: All of Statistics: A Concise Course in Statistical Inference Brief

102 6. Convergence of Random Variables

Now ψ′(0) = E(Y1) = 0, ψ′′(0) = E(Y 21 ) = V ar(Y1) = 1. So,

ψ(t) = ψ(0) + tψ′(0) +t2

2!ψ′′(0) +

t3

3!ψ′′(0) + · · ·

= 1 + 0 +t2

2+t3

3!ψ′′(0) + · · ·

= 1 +t2

2+t3

3!ψ′′(0) + · · ·

Now,

ξn(t) =

(t√n

)]n

=

[1 +

t2

2n+

t3

3!n3/2ψ′′(0) + · · ·

]n

=

[1 +

t2

2+ t3

3!n1/2ψ′′(0) + · · ·

n

]n

→ et2/2

which is the mgf of a N(0,1). The result follows from the previous

Theorem. In the last step we used the fact that, if an → a then(1 +

ann

)n→ ea.

6.8 Excercises

1. Let X1, . . . , Xn be iid with finite mean µ = E(X1) and

finite variance σ2 = V(X1). Let Xn be the sample mean

and let S2n be the sample variance.

(a) Show that E(S2n) = σ2.

(b) Show that S2n

P−→ σ2. Hint: Show that S2n = cnn

−1∑ni=1X

2i −

dnX2

n where cn → 1 and dn → 1. Apply the law of large

numbers to n−1∑n

i=1X2i and to Xn. Then use part (e) of

Theorem 6.5.

Page 99: All of Statistics: A Concise Course in Statistical Inference Brief

6.8 Excercises 103

2. Let X1, X2, . . . be a sequence of random variables. Show

that Xnqm−→ b if and only if

limn→∞

E(Xn) = b and limn→∞

V(Xn) = 0.

3. Let X1, . . . , Xn be iid and let µ = E(X1). Suppose that

the variance is finite. Show that Xnqm−→µ.

4. Let X1, X2, . . . be a sequence of random variables such

that

P(Xn =

1

n

)= 1− 1

n2and P (Xn = n) =

1

n2.

Does Xn converge in probability? Does Xn converge in

quadratic mean?

5. Let X1, . . . , Xn ∼ Bernoulli(p). Prove that

1

n

n∑

i=1

X2i

P−→ p and1

n

n∑

i=1

X2i

qm−→ p.

6. Suppose that the height of men has mean 68 inches and

standard deviation 4 inches. We draw 100 men at ran-

dom. Find (approximately) the probability that the aver-

age height of men in our sample will be at least 68 inches.

7. Let λn = 1/n for n = 1, 2, . . .. Let Xn ∼ Poisson(λn).

(a) Show that XnP−→ 0.

(b) Let Yn = nXn. Show that YnP−→ 0.

8. Suppose we have a computer program consisting of n =

100 pages of code. LetXi be the number of errors on the ith

page of code. Suppose that the X ′is are Poisson with mean

1 and that they are independent. Let Y =∑n

i=1Xi be the

total number of errors. Use the central limit theorem to

approximate P(Y < 90).

Page 100: All of Statistics: A Concise Course in Statistical Inference Brief

104 6. Convergence of Random Variables

9. Suppose that P(X = 1) = P(X = −1) = 1/2. Define

Xn =

{X with probability 1− 1

n

en with probability 1n.

Does Xn converge to X in probability? Does Xn converge

to X in distribution? Does E(X −Xn)2 converge to 0?

10. Let Z ∼ N(0, 1). Let t > 0.

(a) Show that, for any k > 0,

P(|Z| > t) ≤ E|Z|ktk

.

(b) (Mill’s inequality.) Show that

P(|Z| > t) ≤{2

π

}1/2e−t

2/2

t.

Hint. Note that P(|Z| > t) = 2P(Z > t). Now write out

what P(Z > t) means and note that x/t > 1 whenever

x > t.

11. Suppose that Xn ∼ N(0, 1/n) and let X be a random

variable with distribution F (x) = 0 if x < 0 and F (x) = 1

if x ≥ 0. Does Xn converge to X in probability? (Prove or

disprove). Does Xn converge to X in distribution? (Prove

or disprove).

12. Let X,X1, X2, X3, . . . be random variables that are posi-

tive and integer valued. Show that Xn à X if and only

if

limn→∞

P(Xn = k) = P(X = k)

for every integer k.

Page 101: All of Statistics: A Concise Course in Statistical Inference Brief

6.8 Excercises 105

13. Let Z1, Z2, . . . be i.i.d., random variables with density f .

Suppose that P(Zi > 0) = 1 and that λ = limx↓0 f(x) > 0.

Let

Xn = n min{Z1, . . . , Zn}.Show that Xn à Z where Z has an exponential distribu-

tion with mean 1/λ.

14. Let X1, . . . , Xn ∼ Uniform(0, 1). Let Yn = X2

n. Find the

limiting distribution of Yn.

15. Let (X11

X21

),

(X12

X22

), . . . ,

(X1n

X2n

)

be iid random vectors with mean µ = (µ1, µ2) and variance

Σ. Let

X1 =1

n

n∑

i=1

X1i, X2 =1

n

n∑

i=1

X2i

and define Yn = X1/X2. Find the limiting distribution of

Yn.

Page 102: All of Statistics: A Concise Course in Statistical Inference Brief

Part II

Statistical Inference

103

Page 103: All of Statistics: A Concise Course in Statistical Inference Brief
Page 104: All of Statistics: A Concise Course in Statistical Inference Brief
Page 105: All of Statistics: A Concise Course in Statistical Inference Brief

7

Models, Statistical Inference andLearning

7.1 Introduction

Statistical inference, or \learning" as it is called in computer

science, is the process of using data to infer the distribution

that generated the data. The basic statistical inference problem

is this:

We observe X1; : : : ; Xn � F . We want to infer (o r

estimate o r learn) F o r some feature of F such as its

mean.

7.2 Parametric and Nonparametric Models

A statistical model is a set of distributions (or a set of

densities) F. A parametric model is a set F that can be pa-

rameterized b y a �nite n umber of parameters. For example, if

we assume that the data come from a Normal distribution then

Page 106: All of Statistics: A Concise Course in Statistical Inference Brief

106 7. Models, Statistical Inference and Learning

the model is

F =

�f(x;�; �) =

1

�p�exp

��

1

2�2(x� �)2

�; � 2 R; � > 0

�:

(7.1)

This is a two-parameter model. We have written the density

as f(x;�; �) to show that x is a value of the random variable

whereas � and � are parameters. In general, a parametric model

takes the form

F =

�f(x; �) : � 2 �

�(7.2)

where � is an unknown parameter (or vector of parameters) that

can take values in the parameter space �. If � is a vector but

we are only interested in one component of �, we call the re-

maining parameters nuisance parameters. A nonparamet-

ric model is a set F that cannot be parameterized by a �nite

number of parameters. For example, FALL = fall cdf 0sg is non-parametric.The distinction

between parametric

and nonparametric

is more subtle than

this but we don't

need a rigorous

de�nition for our

purposes.

Example 7.1 (One-dimensional Parametric Estimation.) Let X1; : : : ; Xn

be independent Bernoulli(p) observations. The problem is to es-

timate the parameter p. �

Example 7.2 (Two-dimensional Parametric Estimation.) Suppose that

X1; : : : ; Xn � F and we assume that the pdf f 2 F where F is

given in (7.1). In this case there are two parameters, � and �.

The goal is to estimate the parameters from the data. If we are

only interested in estimating � then � is the parameter of inter-

est and � is a nuisance parameter. �

Example 7.3 (Nonparametric estimation of the cdf.) Let X1; : : : ; Xn

be independent observations from a cdf F . The problem is to es-

timate F assuming only that F 2 FALL = fall cdf 0sg. �

Example 7.4 (Nonparametric density estimation.) Let X1; : : : ; Xn

be independent observations from a cdf F and let f = F 0 be the

Page 107: All of Statistics: A Concise Course in Statistical Inference Brief

7.2 Parametric and Nonparametric Models 107

pdf . Suppose we want to estimate the pdf f . It is not pos-

sible to estimate f assuming only that F 2 FALL. We need to

assume some smoothness on f . For example, we might assume

that f 2 F = FDENSTFSOB where FDENS is the set of all proba-

bility density functions and

FSOB =

�f :

Z(f 00(x))2dx <1

�:

The class FSOB is called a Sobolev space; it is the set of func-

tions that are not \too wiggly." �

Example 7.5 (Nonparametric estimation of functionals.) Let X1; : : : ; Xn �F . Suppose we want to estimate � = E(X1) =

Rx dF (x) as-

suming only that � exists. The mean � may be thought of as a

function of F : we can write � = T (F ) =Rx dF (x). In general,

any function of F is called a statistical functional. Other

examples of functions are the variance T (F ) =Rx2dF (x) ��R

xdF (x)�2

and the median T (F ) = F�1=2. �

Example 7.6 (Regression, prediction and classi�cation.) Suppose we

observe pairs of data (X1; Y1); : : : (Xn; Yn). Perhaps Xi is the

blood pressure of subject i and Yi is how long they live. X is

called a predictor or regressor or feature or independent

variable. Y is called the outcome or the response variable

or the dependent variable. We call r(x) = E(Y jX = x) the

regression function. If we assume that f 2 F where F is �-

nite dimensional { the set of straight lines for example { then

we have a parametric regression model. If we assume that

f 2 F where F is not �nite dimensional then we have a para-

metric regression model. The goal of predicting Y for a new

patient based on their X value is called prediction. If Y is dis-

crete (for example, live or die) then prediction is instead called

classi�cation. If our goal is to estimate the functin f , then we

call this regression or curve estimation. Regression models

Page 108: All of Statistics: A Concise Course in Statistical Inference Brief

108 7. Models, Statistical Inference and Learning

are sometimes written as

Y = f(X) + � (7.3)

where E (�) = 0. We can always rewrite a regression model this

way. To see this, de�ne � = Y � f(X) and hence Y = Y +

f(X)�f(X) = f(X)+�. Moreover, E (�) = E E (�jX) = E (E (Y �f(X))jX) = E (E (Y jX)� f(X)) = E (f(X)� f(X)) = 0. �

What's Next? It is traditional in most introductory courses

to start with parametric inference. Instead, we will start with

nonparametric inference and then we will cover parametric in-

ference. In some respects, nonparametric inference is easier to

understand and is more useful than parametric inference.

Frequentists and Bayesians. There are many approaches

to statistical inference. The two dominant approaches are called

frequentist inference and Bayesian inference. We'll cover

both but we will start with frequentist inference. We'll postpone

a discussion of the pro's and con's of these two until later.

Some Notation. If F = ff(x; �) : � 2 �g is a parametric

model, we write P�(X 2 A) =RAf(x; �)dx and E �(r(X)) =R

r(x)f(x; �)dx. The subscript � indicates that the probability

or expectation is with respect to f(x; �); it does not mean we

are averaging over �. Similarly, we write V� for the variance.

7.3 Fundamental Concepts in Inference

Many inferential problems can be identi�ed as being one of

three types: estimation, con�dence sets or hypothesis testing.

We will treat all of these problems in detail in the rest of the

book. Here, we give a brief introduction to the ideas.

7.3.1 Point Estimation

Page 109: All of Statistics: A Concise Course in Statistical Inference Brief

7.3 Fundamental Concepts in Inference 109

Point estimation refers to providing a single \best guess"

of some quantity of interest. The quantity of interest could be a

parameter in a parametric model, a cdf F , a probability density

function f , a regression function r, or a prediction for a future

value Y of some random variable.

By convention, we denote a point estimate of � by b�.Remember that � is a �xed, unknown quantity. The es-

timate b� depends on the data so b� is a random variable.

More formally, let X1; : : : ; Xn be n iid data point from some

distribution F . A point estimator b�n of a parameter � is some

function of X1; : : : ; Xn:b�n = g(X1; : : : ; Xn):

We de�ne

bias (b�n) = E �(b�n)� � (7.4)

to be the bias of b�n. We say that b�n is unbiased if E (b�n) = �.

Unbiasedness used to receive much attention but these days it is

not considered very important; many of the estimators we will

use are biased. A point estimator b�n of a parameter � is con-

sistent if b�n P�! �. Consistency is a reasonable requirement for

estimators. The distribution of b�n is called the sampling dis-

tribution. The standard deviation of b�n is called the standarderror, denoted by se :

se = se (b�n) =qV(b�n): (7.5)

Often, it is not possible to compute the standard error but usu-

ally we can estimate the standard error. The estimated standard

error is denoted by bse .Example 7.7 Let X1; : : : ; Xn � Bernoulli(p) and let bpn = n�1

PiXi.

Then E (bpn) = n�1P

iE (Xi) = p so bpn is unbiased. The standard

Page 110: All of Statistics: A Concise Course in Statistical Inference Brief

110 7. Models, Statistical Inference and Learning

error is se =pV(bpn) = pp(1� p)=n. The estimated standard

error is bse =pbp(1� bp)=n. �

The quality of a point estimate is sometimes assessed by the

mean squared error, or MSE, de�ned by

MSE = E �(b�n � �)2:

Recall that E �(�) refers to expectation with respect to the dis-

tribution

f(x1; : : : ; xn; �) =

nYi=1

f(xi; �)

that generated the data. It does not mean we are averaging over

a density for �.

Theorem 7.8 The MSE can be written as

MSE = bias (b�n)2 + V�(b�n): (7.6)

Proof. Let �n = E�(b�n). ThenE �(b�n � �)2 = E �(b�n � �n + �n � �)2

= E �(b�n � �n)2 + 2(�n � �)2E �(b�n � �n) + E �(�n � �)2

= (�n � �)2 + E �(b�n � �n)2

= bias 2 + V(b�n): �Theorem 7.9 If bias ! 0 and se ! 0 as n ! 1 then b�n is

consistent, that is, b�n P�! �.

Proof. If bias ! 0 and se ! 0 then, by Theorem 7.8,

MSE ! 0. It follows that b�n qm�! �. (Recall de�nition 6.3.) The

result follows from part (b) of Theorem 6.4. �

Example 7.10 Returning to the coin ipping example, we have

that E p(bpn) = p so that bias = p�p = 0 and se =pp(1� p)=n!

0. Hence, bpn P�! p, that is, bpn is a consistent estimator. �

Page 111: All of Statistics: A Concise Course in Statistical Inference Brief

7.3 Fundamental Concepts in Inference 111

Many of the estimators we will encounter will turn out to

have, approximately, a Normal distribution.

De�nition 7.11 An estimator is asymptotically Nor-

mal if b�n � �

se N(0; 1): (7.7)

7.3.2 Con�dence Sets

A 1�� con�dence interval for a parameter � is an interval

Cn = (a; b) where a = a(X1; : : : ; Xn) and b = b(X1; : : : ; Xn) are

functions of the data such that

P�(� 2 Cn) � 1� �; for all � 2 �: (7.8)

In words, (a; b) traps � with probability 1 � �. We call 1 � �

the coverage of the con�dence interval. Commonly, people use

95 per cent con�dence intervals which corresponds to choosing

� = 0:05. Note: Cn is random and � is �xed! If � is a vector

then we use a con�dence set (such a sphere or an ellipse) instead

of an interval.

Warning! There is much confusion about how to interpret

a con�dence interval. A con�dence interval is not a probability

statement about � since � is a �xed quantity, not a random

variable. Some texts interpret con�dence intervals as follows: if

I repeat the experiment over and over, the interval will contain

the parameter 95 per cent of the time. This is correct but useless

since we rarely repeat the same experiment over and over. A

better interpretation is this:

On day 1, you collect data and construct a 95 per

cent con�dence interval for a parameter �1. On day

Page 112: All of Statistics: A Concise Course in Statistical Inference Brief

112 7. Models, Statistical Inference and Learning

2, you collect new data and construct a 95 per cent

con�dence interval for an unrelated parameter �2. On

day 3, you collect new data and construct a 95 per

cent con�dence interval for an unrelated parameter �3.

You continue this way constructing con�dence inter-

vals for a sequence of unrelated parameters �1; �2; : : :

Then 95 per cent your intervals will trap the true pa-

rameter value. There is no need to introduce the idea

of repeating the same experiment over and over.

Example 7.12 Every day, newspapers report opinion polls. For

example, they might say that \83 per cent of the population fa-

vor arming pilots with guns." Usually, you will see a statement

like \this poll is accurate to within 4 points 95 per cent of the

time." They are saying that 83 � 4 is a 95 per cent con�dence

interval for the true but unknown proportion p of people who

favor arming pilots with guns. If you form a con�dence inter-

val this way everyday for the rest of your life, 95 per cent of

your intervals will contain the true parameter. This is true even

though you are estimating a di�erent quantity (a di�erent poll

question) every day.

Later, we will discuss Bayesian methods in which we treat

� as if it were a random variable and we do make probability

statements about �. In particular, we will make statements like

\the probability that � is Cn, given the data, is 95 per cent."

However, these Bayesian intervals refer to degree-of-belief prob-

abilities. These Bayesian intervals will not, in general, trap the

parameter 95 per cent of the time.

Example 7.13 In the coin ipping setting, let Cn = (bpn��n; bpn+�n) where �

2n= log(2=�)=(2n). From Hoe�ding's inequality (5.4)

it follows that

P(p 2 Cn) � 1� �

for every p. Hence, Cn is a 1� � con�dence interval. �

Page 113: All of Statistics: A Concise Course in Statistical Inference Brief

7.3 Fundamental Concepts in Inference 113

As mentioned earlier, point estimators often have a limiting

Normal distribution, meaning that equation (7.7) holds, that

is, b�n � N(�; bse 2). In this case we can construct (approximate)

con�dence intervals as follows.

Theorem 7.14 (Normal-based Con�dence Interval.) Suppose thatb�n � N(�; bse 2). Let � be the cdf of a standard Normal and

let z�=2 = ��1(1 � (�=2)), that is, P(Z > z�=2) = �=2 and

P(�z�=2 < Z < z�=2) = 1� � where Z � N(0; 1). Let

Cn = (b�n � z�=2 bse ; b�n + z�=2 bse ): (7.9)

Then

P�(� 2 Cn)! 1� �: (7.10)

Proof. Let Zn = (b�n��)= bse . By assumption Zn Z where

Z � N(0; 1). Hence,

P�(� 2 Cn) = P�

�b�n � z�=2 bse < � < b�n + z�=2 bse�= P�

�z�=2 <

b�n � �bse < z�=2

!! P

��z�=2 < Z < z�=2

�= 1� �: �

For 95 per cent con�dence intervals, � = 0:05 and z�=2 =

1:96 � 2 leading to the approximate 95 per cent con�dence

interval b�n� 2 bse . We will discuss the construction of con�dence

intervals in more generality in the rest of the book.

Example 7.15 Let X1; : : : ; Xn � Bernoulli(p) and let bpn = n�1P

n

i=1Xi.

Then V(bpn) = n�2P

n

i=1 V(Xi) = n�2P

n

i=1 p(1�p) = n�2np(1�p) = p(1�p)=n. Hence, se =

pp(1� p)=n and bse =

pbpn(1� bpn)=n.By the Central Limit Theorem, bpn � N(p; bse2). Therefore, anapproximate 1� � con�dence interval is

bp� z�=2 bse = bp� z�=2

rbpn(1� bpn)n

:

Page 114: All of Statistics: A Concise Course in Statistical Inference Brief

114 7. Models, Statistical Inference and Learning

Compare this with the con�dence interval in the previous exam-

ple. The Normal-based interval is shorter but it only has approx-

imately (large sample) correct coverage. �

7.3.3 Hypothesis Testing

In hypothesis testing, we start with some default theory

{ called a null hypothesis { and we ask if the data provide

suÆcient evidence to reject the theory. If not we retain the null

hypothesis.The term \retaining

the null hypothesis"

is due to Chris Gen-

ovese. Other termi-

nology is \accepting

the null" or \failing

to reject the null."

Example 7.16 (Testing if a Coin is Fair) SupposeX1; : : : ; Xn � Bernoulli(p)

denote n independent coin ips. Suppose we want to test if the

coin is fair. Let H0 denote the hypothesis that the coin is fair

and let H1 denote the hypothesis that the coin is not fair. H0

is called the null hypothesis and H1 is called the alternative

hypothesis. We can write the hypotheses as

H0 : p = 1=2 versus H1 : p 6= 1=2:

It seems reasonable to reject H0 if T = jbpn � (1=2)j is large.

When we discuss hypothesis testing in detail, we will be more

precise about how large T should be to reject H0. �

7.4 Bibliographic Remarks

Statistical inference is covered in many texts. Elementary

texts include DeGroot and Schervish (2001) and Larsen and

Marx (1986). At the intermediate level I recommend Casella

and Berger (2002) and Bickel and Doksum (2001). At the ad-

vanced level, Lehmann and Casella (1998), Lehmann (1986) and

van der Vaart (1998).

7.5 Technical Appendix

Page 115: All of Statistics: A Concise Course in Statistical Inference Brief

7.5 Technical Appendix 115

Our de�nition of con�dence interval requires that P�(� 2Cn) � 1 � � for all � 2 �. An pointwise asymptotic con-

�dence interval requires that lim infn!1 P�(� 2 Cn) � 1�� for

all � 2 �. An uniform asymptotic con�dence interval requres

that lim infn!1 inf�2� P�(� 2 Cn) � 1 � �. The approximate

Normal-based interval is a pointwise asymptotic con�dence in-

terval. In general, it might not be a uniform asymptotic con�-

dence interval.

Page 116: All of Statistics: A Concise Course in Statistical Inference Brief

116 7. Models, Statistical Inference and Learning

Page 117: All of Statistics: A Concise Course in Statistical Inference Brief

8

Estimating the cdf and StatisticalFunctionals

The �rst inference problem we will consider is nonparametric

estimation ofthe cdf F and functions of the cdf.

8.1 The Empirical Distribution Function

Let X1; : : : ; Xn � F be iid where F is a distribution function

on the real line. We will estimate F with the empirical distribu-

tion function, which is de�ned as follo ws.

De�nition 8.1 The empirical distribution function bFnis the cdf that puts mass 1=n at e ach data point Xi. For-

mally,

bFn(x) =

Pn

i=1 I(Xi � x)

n

=n umber of observations less than or equal to x

n(8.1)

where

I(Xi � x) =

�1 if Xi � x

0 if Xi > x:

Page 118: All of Statistics: A Concise Course in Statistical Inference Brief

118 8. Estimating the cdf and Statistical Functionals

Example 8.2 (Nerve Data.) Cox and Lewis (1966) reported 799

waiting times between successive pulses along a nerve �bre. The

�rst plot in Figure 8.1 shows the a \toothpick plot" where each

toothpick shows the location of one data point. The second plot

shows that empirical cdf bFn. Suppose we want to estimate the

fraction of waiting times between .4 and .6 seconds. The estimate

is bFn(:6)� bFn(:4) = :93� :84 = :09. �

The following theorems give some properties of bFn(x).Theorem 8.3 At any �xed value of x,

E

� bFn(x)� = F (x) and V

� bFn(x)� =F (x)(1� F (x))

n:

Thus,

MSE =F (x)(1� F (x))

n! 0

and hence, bFn(x) P�!F (x).

Actually,

supxj bFn(x)�F (x)j

converges to 0

almost surely.

Theorem 8.4 (Glivenko-Cantelli Theorem.) Let X1; : : : ; Xn � F .

Then

supx

j bFn(x)� F (x)j P�! 0:

8.2 Statistical Functionals

A statistical functional T (F ) is any function of F . Examples

are the mean � =RxdF (x), the variance �2 =

R(x� �)2dF (x)

and the median m = F�1(1=2).

De�nition 8.5 The plug-in estimator of � = T (F ) is

de�ned by b�n = T ( bFn):In other words, just plug in bFn for the unknown F .

Page 119: All of Statistics: A Concise Course in Statistical Inference Brief

8.2 Statistical Functionals 119

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

time

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

0.0

0.4

0.8

time

cdf

FIGURE 8.1. Nerve data. The solid line in the middle is the empirical

distribution function. The lines above and below the middle line are

a 95 per cent con�dence band. The con�dence band is explained in

the appendix.

Page 120: All of Statistics: A Concise Course in Statistical Inference Brief

120 8. Estimating the cdf and Statistical Functionals

A functional of the formRr(x)dF (x) is called a linear func-

tional. Recall thatRr(x)dF (x) is de�ned to be

Rr(x)f(x)dx

in the continuous case andP

jr(xj)f(xj) in the discrete. The

empirical cdf bFn(x) is discrete, putting mass 1=n at each Xi.

Hence, if T (F ) =Rr(x)dF (x) is a linear functional then we

have:

The plug-in estimator for linear functional T (F ) =Rr(x)dF (x) is:

T ( bFn) = Z r(x)d bFn(x) = 1

n

nXi=1

r(Xi): (8.2)

Sometimes we can �nd the estimated standard error bse of

T ( bFn) by doing some calculations. However, in other cases it

is not obvious how to estimate the standard error. In the next

chapter, we will discuss a general method for �nding bse. Fornow, let us just assume that somehow we can �nd bse. In many

cases, it turns out that

T ( bFn) � N(T (F ); bse2):By equation (7.10), an approximate 1 � � con�dence interval

for T (F ) is then

T ( bFn)� z�=2 bse: (8.3)

We will call this theNormal-based interval. For a 95 per cent

con�dence interval, z�=2 = z:05=2 = 1:96 � 2 so the interval is

T ( bFn)� 2 bse:Example 8.6 (The mean.) Let � = T (F ) =

Rx dF (x). The plug-

in estimator is b� =Rx d bFn(x) = Xn. The standard error is

se =

qV(Xn) = �=

pn. If b� denotes an estimate of �, then

Page 121: All of Statistics: A Concise Course in Statistical Inference Brief

8.2 Statistical Functionals 121

the estimated standard error is b�=pn. (In the next example, we

shall see how to estimate �.) A Normal-based con�dence interval

for � is Xn � z�=2 se2: �

Example 8.7 (The Variance) Let �2 = T (F ) = V(X) =Rx2dF (x)��R

xdF (x)�2. The plug-in estimator is

b�2 =

Zx2d bFn(x)� �Z xd bFn(x)�2

=1

n

nXi=1

X2i�

1

n

nXi=1

Xi

!2

=1

n

nXi=1

(Xi �Xn)2:

Another reasoable estimator of �2 is the sample variance

S2n=

1

n� 1

nXi=1

(Xi �Xn)2:

In practice, there is little di�erence between b�2 and S2nand you

can use either one. Returning the our last example, we now see

that the estimated standard error of the estimate of the mean isbse = b�=pn. �Example 8.8 (The Skewness) Let � and �2 denote the mean and

variance of a random variable X. The skewness is de�ned to be

� =E(X � �)3

�3=

R(x� �)3dF (x)�R

(x� �)2dF (x)3=2 :

The skewness measure the lack of symmetry of a distribution.

To �nd the plug-in estimate, �rst recall that b� = n�1P

iXi andb�2 = n�1

Pi(Xi � b�)2. The plug-in estimate of � is

b� =

R(x� �)3d bFn(x)nR

(x� �)2d bFn(x)o3=2=

1n

Pi(Xi � b�)3b�3 : �

Page 122: All of Statistics: A Concise Course in Statistical Inference Brief

122 8. Estimating the cdf and Statistical Functionals

Example 8.9 (Correlation.) Let Z = (X; Y ) and let � = T (F ) =

E (X��X)(Y��Y )=(�x�y) denote the correlation between X and

Y , where F (x; y) is bivariate. We can write T (F ) = a(T1(F ); T2(F ); T3(F ); T4(F ); T5(F ))

where

T1(F ) =Rx dF (z) T2(F ) =

Ry dF (z) T3(F ) =

Rxy dF (z)

T4(F ) =Rx2 dF (z) T5(F ) =

Ry2 dF (z)

and

a(t1; : : : ; t5) =t3 � t1t2p

(t4 � t21)(t5 � t22):

Replace F with bFn in T1(F ), . . . , T5(F ), and take

b� = a(T1( bFn); T2( bFn); T3( bFn); T4( bFn); T5( bFn)):We get b� = P

i(Xi �Xn)(Yi � Y n)qP

i(Xi �Xn)2

qPi(Yi � Y n)2

which is called the sample correlation. �

Example 8.10 (Quantiles.) Let F be strictly increasing with den-

sity f . The T (F ) = F�1(p) be the pth quantile. The estimate

if T (F ) is bF�1n

(p). We have to be a bit careful since bFn is

not invertible. To avoid ambiguity we de�ne bF�1n

(p) = inffx :bFn(x) � pg. We call bF�1n

(p) the pth sample quantile. �

Only in the �rst example did we compute a standard error or

a con�dence interval. How shall we handle the other examples.

When we discuss parametric methods, we will develop formulae

for standard errors and con�dence intervals. But in our non-

parametric setting we need something else. In the next chapter,

we will introduce two methods { the jackknife and the bootstrap

{ for getting standard errors and con�dence intervals.

Example 8.11 (Plasma Cholesterol.) Figure 8.2 shows histograms

for plasma cholesterol (in mg/dl) for 371 patients with chest

Page 123: All of Statistics: A Concise Course in Statistical Inference Brief

8.2 Statistical Functionals 123

pain (Scott et al. 1978). The histograms show the percentage of

patients in 10 bins. The �rst histogram is for 51 patients who

had no evidence of heart disease while the second histogram is

for 320 patients who had narrowing of the arteries. Is the mean

cholesterol di�erent in the two groups? Let us regard these data

as samples from two distributions F1 and F2. Let �1 =RxdF1(x)

and �2 =RxdF2(x) denote the means of the two populations.

The plug-in estimates are b�1 =Rxd bFn;1(x) = Xn;1 = 195:27

and b�2 = R xd bFn;2(x) = Xn;2 = 216:19: Recall that the standard

error of the sample mean b� = 1n

Pn

i=1Xi is

se (b�) =vuutV

�1

n

nXi=1

Xi

�=

vuut 1

n2

nXi=1

V(Xi) =

rn�2

n2=

�pn

which we estimate by

bse (b�) = b�pn

where

b� =

vuut 1

n

nXi=1

(Xi �X)2:

For the two groups this yields bse (b�1) = 5:0 and bse (b�2) = 2:4.

Approximate 95 per cent con�dence intervals for �1 and �2 areb�1 � 2 bse (b�1) = (185; 205) and b�2 � 2 bse (b�2) = (211; 221).

Now, consider the functional � = T (F2)�T (F1) whose plug-in

estimate is b� = b�2�b�1 = 216:19�195:27 = 20:92: The standard

error of b� is

se =pV(b�2 � b�1) =pV(b�2) + V(b�1) =p(se (b�1))2 + (se (b�2))2

and we estimate this by

bse =p( bse (b�1))2 + ( bse (b�2))2 = 5:55:

Page 124: All of Statistics: A Concise Course in Statistical Inference Brief

124 8. Estimating the cdf and Statistical Functionals

An approximate 95 per cent con�dence interval for � is b��2 bse =

(9:8; 32:0). This suggests that cholesterol is higher among those

with narrowed arteries. We should not jump to the conclusion

(from these data) that cholesterol causes heart disease. The leap

from statistical evidence to causation is very subtle and is dis-

cussed later in this text. �

8.3 Technical Appendix

In this appendix we explain how to construct a con�dence

band for the cdf .

Theorem 8.12 (Dvoretzky-Kiefer-Wolfowitz (DKW) inequality.) Let

X1; : : : ; Xn be iid from F . Then, for any � > 0,

P

�supx

jF (x)� bFn(x)j > �

�� 2e�2n�

2

: (8.4)

From the DKW inequality, we can construct a con�dence set.

Let �2n= log(2=�)=(2n), L(x) = maxf bFn(x)��n; 0g and U(x) =

minf bFn(x) + �n; 1g. It follows from (8.4) that for any F ,

P(F 2 Cn) � 1� �:

Thus, Cn is a nonparametric 1�� con�dence set for F . A better

name for Cn is a con�dence band. To summarize:

A 1�� nonparametric con�dence band for F is (L(x); U(x)) where

L(x) = maxf bFn(x)� �n; 0gU(x) = minf bFn(x) + �n; 1g

�n =

s1

2nlog

�2

�:

Page 125: All of Statistics: A Concise Course in Statistical Inference Brief

8.3 Technical Appendix 125

plasma cholesterol for patients without heart disease

100 150 200 250 300 350 400

plasma cholesterol for patients with heart disease

100 150 200 250 300 350 400

FIGURE 8.2. Plasma cholesterol for 51 patients with no heart disease

and 320 patients with narrowing of the arteries.

Page 126: All of Statistics: A Concise Course in Statistical Inference Brief

126 8. Estimating the cdf and Statistical Functionals

Example 8.13 The dashed lines in Figure 8.1 give a 95 per cent

con�dence band using �n =q

12nlog�

2:05

�= :048. �

8.4 Bibliographic Remarks

The Glivenko-Cantelli theorem is the tip of the iceberg. The

theory of distribution functions is a special case of what are

called empirical processes which underlie much of modern statis-

tical theory. Some references on empirical processes are Shorack

and Wellner (1986) and van der Vaart and Wellner (1996).

8.5 Exercises

1. Prove Theorem 8.3.

2. LetX1; : : : ; Xn � Bernoulli(p) and let Y1; : : : ; Ym � Bernoulli(q).

Find the plug-in estimator and estimated standard error

for p. Find an approximate 90 per cent con�dence interval

for p. Find the plug-in estimator and estimated standard

error for p�q. Find an approximate 90 per cent con�dence

interval for p� q.

3. (Computer Experiment.) Generate 100 observations from

a N(0,1) distribution. Compute a 95 per cent con�dence

band for the cdf F . Repeat this 1000 times and see how

often the con�dence band contains the true distribution

function. Repeat using data from a Cauchy distribution.

4. Let X1; : : : ; Xn � F and let bFn(x) be the empirical distri-

bution function. For a �xed x, use the central limit theo-

rem to �nd the limiting distribution of bFn(x).5. Let x and y be two distinct points. Find Cov( bFn(x); bFn(y)).

Page 127: All of Statistics: A Concise Course in Statistical Inference Brief

8.5 Exercises 127

6. Let X1; : : : ; Xn � F and let bF be the empirical distri-

bution function. Let a < b be �xed numbers and de�ne

� = T (F ) = F (b)�F (a). Let b� = T ( bFn) = bFn(b)� bFn(a).Find the estimated standard error of b�. Find an expressionfor an approximate 1� � con�dence interval for �.

7. Data on the magnitudes of earthquakes near Fiji are avail-

able on the course website. Estimate the cdf F (x). Com-

pute and plot a 95 per cent con�dence envelope for F .

Find an approximate 95 per cent con�dence interval for

F (4:9)� F (4:3).

8. Get the data on eruption times and waiting times between

eruptions of the old faithful geyser from the course web-

site. Estimate the mean waiting time and give a standard

error for the estimate. Also, give a 90 per cent con�dence

interval for the mean waiting time. Now estimate the me-

dian waiting time. In the next chapter we will see how to

get the standard error for the median.

9. 100 people are given a standard antibiotic to treat an in-

fection and another 100 are given a new antibiotic. In the

�rst group, 90 people recover; in the second group, 85 peo-

ple recover. Let p1 be the probability of recovery under

the standard treatment and let p2 be the probability of

recovery under the new treatment. We are interested in

estimating � = p1 � p2. Provide an estimate, standard er-

ror, an 80 per cent con�dence interval and a 95 per cent

con�dence interval for �.

10. In 1975, an experiment was conducted to see if cloud seed-

ing produced rainfall. 26 clouds were seeded with silver

nitrate and 26 were not. The decision to seed or not was

made at random. Get the data from

http://lib.stat.cmu.edu/DASL/Stories/CloudSeeding.html

Page 128: All of Statistics: A Concise Course in Statistical Inference Brief

128 8. Estimating the cdf and Statistical Functionals

Let � be the di�erence in the median precipitation from

the two groups. Estimate �. Estimate the standard error of

the estimate and produce a 95 per cent con�dence interval.

Page 129: All of Statistics: A Concise Course in Statistical Inference Brief

9

The Bootstrap

The bootstrap is a nonparametric method for estimating stan-

dard errors and computing con�dence intervals. Let

Tn = g(X1; : : : ; Xn)

be a statistic, that is, any function of the data. Suppose we

want to know VF (Tn), the variance of Tn. We hav ewritten VF

to emphasize that the variance usually depends on the unknown

distribution function F . For example, if Tn = n�1P

n

i=1Xi then

VF (Tn) = �2=n where �2 =R(x� �)2dF (x) and � =

RxdF (x).

The bootstrap idea has two parts. First we estimate VF (Tn) with

VbFn(Tn). Thus, for Tn = n�1

Pn

i=1Xi we hav eV bFn(Tn) = b�2=n

where b�2 = n�1P

n

i=1(Xi �Xn). However, in morecomplicated

cases we cannot write down a simple formula for VbFn(Tn). This

leads us to the second step which is to approximate VbFn(Tn)

using simulation. Before proceeding, let us discuss the idea of

simulation.

9.1 Simulation

Page 130: All of Statistics: A Concise Course in Statistical Inference Brief

130 9. The Bootstrap

Suppose we draw an iid sample Y1; : : : ; YB from a distribution

G. By the law of large numbers,

Y n =1

B

BXj=1

YjP�!Zy dG(y) = E (Y )

as B !1. So if we draw a large sample from G, we can use the

sample mean Y n to approximate E (Y ). In a simulation, we can

make B as large as we like in which case, the di�erence between

Y n and E (Y ) is negligible. More generally, if h is any function

with �nite mean then

1

B

BXj=1

h(Yj)P�!Zh(y)dG(y) = E (h(Y ))

as B !1. In particular,

1

B

BXj=1

(Yj�Y )2 =1

B

BXj=1

Y 2j��1

B

BXj=1

Yj

�2P�!Zy2dF (y)�

�ZydF (y)

�2

= V(Y ):

Hence, we can use the sample variance of the simulated values

to approximate V(Y ).

9.2 Bootstrap Variance Estimation

According to what we just learned, we can approximate VbFn(Tn)

by simulation. Now VbFn(Tn) means \the variance of Tn if the

distribution of the data is bFn." How can we simulate from the

distribution of Tn when the data are assumed to have distribu-

tion bFn? The answer is to simulateX�1 ; : : : ; X

�nfrom bFn and then

compute T �n= g(X�

1 ; : : : ; X�n). This constitutes one draw from

the distribution of Tn. The idea is illustrated in the following

diagram:

Real world F =) X1; : : : ; Xn =) Tn = g(X1; : : : ; Xn)

Bootstrap world bFn =) X�1 ; : : : ; X

�n

=) T �n= g(X�

1 ; : : : ; X�n)

Page 131: All of Statistics: A Concise Course in Statistical Inference Brief

9.2 Bootstrap Variance Estimation 131

How do we simulate X�1 ; : : : ; X

�nfrom bFn? Notice that bFn puts

mass 1/n at each data pointX1; : : : ; Xn. Therefore, drawing an

observation from bFn is equivalent to drawing one point

at random from the original data set. Thu, to simulate

X�1 ; : : : ; X

�n� bFn it suÆces to draw n observations with re-

placement from X1; : : : ; Xn. Here is a summary:

Boostrap Variance Estimation

1. Draw X�1 ; : : : ; X

�n� bFn.

2. Compute T �n= g(X�

1 ; : : : ; X�n).

3. Repeat steps 1 and 2, B times, to get T �n;1; : : : ; T

�n;B

.

4. Let

vboot =1

B

BXb=1

T �n;b�

1

B

BXr=1

T �n;b

!2

: (9.1)

Example 9.1 The following pseudo-code shows how to use the

bootstrap to estimate the standard of the median.

Page 132: All of Statistics: A Concise Course in Statistical Inference Brief

132 9. The Bootstrap

Bootstrap for The Median

Given data X = (X(1), ..., X(n)):

T <- median(X)

Tboot <- vector of length B

for(i in 1:N){

Xstar <- sample of size n from X (with replacement)

Tboot[i] <- median(Xstar)

}

se <- sqrt(variance(Tboot))

The following schematic diagram will remind you that we are

using two approximations:

VF (Tn)

not so smallz}|{� V

bFn(Tn)

smallz}|{� vboot:

Example 9.2 Consider the nerve data. Let � = T (F ) =R(x �

�)3dF (x)=�3 be the skewness. The skewness is a measure of

asymmetry. A Normal distribution, for example, has skewness

0. The plug-in estimate of the skewness is

b� = T ( bFn) = R(x� �)3d bFn(x)b�3 =

1n

Pn

i=1(Xi �Xn)3b�3 = 1:76:

To estimate the standard error with the bootstrap we follow the

same steps as with the medin example except we compute the

skewness from each bootstrap sample. When applied to the nerve

data, the bootstrap, based on B = 1000 replications, yields a

standard error for the estimated skewness of .16.

Page 133: All of Statistics: A Concise Course in Statistical Inference Brief

9.3 Bootstrap Con�dence Intervals 133

9.3 Bootstrap Con�dence Intervals

There are several ways to construct bootstrap con�dence in-

tervals. Here we discuss three methods.

Normal Interval. The simplest is the Normal interval

Tn � z�=2 bse boot (9.2)

where bse boot is the bootstrap estimate of the standard error.

This interval is not accurate unless the distribution of Tn is

close to Normal.

Pivotal Intervals. Let � = T (F ) and b�n = T ( bFn) and de�ne the

pivot Rn = b�n � �. Let b��n;1; : : : ;

b��n;B

denote bootstrap replica-

tions of b�n. Let H(r) denote the cdf of the pivot:

H(r) = PF (Rn � r): (9.3)

De�ne C?

n= (a; b) where

a = b�n �H�1�1�

2

�and b = b�n �H�1

��2

�: (9.4)

It follows that

P(a � � � b) = P(a� b�n � � � b�n � b� b�n)= P(b�n � b � b�n � � � b�n � a)

= P(b�n � b � Rn � b�n � a)

= H(b�n � a)�H(b�n � b)

= H�H�1

�1�

2

���H

�H�1

��2

��= 1�

2��

2= 1� �:

Hence, C?

nis an exact 1� � con�dence interval for �. Unfortu-

nately, a and b depend on the unknown distribution H but we

Page 134: All of Statistics: A Concise Course in Statistical Inference Brief

134 9. The Bootstrap

can form a bootstrap estimate of H:

bH(r) =1

B

BXb=1

I(R�n;b� r) (9.5)

where R�n;b

= b��n;b� b�n. Let r�� denote the � sample qauntile

of (R�n;1; : : : ; R

�n;B

) and let ���denote the � sample qauntile of

(��n;1; : : : ; �

�n;B

). Note that r��= ��

�� b�n. It follows that an ap-

proximate 1� � con�dence interval is Cn = (ba;bb) whereba = b�n � bH�1

�1�

2

�= b�n � r�1��=2 = 2b�n � ��1��=2bb = b�n � bH�1

��2

�= b�n � r�

�=2 = 2b�n � ���=2:

In summary:

The 1� � bootstrap pivotal con�dence interval is

Cn =�2b�n � b��1��=2; 2b�n � b���=2� : (9.6)

Typically, this is a pointwise, asymptotic con�dence interval.

Theorem 9.3 Under weak conditions on T (F ), PF (T (F ) 2 Cn)!1� � as n!1, where Cn is given in (9.6).

Percentile Intervals. The bootstrap percentile interval is

de�ned by

Cn =����=2; �

�1��=2

�:

The justi�cation for this interval is given in the appendix.

Example 9.4 For estimating the skewness of the nerve data, here

are the various con�dence intervals.

Page 135: All of Statistics: A Concise Course in Statistical Inference Brief

9.3 Bootstrap Con�dence Intervals 135

Method 95% Interval

Normal (1.44, 2.09)

Percentile (1.42, 2.03)

Pivotal (1.48, 2.11)

All these con�dence intervals are approximate. The probabil-

ity that T (F ) is in the interval is not exactly 1 � �. All three

intervals have the same level of accuracy. There are more accu-

rate bootstrap con�dence intervals but they are more compli-

cated and we will not discuss them here.

Example 9.5 (The Plasma Cholesterol Data.) Let us return to the

cholesterol data. Suppose we are interested in the di�erence of

the medians. Pseudo-code for the bootstrap analysis is as follows:

x1 <- first sample

x2 <- second sample

n1 <- length(x1)

n2 <- length(x2)

th.hat <- median(x2) - median(x1)

B <- 1000

Tboot <- vector of length B

for(i in 1:B){

xx1 <- sample of size n1 with replacement from x1

xx2 <- sample of size n2 with replacement from x2

Tboot[i] <- median(xx2) - median(xx1)

}

se <- sqrt(variance(Tboot))

Normal <- (th.hat - 2*se, th.hat + 2*se)

percentile <- (quantile(Tboot,.025), quantile(Tboot,.975))

pivotal <- ( 2*th.hat-quantile(Tboot,.975), 2*th.hat-quantile(Tboot,.025) )

The point estimate is 18.5, the bootstrap standard error is 7.42

and the resulting approximate 95 per cent con�dence intervals

are as follows:

Page 136: All of Statistics: A Concise Course in Statistical Inference Brief

136 9. The Bootstrap

Method 95% Interval

Normal (3.7, 33.3)

Percentile (5.0, 33.3)

Pivotal (5.0, 34.0)

Since these intervals exclude 0, it appears that the second group

has higher cholesterol although there is considerable uncertainty

about how much higher as re ected in the width of the intervals.

The next two examples are based on small sample sizes. In

practice, statistical methods based on very small sample sizes

might not be reliable. We include the examples for their peda-

gogical value but we do want to sound a note of caution about

interpreting the results with some skepticism.

Example 9.6 Here is an example that was one of the �rst used

to illustrate the bootstrap by Bradley Efron, the inventor of the

bootstrap. The data are LSAT scores (for entrance to law school)

and GPA.

LSAT 576 635 558 578 666 580 555 661

651 605 653 575 545 572 594

GPA 3.39 3.30 2.81 3.03 3.44 3.07 3.00 3.43

3.36 3.13 3.12 2.74 2.76 2.88 3.96

Each data point is of the form Xi = (Yi; Zi) where Yi = LSATi

and Zi = GPAi. The law school is interested in the correlation

� =

R R(y � �Y )(z � �Z)dF (y; z)qR

(y � �Y )2dF (y)R(z � �Z)2dF (z)

:

The plug-in estimate is the sample correlation

b� = Pi(Yi � Y )(Zi � Z)qP

i(Yi � Y )2

Pi(Zi � Z)2

:

Page 137: All of Statistics: A Concise Course in Statistical Inference Brief

9.3 Bootstrap Con�dence Intervals 137

The estimated correlation is b� = :776. The bootstrap based on

B = 1000 gives bse = :137. Figure 9.1 shows the data and a his-

togram of the bootstrap replications b��1; : : : ; b��B. This histogram is

an approximation to the sampling distribution of b�. The Normal-

based 95 per cent con�dence interval is :78 � 2 bse = (:51; 1:00)

while the percentile interval is (.46,.96). In large samples, the

two methods will show closer agreement.

Example 9.7 This example is borrowed from An Introduction to

the Bootstrap by B. Efron and R. Tibshirani. When drug com-

panies introduce new medications, they are sometimes requires

to show bioequivalence. This means that the new drug is not

substantially di�erent than the current treatment. Here are data

on eight subjects who used medical patches to infuse a hormone

into the blood. Each subject received three treatments: placebo,

old-patch, new-patch.

subject placebo old new old-placebo new-old

______________________________________________________________

1 9243 17649 16449 8406 -1200

2 9671 12013 14614 2342 2601

3 11792 19979 17274 8187 -2705

4 13357 21816 23798 8459 1982

5 9055 13850 12560 4795 -1290

6 6290 9806 10157 3516 351

7 12412 17208 16570 4796 -638

8 18806 29044 26325 10238 -2719

Let Z = old � placebo and Y = new � old. The Food and

Drug Administration (FDA) requirement for bioequivalence is

that j�j � :20 where

� =E F (Y )

E F (Z):

Page 138: All of Statistics: A Concise Course in Statistical Inference Brief

138 9. The Bootstrap

• •

••

LSAT

GPA

560 580 600 620 640 660

2.8

3.0

3.2

3.4

0.2 0.4 0.6 0.8 1.0

050

100

150

Bootstrap Samples

FIGURE 9.1. Law school data.

Page 139: All of Statistics: A Concise Course in Statistical Inference Brief

9.4 Bibliographic Remarks 139

The estimate of � is

b� = Y

Z=�452:36342

= �:0713:

The bootstrap standard error is bse = :105. To answer the bioe-

quivalence question, we compute a con�dence interval. From

B = 1000 bootstrap replications we get the 95 per cent inter-

val is (-.24,.15). This is not quite contained in [-.20,.20] so at

the 95 per cent level we have not demonstrated bioequivalence.

Figure 9.2 shows the histogram of the bootstrap values.

9.4 Bibliographic Remarks

The boostrap was invented by Efron (1979). There are several

books on these topics including Efron and Tibshirani (1993),

Davison and Hinkley (1997), Hall (1992), and Shao and Tu

(1995). Also, see section 3.6 of van der Vaart and Wellner (1996).

9.5 Technical Appendix

9.5.1 The Jackknife

There is another method for computing standard errors called

the jackknife, due to Quenouille (1949). It is less computa-

tionally expensive than the bootstrap but is less general. Let

Tn = T (X1; : : : ; Xn) be a statistic and T(�i) denote the statistic

with the ith observation removed. Let T n = n�1P

n

i=1 T(�i). The

jackknife estimate of var(Tn) is

vjack =n� 1

n

nXi=1

(T(�i) � T n)2

and the jackknife estimate of the standard error is bsejack =pvjack. Under suitable conditions on T , it can be shown that

vjack consistently estimates var(Tn) in the sense that vjack=var(Tn)p!

Page 140: All of Statistics: A Concise Course in Statistical Inference Brief

140 9. The Bootstrap

-0.3 -0.2 -0.1 0.0 0.1 0.2

020

4060

80

Bootstrap Samples

FIGURE 9.2. Patch data.

Page 141: All of Statistics: A Concise Course in Statistical Inference Brief

9.6 Excercises 141

1. However, unlike the bootstrap, the jackknife does not produce

consistent estimates of the standard error of sample quantiles.

9.5.2 Justi�cation For The Percentile Interval

Suppose there exists a monotone transformation U = m(T )

such that U � N(�; c2) where � = m(�). We do not suppose we

know the transformation, only that one exist. Let U�b= m(��

n;b).

Let u��be the � sample quantile of the U�

b's. Since a mono-

tone transformation preserves quantiles, we have that u��=2 =

m(���=2). Also, since U � N(�; c2), the �=2 quantile of U is

�� z�=2c. Hence u��=2 = �� z�=2c. Similarly, u�1��=2 = �+ z�=2c.

Therefore,

P(���=2 � � � ��1��=2) = P(m(��

�=2) � m(�) � m(��1��=2))

= P(u��=2 � � � u�1��=2)

= P(U � cz�=2 � � � U + cz�=2)

= P(�z�=2 �U � �

c� z�=2)

= 1� �:

An exact normalizing transformation will rarely exist but there

may exist approximate normalizing transformations.

9.6 Excercises

1. Consider the data in Example 9.6. Find the plug-in esti-

mate of the correlation coeÆcient. Estimate the standard

error using the bootstrap. Find a 95 per cent con�dence

inerval using all three methods.

2. (Computer Experiment.) Conduct a simulation to compare

the four bootstrap con�dence interval methods. Let n =

50 and let T (F ) =R(x � �)3dF (x)=�3 be the skewness.

Draw Y1; : : : ; Xn � N(0; 1) and set Xi = eYi, i = 1; : : : ; n.

Construct the four types of bootstrap 95 per cent intervals

for T (F ) from the data X1; : : : ; Xn. Repeat this whole

Page 142: All of Statistics: A Concise Course in Statistical Inference Brief

142 9. The Bootstrap

thing many times and estimate the true coverage of the

four intervals.

3. Let

X1; : : : ; Xn � t3

where n = 25. Let � = T (F ) = (q:75 � q:25)=1:34 where qp

denotes the pth quantile. Do a simulation to compare the

coverage and length of the following con�dence intervals

for �: (i) Normal interval with standard error from the

bootstrap, (ii) bootstrap percentile interval.

Remark: The jackknife does not give a consistent estima-

tor of the variance of a quantile.

4. Let X1; : : : ; Xn be distinct observations (no ties). Show

that there are �2n� 1

n

�distinct bootstrap samples.

Hint: Imagine putting n balls into n buckets.

5. LetX1; : : : ; Xn be distinct observations (no ties). LetX�1 ; : : : ; X

�n

denote a bootstrap sample and let X�n= n�1

Pn

i=1X�i.

Find: E (X�njX1; : : : ; Xn), V(X

�njX1; : : : ; Xn), E (X

�n) and

V(X�n).

6. (Computer Experiment.) Let X1; :::; Xn Normal(�; 1): Let

� = e� and let b� = eX be the mle. Create a data set (using

� = 5) consisting of n=100 observations.

(a) Use the bootstrap to get the se and 95 percent con�-

dence interval for �:

(b) Plot a histogram of the bootstrap replications for the

parametric and nonparametric bootstraps. These are es-

timates of the distribution of b�. Compare this to the true

sampling distribution of b�.

Page 143: All of Statistics: A Concise Course in Statistical Inference Brief

9.6 Excercises 143

7. LetX1; :::; Xn Unif(0; �):The mle is b� = Xmax = maxfX1; :::; Xng.Generate a data set of size 50 with � = 1:

(a) Find the distribution of b�. Compare the true distri-

bution of b� to the histograms from the parametric and

nonparametric bootstraps.

(b) This is a case where the nonparametric bootstrap does

very poorly. In fact, we can prove that this is the case.

Show that, for the parametric bootstrap P (b�� = b�) = 0

but for the nonparametric bootstrap P (b�� = b�) � :632.

Hint: show that, P (b�� = b�) = 1 � (1� (1=n))n then take

the limit as n gets large.

8. Let Tn = X2

n, � = E (X1), �k =

Rjx � �jkdF (x) andb�k = n�1

Pn

i=1 jXi �Xnjk. Show that

vboot =4X

2

nb�2

n+

4Xnb�3

n2+b�4

n3:

Page 144: All of Statistics: A Concise Course in Statistical Inference Brief

144 9. The Bootstrap

Page 145: All of Statistics: A Concise Course in Statistical Inference Brief

10

Parametric Inference

We now turn our attention to parametric models, that is,

models of the form

F =nf(x; �) : � 2 �

o(10.1)

where the � � Rk is the parameter space and � = (�1; : : : ; �k)

is the parameter. The problem of inference then reduces to the

problem of estimating the parameter �.

Students learning statistics often ask: how would we ever

know that the distribution that generated the data is in some

parametric model? This is an excellent question. Indeed, we

would rarely hav e such knowledge which is why nonparametric

methods are preferable. Still, studying methods for parametric

models is useful for two reasons. First, there are some cases

where background knowledge suggests that a parametric model

provides a reasonable approximation. F or example, counts of

traÆc accidents are known from prior experience to follow ap-

proximately a P oissonmodel. Second, the inferential concepts

Page 146: All of Statistics: A Concise Course in Statistical Inference Brief

146 10. Parametric Inference

for parametric models provide background for understanding

certain nonparametric methods.

We will begin with a brief dicussion about parameters of in-

terest and nuisance parameters in the next section, then we will

discuss two methods for estimating �, the method of moments

and the method of maximum likelihood.

10.1 Parameter of Interest

Often, we are only interested in some function T (�). For ex-

ample, if X � N(�; �2) then the parameter is � = (�; �). If our

goal is to estimate � then � = T (�) is called the parameter

of interest and � is called a nuisance parameter. The pa-

rameter of interest can be a complicated function of � as in the

following example.

Example 10.1 Let X1; : : : ; Xn � Normal(�; �2). The parameter

is � = (�; �) is � = f(�; �) : � 2 R; � > 0g. Suppose that Xi

is the outcome of a blood test and suppose we are interested in

� , the fraction of the population whose test score is larger than

1. Let Z denote a standard Normal random variable. Then

� = P(X > 1) = 1� P(X < 1) = 1� P

�X � �

�<

1� �

�= 1� P

�Z <

1� �

�= 1� �

�Z <

1� �

�:

The parameter of interest is � = T (�; �) = 1��((1� �)=�). �

Example 10.2 Recall that X has a Gamma(�; �) distribution if

f(x; �; �) =1

���(�)x��1e�x=�; x > 0

where �; � > 0 and

�(�) =

Z 1

0

y��1e�ydy

Page 147: All of Statistics: A Concise Course in Statistical Inference Brief

10.2 The Method of Moments 147

is the Gamma function. The parameter is � = (�; �). The Gamma

distribution is sometimes used to model lifetimes of people, an-

imals, and electronic equipment. Suppose we want to estimate

the mean lifetime. Then T (�; �) = E �(X1) = ��. �

10.2 The Method of Moments

The �rst method for generating parametric estimators that

we will study is called the method of moments. We will see

that these estimators are not optimal but they are often easy to

compute. They are are also useful as starting values for other

methods that require iterative numerical routines.

Suppose that the parameter � = (�1; : : : ; �k) has k compo-

nents. For 1 � j � k, de�ne the jth moment

�j � �j(�) = E �(Xj) =

ZxjdF�(x) (10.2)

and the jth sample moment

b�j = 1

n

nXi=1

Xj

i: (10.3)

De�nition 10.3 Themethod of moments estimator b�nis de�ned to be the value of � such that

�1(b�n) = b�1

�2(b�n) = b�2

......

...

�k(b�n) = b�k: (10.4)

Formula (10.4) de�nes a system of k equations with k un-

knowns.

Page 148: All of Statistics: A Concise Course in Statistical Inference Brief

148 10. Parametric Inference

Example 10.4 Let X1; : : : ; Xn � Bernoulli(p). Then �1 = E p(X) =

p and b�1 = n�1P

n

i=1Xi. By equating these we get the estimator

bpn = 1

n

nXi=1

Xi: �

Example 10.5 Let X1; : : : ; Xn � Normal(�; �2). Then, �1 = E �(X1) =

� and �2 = E �(X21 ) = V�(X1) + (E �(X1))

2 = �2 + �2. We needRecall that V(X) =

E (X2) � (E (X))2.

Hence, E (X2) =

V(X) + (E (X))2.

to solve the equations

b� =1

n

nXi=1

Xi

b�2 + b�2 =1

n

nXi=1

X2i:

This is a system of 2 equations with 2 unknowns. The solution

is

b� = Xn

b�2 =1

n

nXi=1

(Xi �Xn)2: �

Theorem 10.6 Let b�n denote the method of moments estimator.

Under the the conditions given in the appendix, the following

statements hold:

1. The estimate b�n exists with probability tending to 1.

2. The estimate is consistent: b�n P�! �.

3. The estimate is asymptotically Normal:

pn(b�n � �) N(0;�)

where

� = gE �(Y YT )gT ;

Y = (X;X2; : : : ; Xk)T , g = (g1; : : : ; gk) and gj = @��1j(�)=@�.

Page 149: All of Statistics: A Concise Course in Statistical Inference Brief

10.3 Maximum Likelihood 149

The last statement in the Theorem above can be used to �nd

standard errors and con�dence intervals. However, there is an

easier way: the bootstrap. We defer discussion of this until the

end of the chapter.

10.3 Maximum Likelihood

The most common method for estimating parameters in a

parametric model is the maximum likelihood method. Let

X1; : : : ; Xn be iid with pdf f(x; �).

De�nition 10.7 The likelihood function is de�ned by

Ln(�) =nYi=1

f(Xi; �): (10.5)

The log-likelihood function is de�ned by `n(�) = logLn(�).

The likelihood function is just the joint density of the data,

except that we treat it is a function of the parameter �.

Thus Ln : �! [0;1). The likelihood function is not a density

function: in general, it is not true that Ln(�) integrates to 1.

De�nition 10.8 The maximum likelihood estimator

mle , denoted by b�n, is the value of � that maximizes

Ln(�).

The maximum of `n(�) occurs at the same place as the max-

imum of Ln(�), so maximizing the log-likelihood leads to the

same answer as maximizing the likelihood. Often, it is easier to

work with the log-likelihood.

Remark 10.9 If we multiply Ln(�) by any positive constant c (notdepending on �) then this will not change the mle . Hence, we

Page 150: All of Statistics: A Concise Course in Statistical Inference Brief

150 10. Parametric Inference

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p

bpnLn(p)

FIGURE 10.1. Likelihood function for Bernoulli with n = 20 andPn

i=1Xi = 12. The mle is bpn = 12=20 = 0:6.

shall often be sloppy about dropping constants in the likelihood

function.

Example 10.10 Suppose that X1; : : : ; Xn � Bernoulli(p). The prob-

ability function is f(x; p) = px(1 � p)1�x for x = 0; 1. The un-

known parameter is p. Then,

Ln(p) =nYi=1

f(Xi; p) =

nYi=1

pXi(1� p)1�Xi = pS(1� p)n�S

where S =P

iXi. Hence,

`n(p) = S log p+ (n� S) log(1� p):

Take the derivative of `n(p), set it equal to 0 to �nd that the

mle is bpn = S=n. See Figure 10.1. �

Example 10.11 Let X1; : : : ; Xn � N(�; �2). The parameter is

� = (�; �) and the likelihood function is (ignoring some con-

stants)

Ln(�; �) =Yi

1

�exp

��

1

2�2(Xi � �)2

Page 151: All of Statistics: A Concise Course in Statistical Inference Brief

10.3 Maximum Likelihood 151

= ��n exp

(�

1

2�2

Xi

(Xi � �)2

)

= ��n exp

��nS2

2�2

�exp

��n(X � �)2

2�2

�whereX = n�1

PiXi is the sample mean and S2 = n�1

Pi(Xi�

X)2. The last equality above follows from the fact thatP

i(Xi�

�)2 = nS2+n(X��)2 which can be veri�ed by writingP

i(Xi�

�)2 =P

i(Xi�X+X��)2 and then expanding the square. The

log-likelihood is

`(�; �) = �n log � �nS2

2�2�n(X � �)2

2�2:

Solving the equations

@`(�; �)

@�= 0 and

@`(�; �)

@�= 0

we conclude that b� = X and b� = S. It can be veri�ed that these

are indeed global maxima of the likelihood. �

Example 10.12 (A Hard Example.) Here is the example that con-

fuses everyone. Let X1; : : : ; Xn � Unif(0; �). Recall that

f(x; �) =

�1�

0 � x � �

0 otherwise:

Consider a �xed value of �. Suppose � < Xi for some i. Then,

f(Xi; �) = 0 and hence Ln(�) =Q

if(Xi; �) = 0. It follows that

Ln(�) = 0 if any Xi > �. Therefore, Ln(�) = 0 if � < X(n)

where X(n) = maxfX1; : : : ; Xng. Now consider any � � X(n).

For every Xi we then have that f(Xi; �) = 1=� so that Ln(�) =Qif(Xi; �) = ��n. In conclusion,

Ln(�) =� �

1�

�n� � X(n)

0 � < X(n):

See Figure 10.2. Now Ln(�) is strictly decreasing over the inter-

val [X(n);1). Hence, b�n = X(n). �

Page 152: All of Statistics: A Concise Course in Statistical Inference Brief

152 10. Parametric Inference

0.0 0.5 1.0 1.5 2.0

0.0

0.5

1.0

1.5

0.0 0.5 1.0 1.5 2.0

0.0

0.5

1.0

1.5

0.0 0.5 1.0 1.5 2.0

0.0

0.5

1.0

1.5

0.6 0.8 1.0 1.2 1.4

0.0

0.5

1.0

1.5

� = :75 � = 1

� = 1:25

x x

x �

f(x;�)

f(x;�)

f(x;�)

Ln(�)

FIGURE 10.2. Likelihood function for Uniform (0; �). The

vertical lines show the observed data. The �rst three

plots show f(x; �) for three di�erent values of �. When

� < X(n) = maxfX1; : : : ;Xng, as in the �rst plot, f(X(n); �) = 0

and hence Ln(�) =Qn

i=1 f(Xi; �) = 0. Otherwise f(Xi; �) = 1=�

for each i and hence Ln(�) =Qn

i=1 f(Xi; �) = (1=�)n. The last plot

shows the likelihood function.

Page 153: All of Statistics: A Concise Course in Statistical Inference Brief

10.4 Properties of Maximum Likelihood Estimators. 153

10.4 Properties of Maximum Likelihood

Estimators.

Under certain conditions on the model, the maximum likeli-

hood estimator b�n possesses many properties that make it an ap-

pealing choice of estimator. The main properties of the mle are:

(1) It is consistent: b�n P�! �? where �? denotes the true value

of the parameter �;

(2) It is equivariant: if b�n is the mle of � then g(b�n) is themle of g(�);

(3) It is asymptotically Normal:pn(b�� �?)= bse N(0; 1)

where bse can be computed analytically;

(4) It is asymptotically optimal or eÆcient: roughly, this

means that among all well behaved estimators, the mle has the

smallest variance, at least for large samples.

(5) The mle is approximately the Bayes estimator. (To be

explained later.)

We will spend some time explaining what these properties

mean and why they are good things. In suÆciently complicated

problems, these properties will no longer hold and the mle will

no longer be a good estimator. For now we focus on the simpler

situations where the mle works well. The properties we discuss

only hold if the model satis�es certain regularity conditions.

These are essentially smoothness conditions on f(x; �). Unless

otherwise stated, we shall tacitly assume that these con-

ditions hold.

10.5 Consistency of Maximum Likelihood

Estimators.

Consistency means that the mle converges in probability to

the true value. To proceed, we need a de�nition. If f and g are

Page 154: All of Statistics: A Concise Course in Statistical Inference Brief

154 10. Parametric Inference

pdf 's, de�ne the Kullback-Leibler distance between f and

g to be

D(f; g) =

Zf(x) log

�f(x)

g(x)

�dx: (10.6)

It can be shown that D(f; g) � 0 and D(f; f) = 0. For anyThis is not a dis-

tance in the formal

sense.�; 2 � write D(�; ) to mean D(f(x; �); f(x; )). We will

assume that � 6= implies that D(�; ) > 0.

Let �? denote the true value of �. Maximizing `n(�) is equiv-

alent to maximizing

Mn(�) =1

n

Xi

logf(Xi; �)

f(Xi; �?):

By the law of large numbers, Mn(�) converges to

E �?

�log

f(Xi; �)

f(Xi; �?)

�=

Zlog

�f(x; �)

f(x; �?)

�f(x; �?)dx

= �Z

log

�f(x; �?)

f(x; �)

�f(x; �?)dx

= �D(�?; �):

Hence,Mn(�) � �D(�?; �) which is maximized at �? since�D(�?; �?) =

0 and �D(�?; �?) < 0 for � 6= �?. Hence, we expect that the max-

imizer will tend to �?. To prove this formally, we need more than

Mn(�)P�! �D(�?; �). We need this convergence to be uniform

over �. We also have to make sure that the function D(�?; �) is

well behaved. Here are the formal details.

Theorem 10.13 Let �? denote the true value of �. De�ne

Mn(�) =1

n

Xi

logf(Xi; �)

f(Xi; �?)

and M(�) = �D(�?; �). Suppose that

sup�2�

jMn(�)�M(�)j P�! 0 (10.7)

Page 155: All of Statistics: A Concise Course in Statistical Inference Brief

10.6 Equivariance of the mle 155

and that, for every � > 0,

sup�:j���?j��

M(�) < M(�?): (10.8)

Let b�n denote the mle. Then b�n P�! �?.

Proof. See appendix. �

10.6 Equivariance of the mle

Theorem 10.14 Let � = g(�) be a one-to-one function of �. Letb�n be the mle of �. Then b�n = g(b�n) is the mle of � .

Proof. Let h = g�1 denote the inverse of g. Then b�n = h(b�n).For any � , L(�) =

Qif(xi; h(�)) =

Qif(xi; �) = L(�) where

� = h(�). Hence, for ana � , Ln(�) = L(�) � L(b�) = Ln(b�). �Example 10.15 Let X1; : : : ; Xn � N(�; 1). The mle for � is b�n =Xn. Let � = e�. Then, the mle for � is b� = e

b� = eX. �

10.7 Asymptotic Normality

It turns out that b�n is approximately Normal and we can

compute its variance analytically. To explore this, we �rst need

a few de�nitions.

Page 156: All of Statistics: A Concise Course in Statistical Inference Brief

156 10. Parametric Inference

De�nition 10.16 The score function is de�ned to be

s(X; �) =@ log f(X; �)

@�: (10.9)

The Fisher information is de�ned to be

In(�) = V�

nXi=1

s(Xi; �)

!

=

nXi=1

V� (s(Xi; �)) : (10.10)

For n = 1 we will sometimes write I(�) instead of I1(�).

It can be shown that E �(s(X; �)) = 0. It then follows that

V�(s(X; �)) = E �(s2(X; �)). In fact a further simpli�cation of

In(�) is given in the next result.

Theorem 10.17 In(�) = nI(�) and

I(�) = �E ��@2 log f(X; �)

@2�2

�= �

Z �@2 log f(x; �)

@2�2

�f(x; �)dx: (10.11)

Page 157: All of Statistics: A Concise Course in Statistical Inference Brief

10.7 Asymptotic Normality 157

Theorem 10.18 (Asymptotic Normality of the mle .) Under appro-

priate regularity conditions, the following hold:

1. Let se =p1=In(�). Then,

(b�n � �)

se N(0; 1): (10.12)

2. Let bse =

q1=In(b�n). Then,

(b�n � �)bse N(0; 1): (10.13)

The proof is in the appendix. The �rst statement says thatb�n � N(�; se ) where the standard error of b�n is se =p1=In(�).

The second statement says that this is still true even if we re-

place the standard error by its estimated standard error bse =q1=In(b�n).Informally, the theorem says that the distribution of themle can

be approximated withN(�; bse2). From this fact we can construct

an (asymptotic) con�dence interval.

Theorem 10.19 Let

Cn =�b�n � z�=2 bse ; b�n + z�=2 bse �:

Then, P�(� 2 Cn)! 1� � as n!1.

Proof. Let Z denote a standard normal random variable.

Then,

P�(� 2 Cn) = P�

�b�n � z�=2 bse � � � b�n + z�=2 bse�= P�

�z�=2 �

b�n � �bse � z�=2

!

Page 158: All of Statistics: A Concise Course in Statistical Inference Brief

158 10. Parametric Inference

! P(�z�=2 < Z < z�=2) = 1� �: �

For � = :05, z�=2 = 1:96 � 2, so:

b�n � 2 bse (10.14)

is an approximate 95 per cent con�dence interval.

When you read an opinion poll in the newspaper, you often

see a statement like: the poll is accurate to within one point,

95 per cent of the time. They are simply giving a 95 per cent

con�dence interval of the form b�n � 2 bse .Example 10.20 Let X1; : : : ; Xn � Bernoulli(p). The mle is bpn =P

iXi=n and f(x; p) = px(1� p)1�x, log f(x; p) = x log p+ (1�

x) log(1�p), s(X; p) = (x=p)� (1�x)=(1�p), and �s0(X; p) =

(x=p2) + (1� x)=(1� p)2. Thus,

I(p) = E p(�s0(X; p)) =p

p2+

(1� p)

(1� p)2=

1

p(1� p):

Hence,

bse =1pIn(bpn) = 1p

nI(bpn) =�bp(1� bp)

n

�1=2

:

An approximate 95 per cent con�dence interval is

bpn � 2

�bp(1� bp)n

�1=2

:

Compare this with the Hoe�ding interval. �

Example 10.21 Let X1; : : : ; Xn � N(�; �2) where �2 is known.

The score function is s(X; �) = (X � �)=�2 and s0(X; �) =

�1=�2 so that I1(�) = 1=�2. The mle is b�n = Xn. According

to Theorem 10.18, Xn � N(�; �2=n). In fact, in this case, the

distribution is exact. �

Page 159: All of Statistics: A Concise Course in Statistical Inference Brief

10.8 Optimality 159

Example 10.22 Let X1; : : : ; Xn � Poisson(�). Then b�n = Xn

and some calculations show that I1(�) = 1=�, so

bse =1q

nI1(b�n) =sb�n

n:

Therefore, an approximate 1 � � con�dence interval for � isb�n � z�=2

qb�n=n. �10.8 Optimality

Suppose that X1; : : : ; Xn � N(�; �2). The mle is b�n = Xn.

Another reasonable estimator is the sample median e�n. Themle satis�es p

n(b�n � �) N(0; �2):

It can be proved that the median satis�es

pn(e�n � �) N

�0; �2

2

�:

This means that the median converges to the right value but

has a larger variance than the mle .

More generally, consider two estimators Tn and Un and sup-

pose that pn(Tn � �) N(0; t2)

and that pn(Un � �) N(0; u2):

We de�ne the asymptotic relative eÆciency of U to T by ARE(U; T ) =

t2=u2. In the Normal example, ARE(e�n; b�n) = 2=� = :63. The

interpretation is that if you use the median, you are only using

about 63 per cent of the available data.

Theorem 10.23 If b�n is the mle and e�n is any other estimator

then The result is ac-

tually more subtle

than this but we

needn't worry about

the details.

Page 160: All of Statistics: A Concise Course in Statistical Inference Brief

160 10. Parametric Inference

ARE(e�n; b�n) � 1:

Thus, the mle has the smallest (asymptotic) variance and we

say that mle is eÆcient or asymptotically optimal.

This result is predicated upon the assumed model being cor-

rect. If the model is wrong, the mle may no longer be optimal.

We will discuss optimality in more generality when we discuss

decision theory.

10.9 The Delta Method.

Let � = g(�) where g is a smooth function. The maximum

likelihood estimator of � is b� = g(b�). Now we address the fol-

lowing question: what is the distribution of b�?Theorem 10.24 (The Delta Method) If � = g(�) where g is di�er-

entiable and g0(�) 6= 0 then

pn(b�n � �)bse (b� ) N(0; 1) (10.15)

where b�n = g(b�n) andbse (b�n) = jg0(b�)j bse (b�n) (10.16)

Hence, if

Cn =�b�n � z�=2 bse (b�n); b�n + z�=2 bse (b�n)� (10.17)

then P�(� 2 Cn)! 1� � as n!1.

Example 10.25 Let X1; : : : ; Xn � Bernoulli(p) and let = g(p) =

log(p=(1�p)). The Fisher information function is I(p) = 1=(p(1�p)) so the standard error of the mle bpn is se = fbpn(1� bpn)=ng1=2.

Page 161: All of Statistics: A Concise Course in Statistical Inference Brief

10.9 The Delta Method. 161

The mle of is b = log bp=(1� bp). Since, g0(p) = 1=(p=(1�p)),according to the delta method

bse ( b n) = jg0(bpn)j bse (bpn) = 1pnbpn(1� bpn) :

An approximate 95 per cent con�dence interval is

b n � 2pnbpn(1� bpn) : �

Example 10.26 Let X1; : : : ; Xn � N(�; �2). Suppose that � is

known, � is unknown and that we want to estimate = log�.

The log-likelihood is `(�) = �n log � � 12�2

Pi(xi � �)2. Di�er-

entiate and set equal to 0 and conclude that

b� =

�Pi(Xi � �)2

n

�1=2

:

To get the standard error we need the Fisher information. First,

log f(X; �) = � log � �(X � �)2

2�2

with second derivative

1

�2�

3(X � �)2

�4

and hence

I(�) = �1

�2+

3�2

�4=

2

�2:

Hence, bse = b�n=p2n: Let = g(�) = log(�). Then, b n =

log b�n. Since, g0 = 1=�,

bse ( b n) = 1b� b�p2n

=1p2n

and an approximate 95 per cent con�dence interval is b n �2=p2n. �

Page 162: All of Statistics: A Concise Course in Statistical Inference Brief

162 10. Parametric Inference

10.10 Multiparameter Models

These ideas can directly be extended to models with several

parameters. Let � = (�1; : : : ; �k) and let b� = (b�1; : : : ; b�k) be themle . Let `n =

Pn

i=1 log f(Xi; �),

Hjj =@2`n

@�2j

and Hjk =@2`n

@�j@�k:

De�ne the Fisher Information Matrix by

In(�) = �

26664E �(H11) E �(H12) � � � E �(H1k)

E �(H21) E �(H22) � � � E �(H2k)...

......

...

E �(Hk1) E �(Hk2) � � � E �(Hkk)

37775 : (10.18)

Let Jn(�) = I�1n(�) be the inverse of In.

Theorem 10.27 Under appropriate regularity conditions,

pn(b� � �) � N(0; Jn):

Also, if b�j is the jth component of b�, thenpn(b�j � �j)bse j N(0; 1)

where bse 2j= Jn(j; j) is the j

th diagonal element of Jn. The ap-

proximate covariance of b�j and b�k is Cov(b�j; b�k) � Jn(j; k).

There is also a multiparameter delta method. Let � = g(�1; : : : ; �k)

be a function and let

rg =

0BBBBB@@g@�1.

.

.

@g@�k

1CCCCCAbe the gradient of g.

Page 163: All of Statistics: A Concise Course in Statistical Inference Brief

10.11 The Parametric Bootstrap 163

Theorem 10.28 (Multiparameter delta method) Suppose that

rg evaluated at b� is not 0. Let b� = g(b�). Thenpn(b� � �)bse (b�) N(0; 1)

where bse (b�) =q(brg)T bJn(brg); (10.19)

bJn = Jn(b�n) and brg is rg evaluated at � = b�.Example 10.29 Let X1; : : : ; Xn � N(�; �2). Let � = g(�; �) =

�=�. In homework question 8 you will show that

In(�; �) =

�n

�20

0 2n�2

�:

Hence,

Jn = I�1n(�; �) =

1

n

��2 0

0 �2

2

�:

The gradient of g is

rg =

� �

�2

1�

!:

Thus,

bse (b�) =q(brg)T bJn(brg) = 1pn

�1b�4 + b�2

2b�2�1=2

: �

10.11 The Parametric Bootstrap

For parametric models, standard errors and con�dence in-

tervals may also be estimated using the bootstrap. There is

only one change. In the nonparametric bootstrap, we sampled

Page 164: All of Statistics: A Concise Course in Statistical Inference Brief

164 10. Parametric Inference

X�1 ; : : : ; X

�nfrom the empirical distribution bFn. In the paramet-

ric bootstrap we sample instead from f(x; b�n). Here, b�n could

be the mle or the method of moments estimator.

Example 10.30 Consider example 10.29. To get to bootstrap stan-

dard error, simulate X1; : : : ; X�n� N(b�; b�2), compute b�� =

n�1P

iX�iand b�2� = n�1

Pi(X�

i� b��)2. Then compute b� � =

g(b��; b��) = b��=b��. Repeating this B times yields bootstrap repli-

cations b� �1 ; : : : ; b� �Band the estimated standard error is

bse boot =

sPB

b=1(b� �b � b�)2B

: �

The bootstrap is much easier than the delta method. On the

other hand, the delta method has the advantage that it gives a

closed form expression for the standard error.

10.12 Technical Appendix

10.12.1 Proofs

Proof of Theorem 10.13. Since b�n maximizes Mn(�), we

have Mn(b�n) �Mn(�?). Hence,

M(�?)�M(b�n) = Mn(�?)�M(b�n) +M(�?)�Mn(�?)

� Mn(b�)�M(b�n) +M(�?)�Mn(�?)

� sup�

jMn(�)�M(�)j+M(�?)�Mn(�?)

P�! 0

where the last line follows from (10.7). It follows that, for any

Æ > 0,

P

�M(b�n) < M(�?)� Æ

�! 0:

Page 165: All of Statistics: A Concise Course in Statistical Inference Brief

10.12 Technical Appendix 165

Pick any � > 0. By (10.8), there exists Æ > 0 such that j���?j � �

implies that M(�) < M(�?)� Æ. Hence,

P(jb�n � �?j > �) � P

�M(b�n) < M(�?)� Æ

�! 0: �

Next we want to prove Theorem 10.18. First we need a Lemma.

Lemma 10.31 The score function satis�es

E � [s(X; �)] = 0:

Proof. Note that 1 =Rf(x; �)dx. Di�erentiate both sides

of this equation to conclude that

0 =@

@�

Zf(x; �)dx =

Z@

@�f(x; �)dx

=

Z @f(x;�)

@�

f(x; �)f(x; �)dx =

Z@ log f(x; �)

@�f(x; �)dx

=

Zs(x; �)f(x; �)dx = E �s(X; �): �

Proof of Theorem 10.18. Let `(�) = logL(�). Then,

0 = `0(b�) � `0(�) + (b� � �)`00(�):

Rearrange the above equation to get b�� � = �`0(�)=`00(�) or, inother words,

pn(b� � �) =

1pn`0(�)

� 1n`00(�)

�TOP

BOTTOM:

Let Yi = @ log f(Xi; �)=@�. Recall that E (Yi) = 0 from the pre-

vious Lemma and also V(Yi) = I(�). Hence,

TOP = n�1=2Xi

Yi =pnY =

pn(Y � 0) W � N(0; I)

Page 166: All of Statistics: A Concise Course in Statistical Inference Brief

166 10. Parametric Inference

by the central limit theorem. Let Ai = �@2 log f(Xi; �)=@�2.

Then E (Ai) = I(�) and

BOTTOM = AP�! I(�)

by the law of large numbers. Apply Theorem 6.5 part (e), to

conclude that

pn(b� � �)

W

I(�)

d= N

�0;

1

I(�)

�:

Assuming that I(�) is a continuous function of �, it follows that

I(b�n) P�! I(�). Now

b�n � �bse =pnI1=2(b�n)(b�n � �)

=np

nI1=2(�)(b�n � �)o(I(b�n)

I(�)

)1=2

:

The �rst terms tends in distribution to N(0,1). The second term

tends in probability to 1. The result follows from Theorem 6.5

part (e). �

Outline of Proof of Theorem 10.24. Write,

b� = g(b�) � g(�) + (b� � �)g0(�) = � + (b� � �)g0(�):

Thus, pn(b� � �) �

pn(b� � �)g0(�)

and hence pnI(�)(b� � �)

g0(�)�pnI(�)(b� � �):

Theorem 10.18 tells us that the right hand side tends in distri-

bution to a N(0,1). Hence,pnI(�)(b� � �)

g0(�) N(0; 1)

Page 167: All of Statistics: A Concise Course in Statistical Inference Brief

10.12 Technical Appendix 167

or, in other words, b� � N��; se 2(b�n)�

where

se 2(b�n) = (g0(�))2

nI(�):

The result remains true if we substitute b� for � by Theorem 6.5

part (e). �

10.12.2 SuÆciency

A statistic is a function T (Xn) of the data. A suÆcient statis-

tic is a statistic that contains all the information in the data.

To make this more formal, we need some de�ntions.

De�nition 10.32 Write xn $ yn if f(xn; �) = c f(yn; �)

for some constant c that might depend on xn and yn but

not �. A statistic T (xn) is suÆcient if T (xn) $ T (yn)

implies that xn $ yn.

Notice that if xn $ yn then the likelihood function based on

xn has the same shape as the likelihood function based on yn.

Roughly speaking, a statistic is suÆcient if we can calculate the

likelihood function knowing only T (Xn).

Example 10.33 Let X1; : : : ; Xn � Bernoulli(p). Then L(p) =

pS(1� p)n�S where S =P

iXi so S is suÆcient. �

Example 10.34 Let X1; : : : ; Xn � N(�; �) and let T = (X;S).

Then,

f(Xn;�; �) =Yi

f(Xi;�; �)

=Yi

1

�p2�

exp

(�

1

2�2

Xi

(Xi � �)2

)

=

�1

�p2�

�nexp

��nS2

2�2

�exp

��n(X � �)2

2�2

Page 168: All of Statistics: A Concise Course in Statistical Inference Brief

168 10. Parametric Inference

The last expression depends on the data only through T and

therefore, T = (X;S) is a suÆcient statistic. Note that U =

(17X;S) is also a suÆcient statistic. If I tell you the value of U

then you can easily �gure out T and then compute the likelihood.

SuÆcient statistics are far from unique. Consider the following

statistics for the N(�; �2) model:

T1(Xn) = (X1; : : : ; Xn)

T2(Xn) = (X;S)

T3(Xn) = X

T4(Xn) = (X;S;X3):

The �rst statistic is just the whole data set. This is suÆcient.

The second is also suÆcient as we proved above. The third is not

suÆcient: you can't compute L(�; �) if I only tell you X. The

fourth statistic T4 is suÆcient. The statistics T1 and T4 are suÆ-

cient but they contain redundant information. Intuitively, there

is a sense in which T2 is a \more concise" suÆcient statistic

than either T1 or T4. We can express this formally by noting

that T2 is a function of T1 and similarly, T2 is a function of T4.

For example, T2 = g(T4) where g(a1; a2; a3) = (a1; a2). �

De�nition 10.35 A statistic T is minimally suÆcient if

(i) it is suÆcient and (ii) it is a function of every other

suÆcient statistic.

Theorem 10.36 T is minimal suÆcient if T (xn) = T (yn) if and

only if xn $ yn.

A statistic induces a partition on the set if outcomes. We can

think of suÆciency in terms of these partitions.

Page 169: All of Statistics: A Concise Course in Statistical Inference Brief

10.12 Technical Appendix 169

Example 10.37 Let X1; X2; X3 � Bernoulli(�). Let V = X1,

T =P

iXi and U = (T;X1). Here is the set of outcomes and

the statistics:

X1 X2 V T U

0 0 0 0 (0,0)

0 1 0 1 (1,0)

1 0 1 1 (1,1)

1 1 1 2 (2,1)

The partitions induced by these statistics are:

V �! f(0; 0); (0; 1)g; f(1; 0); (1; 1)gT �! f(0; 0)g; f(0; 1); (1; 0)g; f(1; 1)gU �! f(0; 0)g; f(0; 1)g; f(1; 0)g; f(1; 1)g:

Then V is not suÆcient but T and U are suÆcient. T is min-

imal suÆcient; U is not minimal since if xn = (1; 0; 1) and

yn = (0; 1; 1), then xn $ yn yet U(xn) 6= U(yn). The statistic

W = 17T generates the same partition as T . It is also minimal

suÆcient. �

Example 10.38 For a N(�; �2) model, T = (X;S) is a minimally

suÆcient statistic. For the Bernoulli model, T =P

iXi is a

minimally suÆcient statistic. For the Poisson model, T =P

iXi

is a minimally suÆcient statistic. Check that T = (P

iXi; X1)

is suÆcient but not minimal suÆcient. Check that T = X1 is

not suÆcient. �

I did not give the usual de�nition of suÆciency. The usual

de�nition is this: T is suÆcient if the distribution of Xn given

T (Xn) = t does not depend on �.

Example 10.39 Two coin ips. Let X = (X1; X2) � Bernoulli(p).

Then T = X1 + X2 is suÆcient. To see this, we need the dis-

tribution of (X1; X2) given T = t. Since T can take 3 possible

values, there are 3 conditional distributions to check. They are:

Page 170: All of Statistics: A Concise Course in Statistical Inference Brief

170 10. Parametric Inference

(i) the distribution of (X1; X2) given T = 0:

P (X1 = 0; X2 = 0jt = 0) = 1; P (X1 = 0; X2 = 1jt = 0) = 0;

P (X1 = 1; X2 = 0jt = 0) = 0; P (X1 = 1; X2 = 1jt = 0) = 0

(ii) the distribution of (X1; X2) given T = 1:

P (X1 = 0; X2 = 0jt = 1) = 0; P (X1 = 0; X2 = 1jt = 1) =1

2;

P (X1 = 1; X2 = 0jt = 1) =1

2; P (X1 = 1; X2 = 1jt = 1) = 0

(iii) the distribution of (X1; X2) given T = 2:

P (X1 = 0; X2 = 0jt = 2) = 0; P (X1 = 0; X2 = 1jt = 2) = 0;

P (X1 = 1; X2 = 0jt = 2) = 0; P (X1 = 1; X2 = 1jt = 2) = 1:

None of these depend on the parameter p. Thus, the distribution

of X1; X2jT does not depend on � so T is suÆcient. �

Theorem 10.40 (Factorization Theorem) T is suÆcient if and

only if there are functions g(t; �) and h(x) such that f(xn; �) =

g(t(xn); �)h(xn).

Example 10.41 Return to the two coin ips. Let t = x1 + x2.

Then

f(x1; x2; �) = f(x1; �)f(x2; �)

= �x1(1� �)1�x1�x2(1� �)1�x2

= g(t; �)h(x1; x2)

where g(t; �) = �t(1 � �)2�t and h(x1; x2) = 1. Therefore, T =

X1 +X2 is suÆcient. �

Now we discuss an implication of suÆciency in point estima-

tion. Let b� be an estimator of �. The Rao-Blackwell theorem says

that an estimator should only depend on the suÆcient statistic,

otherwise it can be improved. Let R(�; b�) = E �[(� � b�)2] denotethe mse of the estimator.

Page 171: All of Statistics: A Concise Course in Statistical Inference Brief

10.12 Technical Appendix 171

Theorem 10.42 (Rao-Blackwell) Let b� be an estimator and

let T be a suÆcient statistic. De�ne a new estimator by

e� = E (b�jT ):Then, for every �, R(�; e�) � R(�; b�):Example 10.43 Consider ipping a coin twice. Let b� = X1. This

is a well de�ned (and unbiased) estimator. But it is not a func-

tion of the suÆcient statistic T = X1 + X2. However, note

that e� = E (X1jT ) = (X1 +X2)=2. By the Rao-Blackwell Theo-

rem, e� has MSE at least as small as b� = X1. The same applies

with n coin ips. Again de�ne b� = X1 and T =P

iXi. Thene� = E (X1jT ) = n�1

PiXi has improved mse . �

10.12.3 Exponential Families

Most of the parametric models we have studied so far are spe-

cial cases of a general class of models called exponential families.

We say that ff(x; �; � 2 �g is a one-parameter exponential

family if there are functions �(�), B(�), T (x) and h(x) such

that

f(x; �) = h(x)e�(�)T (x)�B(�):

It is easy to see that T (X) is suÆcient. We call T the natural

suÆcient statistic.

Example 10.44 Let X � Poisson(�). Then

f(x; �) =�xe��

x!=

1

x!ex log ���

and hence, this is an exponential family with �(�) = log �, B(�) =

�, T (x) = x, h(x) = 1=x!. �

Example 10.45 Let X � Bin(n; �). Then

f(x; �) =

�n

x

��x(1��)n�x =

�n

x

�exp

�x log

��

1� �

�+ n log(1� �)

�:

Page 172: All of Statistics: A Concise Course in Statistical Inference Brief

172 10. Parametric Inference

In this case,

�(�) = log

��

1� �

�; B(�) = �n log(�)

and

T (x) = x; h(x) =

�n

x

�:

We can rewrite an exponential family as

f(x; �) = h(x)e�T (x)�A(�)

where � = �(�) is called the natural parameter and

A(�) = log

Zh(x)e�T (x)dx:

For example a Poisson can be written as f(x; �) = e�x�e�

=x!

where the natural parameter is � = log �.

LetX1; : : : ; Xn be iid from a exponential family. Then f(xn; �)

is an exponential family:

f(xn; �) = hn(xn)hn(x

n)e�(�)Tn(xn)�Bn(�)

where hn(xn) =

Qih(xi), Tn(x

n) =P

iT (xi) and Bn(�) =

nB(�). This implies thatP

iT (Xi) is suÆcient.

Example 10.46 Let X1; : : : ; Xn � Uniform(0; �). Then

f(xn; �) =1

�nI(x(n) � �)

where I is 1 if the term inside the brackets is true and 0 other-

wise, and x(n) = maxfx1; : : : ; xng. Thus T (Xn) = maxfX1; : : : ; Xngis suÆcient. But since T (Xn) 6=

PiT (Xi), this cannot be an ex-

ponential family. �

Page 173: All of Statistics: A Concise Course in Statistical Inference Brief

10.12 Technical Appendix 173

Theorem 10.47 Let X have density in an exponential family.

Then,

E (T (X)) = A0(�); V(T (X)) = A00(�):

If � = (�1; : : : ; �k) is a vector, then we say that f(x; �) has

exponential family form if

f(x; �) = h(x) exp

(kXj=1

�j(�)Tj(x)� B(�)

):

Again, T = (T1; : : : ; Tk) is suÆcient and n iid samples also has

exponential form with suÆcient statistic (P

iT1(Xi); : : : ;

PiTk(Xi)).

Example 10.48 Consider the normal family with � = (�; �). Now,

f(x; �) = exp

��

�2x�

x2

2�2�

1

2

��2

�2+ log(2��2)

��:

This is exponential with

�1(�) =�

�2; T1(x) = x

�2(�) = �1

2�2; T2(x) = x2

B(�) =1

2

��2

�2+ log(2��2)

�; h(x) = 1:

Hence, with n iid samples, (P

iXi;P

iX2i) is suÆcient. �

As before we can write an exponential family as

f(x; �) = h(x) exp�T T (x)� � A(�)

where A(�) =

Rh(x)eT

T (x)�dx. It can be shown that

E (T (X)) = _A(�) V(T (X)) = �A(�)

where the �rst expression is the vector of partial derivatives and

the second is the matrix of second derivatives.

Page 174: All of Statistics: A Concise Course in Statistical Inference Brief

174 10. Parametric Inference

10.13 Exercises

1. Let X1; : : : ; Xn � Gamma(�; �). Find the method of mo-

ments estimator for � and �.

2. Let X1; : : : ; Xn � Uniform(a; b) where a and b are un-

known parameters and a < b.

(a) Find the method of moments estimators for a and b.

(b) Find the mle ba and bb.(c) Let � =

RxdF (x). Find the mle of � .

(d) Let b� be the mle from (1bc). Let e� be the nonpara-

metric plug-in estimator of � =RxdF (x). Suppose that

a = 1, b = 3 and n = 10. Find the mse of b� by simulation.Find the mse of e� analytically. Compare.

3. Let X1; : : : ; Xn � N(�; �2). Let � be the .95 percentile,

i.e. P(X < �) = :95.

(a) Find the mle of � .

(b) Find an expression for an approximate 1�� con�dence

interval for � .

(c) Suppose the data are:

3.23 -2.50 1.88 -0.68 4.43 0.17

1.03 -0.07 -0.01 0.76 1.76 3.18

0.33 -0.31 0.30 -0.61 1.52 5.43

1.54 2.28 0.42 2.33 -1.03 4.00

0.39

Find the mle b� . Find the standard error using the delta

method. Find the standard error using the parametric

bootstrap.

Page 175: All of Statistics: A Concise Course in Statistical Inference Brief

10.13 Exercises 175

4. Let X1; : : : ; Xn � Uniform(0; �). Show that the mle is

consistent. Hint: Let Y = maxfX1; :::; Xng:. For any c,

P(Y < c) = P(X1 < c;X2 < c; :::; Xn < c) = P(X1 <

c)P(X2 < c):::P(Xn < c).

5. Let X1; : : : ; Xn � Poisson(�). Find the method of mo-

ments estimator, the maximum likelihood estimator and

the Fisher information I(�).

6. Let X1; :::; Xn � N(�; 1). De�ne

Yi =

�1 if Xi > 0

0 if Xi � 0:

Let = P(Y1 = 1):

(a) Find the maximum likelihood estimate b of .

(b) Find an approximate 95 per cent con�dence interval

for :

(c) De�ne e = (1=n)P

iYi. Show that e is a consistent

estimator of .

(d) Compute the asymptotic relative eÆciency of e to b .Hint: Use the delta method to get the standard error of the

mle . Then compute the standard error (i.e. the standard

deviation) of e .(e) Suppose that the data are not really normal. Show thatb is not consistent. What, if anything, does b converge to?

7. (Comparing two treatments.) n1 people are given treat-

ment 1 and n2 people are given treatment 2. Let X1 be

the number of people on treatment 1 who respond fa-

vorably to the treatment and let X2 be the number of

people on treatment 2 who respond favorably. Assume

that X1 � Binomial(n1; p1) X2 � Binomial(n2; p2). Let

= p1 � p2.

Page 176: All of Statistics: A Concise Course in Statistical Inference Brief

176 10. Parametric Inference

(a) Find the mle for .

(b) Find the Fisher information matrix I(P1; p2).

(c) Use the multiparameter delta method to �nd the asymp-

totic standard error of b .(d) Suppose that n1 = n2 = 200, X1 = 160 and X2 = 148.

Find b . Find an approximate 90 percent con�dence inter-

val for using (i) the delta method and (ii) the parametric

bootstrap.

8. Find the Fisher information matrix for Example 10.29.

9. Let X1; :::; Xn Normal(�; 1): Let � = e� and let b� = eX

be the mle. Create a data set (using � = 5) consisting of

n=100 observations.

(a) Use the delta method to get bse and 95 percent con�-

dence interval for �. Use the parametric bootstrap to getbse and 95 percent con�dence interval for �. Use the non-

parametric bootstrap to get bse and 95 percent con�dence

interval for �: Compare your answers.

(b) Plot a histogram of the bootstrap replications for the

parametric and nonparametric bootstraps. These are esti-

mates of the distribution of b�. The delta method also gives

an approximation to this distribution namely, Normal(b�; se 2).

Compare these to the true sampling distribution of b� (whihcyou can get by simulation). Which approximation, para-

metric bootstrap, bootstrap, or delta method is closer to

the true distribution?

10. LetX1; :::; Xn Unif(0; �): Themle is b� = X(n) = maxfX1; :::; Xng.Generate a data set of size 50 with � = 1:

(a) Find the distribution of b� analytically. Compare the

true distribution of b� to the histograms from the paramet-

ric and nonparametric bootstraps.

Page 177: All of Statistics: A Concise Course in Statistical Inference Brief

10.13 Exercises 177

(b) This is a case where the nonparametric bootstrap

does very poorly. Show that, for the parametric bootstrap

P(b�� = b�) = 0 but for the nonparametric bootstrap P(b�� =b�) � :632. Hint: show that, P(b�� = b�) = 1� (1� (1=n))n

then take the limit as n gets large. What is the implication

of this?

Page 178: All of Statistics: A Concise Course in Statistical Inference Brief

178 10. Parametric Inference

Page 179: All of Statistics: A Concise Course in Statistical Inference Brief

11

Hypothesis Testing and p-values

Suppose we want to know if a certain chemical causes cancer.

We take some rats and randomly divide them into two groups.

We expose one group to the chemical and then we compare

the cancer rate in the two groups. Consider the following two

h ypotheses:

The Null Hypothesis: The cancer rate is the same

in the two groups

The Alternative Hypothesis: The cancer rate is

not the same in the two groups.

If the exposed group has a much higher rate of cancer than the

unexposed group then we will reject the null hypothesis and

conclude that the evidence favors the alternative h ypothesis;

in other words we will conclude that there is evidence that the

chemical causes cancer. This is an example of hypothesis testing.

More formally, suppose that we partition the parameter space

� into two disjoint sets �0 and �1 and that we wish to test

H0 : � 2 �0 v ersus H1 : � 2 �1: (11.1)

Page 180: All of Statistics: A Concise Course in Statistical Inference Brief

180 11. Hypothesis Testing and p-values

We call H0 the null hypothesis and H1 the alternative hy-

pothesis.

Let X be a random variable and let X be the range of X. We

test a hypothesis by �nding an appropriate subset of outcomes

R � X called the rejection region. If X 2 R we reject the

null hypothesis, otherwise, we do not reject the null hypothesis:

X 2 R =) reject H0

X =2 R =) retain (do not reject) H0

Usually the rejection region R is of the form

R =nx : T (x) > c

o

where T is a test statistic and c is a critical value. The prob-

lem in hypothesis testing is to �nd an appropriate test statistic

T and an appropriate cuto� value c.

Warning! There is a tendency to use hypthesis testing meth-

ods even when they are not appropriate. Often, estimation and

con�dence intervals are better tools. Use hypothesis testing only

when you want to test a well de�ned hypothesis.

Hypothesis testing is like a legal trial. We asume someone is

innocent unless the evidence strongly suggests that he is guilty.

Similarly, we retain H0 unless there is strong evidence to reject

H0. There are two types of errors when can make. Rejecting H0

when H0 is true is called a type I error. Rejecting H1 when

H1 is true is called a type II error. The possible outcomes for

hypothesis testing are summarized in the Table below:

Page 181: All of Statistics: A Concise Course in Statistical Inference Brief

11. Hypothesis Testing and p-values 181

Retain Null Reject Null

H0 truep

type I error

H1 true type II errorp

Summary of outcomes of hypothesis testing.

De�nition 11.1 The power function of a test with rejection re-

gion R is de�ned by

�(�) = P�(X 2 R): (11.2)

The size of a test is de�ned to be

� = sup�2�0

�(�): (11.3)

A test is said to have level � if its size is less than or equal to

�.

A hypothesis of the form � = �0 is called a simple hypoth-

esis. A hypothesis of the form � > �0 or � < �0 is called a

composite hypothesis. A test of the form

H0 : � = �0 versus H1 : � 6= �0

is called a two-sided test. A test of the form

H0 : � � �0 versus H1 : � > �0

or

H0 : � � �0 versus H1 : � < �0

is called a one-sided test. The most common tests are two-

sided.

Page 182: All of Statistics: A Concise Course in Statistical Inference Brief

182 11. Hypothesis Testing and p-values

Example 11.2 Let X1; : : : ; Xn � N(�; �) where � is known. We

want to test H0 : � � 0 versus H1 : � > 0. Hence, �0 = (�1; 0]

and �1 = (0;1). Consider the test:

reject H0 if T > c

where T = X. The rejection region is R =nxn : T (xn) > c

o.

Let Z denote a standard normal random variable. The power

function is

�(�) = P��X > c

�= P�

�pn(X � �)

�>

pn(c� �)

�= P

�Z >

pn(c� �)

�= 1� �

�pn(c� �)

�:

This function is increasing in �. Hence

size = sup�<0

�(�) = �(0) = 1� �

�pnc

�:

To get a size � test, set this equal to � and solve for c to get

c =���1(1� �)p

n:

So we reject when X > ���1(1��)=pn. Equivalently, we reject

when pn (X � 0)

�> z�: �

Finding most powerful tests is hard and, in many cases, most

powerful tests don't even exist. Instead of searching for most

powerful tests, we'll just consider three widely used tests: the

Wald test, the �2 test and the permutation test. A fourth test,

the likelihood ratio test, is dicussed in the appendix.

Page 183: All of Statistics: A Concise Course in Statistical Inference Brief

11.1 The Wald Test 183

11.1 The Wald Test

Let � be a scalar parameter, let b� be an estimate of � and letbse be the estimated standard error of b�. The test is named

after Abraham Wald

(1902-1950), who

was a very in uen-

tial mathematical

statistician.

De�nition 11.3 The Wald Test

Consider testing

H0 : � = �0 versus H1 : � 6= �0:

Assume that b� is asymptotically Normal:

pn(b� � �0)bse N(0; 1):

The size � Wald test is: reject H0 when jW j > z�=2

where

W =b� � �0bse :

Theorem 11.4 Asymptotically, the Wald test has size �, that is,

P�0

�jZj > z�=2

�! �

as n!1.

Proof. Under � = �0, (b� � �0)= bse N(0; 1). Hence, the

probability of rejecting when the null � = �0 is true is

P�0

�jW j > z�=2

�= P�0

jb� � �0jbse > z�=2

!! P

�jN(0; 1)j > z�=2

�= �: �

Remark 11.5 Most texts de�ne the Wald test slightly di�erently.

They use the standard error computed at � = �0 rather then at

the estimated value b�. Both versions are valid.

Page 184: All of Statistics: A Concise Course in Statistical Inference Brief

184 11. Hypothesis Testing and p-values

Let us consider the power of the Wald test when the null

hypothesis is false.

Theorem 11.6 Suppose the true value of � is �? 6= �0. The power

�(�?) { the probability of correctly rejecting the null hypothesis

{ is given (approximately) by

1� �

��0 � �?bse + z�=2

�+ �

��0 � �?bse � z�=2

�: (11.4)

Recall that bse tends to 0 as the sample size increases. Inspect-

ing (11.4) closely we note that: (i) the power is large if �? is far

from �0 and (ii) the power is large if the sample size is large.

Example 11.7 (Comparing Two Prediction Algorithms) We test a

prediction algorithm on a test set of size m and we test a second

prediction algorithm on a second test set of size n. Let X be

the number of incorrect predictions for algorithm 1 and let Y

be the number of incorrect predictions for algorithm 2. Then

X � Binomial(m; p1) and Y � Binomial(n; p2). To test the null

hypothesis that p1 = p2 write

H0 : Æ = 0 versus H1 : Æ 6= 0

where Æ = p1 � p2. The mle is bÆ = bp1 � bp2 with estimated

standard error

bse =

rbp1(1� bp1)m

+bp2(1� bp2)

n:

The size � Wald test is to reject H0 when jW j > z�=2 where

W =bÆ � 0bse =

bp1 � bp2qbp1(1�bp1)

m+

bp2(1�bp2)n

:

The power of this test will be largest when p1 is far from p2 and

when the sample sizes are large.

Page 185: All of Statistics: A Concise Course in Statistical Inference Brief

11.1 The Wald Test 185

What if we used the same test set to test both algorithms?

The two samples are no longer independent. Instead we use the

following strategy. Let Xi = 1 if algorithm 1 is correct on test

case i and Xi = 0 otherwise. Let Yi = 1 if algorithm 2 is correct

on test case i Yi = 0 otherwise. A typical data set will look

something like this:

Test Case Xi Yi Di = Xi � Yi1 1 0 1

2 1 1 0

3 1 1 0

4 0 1 -1

5 0 0 0...

......

...

n 0 1 -1

Let

Æ = E (Di) = E (Xi)� E (Yi) = P(Xi = 1)� P(Yi = 1):

Then bÆ = D = n�1P

n

i=1Di and bse (bÆ) = S=pn where S2 =

n�1P

n

i=1(Di�D)2. To test H0 : Æ = 0 versus H1 : Æ 6= 0 we use

W = bÆ= bse and reject H0 if jW j > z�=2. This is called a paired

comparison. �

Example 11.8 (Comparing Two Means.) Let X1; : : : ; Xm and Y1; : : : ; Yn

be two independent samples from populations with means �1 and

�2, respectively. Let's test the null hypothesis that �1 = �2.

Write this as H0 : Æ = 0 versus H1 : Æ 6= 0 where Æ = �1 � �2.

Recall that the nonparametric plug-in estimate of Æ is bÆ = X�Ywith estimated standard error

bse =

rs21m

+s22n

Page 186: All of Statistics: A Concise Course in Statistical Inference Brief

186 11. Hypothesis Testing and p-values

where s21 and s22 are the sample variances. The size � Wald test

rejects H0 when jW j > z�=2 where

W =bÆ � 0bse =

X � Yqs21

m+

s22

n

: �

Example 11.9 (Comparing Two Medians.) Consider the previous

example again but let us test whether the medians of the two

distributions are the same. Thus, H0 : Æ = 0 versus H1 : Æ 6=0 where Æ = �1 � �2 where �1 and �2 are the medians. The

nonparametric plug-in estimate of Æ is bÆ = b�1� b�2 where b�1 andb�2 are the sample medians. The estimated standard error bse ofbÆ can be obtained from the bootstrap. The Wald test statistic is

W = bÆ= bse . �There is a relationship between the Wald test and the 1� �

asymptotic con�dence interval b� � bse z�=2.Theorem 11.10 The size � Wald test rejects H0 : � = �0 versus

H1 : � 6= �0 if and only if �0 =2 C where

C = (b� � bse z�=2; b� + bse z�=2):Thus, testing the hypothesis is equivalent to checking whether

the null value is in the con�dence interval.

11.2 p-values

Reporting \reject H0" or \retain H0" is not very informative.

Instead, we could ask, for every �, whether the test rejects at

that level. Generally, if the test rejects at level � it will also

reject at level �0 > �. Hence, there is a smallest � at which the

test rejects and we call this number the p-value.

Page 187: All of Statistics: A Concise Course in Statistical Inference Brief

11.2 p-values 187

De�nition 11.11 Suppose that for every � 2 (0; 1) we have

a size � test with rejection region R�. Then,

p-value = infn� : T (Xn) 2 R�

o:

That is, the p-value is the smallest level at which we can

reject H0.

Informally, the p-value is a measure of the evidence against

H0: the smaller the p-value, the stronger the evidence against

H0. Typically, researchers use the following evidence scale:

p-value evidence

< :01 very strong evidence againt H0

.01 - .05 strong evidence againt H0

.05 - .10 weak evidence againt H0

> :1 little or no evidence againt H0

Warning! A large p-value is not strong evidence in favor of

H0. A large p-value can occur for two reasons: (i) H0 is true or

(ii) H0 is false but the test has low power.

But do not confuse the p-value with P(H0jData). The p- We discuss quanti-

ties like P(H0jData)in the chapter on

Bayesian inference.

value is not the probability that the null hypothesis is

true.

The following result explains how to compute the p-value.

Theorem 11.12 Suppose that the size � test is of the form

reject H0 if and only if T (Xn) � c�:

Then,

p-value = sup�2�0

P�(T (Xn) � T (xn)):

In words, the p-value is the probability (under H0) of observing

a value of the test statistic as or more extreme than what was

actually observed.

Page 188: All of Statistics: A Concise Course in Statistical Inference Brief

188 11. Hypothesis Testing and p-values

For a Wald test, W has an aproximate N(0; 1) distribution

under H0. Hence, the p-value is

p-value � P

�jZj > jwj

�= 2P

�Z < �jwj

�= 2�

�Z < �jwj

�(11.5)

where Z � N(0; 1) and w = (b�� �0)= bse is the observed value of

the test statistic.

Here is an important property of p-values.

Theorem 11.13 If the test statistic has a continuous distribu-

tion, then under H0 : � = �0, the p-value has a Uniform (0,1)

distribution.

If the p-value is less than .05 then people often report that

\the result is statistically signi�cant at the 5 per cent level."

This just means that the null would be rejected if we used a size

� = 0:05 test. In general, the size � test rejects if and only if

p-value < �.

Example 11.14 Recall the cholesterol data from Example 8.11.

To test of the means ard di�erent we compute

W =bÆ � 0bse =

X � Yqs21

m+

s22

n

=216:2� 195:3

52 + 2:42= 3:78:

To compute the p-value, let Z � N(0; 1) denote a standard Nor-

mal random variable. Then,

p-value = P(jZj > 3:78) = 2P(Z < �3:78) = :0002

which is very strong evidence against the null hypothesis. To

test if the medians are di�erent, let b�1 and b�2 denote the sample

medians. Then,

W =b�1 � b�2bse =

212:5� 194

7:7= 2:4

Page 189: All of Statistics: A Concise Course in Statistical Inference Brief

11.3 The �2 distribution 189

where the standard error 7.7 was found using the boostrap. The

p-value is

p-value = P(jZj > 2:4) = 2P(Z < �2:4) = :02

which is strong evidence against the null hypothesis. �

Warning! A result might be statistically signi�cant and yet

the size of the e�ect might be small. In such a case we have a

result that is statistically signi�cant but not practically signi�-

cant. It is wise to report a con�dence interval as well.

11.3 The �2 distribution

Let Z1; : : : ; Zk be independent, standard normals. Let V =Pk

i=1 Z2i. Then we say that V has a �2 distribution with k de-

grees of freedom, written V � �2k. The probability density of V

is

f(v) =v(k=2)�1e�v=2

2k=2�(k=2)

for v > 0. It can be shown that E (V ) = k and V(k) = 2k. We

de�ne the upper � quantile �2k;�

= F�1(1 � �) where F is the

cdf . That is, P(�2k> �2

k;�) = �.

11.4 Pearson's �2 Test For Multinomial Data

Pearson's �2 test is used for multinomial data. Recall that

X = (X1; : : : ; Xk) has a multinomial distribution if

f(x1; : : : ; xk; p) =

�n

x1 : : : xk

�px11 � � � p

xk

k

where �n

x1 : : : xk

�=

n!

x1! � � �xk!:

The mle of p is bp = (bp1; : : : ; bpk) = (X1=n; : : : ; Xk=n).

Page 190: All of Statistics: A Concise Course in Statistical Inference Brief

190 11. Hypothesis Testing and p-values

Let (p01; : : : ; p0k) be some �xed set of probabilities and sup-

pose we want to test

H0 : (p1; : : : ; pk) = (p01; : : : ; p0k) versus H1 : (p1; : : : ; pk) 6= (p01; : : : ; p0k):

Pearson's �2statistic is

T =

kXj=1

(Xj � np0j)2

np0j=

kXj=1

(Oj � Ej)2

Ej

where Oj = Xj is the observed data and Ej = E (Xj) = np0j is

the expected value of Xj under H0.

Theorem 11.15 Under H0, T �2k�1. Hence the test: reject H0

if T > �2k�1;� has asymptotic level �. The p-value is P(�2

k> t)

where t is the observed value of the test statistic.

Example 11.16 (Mendel's peas.) Mendel bred peas with round yel-

low seeds and wrinkled green seeds. There are four types of progeny:

round yellow, wrinkled yellow, round green and (4) wrinkled

green. The number of each type is multinomial with probability

(p1; p2; p3; p4). His theory of inheritance predicts that

p =

�9

16;3

16;3

16;1

16

�� p0:

In n = 556 trials he observed X = (315; 101; 108; 32). Since,

np10 = 312:75; np20 = np30 = 104:25 and np40 = 34:75, the test

statistic is

�2 =(315� 312:75)2

312:75+(101� 104:25)2

104:25+(108� 104:25)2

104:25+(32� 34:75)2

34:75= 0:47:

The � = :05 value for a �23 is 7.815. Since 0:47 is not larger

than 7.815 we do not reject the null. The p-value is

p-value = P(�23 > :47) = :07

which is only moderate evidence againt H0. Hence, the data do

not contradict Mendel's theory. Interestingly, there is some con-

troversey about whether Mendel's results are \too good." �

Page 191: All of Statistics: A Concise Course in Statistical Inference Brief

11.5 The Permutation Test 191

11.5 The Permutation Test

The permutation test is a nonparametric method for test-

ing whether two distribution are the same. This test is \exact"

meaning that it is not based on large sample theory approxima-

tions. Suppose that X1; : : : ; Xm � FX and Y1; : : : ; Yn � FY are

two independent samples and H0 is the hypothesis that the two

samples are identically distributed. This is the type of hypothe-

sis we would consider when testing whether a treatment di�ers

from a placebo. More precisely we are testing

H0 : FX = FY versus H1 : FX 6= FY :

Let T (x1; : : : ; xm; y1; : : : ; yn) be some test statistic, for example,

T (X1; : : : ; Xm; Y1; : : : ; Yn) = jXm � Y nj:

Let N = m + n and consider forming all N ! permutations of

the dataX1; : : : ; Xm; Y1; : : : ; Yn. For each permutation, compute

the test statistic T . Denote these values by T1; : : : ; TN !. Under

the null hypothesis, each of these values is equally likely. The Under the null hy-

pothesis, given the

ordered data values,

X1; : : : ; Xm; Y1; : : : ; Ynis uniformly dis-

tributed over the

N ! permutations of

the data.

distribution P0 that puts mass 1=N ! on each Tj is called the

permutation distribution of T . Let tobs be the observed value

of the test statistic. Assuming we reject when T is large, the p-

value is

p-value = P0(T > tobs) =1

N !

N !Xj=1

I(Tj > tobs):

Example 11.17 Here is a toy example to make the idea clear.

Suppose the data are: (X1; X2; Y1) = (1; 9; 3). Let T (X1; X2; Y1) =

jX � Y j = 2. The permutations are:

Page 192: All of Statistics: A Concise Course in Statistical Inference Brief

192 11. Hypothesis Testing and p-values

permutation value of T probability

(1,9,3) 2 1/6

(9,1,3) 2 1/6

(1,3,9) 7 1/6

(3,1,9) 7 1/6

(3,9,1) 5 1/6

(9,3,1) 5 1/6

The p-value is P(T > 2) = 4=6. �

Usually, it is not practical to evaluate all N ! permutations.

We can approximate the p-value by sampling randomly from

the set of permutations. The fraction of times Tj > tobs among

these samples approximates the p-value.

Algorithm for Permutation Test

1. Compute the observed value of the test statistic tobs =

T (X1; : : : ; Xm; Y1; : : : ; Yn).

2. Randomly permute the data. Compute the statistic again

using the permuted data.

3. Repeat the previous step B times and let T1; : : : ; TB de-

note the resulting values.

4. The approximate p-value is

1

B

BXj=1

I(Tj > tobs):

Example 11.18 DNA microarrays allow researchers to measure

the expression levels of thousands of genes. The data are the lev-

els of messenger RNA (mRNA) of each gene, which is thought

Page 193: All of Statistics: A Concise Course in Statistical Inference Brief

11.6 Multiple Testing 193

to provide a measure how much protein that gene produces.

Roughly, the larger the number, the more active the gene. The

table below, reproduced from Efron, et. al. (JASA, 2001, p. 1160)

shows the expression levels for genes from two types of liver can-

cer cells. There are 2,638 genes in this experiment but here we

show just the �rst two. The data are log-ratios of the intensity

levels of two di�erent color dyes used on the arrays.

Type I Type II

Patient 1 2 3 4 5 6 7 8 9 10

Gene 1 230.0 -1,350 -1,580.0 -400 -760 970 110 -50 -190 -200

Gene 2 470.0 -850 -.8 -280 120 390 -1730 -1360 -.8 -330...

......

......

......

......

......

Let's test whether the median level of gene 1 is di�erent be-

tween the two groups. Let �1 denote the median level of gene 1

of Type I and let �2 denote the median level of gene 1 of Type II.

The absolute di�erence of sample medians is T = jb�1�b�2j = 710.

Now we estimate the permutation distribution by simulation and

we �nd that the estimated p-value is .045. Thus, if we use a

� = :05 level of signi�cance, we would say that there is evidence

to reject the null hypothesis of no di�erence. �

In large samples, the permutation test usually gives similar

results to a test that is based on large sample theory. The per-

mutation test is thus most useful for small samples.

11.6 Multiple Testing

In some situations we may conduct many hypothesis tests. In

example 11.18, there were actually 2,638 genes. If we tested for a

di�erence for each gene, we would be conducting 2,638 separate

hypothesis tests. Suppose each test is conducted at level �. For

any one test, the chance of a false rejection of the null is �.

But the chance of at least one false rejection is much higher.

Page 194: All of Statistics: A Concise Course in Statistical Inference Brief

194 11. Hypothesis Testing and p-values

This is themultiple testing problem. The problem comes up

in many data mining situations where one may end up testing

thousands or even millions of hypotheses. There are many ways

to deal with this problem. Here we discuss two methods.

Consider m hypothesis tests:

H0i versus H1i; i = 1; : : : ; m

and let P1; : : : ; Pm denote m p-values for these tests.

The Bonferroni Method

Given p-values P1; : : : ; Pm, reject null hypothesis H0i if Pi < �=m.

Theorem 11.19 Using the Bonferroni method, the probability of

falsely rejecting any null hypotheses is less than or equal to �.

Proof. Let R be the event that at least one null hypotheses

is falsely rejected. Let Ri be the event that the ith null hypoth-

esis is falsely rejected. Recall that if A1; : : : ; Ak are events then

P(Sk

i=1Ai) �P

k

i=1 P(Ai). Hence,

P(R) = P

m[i=1

Ri

!�

mXi=1

P (Ri) =

mXi=1

m= �

from Theorem 11.13. �

Example 11.20 In the gene example, using � = :05, we have

that :05=2638 = :00001895375. Hence, for any gene with p-value

less than .00001895375, we declare that there is a signi�cant

di�erence.

The Bonferroni method is very conservative because it is try-

ing to make it unlikely that you would make even one false

rejection. Sometimes, a more reasonable idea is to control the

Page 195: All of Statistics: A Concise Course in Statistical Inference Brief

11.6 Multiple Testing 195

false discovery rate (FDR) which is de�ned as the mean of the

number of false rejections divided by the number of rejections.

Suppose we reject all null hypotheses whose p-values fall be-

low some threshold. Let m0 be the number of null hypotheses

are true and m1 = m�m0 null hypotheses are false. The tests

can be categorize in a 2� 2 as in the following table.

H0 Not Rejected H0 Rejected Total

H0 True U V m0

H0 False T S m1

Total m�R R m

De�ne the False Discovery Proportion (FDP)

FDP =

�V=R if R > 0

0 if R = 0:

The FDP is the proportion of rejections that are incorrect. Next

de�ne FDR = E (FDP).

The Benjamini-Hochberg (BH) Method

1. Let P(1) < � � � < P(m) denote the ordered p-values.

2. De�ne

`i =i�

Cmm; and R = max

ni : P(i) < `i

o(11.6)

where Cm is de�ned to be 1 if the p-values are independent

and Cm =P

m

i=1(1=i) otherwise.

3. Let t = P(R); we call t the BH rejection threshold.

4. Rejects all null hypotheses H0i for which Pi � t:

Page 196: All of Statistics: A Concise Course in Statistical Inference Brief

196 11. Hypothesis Testing and p-values

Theorem 11.21 (Benjamini and Hochberg) If the procedure

above is applied, then regardless of how many nulls are true

and regardless of the distribution of the p-values when the null

hypthesis is false,

FDR = E (FDP) �m0

m� � �:

Example 11.22 Figure 11.1 shows 7 ordered p-values plotted as

vertical lines. If we tested at level � without doing any corec-

tion for mutiple testing, we would reject all hypotheses whose

p-values are less than �. In this case, the 5 hypotheses corre-

sponding to the 5 smallest p-values are rejected. The Bonferroni

method rejects all hypotheses whose p-values are less than �=m.

In this example, this leads to no rejections. The BH threshold

corresponds to the last p-value that falls under the line with slope

�. This leads to three hypotheses being rejected in this case. �

Summary. The Bonferonni method controls the probability

of a single false rejection. This is very strict and leads to low

power when there are many tests. The FDR method controls the

fraction of false discoveries which is a more reasonable criterion

when there are many tests.

11.7 Technical Appendix

11.7.1 The Neyman-Pearson Lemma

In the special case of a simple null H0 : � = �0 and a simple

alternative H1 : � = �1 we can say precisely what the most

powerful test is.

Theorem 11.23 (Neyman-Pearson.) Suppose we test H0 : � =

�0 versus H1 : � = �1. Let

T =L(�1)L(�0)

=

Qn

i=1 f(xi; �1)Qn

i=1 f(xi; �0):

Page 197: All of Statistics: A Concise Course in Statistical Inference Brief

11.7 Technical Appendix 197

threshold

reject don't reject

p-values

0 1

��

t

�=m

FIGURE 11.1. Schematic illustration of Benjamini-Hochberg pro-

cedure. All hypotheses corresponding to the last undercrossing are

rejected.

Page 198: All of Statistics: A Concise Course in Statistical Inference Brief

198 11. Hypothesis Testing and p-values

Suppose we reject H0 when T > k. If we choose k so that

P�0(T > k) = � then this test is the most powerful, size �

test. That is among all tests with size �, this test maximizes the

power �(�1).

11.7.2 Power of the Wald Test

Proof of Theorem 11.6.

Let Z � N(0; 1). Then,

Power = �(�?)

= P�?(Reject H0)

= P�?(jW j > z�=2)

= P�?

jb� � �0jbse > z�=2

!

= P�?

b� � �0bse > z�=2

!+ P�?

b� � �0bse < �z�=2

!= P�?

�b� > �0 + bse z�=2�+ P�?

�b� < �0 � bse z�=2�= P�?

b� � �?bse >�0 � �?bse + z�=2

!+ P�?

b� � �?bse <�0 � �?bse � z�=2

!

� P

�Z >

�0 � �?bse + z�=2

�+ P

�Z <

�0 � �?bse � z�=2

�= 1� �

��0 � �?bse + z�=2

�+ �

��0 � �?bse � z�=2

�: �

11.7.3 The t-test

To test H0 : � = �0 where � is the mean, we can use the Wald

test. When the data are assumed to be Normal and the sample

size is small, it is common instead to use the t-test. A random

variable T has a t-distribution with k degrees of freedom if it has

Page 199: All of Statistics: A Concise Course in Statistical Inference Brief

11.7 Technical Appendix 199

density

f(t) =��k+12

�pk��

�k

2

� �1 + t2

k

�(k+1)=2:

When the degrees of freedom k ! 1, this tends to a Normal

distribution. When k = 1 it reduces to a Cauchy.

Let X1; : : : ; Xn � N(�; �2) where � = (�; �2) are both un-

known. Suppose we want to test � = �0 versus � 6= �0. Let

T =

pn(Xn � �0)

Sn

where S2nis the sample variance. For large samples T � N(0; 1)

under H0. The exact distribution of T under H0 is tn�1. Hence

if we reject when jT j > tn�1;�=2 then we get a size� test.

11.7.4 The Likelihood Ratio Test

Let � = (�1; : : : ; �q; �q+1; : : : ; �r) and suppose that �0 consists

of all parameter values � such that (�q+1; : : : ; �r) = (�0;q+1; : : : ; �0;r).

De�nition 11.24 De�ne the likelihood ratio statistic

by

� = 2 log

�sup

�2� L(�)sup�2�0

L(�)

�= 2 log

L(b�)L(b�0)

!

where b� is the mle and b�0 is the mle when � is restricted

to lie in �0. The likelihood ratio test is: reject H0

when �(xn) > �2r�q;�.

For example, if � = (�1; �2; �3; �4) and we want to test the null

hypothesis that �3 = �4 = 0 then the limiting distribution has

4� 2 = 2 degrees of freedom.

Page 200: All of Statistics: A Concise Course in Statistical Inference Brief

200 11. Hypothesis Testing and p-values

Theorem 11.25 Under H0,

2 log�(xn)d! �2

r�q:

Hence, asymptotically, the LR test is level �.

11.8 Bibliographic Remarks

The most complete book on testing is (1986). See also Casella

and Berger (1990, Chapter 8). The FDR method is due to Ben-

jamini and Hochberg (1995).

11.9 Exercises

1. Prove Theorem 11.13.

2. Prove Theorem 11.10.

3. LetX1; :::; Xn � Uniform(0; �) and let Y = maxfX1; :::; Xng.We want to test

H0 : � = 1=2 versus H1 : � > 1=2.

The Wald test is not appropriate since Y does not converge

to a Normal. Suppose we decide to test this hypothesis by

rejecting H0 when Y > c.

(a) Find the power function.

(b) What choice of c will make the size of the test .05?

(c) In a sample of size n = 20 with Y=0.48 what is the

p-value? What conclusion about H0 would you make?

(d) In a sample of size n = 20 with Y=0.52 what is the

p-value? What conclusion about H0 would you make?

4. There is a theory that people can postpone their death

until after an important event. To test the theory, Phillips

Page 201: All of Statistics: A Concise Course in Statistical Inference Brief

11.9 Exercises 201

and King (1988) collected data on deaths around the Jew-

ish holiday Passover. Of 1919 deaths, 922 died the week

before the holiday and 997 died the week after. Think

of this as a binomial and test the null hypothesis that

� = 1=2: Report and interpret the p-value. Also construct

a con�dence interval for �:

Reference:

Phillips, D.P. and King, E.W. (1988).

Death takes a holiday: Mortality surrounding major social

occasions.

The Lancet, 2, 728-732.

5. In 1861, 10 essays appeared in the New Orleans Daily

Crescent. They were signed \Quintus Curtuis Snodgrass"

and some people suspected they were actually written by

Mark Twain. To investigate this, we will consider the pro-

portion of three letter words found in an author's work.

From eight Twain essays we have:

.225 .262 .217 .240 .230 .229 .235 .217

From 10 Snodgrass essays we have:

.209 .205 .196 .210 .202 .207 .224 .223 .220 .201

(source: Rice xxxx)

(a) Perform a Wald test for equality of the means. Use

the nonparametric plug-in estimator. Report the p-value

and a 95 per cent con�dence interval for the di�erence of

means. What do you conclude?

(b) Now use a permutation test to avoid the use of large

sample methods. What is your conclusion?

6. Let X1; : : : ; Xn � N(�; 1). Consider testing

H0 : � = 0 versus � = 1:

Page 202: All of Statistics: A Concise Course in Statistical Inference Brief

202 11. Hypothesis Testing and p-values

Let the rejection region be R = fxn : T (xn) > cg whereT (xn) = n�1

Pn

i=1Xi.

(a) Find c so that the test has size �.

(b) Find the power under H1, i.e. �nd �(1).

(c) Show that �(1)! 1 as n!1.

7. Let b� be themle of a parameter � and let bse = fnI(b�)g�1=2where I(�) is the Fisher information. Consider testing

H0 : � = �0 versus � 6= �0:

Consider the Wald test with rejection region R = fxn :

jZj > z�=2g where Z = (b� � �0)= bse. Let �1 > �0 be some

alternative. Show that �(�1)! 1.

8. Here are the number of elderly Jewish and Chinese women

who died just before and after the Chinese Harvest Moon

Festival.

Week Chinese Jewish

-2 55 141

-1 33 145

1 70 139

2 49 161

Compare the two mortality patterns.

9. A randomized, double-blind experiment was conducted to

assess the e�ectiveness of several drugs for reducing post-

operative nausea. The data are as follows.

Number of Patients Incidence of Nausea

Placebo 80 45

Chlorpromazine 75 26

Dimenhydrinate 85 52

Pentobarbital (100 mg) 67 35

Pentobarbital (150 mg) 85 37

Page 203: All of Statistics: A Concise Course in Statistical Inference Brief

11.9 Exercises 203

(source: )

(a) Test each drug versus the placebo at the 5 per cent

level. Also, report the estimated odds-ratios. Summarize

your �ndings.

(b) Use the Bonferoni and the FDR method to adjust for

multiple testing.

10. Let X1; :::; Xn � Poisson(�):

(a) Let �0 > 0. Find the size � Wald test for

H0 : � = �0 versus H1 : � 6= �0:

(b) (Computer Experiment.) Let �0 = 1, n = 20 and � =

:05. Simulate X1; : : : ; Xn � Poisson(�0) and perform the

Wald test. Repeat many times and count how often you

reject the null. How close is the type I error rate to .05?

Page 204: All of Statistics: A Concise Course in Statistical Inference Brief

204 11. Hypothesis Testing and p-values

Page 205: All of Statistics: A Concise Course in Statistical Inference Brief

12

Bayesian Inference

12.1 The Bay esianPhilosophy

The statistical theory and methods that we hav e discussed

so far are known as frequentist (or classical) inference. The

frequentist point of view is based on the following postulates:

(F1) Probability refers to limiting relative frequencies. Proba-

bilities are objective properties of the real world.

(F2) P arameters are �xed, (usually unknown) constants. Be-

cause they are not uctuating, no probability statements can

be made about parameters.

(F3) Statistical procedures should be designed to hav ewell de-

�ned long run frequency properties. F or example, a 95 per cent

con�dence in terval should trap the true value of the parameter

with limiting frequency at least 95per cent.

There is another approach to inference called Bay esian in-

ference. The Bay esianapproach is based on the following pos-

tulates:

Page 206: All of Statistics: A Concise Course in Statistical Inference Brief

206 12. Bayesian Inference

(B1) Probability describes degree of belief, not limiting fre-

quency. As such, we can make probability statements about lots

of things, not just data which are subject to random variation.

For example, I might say that `the probability that Albert Ein-

stein drank a cup of tea on August 1 1948" is .35. This does not

refer to any limiting frequency. It re ects my strength of belief

that the proposition is true.

(B2) We can make probability statements about parameters,

even though they are �xed constants.

(B3) We make inferences about a parameter �, by producing

a probability distribution for �. Inferences, such as point esti-

mates and interval estimates may then be extracted from this

distribution.

Bayesian inference is a controversial approach because it in-

herently embraces a subjective notion of probability. In general,

Bayesian methods provide no guarantees on long run perfor-

mance. The �eld of Statistics puts more emphasis on frequentist

methods although Bayesian methods certainly have a presence.

Certain data mining and machine learning communuties seem to

embrace Bayesian methods very strongly. Let's put aside philo-

sophical arguments for now and see how Bayesian inference is

done. We'll conclude this chapter with some discussion on the

strengths and weaknesses of each approach.

12.2 The Bayesian Method

Bayesian inference is usually carried out in the following way.

1. We choose a probability density f(�) { called the prior

distribution { that expresses our degrees of beliefs about

a parameter � before we see any data.

Page 207: All of Statistics: A Concise Course in Statistical Inference Brief

12.2 The Bayesian Method 207

2. We choose a statistical model f(xj�) that re ects our be-liefs about x given �. Notice that we now write this as

f(xj�) instead of f(x; �).

3. After observing data X1; : : : ; Xn, we update our beliefs

and form the posterior distribution f(�jX1; : : : ; Xn).

To see how the third step is carried out, �rst, suppose that � is

discrete and that there is a single, discrete observation X. We

should use a capital letter now to denote the parameter since

we are treating it like a random variable so let � denote the

parameter. Now, in this discrete setting,

P (� = �jX = x) =P (X = x;� = �)

P (X = x)=

P (X = xj� = �)P (� = �)P�P (X = xj� = �)P (� = �)

which you may recognize from earlier in the course as Bayes'

theorem. The version for continuous variables is obtained by

using density functions:

f(�jx) =f(xj�)f(�)Rf(xj�)f(�)d�

: (12.1)

If we have n iid observations X1; : : : ; Xn, we replace f(xj�)with f(x1; : : : ; xnj�) =

Qn

i=1 f(xij�). Let us write Xn to mean

(X1; : : : ; Xn) and xn to mean (x1; : : : ; xn). Then

f(�jxn) =f(xnj�)f(�)Rf(xnj�)f(�)d�

=Ln(�)f(�)RLn(�)f(�)d�

/ Ln(�)f(�):

(12.2)

In the right hand side of the last equation, we threw away the

denominatorRLn(�)f(�)d� which is a constant that does not

depend on �; we call this quantity the normalizing constant.

We can summarize all this by writing:

\posterior is proportional to likelihood times prior:00 (12.3)

You might wonder, doesn't it cause a problem to throw away

the constantRLn(�)f(�)d�? The answer is that we can always

Page 208: All of Statistics: A Concise Course in Statistical Inference Brief

208 12. Bayesian Inference

recover the constant is since we know thatRf(�jxn)d� = 1.

Hence, we often omit the constant until we really need it.

What do we do with the posterior? First, we can get a point

estimate by summarizing the center of the posterior. Typically,

we use the mean or mode of the posterior. The posterior mean

is

�n =

Z�f(�jxn)d� =

R�Ln(�)f(�)RLn(�)f(�)d�

: (12.4)

We can also obtain a Bayesian interval estimate. De�ne a and b

byRa

�1 f(�jxn)d� =R1bf(�jxn)d� = �=2. Let C = (a; b). Then

P(� 2 Cjxn) =Z

b

a

f(�jxn) d� = 1� �

so C is a 1� � posterior interval.

Example 12.1 Let X1; : : : ; Xn � Bernoulli(p). Suppose we take

the uniform distribution f(p) = 1 as a prior. By Bayes' theorem

the posterior has the form

f(pjxn) / f(p)Ln(p) = ps(1� p)n�s = ps+1�1(1� p)n�s+1�1

where s =P

ixi is the number of heads. Recall that random

variable has a Beta distribution with parameters � and � if its

density is

f(p; �; �) =�(� + �)

�(�)�(�)p��1(1� p)��1:

We see that the posterior for p is a Beta distribution with pa-

rameters s+ 1 and n� s+ 1. That is,

f(pjxn) =�(n+ 2)

�(s+ 1)�(n� s + 1)p(s+1)�1(1� p)(n�s+1)�1:

We write this as

pjxn � Beta(s+ 1; n� s+ 1):

Page 209: All of Statistics: A Concise Course in Statistical Inference Brief

12.2 The Bayesian Method 209

Notice that we have �gured out the normalizing constant without

actually doing the integralRLn(p)f(p)dp. The mean of a Beta

(�; �) is �=(� + �) so the Bayes estimator is

p =s+ 1

n+ 2:

It is instructive to rewrite the estimator as

p = �nbp+ (1� �n)epwhere bp = s=n is the mle, ep = 1=2 is the prior mean and �n =

n=(n + 2) � 1. A 95 per cent posterior interval can be obtained

by numerically �nding a and b such thatRb

af(pjxn) dp = :95.

Suppose that instead of a uniform prior, we use the prior p �Beta(�; �). If you repeat the calculations above, you will see that

pjxn � Beta(� + s; � + n� s). The at prior is just the special

case with � = � = 1. The posterior mean is

p =� + s

� + � + n=

�n

� + � + n

� bp+ � �+ �

� + � + n

�p0

where p0 = �=(�+ �) is the prior mean. �

In the previous example, the prior was a Beta distribution

and the posterior was a Beta distribution. When the prior and

the posterior are in the same family, we say that the prior is

conjugate.

Example 12.2 Let X1; : : : ; Xn � N(�; �2). For simplicity, let us

assume that � is known. Suppose we take as a prior � � N(a; b2).

In problem 1 in the homework, it is shown that the posterior for

� is

�jXn � N(a; b2) (12.5)

where

� = wX + (1� w)a

Page 210: All of Statistics: A Concise Course in Statistical Inference Brief

210 12. Bayesian Inference

where

w =1se2

1se2

+ 1b2

and1

� 2=

1

se2+

1

b2

and se = �=pn is the standard error of the mle X. This is

another example of a conjugate prior. Note that w ! 1 and

�=se ! 1 as n ! 1. So, for large n, the posterior is approx-

imately N(b�; se2). The same is true if n is �xed but b ! 1,

which corresponds to letting the prior become very at.

Continuing with this example, let is �nd C = (c; d) such that

Pr(� 2 CjXn) = :95. We can do this by choosing c such that

Pr(� < cjXn) = :025 and Pr(� > djXn) = :025. So, we want to

�nd c such that

P (� < cjXn) = P

�� � �

�<c� �

���� Xn

�= P

�Z <

c� �

�= :025:

Now, we know that P (Z < �1:96) = :025. So

c� �

�= �1:96

implying that c = ��1:96�: By similar arguments, d = �+1:96:

So a 95 per cent Bayesian interval is � � 1:96 � . Since � � b�and � � se, the 95 per cent Bayesian interval is approximated

by b� � 1:96 se which is the frequentist con�dence interval. �

12.3 Functions of Parameters

How do we make inferences about a function � = g(�)? Re-

member in Chapter 3 we solved the following problem: given

the density fX for X, �nd the density for Y = g(X). We now

simply apply the same reasoning. The posterior cdf for � is

H(� jxn) = P(g(�) � �) =

ZA

f(�jxn)d�

Page 211: All of Statistics: A Concise Course in Statistical Inference Brief

12.4 Simulation 211

where A = f� : g(�) � �g. The posterior density is h(� jxn) =H 0(� jxn).

Example 12.3 Let X1; : : : ; Xn � Bernoulli(p) and f(p) = 1 so

that pjXn � Beta(s + 1; n � s + 1) with s =P

n

i=1 xi. Let =

log(p=(1� p)). Then

H( jxn) = P( � jxn) = P

�log

�P

1� P

��

���� xn�= P

�P �

e

1 + e

���� xn�=

Ze =(1+e )

0

f(pjxn) dp

=�(n+ 2)

�(s+ 1)�(n� s+ 1)

Ze =(1+e )

0

ps(1� p)n�s dp

and

h( jxn) = H 0( jxn)

=�(n+ 2)

�(s+ 1)�(n� s+ 1)

�e

1 + e

�s�1

1 + e

�n�s0@@�

e

1+e

�@

1A=

�(n+ 2)

�(s+ 1)�(n� s+ 1)

�e

1 + e

�s�1

1 + e

�n�s�1

1 + e

�2

=�(n+ 2)

�(s+ 1)�(n� s+ 1)

�e

1 + e

�s�1

1 + e

�n�s+2

for 2 R. �

12.4 Simulation

The posterior can often be approximated by simulation. Sup-

pose we draw �1; : : : ; �B � p(�jxn). Then a histogram of �1; : : : ; �B

approximates the posterior density p(�jxn). An approximation

Page 212: All of Statistics: A Concise Course in Statistical Inference Brief

212 12. Bayesian Inference

to the posterior mean �n = E (�jxn) is B�1PB

j=1 �j: The poste-

rior 1� � interval can be approximated by (��=2; �1��=2) where

��=2 is the �=2 sample quantile of �1; : : : ; �B.

Once we have a sample �1; : : : ; �B from f(�jxn), let �i = g(�i).

Then �1; : : : ; �B is a sample from f(� jxn). This avoids the needto do any analytical calculations. Simulation is discussed in more

detail later in the book.

Example 12.4 Consider again Example 12.3. We can approxi-

mate the posterior for without doing any calculus. Here are

the steps:

1. Draw P1; : : : ; PB � Beta(s+ 1; n� s+ 1).

2. Let i = log(Pi=(1� Pi)) for i = 1; : : : ; B.

Now 1; : : : ; B are iid draws from h( jxn). A histogram of

these values provides an estimate of h( jxn). �

12.5 Large Sample Properties of Bayes'

Procedures.

In the Bernoulli and Normal examples we saw that the pos-

terior mean was close to the mle . This is true in greater gen-

erality.

Theorem 12.5 Under appropriate regularity conditions, we have

that the posterior is approximately N(b�; bse 2) where b�n is the

mle and bse = 1=

qnI(b�n). Hence, �n � b�n. Also, if C = (b�n �

z�=2 bse ; b�n+z�=2 bse ) is the asymptotic frequentist 1�� con�dence

interval, then Cn is also an approximate 1�� Bayesian posterior

interval:

P(� 2 CjXn)! 1� �:

There is also a Bayesian delta method. Let � = g(�). Then

� jXn � N(b� ; ese2)

Page 213: All of Statistics: A Concise Course in Statistical Inference Brief

12.6 Flat Priors, Improper Priors and \Noninformative" Priors. 213

where b� = g(b�) and ese = se jg0(b�)j.12.6 Flat Priors, Improper Priors and

\Noninformative" Priors.

A big question in Bayesian inference is: where do you get

the prior f(�)? One school of thought, called \subjectivism"

says that the prior should re ect our subjective opinion about

� before the data are collected. This may be possible in some

cases but seems impractical in complicated problems especially

if there are many parameters. An alternative is to try to de�ne

some sort of \noninformative prior." An obvious candidate for

a noninformative prior is to use a \ at" prior f(�) / constant.

In the Bernoulli example, taking f(p) = 1 leads to pjXn �Beta(s+ 1; n� s+ 1) as we saw earlier which seemed very rea-

sonable. But unfettered use of at priors raises some questions.

Improper Priors. Consider the N(�; 1) example. Suppose

we adopt a at prior f(�) / c where c > 0 is a constant. Note

thatRf(�)d� = 1 so this is not a real probability density

in the usual sense. We call such a prior an improper prior.

Nonetheless, we can still carry out Bayes' theorem and compute

the posterior density f(�) / Ln(�)f(�) / Ln(�). In the nor-

mal example, this gives �jXn � N(X; �2=n) and the resulting

point and interval estimators agree exactly with their frequen-

tist counterparts. In general, improper priors are not a problem

as long as the resulting posterior is a well de�ned probability

distribution.

Flat Priors are Not Invariant.Go back to the Bernoulli

example and consider using the at prior f(p) = 1. Recall that

a at prior presumably represents our lack of information about

p before the experiment. Now let = log(p=(1� p)). This is a

transformation and we can compute the resulting distribution

Page 214: All of Statistics: A Concise Course in Statistical Inference Brief

214 12. Bayesian Inference

for . It turns out that

f( ) =e

(1 + e )2:

But one could argue that if we are ignorant about p then we are

also ignorant about so shouldn't we use a at prior for ? This

contradicts the prior f( ) for that is implied by using a at

prior for p. In short, the notion of a at prior is not well-de�ned

because a at prior on a parameter does not imply a at prior

on a transformed version of the parameter. Flat priors are not

transformation invariant.

Jeffreys' Prior. Je�reys came up with a \rule" for cre-

ating priors. The rule is: take f(�) / I(�)1=2 where I(�) is the

Fisher information function. This rule turns out to be transfor-

mation invariant. There are various reasons for thinking that

this prior might be a useful prior but we will not go into details

here.

Example 12.6 Consider the Bernoulli (p). Recall that

I(p) =1

p(1� p):

Je�rey's rule says to use the prior

f(p) /pI(p) = p�1=2(1� p)�1=2:

This is a Beta (1/2,1/2) density. This is very close to a uniform

density.

In a multiparameter problem, the Je�reys' prior is de�ned to

be f(�) /pdetI(�) where det(A) denotes the determinant of

a matrix A.

12.7 Multiparameter Problems

Page 215: All of Statistics: A Concise Course in Statistical Inference Brief

12.7 Multiparameter Problems 215

In principle, multiparameter problems are handled the same

way. Suppose that � = (�1; : : : ; �p). The posterior density is still

given by

p(�jxn) / Ln(�)f(�):

The question now arises of how to extract inferences about one

parameter. The key is �nd the marginal posterior density for

the parameter of interest. Suppose we want to make inferences

about �1. The marginal posterior for �1 is

f(�1jxn) =Z� � �Zf(�1; � � � ; �pjxn)d�2 : : : d�p:

In practice, it might not be feasible to do this integral. Simula-

tion can help. Draw randomly from the posterior:

�1; : : : ; �B � f(�jxn)

where the superscripts index the di�erent draws. Each �j is a

vector �j = (�j

1; : : : ; �j

p). Now collect together the �rst compo-

nent of each draw:

�11; : : : ; �B

1 :

These are a sample from f(�1jxn) and we have avoided doing

any integrals.

Example 12.7 (Comparing two binomials.) Suppose we have n1 con-

trol patients and n2 treatment patients and that X1 control pa-

tients survive while X2 treatment patients survive. We want to

estimate � = g(p1; p2) = p2 � p1. Then,

X1 � Binomial(n1; p1) and X2 � Binomial(n2; p2):

Suppose we take f(p1; p2) = 1. The posterior is

f(p1; p2jx1; x2) / px11 (1� p1)n1�x1px22 (1� p2)

n2�x2:

Notice that (p1; p2) live on a rectangle (a square, actually) and

that

f(p1; p2jx1; x2) = f(p1jx1)f(p2jx2)

Page 216: All of Statistics: A Concise Course in Statistical Inference Brief

216 12. Bayesian Inference

where

f(p1jx1) / px11 (1� p1)n1�x1 and f(p2jx2) / px22 (1� p2)

n2�x2

which implies that p1 and p2 are independent under the pos-

terior. Also, p1jx1 � Beta(x1 + 1; n1 � x1 + 1) and p2jx2 �Beta(x2+1; n2�x2+1). If we simulate P1;1; : : : ; P1;B � Beta(x1+

1; n1�x1+1) and P2;1; : : : ; P2;B � Beta(x2+1; n2�x2+1) then

�b = P2;b � P1;b, b = 1; : : : ; B, is a sample from f(� jx1; x2). �

12.8 Strengths and Weaknesses of Bayesian

Inference

Bayesian inference is appealing when prior information is avail-

able since Bayes' theorem is a natural way to combine prior in-

formation with data. Some people �nd Bayesian inference psy-

chologically appealing because it allows us to make probability

statements about parameters. In contrast, frequentist inference

provides con�dence sets Cn which trap the parameter 95 per

cent of the time, but we cannot say that P(� 2 CnjXn) is .95.

In the frequentist approach we can make probability statements

about Cn not �. However, psychological appeal is not really an

argument for using one type of inference over another.

In parametric models, with large samples, Bayesian and fre-

quentist methods give approximately the same inferences. In

general, they need not agree. Consider the following example.

Example 12.8 Let X � N(�; 1) and suppose we use the prior

� � N(0; � 2). From (12.5), the posterior is

�jx � N

�x

1 + 1�2

;1

1 + 1�2

�= N (cx; c)

where c = � 2=(� 2 + 1). A 1 � � per cent posterior interval is

C = (a; b) where

a = cx�pcz�=2 and b = cx+

pcz�=2:

Page 217: All of Statistics: A Concise Course in Statistical Inference Brief

12.9 Appendix 217

Thus, P(� 2 CjX) = 1� �. We can now ask, from a frequntist

perspective, what is the coverage of C, that is, how often will

this interval contain the true value? The answer is

P�(a < � < b) = P�(cX � z�=2pc < � < cX + z�=2

pc)

= P�

�� � z�=2

pc� c�

c< X � � <

� + z�=2pc� c�

c

�= P�

��(1� c)� z�=2

pc

c< Z <

�(1� c) + z�=2pc

c

�= �

��(1� c) + z�=2

pc

c

�� �

��(1� c)� z�=2

pc

c

�where Z � N(0; 1). Figure 12.1 shows the coverage as a function

of � for � = 1 and � = :05. Unless the true value of � is close to

0, the coverage can be very small. Thus, upon repeated use, the

Bayesian 95 per cent interval might contain the true value with

frequency near 0! In contrast, a con�dence interval has coverage

95 per cent coverage no matter what the true value of � is. �

What should we conclude from all this? The important thing

is to understand that frequentist and Bayesian methods are an-

swering di�erent questions. To combine prior beliefs with data

in a principled way, use bayesian inference. To construct proce-

dures with guaranteed long run performance, such as con�dence

intervals, use frequentist methods. It is worth remarking that it

is possible to develop nonparametric Bayesian methods similar

to plug-in estimation and the bootstrap. be forewarned, how-

ever, that the frequency properties of nonparametric Bayesian

methods can sometimes be quite poor.

12.9 Appendix

Proof of Theorem 12.5.

It can be shown that the e�ect of the prior diminishes as

n increases so that f(�jXn) / Ln(�)f(�) � Ln(�). Hence,

Page 218: All of Statistics: A Concise Course in Statistical Inference Brief

218 12. Bayesian Inference

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

coverage

FIGURE 12.1. Frequentist coverage of 95 per cent Bayesian posterior

interval as a function of the true value �. The dotted line marks the

95 per cent level.

Page 219: All of Statistics: A Concise Course in Statistical Inference Brief

12.10 Bibliographic Remarks 219

log f(�jXn) � `(�). Now, `(�) � `(b�) + (� � b�)`0(b�) + [(� �b�)2=2]`00(b�) = `(b�) + [(� � b�)2=2]`00(b�) since `0(b�) = 0. Exponen-

tiating, we get approximately that

f(�jXn) / exp

(�1

2

(� � b�)2�2n

)

where �2n= ��1=`00(b�n). So the posterior of � is approximately

Normal with mean b� and variance �2n. Let `i = log f(Xij�), then

��2n

= �`00(b�n) =Xi

�`00i(b�n)

= n

�1

n

�Xi

�`00i(b�n) � nE �

h�`00

i(b�n)i

= nI(b�n)and hence �n � se(b�). �12.10 Bibliographic Remarks

Some references on Bayesian inference include Carlin and

Louis (1996), Gelman, Carlin, Stern and Rubin (1995), Lee

(1997), Robert (1994) and Schervish (1995). See Cox (1997), Di-

aconis and Freedman (1996), Freedman (2001), Barron, Schervish

andWasserman (1999), Ghosal, Ghosh and van der Vaart (2001),

Shen and Wasserman (2001) and Zhao (2001) for discussions of

some of the technicalities of nonparametric Bayesian inference.

12.11 Exercises

1. Verify (12.5).

2. Let X1; :::; Xn Normal(�; 1): (a) Simulate a data set (using

� = 5) consisting of n=100 observations.

(b) Take f(�) = 1 and �nd the posterior density. Plot the

density.

Page 220: All of Statistics: A Concise Course in Statistical Inference Brief

220 12. Bayesian Inference

(c) Simulate 1000 draws from the posterior. Plot a his-

togram of the simulated values and compare the histogram

to the answer in (b).

(d) Let � = e�. Find the posterior density for � analytically

and by simulation.

(e) Find a 95 per cent posterior interval for �.

(f) Find a 95 per cent con�dence interval for �.

3. Let X1; :::; Xn Uniform(0; �): Let f(�) / 1=�. Find the

posterior density.

4. Suppose that 50 people are given a placebo and 50 are

given a new treatment. 30 placebo patients show improve-

ment while 40 treated patients show improvement. Let

� = p2�p1 where p2 is the probability of improving under

treatment and p1 is the probability of improving under

placebo.

(a) Find the mle of � . Find the standard error and 90 per

cent con�dence interval using the delta method.

(b) Find the standard error and 90 per cent con�dence

interval using the parametric bootstrap.

(c) Use the prior f(p1; p2) = 1. Use simulation to �nd the

posterior mean and posterior 90 per cent interval for � .

(d) Let

= log

��p1

1� p1

���

p2

1� p2

��be the log-odds ratio. Note that = 0 if p1 = p2. Find

the mle of . Use the delta method to �nd a 90 per cent

con�dence interval for .

(e) Use simulation to �nd the posterior mean and posterior

90 per cent interval for .

Page 221: All of Statistics: A Concise Course in Statistical Inference Brief

12.11 Exercises 221

5. Consider the Bernoulli(p) observations

0 1 0 1 0 0 0 0 0 0

Plot the posterior for p using these priors: Beta(1/2,1/2),

Beta(1,1), Beta(10,10), Beta(100,100).

6. Let X1; : : : ; Xn � Poisson(�).

(a) Let � � Gamma(�; �) be the prior. Show that the

posterior is also a Gamma. Find the posterior mean.

(b) Find the Je�reys' prior. Find the posterior.

Page 222: All of Statistics: A Concise Course in Statistical Inference Brief
Page 223: All of Statistics: A Concise Course in Statistical Inference Brief

������������� ������� �������������������� �"!#�$�������

%'&

(*),+�)�-/.0)�-213+547698�1:-/.�-2;=< >@?A8�;=B�C

DFEHGID J5KML#NPORQ7OPS,TUKVOPLFWXY����[Z��:\�]��#����^#�����_^*�`�_Z ���/ a��]����b�,�_�`����c* �/]��`�H��!#\2�d �H������ceVfbg

��ch!#cia���j ��a�����]"]F^k���l�/��c* ��]���m0�/���nco�_����]"^p] q:c*] c*�_�b���5�_�`�/��ceV�/]�� ��^r�����sU] �`�/�_����]��tco�[ �vuvwx�rqy \z�[mU�������`�= ���5ce ��{|] �/�#����}~�{"����]�����#���/V�/�=���l�/��c* ��]����_u��3]�}�^#]�}��o\2��]F] ���o co]����d�/�#��ce�o�����* �Fg�`}��������:q�] !���^r�����0�"���P�[�����7�M���b�����'}�����\2�r���3oq�] ��c* a��/���_]��`{Aq�]��\�]�co� �`�����s�l�2 �����l�/��\� a���`]F\_��^�!#�����_u

� ]��#����^#����d� �/ co�_�������*}��#��\2��a���Z����t����d� �/ co�_�������`� \��o�hu���_�:����U�t �e�_�`����c* �/]�� ] q0�"uFwx�*�/���3a¡ ��� !� ���:] q�^��_\�������] �*�/�#��]��l{�mF���l�/��c* ��]��~���¢��]�co�_����co����\[ a�a���^ds���"���y�[�y���£�V¤¦¥��* ��^d�/�#��U]��`������a��Z a�!��_��] q¦�����$^���\_���`��]�����!#a��$ �`��\[ a�a���^�§�� ���y����� u

XY�d���� a�a�c*�� ��!��`�d�/���n^#����\_���_� ��\z{p���_�x}������¨�r ��^©���!#�����#��¥y���[�rªI¤¦��� �M���U�¬«:­P�"® ���¯zu¦°�]���c* a�a�{�m0«±c* ��s�³²´�µ�����/]�¶:u¢�:���`�

Page 224: All of Statistics: A Concise Course in Statistical Inference Brief

����� ��������� ��������������������������������� ����!�" �`����]�co�$�zf# co�a��_��] q a�]�����q�!��#\_�/��]����_�

«:­P�F® ���¯�# ­y�%$ ���¯'& �`�b!� ���_^��_����] ��a�]��`��m«:­P�F® ���¯�#)( �*$©��+( ����]�a�!#�/���_���`]���a�]��`��m«:­P�F® ���¯�#)( �*$©��+( , « , a�]����_®«:­P�F® ���¯�#.-=��q¢�/# ��s ��^10���q��32# �� 4_���`] gx]��#��a�]��`��m«:­P�F® ���¯�#65*a�]��87�9;:=< >@?�A9;:=< >CB?�AED/F ­HGJI���¯'KLG M�!�a�a��� \2j�g ���_����a����~a�]����ON

P~�� �t����co����^A���A}��� �tq�] a�a�]�}����/��V�� �����`����c* �/] � ��o���toq�!���\z�/��]��] q�������^� �/#u#��]s��co��� ���Q4_�t�/�#���~�]����b��m#�`]�c*�z�/��c*�_� }���}���a�a�}��`���/���� � ���­SRk¯zub��]s ���`���`�� �e�_�`����c* �/]��_m�}~�3�_Z a�!� �/���/�#��[Z��_�/ ���3a�]��`��] �������j�u

TVUOW+XLY Z[Y \]X3^�_]`a^cbedgf �V�P�Ohji�k/l�m f�nporqSs l o iut �� qQnv=­y�F®2�� ¯�#xw ? 7�«:­P�"®z���¯ D #zy «:­P�"®z���­SGU¯�¯ F ­HGJI2� ¯EKLGJN

X �����'�����oa�]��`�$q�!#��\_����]��k���$���b!� ����^'�����`]���m������*�`���`jr����{`!��l� �����|*}�~ ­�c*�� �����b!� �`��^|�����`]��2¯ �v=­P�F®2���¯V#�w ? ­R��%$p��¯ & # |*}�~ #�� ? ­R���¯��j� Y��u� &? ­I���¯�N

wx�s�/�����`���l�,] qU������\2�� F�/���_mb�yª��h�o���|�0�����z��§#���8�*��§F��¥y���[��ªx¤�������M�����.�h�£§��M�£¤��[�y���J��§����[¤��Y�p�M���p¥y���[�Aªx¤¦��� �M����� �P��� ��¤0§��M�"��b�M�V�����DFEHGC� ����Q�� T�K OPS����nORW��j�V��SV �¡ O¢��S�W�¦]|\�]�co� �`�=�x}�]r�_�`�/��ceV�/]��`��}~�*\[ �Y\�]�co� �`�=���������$�����`jrq�!���\ g

�/��]����_u��3]�}��_Z��_��m"�/����� ^#]F�_� �#] � ���]�Z"��^��:$\�a��[ �  ���l}~�_�~ �,��]�}��#��\2����l�/��ce ��]��¢�����U�z���/�_��u � ]��#����^#�����/����q�]�a�a�]�}����#�h�zf# co�a��_��u£�¤�� ¥§¦ ¨ U8^�_©`«ª�¬�fpo R®­z¯p­y�F® 0�¯°l�m+±§l n�np²³s´f¶µ�f l�t f¶²�npq m¸· n;¹;² l�t f ±f t�t[iut�ºQi n�n;»½¼ ium npq ± f t o¢µ i f�nporqSs l o iut n;¾ ���¿/#ÀR l�m+±��� & #ÂÁ »Ãbedgf

Page 225: All of Statistics: A Concise Course in Statistical Inference Brief

�����«��� �����¸� !'������ ������� ����[�������� �����

- 0 � Á � �-0�Á

v=­y�F® ���¿�¯v=­P�"® �� & ¯

������������ ������� ��� ����¸� !E�����¶�� � !'�a����� ����[������¸���� ������ ��!"! ���%�a��� ������ ����� ��!�� �����$# ���% ���V���'& �t qQn)( k ² m�* o¢q ium n l�t f v=­P�F®2���¿�¯�# w ? ­SR $¬��¯�&3# 0 l�m+± v=­y�F® �� & ¯ #w ? ­¢Á½$£��¯ & #�­rÁ½$´��¯ & N,+§i o¢q * f/oSd l o.- q k3�0/ �0/1� oSdgf m3v=­y�F® �� & ¯�/v=­P�"®z���¿�¯ i oSdgf t µVqQn�f v=­P�"® ���¿�¯2/ v=­P�F®2�� & ¯ » + fpqSoHdgf t f�nporqSs l o iut ² m q43k iut s º65 ±¸i s8q m+l o f�n8oSdgf i oSdgf t)7 n�f�f98�q · ² t f;:=<�»4: »?>A@CB�q i ² n º65d�� & qQn lt q ± q * ² ºQi ² n f�npo¢qSs l o iut @;²³o*qSo n�f t B�f�n�o i q ºSº ² npo tEl o f´oSdgf�D i q m o oSd l o qSoqQn m i o i @CB�q i ² n*d i µ�o iE*�i sFD l�t f%orµ i�t qQn)( k ² m�* orq ium n;»HG£�¤�� ¥/¦ ¨ U ^�_©`«_x¬Jfpo R3¿ ® N N;N�® RJI�­ P~�_����]�!#a�a��P­6K�¯ » ¼ ium npq ± f t n;¹;² l�t f ±f t�t�iut%ºQi n�n l�m+±´º fpo �K ¿V# R »ML�q m�* f½oHd¸qQn½d l n9NO@;q l n)-Vµ�f*d l B�f%oSd l o

v=­%K�® �K ¿`¯�#��=­ R�¯V# K�­'0 $PK�¯Q NR m i oHdgf t f�nporqSs l o iut qQn�K & # S �UTT �WV� Qµ dgf t f S #YX IZ [ ¿ R Z l�m+±\Tjl�m+±V�l�t f]D i npqSorq4B�f *�ium npo l�m oSn;» bed¸qQn qQnoHdgf^D i npo f t q iut s´f l�m ²�npq m¸·´l P~�_�/`_4T�®aV'b D t q iut » +§i µc-

v=­6K�® �K & ¯ # � , ­ �K & ¯�� ­ � Y ��� , ­ �K & ¯�¯ &# � , d S �UTT �eV� Q^f � d w , d S �WTT3�eV3� Q�f $PK f &# Q K0­'0 $PK�¯

­gT �WV� Q ¯ & � d Q K �WTT �eV� Q $PK f & N

Page 226: All of Statistics: A Concise Course in Statistical Inference Brief

����� ��������� ��������������������������������� ����!�"

0.0 0.2 0.4 0.6 0.8 1.0

0.00

000.

0005

0.00

100.

0015

0.00

200.

0025

K� ����

v=­ �K & ¯v=­ �K ¿�¯

� �����]� �������«��� � ����\�� ����[�������� �@��!�� ¿ ��� !� & ���0���L�����¸���%�����«���+§i µ º fpo T1# V #� Q�� � » _���m�����l sFD º f�:=<�»4:=< µ�f½µVq ºSº f � D ºal q m oSd¸qQn* d i q * f;» b bedgf t f�np² º o¢q m¸· f�npo¢qSs l o iut qQn

�K & # S �� Q�� �Q ��� Ql�m+±�t qQn)( k ² m�* o¢q ium qQn

v=­6K�® �K & ¯V# Q��­ Q � � Q ¯ & Nbedgf t qQn)( k ² m�* o¢q ium n l�t fOD ºQi oro�f ± q m���· ² t f :=<�»��¸» R n µ�f * l�m n�f�f)-m fpqSoSdgf t f�npo¢qSs l o iut ² m q k iut s º65�±¸i s8q m+l o�f�n½oHdgf i oSdgf t »�������`���zf# co�a��_���#������a��������H�����:�����_^e�/] �U�3 ��a�����] \_]�co� ���:������j

q�!���\z�/��]�����u"��]h^�]5�`]�mb}~���#����^� ]����zgI�b!�ch�����~�`!�coce �l{=] q�������������jq�!���\z�/��]���u��¢}�]���!�\2�p�`!�coce �`���_� ���=�/���*ceVfF��ch!�c �`����j� ��^Y�����P¢[{��_�3�`���`j�u

Page 227: All of Statistics: A Concise Course in Statistical Inference Brief

�����«��� �����¸� !'������ ������� ����[�������� ���L�TVUOW+XLY Z[Y�\³X^�_©`��)bedgf �'§��0�r��¤�� �V�y�;h qQn

v=­ ���¯V# ��!#? v=­y�F® �� ¯ ­�0�Á#ua0M¯l�m+±��M�0���h§����F�=�V�P�Oh qQn

� ­���® �� ¯�# y v=­P�"® ���¯� ­P� ¯EKb� ­�0�Á#u ��¯µ dgf t f � ­P� ¯ qQn l D t q iutVk iut�� »

£�¤�� ¥/¦ ¨ U ^�_©`� ¼ ium npq ± f t l�· l q m oSdgf�orµ i f�npo¢qSs l o iut n q m ����l sFD º f�:=<�»�<�»� f½d l B�f v=­ �K ¿�¯V# ceMf

�� , ¿ K�­'0 $ K�¯Q # 0� Ql�m+± v=­ �K & ¯�#¬c*Vf, Q��­ Q � � Q ¯ & # Q��­ Q � � Q ¯ & N�°l n�f ±3ium s l�� qSs8²³s t qQn)(�- �K & qQn l @pfpo¢o f t f�npo¢qSs l o iut npq m�* f v=­ �K & ¯]/v=­ �K ¿�¯ »�� i µ�f B�f t - µ dgf m Q qQn ºal�t · f)- v=­ �K ¿�¯ d l n½nps l�ºSº f t½t qQn)( f � * f�D+ok iut�l nps l�ºSº�t f · q ium q m oSdgfFD l�tEl s´fpo�f t n�D l�* f m f l�t]K1#À0 � � »%bed¸² n)-s l�m 5 Dgf i D º f D t f k f t �K ¿ o i �K & »�bed¸qQn3q ºSº ²�npo tEl o�f�n oSd l o ium f 3 m ²¸s;@pf tnp²³s8s l�t q f�n º q ( f8s l�� qSs8²¸s t qQn)( l�t f qSsFDgf tSk f * o » +/i µ *�ium npq ± f t oSdgf�°l 5 f�n t qQn)(�»"8 iut q ºSº ² npo tEl orq ium - º fpo ² n§o l ( f � ­%K�¯ #c0 »°bedgf m

� ­���® �K ¿`¯V# y v=­%K�® �K ¿`¯EK K´# y K�­�0 $ K�¯Q K K# 0� Ql�m+±

� ­���® �K & ¯�#zy v=­%K�® �K & ¯'K�K´# Q��­ Q ��� Q ¯ & N8 iut Q�� ��- - � ­���® �K & ¯�� � ­���® �K ¿l¯ µ d¸q * d8np² ·�· f�npoHn°oSd l o �K ¿ qQn l @pfpo¢o f tf�npo¢qSs l o iut »�bed¸qQn3s8q · d¸o n�f�fps q m o¢²¸qSo¢q4B�f º65 t f l n ium+l @ º fE@;²³o§oHd¸qQn l�m 3npµ�f t ± f�Dgf m+± n ium oHdgf * d i q * f i�k D t q iut » bedgf l ± B�o l�m o l�· f i�k ² npq m¸·s l�� qSs8²³s t qQn)(�- ± f�n�D+qSo�f�qSoSnD t[i @ º fpsÃn)- qQn�oSd l o°qSo ±¸i f�n m i o t f ¹;²¸q t f

Page 228: All of Statistics: A Concise Course in Statistical Inference Brief

����� ��������� ��������������������������������� ����!�"ium f o i * d i�i n�f l D t q iut » �pm d¸q · d�3 ± qSs´f m npq ium+l�º - *�i sFD º f � D t[i @ º fpsÃn)-* d i�i npq m¸·´l3± f k f m npq�@ º f�D t q iut0* l�m @pf8f � o t fps´f º65�± q � * ² º o »HG�������`���x}�]´��!�coc* �������e] q�������������j´q�!���\z�/��] � ��!���� ���`�n�x}�]´^���q¡g

q����`������co�_����]"^��:q�] �3^��zZ"���������o���`����c* �/] ������\2��]"]��`�����d��=��]eco������c*�@4���/�#�oc*VfF��c5!�c9�`����j�a��[ ^����/]Ac*������c*Vf|���l�/��ce ��]����OI�\2��]"]�������� ��n��]co������c*�@4����/���§P¢[{��_�3�`���`jda��� ^��¢�/]�P¢[{ ���t�_�`����c* �/]��`��u

TVUOW+XLY Z[Y \]X3^�_]`�� R ± f * qQnpq ium t ² º f§oSd l o�s8q m qSs8q��uf�n½oSdgf �°l 5 f�nt qQn)( qQn * l�ºSº f ± l �h§��U�"���V¤�¥y� » 8 iut s l�ºSº65 - �� qQn l �A5 f�n t ² º fk iut D t q iut � q k v=­P�"®z���¯�# ���#q�? � ­ ��®����¯ ­'0�ÁFu=Á�¯

µ dgf t f§oHdgf§q m�� s8²¸sÂqQn i B�f t/l�ºSº f�nporqSs l o iut n��� »TVUOW+XLY Z[Y \]X3^�_]`� R m f�nporqSs l o iut oHd l o3s8q m qSs8q��uf�n oHdgf s l�� q43s8²¸s t qQn)(§qQn * l�ºSº f ± l½�p�y�¦�r�'§��r� ¤�¥�� » 8 iut s l�ºSº65 -��� qQn s8q m 3qSs l�� q k v=­P�"® ���¯V#¬���#q

�? ��!�? v=­P�"®���¯ ­'0�ÁFu �b¯µ dgf t f§oHdgf§q m�� s8²¸sÂqQn i B�f t/l�ºSº f�nporqSs l o iut n �� »

DFEHG�E �oT ��LFW��hW ¡ ORQ T�¡��UK W�0�z� �Y���$s�����]���u�°��`]�c P¢[{��������/���_]����_c|m�������U] �`�/�_����]���^����#�����x{

���

F ­P��( GU¯V# F ­HGV( � ¯ � ­y��¯� ­SGU¯ # F ­SGV( � ¯ � ­y��¯5 F ­HGV( ��¯� ­P� ¯EKb� ­'0 Á#u � ¯

}����_��� � ­HGU¯# 5 F ­HG0®/��¯EKb� # 5 F ­SGV( ��¯� ­P� ¯EKb�r���h����� �'§������P�0§�¥���P�_�[� ���¦¤0�M�����'] q�R£u������������/���������_���b�V�y���d�V�y�;hY] q, ���_�`����c* �/]��

Page 229: All of Statistics: A Concise Course in Statistical Inference Brief

�����«�����p"�������� ��%�%� ��!'� ��������­SGU¯���{

� ­ ���( GU¯V# y «:­P�"® ���­SGU¯�¯ F ­P��( GU¯EKb�¸N ­'0�Á#u � ¯����UO\��CU�¥�� ��� bedgf �°l 5 f�n t qQn)( � ­ ��® �� ¯ n l orqQn � f�n

� ­ ��® �� ¯�# y � ­ ��+( GU¯ � ­HGU¯©KLGJN¬Jfpo ���­HGU¯ @pf/oSdgf9B l�º ²gf i�k�� oSd l o¶s8q m qSs8q��uf�n � ­I��+( GU¯ »*bedgf m��� qQn oSdgf�°l 5 f�n f�npo¢qSs l o iut » ����������¦XY��\� �A���_}��`�������/���§P¢[{��_�3�`���`jn ��q�]�a�a�]�}����

� ­ ��® �� ¯ # y v=­y�F® �� ¯ � ­y��¯'K"�%# y d y «:­P�F® ���­HGU¯`¯ F ­SGV( � ¯EKLG f � ­y��¯EKb�# y y «:­P�"®z���­SGU¯�¯ F ­HG0®2� ¯EKLG+Kb�§# y y «:­y�F®z��#­HGU¯�¯ F ­P��( GU¯ � ­HGU¯EKLG+Kb�# y d yµ«:­P�F®2���­HGU¯`¯ F ­y�+( GU¯EKb� f � ­HGU¯©KLG #zy � ­I���( GU¯ � ­SGU¯©KLGJN

wIqU}���\2��]"]���� ���­HGU¯���]��U�¢�����¢Z a�!��~] qU�3�/��V�,c*������co�Q4_��� � ­ ��+( GU¯��/���_�}~��}���a�abco������c*�@4��H�����¢������������ ��^s � �_Z ���`{/Go ��^5���b!���c*������co�Q4_�H�/��������/�_���/ aJ5 � ­I���( GU¯ � ­SGU¯EKLG0u G�:]M} }��5\� � ���#^� ��� fF�a���\�����q�]��`ch!�a�=q�]����/�#�ÃP¢[{ �����_�`�/��ceV�/]��

q�]�����] c*�$�����\_� ��\$a�]����¢q�!��#\_�/��]����_u����UO\��CU�¥�� ��� �Hk�«:­y�F® �� ¯ # ­P� $ ���¯ & oHdgf m oHdgf � l 5 f�n/f�npo¢qSs l o iut qQn

���­SGU¯�#zy � F ­y�+( GU¯'Kb�/#�w$­y�+( R #xGU¯pN ­'0�Á#u ��¯�Hk�«:­P�"®z���¯%# ( � $ �� ( oHdgf m oSdgf �°l 5 f�n3f�nporqSs l o iut qQn�oSdgf�s´f ± q l�m i�koHdgf?D i npo�f t q iut F ­P��( GU¯ » �Hkt«:­P�"® ���¯ qQn �uf t�i 3 ium f ºQi n�n)-�oHdgf m oHdgf �°l 5 f�nf�npo¢qSs l o iut qQn%oSdgf§s i;± f i�k oHdgf^D i npo f t q iut F ­P��( GU¯ » ����������#X'� }���a�a"#��]�Z��~������������] ����c q�]������b!� ����^s�_���`]�� a�]����_uV���#�P¢[{��_� ��!#a������­SGU¯0co������c*�@4��_� � ­I��+( GU¯�#65:­P�g$ ���­HGU¯`¯'& F ­P�+( GU¯EKb�Fu��¦ jb���#�

Page 230: All of Statistics: A Concise Course in Statistical Inference Brief

����� ��������� ��������������������������������� ����!�"�/�#�3^��_����Z V�/��Z ��] q � ­I���( GU¯H}����/�o���_�����\z�~�/]h���­SGU¯  �#^e���z������������� ���b!� a�/]´-*{"���_a�^����/���5���b!� ����]��r� 5�­y�§$ ���­SGU¯�¯ F ­P��( GU¯EKb� #c-#u��F]�a�Z"�����oq�] ����­SGU¯~}~�$���z�/0 Á#u �"u G£�¤�� ¥§¦ ¨ U8^�_©`�^���¬Jfpo R ¿2® N N N_® RJI ­ ¯'­��H®�e&z¯ µ dgf t f � & qQn;( m i µ m »L�²�D�D i n�f¶µ�f�²�n�f l ¯'­��® � & ¯ D t q iut k iut�� »�bedgf �°l 5 f�n f�nporqSs l o iut µVqSoSdt f�n�Dgf * o�o i n;¹;² l�t f ± f tpt[iut%ºQi n�n%qQn§oSdgf�D i npo f t q iut s´f l�m -�µ d¸q * d3qQn

���­SR ¿2® N N N_® RJI�¯V# ��&� & �����I RÀ� � �I

� & �����I gN GDFEHG�� � ORS�ORQ7T��j�3��NPLFW����� #��]���a���c ] q ����^����#�oco���#��c*Vfo��!�a���������\�]�co�a���\[ ����^� �#^A}��

\[ ����] ��V���/�_c*F��p\�]�co�a��_�/��\�]�Z��_�/ ����] q����� �d������] �`{7�#�������#!#�}��d}���a�a co�����/��]��´�q��_} j �z{£���_��!�a��/��u �����nce ���pco���`�/ ���d��]r�2 j �[}¢[{oq���]�c������������_\_�/��]��d�����¸P¢[{��_�¢���l�/��c* ��]����,}������n \_]����`�/ ����������jq�!���\z�/��]��A ����co������ceMfvu� ��U;\��CUu¥�� � � �.¬Jfpo ���� @pf§oSdgf �°l 5 f�n t ² º f k iut n i s´f�D t q iut � ¾

� ­ ��® �� � ¯V# ���#qB? � ­��H® �� ¯�N ­'0 Á#u���¯L�²�D�D i n�f%oHd l ov=­P�F® �� � ¯�� � ­��H® �� � ¯±q�] �t a�a��³N ­'0 Á#u���¯

bedgf m����� qQn%s8q m qSs l��l�m+± � qQn * l�ºSº f ±l5¥��b§��_�oªI§! ����V§ � ¥�� ��� ����� » ����������"�F!#�U] ���s�/��V�5����n�����#] ��c*������c*VfUu��������k�/���_���=���$ �"g

] ���������`!�a��=�� � �`!�\2���/�� ����!� ? v=­P�"® �� � ¯/ ��!� ? v=­y�F®z�����¯ u#�F���#\��5�����[Z��_�/ ���*] q�nq�!���\z�/��]��'��� a�}~�{"� a����`������ �Y]��$���b!� a,�/]A������c*VfF��gch!�cAmF}�����[Z��$�/�� � � ­��H® �� � ¯$� ��!� ? v=­y�F® �� � ¯ u��3�_��\�� m

� ­���® �� � ¯%�¬�`!�? v=­P�"® �� � ¯]/7��!#? v=­y�F® �� � ¯%� � ­���® �� � ¯

Page 231: All of Statistics: A Concise Course in Statistical Inference Brief

����� � � �����%�%� �9� ������ �����}�����\2�A\�] �b���/ ^#��\_��� ­'0�Á#u ��¯ u G

����UO\ �QU�¥�� � ��� L�²�D�D i n�f oSd l o �� qQn°oSdgf � l 5 f�n t ² º f µVqSoSd t f 3n�Dgf * o o i n i s´f,D t q iut � » L�²�D�D i n�f k ² t oHdgf t oHd l o �� d l n *�ium npo l�m ot qQn)(�¾ v=­P�F® ���¯�#�� k iut n i s´f � »°bedgf mn�� qQn%s8q m qSs l�� » ���������������� P¢[{ �����`����j���� � ­��H® ���¯¶#c5�v=­P�F® ���¯ �,­P��¯'Kb�8#��$ �#^

�����#\�� v=­P�"® ���¯ � � ­��H® �� ¯�q�]��� a�a��Fu��3]�} ��a�{A�����h��`�_Z"��] !��3������] g���_c|u G£�¤�� ¥/¦ ¨ U ^�_©`�^�_ ¼ ium npq ± f t oSdgf � f t�m i ² ºSº q�s i;± f º µVqSoSd�n;¹;² l�t f ± f t�t[iutºQi n�n;» �pm f ��l sFD º f�:=<�»�<´µ�f½n�d i µ�f ± oSd l o oSdgf8f�nporqSs l o iut

�K0­HR I ¯�# X IZ [ ¿ R Z � Q � �Q � � Qd l n l *�ium npo l�m o t qQn)( k ² m�* orq ium » bed¸qQn f�nporqSs l o iut qQn1oSdgfED i npo�f t q iuts´f l�m - l�m+± dgf m�* f oSdgf �°l 5 f�n t ² º f)- k iut oSdgfD t q iut%P~�_�/�­gT�® V ¯ µVqSoSdT1# V # Q�� � »��/f m�* f)-]@ 5 oHdgf^D t f B�q i ² n§oHdgf iut fps - oSd¸qQn8f�npo¢qSs l o iutqQn§s8q m qSs l�� »�G£�¤�� ¥/¦ ¨ U ^�_©`�^ � ¼ ium npq ± f t*l�· l q m oSdgf � f t�m i ² ºSº q @;²¸o�µVqSoHd ºQi n�n k ² m�* 3o¢q ium

«:­6K�® �K�¯V# ­6K�$ �K�¯'&K0­'0 $PK�¯ N¬Jfpo�K�­HR I ¯�# �K# S Q Nbedgf t qQn)(�qQn

v=­%K�® �K�¯V# d ­ �K�$PK�¯�&K�­'0 $PK�¯ f # 0K�­'0 $PK�¯ d K�­�0 $2K�¯Q f # 0Qµ d¸q * d�- l n l k ² m�* orq ium i�k K -�qQn *�ium npo l�m o » � o * l�m @pf*n�d i µ m oHd l o4- k iutoHd¸qQn ºQi n�n k ² m�* orq ium - �K�­HR I ¯ qQn°oSdgf � l 5 f�n%f�nporqSs l o iut ² m+± f t oHdgf D t q iut� ­%K�¯ #c0 »��§f m�* f)- �K qQn§s8q m qSs l�� »�G

Page 232: All of Statistics: A Concise Course in Statistical Inference Brief

����� ��������� ��������������������������������� ����!�"� �� ��!��/ a��b!����l�/��]����/]d ��jA���_��}���V�����:�/���5c*������c*Vfn�_�`����c* �/]��

q�]��� �3] ��c* a�co]F^��_a �� ��U;\��CUu¥�� � ��� ¬Jfpo R3¿z® N N;N_® RJI/­ ¯p­y�F® 0�¯Ãl�m+±�º fpo ��/# R »°bedgf m�� qQn8s8q m qSs l�� µVqSoSd t f�n�Dgf * o o i l�m 5 µ�f ºSº 3 @pf�d l B�f ± ºQi n�n k ² m�* orq ium » � o��� U�¨ ¨ � � U �u���U� �¥°U � X��VZ �u��Z Z ��U§¨ U��U�¨�'UOZ�� ¥�����Z � U��;\]X��U�¤� X� ��� ¥§¥ UpZ �¢Y �� � \���Z Z ��U \��¢Y��¸Y X¸`

� ��U �QU;��� ¨ Z ��\]¨ ��� ¦Z�\ �'UOZ�� \�� ¥°U �������QU�]`

qQn%oSdgf iumgº65 f�nporqSs l o iut µVqSoSd oSd¸qQn^D t[i Dgf t o 5 »wIq �����h� �/ co�_�������`� \��s���t�`���`������\_�/�_^�mv�����5�/�#��]��`��c �U]�Z �s^�]F�_�

��] �: ��a�{d ���/���$��� f"�t� f� c*�a�������]�}��_u£�¤�� ¥§¦ ¨ U8^�_©`�^ � L�²�D�D i n�f�oSd l o R ­ ¯'­P�F®;0M¯l�m+± oSd l o � qQnJ( m i µ mo i º q fq m oSdgfq m o f t B l�º���$ � ® �! µ dgf t f -`/ � /®0 »8bedgf´² m qr¹;²gf)-s8q m qSs l�� f�nporqSs l o iut ² m+± f t n;¹;² l�t f ± f t�t[iut%ºQi n�n%qQn��#­HRk¯�# � �2 ���0­ � R�¯µ dgf t f �2 ����­#"�¯ # ­�$&%�$'$)(�%_¯ � ­�$&% �*$)(�% ¯ » � o * l�m @pf n�d i µ m oHd l oVoSd¸qQnqQn oHdgf �°l 5 f�n t ² º f*µVqSoSd t f�n�Dgf * o�o i oSdgf D t q iut oHd l o�D+²¸oSn*s l n�n:�+ � l o

� l�m+± s l n�nF:�+ � l o $ � »-, iut f i B�f t -�qSo * l�m @pf n�d i µ m oHd l o�oHdgf t qQn)(qQn m i o *�ium npo l�m o @;²¸o�qSo ±¸i f�n n l orqQn k 5%v=­P�F® ���¯$� � ­ ��® �� ¯ k iut%l�ºSº���7 n�f�f8�q · ² t f;:=<�»�<�» �/f m�* f)-Ãbedgf iut fps :=<�»4:�:3qSsFD º q f�n oSd l o��� qQn s8q m qSs l�� »GDFEHG/. � T���OPQ���Q 03Or��L#NPO21V���4365 � OPS�ORQ T��¬TvS73 �=T ��L#W°�] ��� �/ c*�z�/����\Hco]F^��_a������� �0�/V�/���lq {�}��[ j��`����!�a� �����x{t\_]���^����/��]�����m

�/�#�dceMf#��ch!�c a���j �_a�����]F]"^k���`����c* �/] �5���5 ���`][f#��ceV�/��a�{�c*������c*VfUu� ]����`��^��_�~�`�"!� ���_^n�����`]��~a�]���� }�����\2�e���~�`�b!� ���_^n���¡ ���a�!���Z �`�¡ �#\�� uwx��� �/ co�_������\�c*]"^���a��¢}�������a¡ �������� c*#a����_m�����\[ ����� �`��]�}����/��V��/�#�hZ ���¡ ��\��h������c ^�]�co����V�/���������s���� ����]n�/�#�o�`���`j�] q,�/�#� |98¸~ ���:� ¦LY�� � ¨=¨ ��; Z ��U��<=�u� �QU� � Y��u��Y �V\��/ U �

> ­ Q ( & ¯@? �LY ¨ U Z ��Uu� �¢Y � X��;U Y �\��/\��/ U �> ­ Q ( ¿ ¯ `

��] !�����a�{d���b!� a����/����Z �`�¡ ��\_� �v=­P�F® ���¯�#�� ? ­ �� ¯���� Y��u� &BA � ? ­ �� ¯pN� ��}��*��[} �����/�#� � �� #�/�_�$]��Y� �/ co�_������\hc*]"^��_a���mU�/���sZ ���� ��\_�] q��/��� |C8�~ ���� ���`][f#��ceV�/��a�{

�=­ �� ¯ A 0Q:D ­P��¯

Page 233: All of Statistics: A Concise Course in Statistical Inference Brief

����� � � � �L�%�F ��� �6������������=!�� � ���¸�%�%� �Ã����! ����"���� �����

−0.5 0.0 0.5

510

1520

�� �� ��

����������� �����«��� � �a��2�� ����[�������� �@��!Ã�[����� !'���a� � ! ��! �%��� � ���� �� � �������� �� � �'� ��!�/�a��� ���%����� � �� ����������\�C� #���!'���¸���;��!'����!�� �����E� �� ���§�����%���'��� ��� �\�©����������}������`� D ­y��¯~���¢������° ���`���������#q�]���c* ����]���u"�:����\_� m

Q v=­P�F® ���¯ A 0D ­P��¯ N ­�0�Á#ua0�-�¯

°�]��= ��{£] �������=���l�/��ce ��]��=����m����s\� � �U�����#]M}�� �/��V�oq�] �sa¡ �`��� Q mv=­P�"®2��� ¯ � v=­y�F® �� ¯zu��|] ���$����_\�������a�{�m

a���c��� �

a���c7�`!�I ��� ��!#� ? ( ?�� � � � Q v=­P� � ® ���¯ � 0

D ­P� ¯ N ­�0�Á#ua0L0M¯����������[{F�$�/�� �[mU����na�]"\[ aPmUa� ��� �h�/ co�a��h���_����� m��/��� |C8�~ ����c*���Fg��c*VfUuUwI�$\� �k a��`]d���s����]�}����/�� �t�/��� |C8�~ ���� #���][fF��c* �/�_a�{A�/���P¢[{��_�3�`!�a�� u

wx�7�`!�coce �l{�m ���7� �/ co�_������\|co]F^#��a��=}������7a¡ �����A�/ co�a�����m �/���|C8�~ ���h ���`]�fF��ce ����a�{'co������ceMfY ��^ P¢[{��_��u,�������`�����=�\[[Z��� �[��/���_���Y�����`!�a����|�#���[ j¬^#]M}��¬}������ �/���k�b!�ch�����A] qs� �� co�_�/�_�������a¡ �`���$ �¢�����$���zf"�3�zf# co�a����`��]�}���u£�¤�� ¥/¦ ¨ U ^�_©`�^ ��! 8� X��#"�\�� ¥ ��¨�¥°U � X��!$ ¬Jfpo S Z ­ ¯p­y� Z ® �e& � Q ¯ -&% #0�® N N;N�® Q »Ã¬Jfpo S # ­ S ¿ ®;N N N�® S I ¯ ± f m i o f oHdgf ± l o l�l�m+±�º fpo � #

Page 234: All of Statistics: A Concise Course in Statistical Inference Brief

����� ��������� ��������������������������������� ����!�"­P��¿z®;N N Nz®2�CIb¯§± f m i o�f§oSdgf§² m ( m i µ m D l�tEl s´fpo�f t n;» R n�np²³s´f½oHd l o

�����FI�� � ­P��¿z® N N;N_®2�CI�¯~� I� Z [ ¿ � &Z � � &��k iut n i s´f ��� - » �pm oHd¸qQns i;± f º -*oSdgf t f l�t f l ns l�m 5 D l�tEl s´fpo�f t nl n i @pn�f t B l o¢q ium n;»§bedgf |C8�~ qQn �� # S # ­ S ¿ ®;N N Nz® S I�¯ »�� m+± f t oSdgf� ��U ¥ � X�� " \�� ¥ ��¨¥°U � X�� ¦ �Q\ � ¨ Uu¥ Y��¥°\��CU �LUuX�U �S� ¨ Z �u� XY Z ¨ \³\;�O` 8��X��X�\³X ¦u� �S� ¥ UpZ �¢Y � UO� �Z[Y=¥*��Z[Y�\³X ¦ �Q\ � ¨ Uu¥ �� �CU ¥*��Z ��U�¥ ��Z[Y � � ¨=¨ �U�<=�LY u� ¨ UuX�Z Z�\ Z �LY��¥°\� U�¨�`

ºQi n�n k ² m�* orq ium£«:­P�F® ���¯�# X IZ [ ¿ ­ �� Z $ � Z ¯ & -%oSdgf t qQn)( i�k oSdgf |C8�~ qQnv=­P�"® ���¯�# � &�» � o * l�m @pf½n�d i µ m oSd l o�oSdgf§s8q m qSs l���t qQn)(�qQn l D�D t[i � q43s l o f º65 �e& � ­��e&p� �p&z¯�l�m+±§ium f * l�m ��m+±½l�m f�npo¢qSs l o iut �� oHd l o l�* d¸q f B�f�noSd¸qQn t qQn)(�» L�q m�* f �e& � ­��e&©� �p& ¯]/ �e& -�µ�f n�f�f oSd l o �� d l n nps l�ºSº f t t qQn)(oSd l�m oHdgf |C8�~ » ��m D tEl�* orq * f)- oHdgf ± q �*f t f m�* f�@pfporµ�f�f m oHdgf t qQn)(�n * l�m@pf3np² @pnpo l�m orq l�º »�bed¸qQn3n�d i µ n oHd l o/s l�� qSs8²³s º q ( f º qQd i�i;± qQn m i o l�mi D+orqSs l�º f�npo¢qSs l o iut q m d¸q · d ± qSs´f m npq ium+l�º D t�i @ º fpsÃn;»DFEHG � �'3�Q¨ORWMWMO��,ORNPOr¡���|������ceMf$���l�/��c* ��]��� ��^8P¢[{ ���H�_�`�/��ceV�/]��� �`���`��]"]"^ �_�`�/��ceV�/]����

�����/�������_�����~���� �¦�/���z{ ���Z �¢��c* a�a"�`���`j�u �F] c*�z�/��co�������¦���¦ a���]t!#���_q�!#a�/]=\2�� �/ \z�/���`�Q4_���� ^|�_�`����c* �/]��_u

TVUOW+XLY Z[Y \]X3^�_]`a^�� R m f�npo¢qSs l o iut��� qQn �y��§����p�y�[�[��� ¥�� q k oHdgf t ff � qQnpoSn l�m i oHdgf t½t ² º f ���� np² * d oSd l ov=­P�"®z�� � ¯ � v=­y�F® �� ¯±q�]��3 a�a�� ��^v=­P�"®z�� � ¯ / v=­y�F® �� ¯±q�]��3 ��a��[ �l�t] ���h�¸N

£�¤�� ¥§¦ ¨ U8^�_©`�^���¬Jfpo R ­ ¯'­P�F®;0M¯ l�m+± *�ium npq ± f t f�nporqSs l o¢q m¸·h� µVqSoSdn;¹;² l�t f ± f t�t�iut´ºQi n�n;» ¬Jfpo ���­HR�¯�#®Á » � fµVq ºSº n�d i µ oHd l o �� qQn l ± 3s8qQn�npq�@ º f;»,L�²�D�D i n�f m i o »°bedgf m oHdgf t f8f � qQnpoSn l3± q �*f t f m o t ² º f��� �VµVqSoSdnps l�ºSº f t t qQn)(�» ��m D l�t o¢q * ² ºal�t - v=­¢ÁF® �� � ¯ � v=­rÁ#® �� ¯ # - » �§f m�* f)- -3#v=­¢ÁF® ���� ¯ # 53­R�����­SGU¯ $ Á�¯'& F ­HGJI�Á�¯'KLG »%bed¸² n)- ����y­HGU¯ # Á » L i oSdgf t f8qQnm i�t ² º f oHd l oA@pf l oSn �� » � B�f m oSd i ² · d �� qQn l ± s8qQn�npq�@ º f/qSo¶qQn *pº f l�t�º65 l@ l ±± f * qQnpq ium t ² º f;»HG

Page 235: All of Statistics: A Concise Course in Statistical Inference Brief

����� ����!��%������� �¸������¢" ������ �����]��*^#�����`���x{©�� �eªI¤¦¥y¥*��¤ � �����M����q�q�]��o�_Z ���`{7�' �#^7�_Z ���`{

� ��-#m�5 ?�� �? ( � � ­P� ¯EKb� �x-#u����UO\��CU�¥�� � ����� �h§����F� �V¤¦¥��"�s§��M�r§����p�P���[� �¦¥��+��� L�²�D�D i n�f�oHd l o�� ¶ l�m+± oSd l o v=­y�F®z���¯ qQn l2*�ium orq m ² i ² n k ² m�* orq ium i�k3�/k iut f B�f t 5�� »�¬Jfpo � @pf l D t q iut± f m npqSo 5 µVqSoSd k ² ºSº np²�D�D iut o l�m+± º fpo �� � @pfoSdgf�°l 5 f�n�� t ² º f;» �Sk oHdgf � l 5 f�n t qQn)( qQn ��m qSo f%oSdgf m �� � qQn l ± s8qQn�npq�@ º f;» ���������� �F!#�U] ��� �� ���������� ^�co���`������a�� u_�����_� �������`� � fF���`����:���_�`�/���

��!�a�� ��p��!�\2� �/�� � v=­P�"® ���¯ � v=­y�F® �� � ¯*q�] �� a�a3�£ ��^jv=­P� � ® ���¯ /v=­P� � ®2�����¯�q�] �¢��]�co��� � u����z����#6v=­y� � ®z�����¯�$ v=­P� � ®2�� ¯ �x-#u �F����\_�%v����\_]����/���b!�] !���m[�������`� ���0 � � ��-��`!�\2� ���� ��v=­y�F® �� � ¯O$ v=­P�"® ���¯���� � �q�]��� a�a����´­P� � $ � ®2� � � � ¯zu �3]�}$m

� ­��H® �� � ¯ $ � ­���® �� ¯ # y v=­y�F® �� � ¯� ­P� ¯EKb�*$ y v=­P�"® ���¯� ­P� ¯EKb�# y �Qv=­P�"®z�� � ¯ $ v=­P�"®z���¯�� � ­y��¯EKb�� y ?���� �

?�� ( � � v=­P�"® �� � ¯ $ v=­P�"® ���¯ � � ­y��¯EKb��

�� y ?���� �

?�� ( � � ­y��¯'K"�� -]N

�3�_��\�� m � ­��H® �� � ¯ � � ­���®2���¯zu�����������c*�a����_�U�/�� � �� ��^�]"�����#] �0c*������co�Q4_�� ­ ��® �� ¯�}�����\2�A\�]������/ ^���\_����������qy \_������ �t�� �h�����/���§P¢[{ ���:��!�a�� u G����UO\��CU�¥�� � � �.¬�fpo R3¿ ® N;N N�® RJI�­ ¯p­ �H® �e&z¯ »�� m+± f t n;¹;² l�t f ± f t 3t[iut§ºQi n�n)- R qQn l ± s8qQn�npq�@ º f;»

�����~���]"] q�] q������¢a¡ �`���/���_]����_c ��� �b!������~�/�_\2������\[ a# ��^h��� ]�co���`�/��^��!#�:�/��� ��^���o���: �:q�] a�a�]�}��_u������5�]��`�����`��]���co�[ �A���3 ^�co���`�����#a���q�]�� ��{n�`������\_�/a�{d�]������/��Z�����`��]��_u#�¦ j �$�/���$��`��]����/]o�U�%¯'­��® ��&_¯zu#X ���_���&s��� Z ���l{pa¡ ����� m0�����d�]��`�����`��]���c*�� �p���5 #���][fF��c* �/�_a�{��_�"!� a���]R£u

Page 236: All of Statistics: A Concise Course in Statistical Inference Brief

��� � ��������� ��������������������������������� ����!�"�3]�} ���dco���#��c*VfF���x{� ��^´ ^�co���`�����#��a����x{ra�����j ��^��dwx�£���_������ aRm¦

��!#a�� ce[{$���~]���� m ��] ���5]��¦�������/���_��u�P~!#�¦�����`�� ��� ��] c*� qy \_��� a�����jb����� ^�co���`��������a����x{o ��^Aco������ceVfF���x{�u� ��U;\��CUu¥�� � � � L�²�D�D i n�f½oHd l o��� d l n *�ium npo l�m o t qQn)( l�m+± qQn l ± s8qQn 3npq�@ º f;» bedgf m qSo qQn§s8q m qSs l�� » �����������������������j ���v=­R��F®2�� ¯ # �nq�]��e�`]�c*� �Vu wIqh��Y}~�_������] �

co������ceVf=�������A�/���_���$�zfF���`�/�:=��!�a����������!#\2�|�/�� �v=­y�F® �� � ¯%� ��!�? v=­P�"® �� � ¯�/¨��!#? v=­ ��F® �� ¯V#�uN

�������¢}~]�!#a�^A��co�a�{o�/��V�t��=�������� ^�c*�����`����a�� u G�3]�} }��=\[ ����`]MZ �o��`���l�/����\_����^�Z����`����]���] q~���#��]��`��c 0�ÁFu@0��*q�] �

���b!� ����^|���`��]���a�]����_u� ��U;\��CUu¥�� � � ¬Jfpo R3¿z® N;N N_®ER;I/­z¯'­P�"® 0M¯ »�bedgf m -�² m+± f t n;¹;² l�t f ±f t�t[iut§ºQi n�n)- ��/# R qQn%s8q m qSs l�� » ���������� � \_\�]��`^����#�k��]'�����_]����_c 0 Á#u �¸0�m ������o ^�co���`�����#a�� u �����

������jd] q ��s���½0 � Q }�����\2�A����\�] ���`�/ ���[u����������_��!�a��:q�] a�a�]�}��¢q��`]�c ����� g]��`��c 0�ÁFu ���"u G� a�����]�!�� �©co������ceMfp�`!�a��_�* ���A��] �o��!� �� ���/�_��^©��]Y���| ^�co�����`��g

��a��h�/���z{k ��� �`\_a�]��`�=��]A ^�co�����`����a�� u � �#[{��/��V� ��n���$�_���M������¥ � �y���§����p�y�[�[��� ¥��£��q��������`�r�zfF���`�/��Y��!�a�� �� � ��^ � � � -p�`!�\2� �/��V�v=­P�"® ���� ¯�/�v=­P�"®z���¯ $ � q�]��� a�a��"u� ��U;\��CUu¥�� � ��� �Hk �� qQn°s8q m qSs l�� oHdgf m qSo�qQn m i o�npo t[ium¸·�º65 q m+l ± s8qQn 3npq�@ º f;»DFEHG�� �J¡VL#OPS���WnJ�TUK T-3V� ��F!���]����|�/�� ��R ­ ¯p­y�F® 0M¯= ��^¨\�]����`��^��_�*�_�`����c* �/�����'��}������

���b!� ����^±�_����] ��a�] ����ut°���]�c �/�#�p����zZ"��]�!�������\z�/��]�� }~�£jb��]�}i�/��V�

Page 237: All of Statistics: A Concise Course in Statistical Inference Brief

�����«����� �¸��������!'� �¸�����]��� �§� !��� ��������­SRk¯# R ���o ^#c*����������a�� u �:]M} \�]����`��^��_�*�_�`����c* �/�������x}~]�m,!��#���zga¡ ����^ �"!� ���/�������_� �§# ­y��¿z®2� & ¯0 ��^5�`!���]������/�� ��R3¿�­z¯'­P��¿z® 0�¯� �#^R & ­z¯'­P� & ® 0M¯����#^��������^#�����/a�{�mM}����/� a�]��`��«3­y�F® ���¯V#YX &� [ ¿ ­y� � $ �� � ¯ & u�3] ���`!���#�����`����� a�{�m����­HR�¯ # R ���� �� ��� ^�co���`������a��A}������`�1R #­HR3¿ ® R & ¯ u �3]�} \�] ������^����$�/���o���_������ a��@4[ ����]��r��]�����]��`ce a�co�[ ����u���_����# ­P��¿ ® N;N N�®2���[¯zm�R # ­HR3¿ ®;N N N�®ER���¯�}����/� R Z ­ ¯'­P� Z ® 0�¯ ­����Fg^�������#^������ ¯3 ��^ra�]����3«:­P�F® ���¯�# X �

� [ ¿ ­P� � $ �� � ¯ & u �"�/�_���� �`��]�!���^��_^�_Z ���`{ ]����t}��#���d���:��`]MZ ��^d���� �[m���q�� � Á#m��/�#������#­HRk¯V#jR ���,���� ^Fgc*�����`����a�� u�wI��\[ �Y���*�`��]�}��Y���� ���/���sq�]�a�a�]�}������n�_�`����c* �/]��_m�jb��]�}�� ���/������ co��� g �"�������A�_�`����c* �/]��_m#�� �:�`ce a�a��_�~�`����j��

���­HRk¯�# d 0 $ � $ �X Z R &Z f � R Z ­�0�Á#ua0M��¯}������`�´­#"�¯ � # ceMf�� "F®�-� bu,�������*���l�/��c* ��]��=�����`����jb�*����� R Z � �o�/] g}¢ ��^���-#uH������co���`�/ �������h�/�� ��m }����_�¨���l�/��ce ��������ce ��{£� �/ csg�_�������_m¢�/���_�������e������ �nZ a�!������ �`�`��������jb�������p���������l�/��ce �����_u¢�������]����`���`Z ����]��Y�a¡[{"�h �p��coU] �`�2 �b���`]�a��=���Yc*]"^��_���'�#]���� �� co�_�/�`��\q�!���\z�/��] �A�_�`����c* �/��]���uDFEHG�� �=O �,NRO � ��K T � 1�O   �nLFQ TUK��¦WwI�5���h^����o\�!�a��5�/] ����^£��]F]�jb�5���� �s\_]�Z����sco]"^����`�p^���\_���`��]��'�����zg

]��`{����r� ���[V��^��_�/ ��aPu � �����\z�/�$] q ^���\_���`��]���������] �`{r\� ���U� q�]�!���^����� ���_a�a�o ��^ P������ ��� ­R��-�-���¯zm©P~���`����� ­'0 � ����¯ m�°������ !���]��'­'0 � � ��¯¢ �#^������c* ���| ��^ � �`��a�a�A­'0�� � ��¯zuDFEHG�� � �HL"Ku FOPW�L#W

0�u�wx�=�[ \2�o] q��/�#�¢q�]�a�a�]M}�������co]F^#��a��_m ����^r­�� ¯��/��� P¢[{��_�,�����`jh �#^�/�#�/P~�{ ���:���l�/��c* ��]���mF!��������*�`�b!� ���_^��_����] ��a�]��`��u­P�¯ R ­zP~����]�co�¡ a ­ Q ®gK�¯zm=K´­zP~�z�2�­gT ®aV¦¯zu­y��¯¶R®­ ��]����`��]���­��U¯ m��3­��� c*c*"­gT�® V ¯ u­y\[¯�R ­z¯'­P�"® �e&z¯~}������`� �e&3����jb��]�}��r ��^|�§­z¯p­��® ��&z¯ u

Page 238: All of Statistics: A Concise Course in Statistical Inference Brief

���u� ��������� ��������������������������������� ����!�"�Fu:���_� R3¿ ® N;N N�® RJI6­ ¯p­y�F® �e&z¯n �#^ �`!���]�����}~�k���l�/��ce �����

}����/��a�] ���tq�!��#\_�/��]���«:­P�"® ���¯ # ­P�/$ ���¯ & � � & u �F��]�} ���� � R ��� ^#c*����������a��t ��^Ac*������c*VfUu

Á#u:���_�s� # �V��¿ ® N;N N�®2��� ��U�d �������/�e� �/ co�_����� ��� \_� u �,�`]�Z������ � �����~�]��l�/���`��]���co]F^��������/����P¢[{������_�`����c* �/]���!���^#����4_����]Vg]��#��a�]��`��u��u ­ � ���_a�a¡= �#^ P~�_�����_��u ¯:���_� R3¿z® N N N_® RJI=�U� =�/ co�a���q��`]�cµ^����`�������#!#�/��]��|}����/�rZ ���� ��\_� �e&[u � ]�������^������_�`����c* �/]��`�t] q,�����q�]��`c � �V& }������`� ��& ���=�/���|�/ co�a��AZ �`�¡ ��\_� u ���_�*�/���|a�]��`�q�!��#\_�/��]��nq�]������l�/��ce ������� �e&3���

«:­�� & ® � � & ¯�# � � &� & $x0 $pa�]�� d � � &

� & f N° ����^*������]�F�/��c* a�Z a�!#�t] q �¢�/�� ��co������c*�@4����H�������`����j=q�]��~ a�a�e&�u�Fu ­rP~���`a����#����m�0�� �LÁ�¯zu"���_� R®­cP�����]�co�¡ a�­ Q ® K�¯� ��^d��!#�U] ���t�����a�]����¢q�!��#\_�/��]������

«:­%K�® �K�¯ # d 0 $ �K K f &}����_���8-E/WK / 0�u � ]�������^����3�/���h�_�`����c* �/]�� �K�­SRk¯ # -FuU��������_�`�/��ceV�/]��:qy a�a���]�!#������^#�h�/���=� �� co�_�/�_����� \���­¢-#®;0M¯3��!#�$}��}���a�a� a�a�]�}��/������u �F�#]M}'�/�� � �K�­SRk¯�#.-������/�#��!��#���b!�� m�co���#��c*Vf�`!�a�� u

� u ­�� \]¥§¦ ��Z�U �¶£�¤¸¦LU �¢Y=¥ UuX�Zp` ¯ � ] c*� ���5�����h������j�] q������hcoa�� ��^����� �� co��� g �"�������e���l�/��ce ��]��t­'0�Á#ua0M��¯���{e�`��ch!#a¡ ����]���u����`{=Z � g��]�!��hZ a�!#���o] q Q ��^´Z ����]�!��sZ��_\_��]����e�Fu �F!�coc* ���@4��d{�] !���`����!#a��/�_u

Page 239: All of Statistics: A Concise Course in Statistical Inference Brief

Part III

Statistical Models and

Methods

243

Page 240: All of Statistics: A Concise Course in Statistical Inference Brief

This is page 244

Printer: Opaque this

Page 241: All of Statistics: A Concise Course in Statistical Inference Brief

This is page 245

Printer: Opaque this

14

Linear Regression

Regression is a method for studying the relationship be-

tween a response variable Y and a covariates X. The co-

variate is also called a predictor variable or a feature. Later,

we will generalize and allow for more than one covariate. The The term \regres-

sion" is due to

Sir Francis Galton

(1822-1911) who

noticed that tall

and short men tend

to have sons with

heights closer to the

mean. He called this

\regression towards

the mean."

data are of the form

(Y1; X1); : : : ; (Yn; Xn):

One way to summarize the relationship between X and Y is

through the regression function

r(x) = E (Y jX = x) =

Zy f(yjx)dy: (14.1)

Most of this chapter is concerned with estimating the regression

function.

14.1 Simple Linear Regression

The simplest version of regression is when Xi is simple (a

scalar not a vector) and r(x) is assumed to be linear:

r(x) = �0 + �1x:

Page 242: All of Statistics: A Concise Course in Statistical Inference Brief

246 14. Linear Regression

This model is called the the simple linear regression model.

Let �i = Yi � (�0 + �1Xi). Then,

E (�ijXi) = E (Yi � (�0 + �1Xi)jXi) = E (YijXi)� (�0 + �1Xi)

= r(Xi)� (�0 + �1Xi)

= (�0 + �1Xi)� (�0 + �1Xi)

= 0:

Let �2(x) = V(�ijX = x). We will make the further simplifying

assumption that �2(x) = �2 does not depend on x. We can thus

write the linear regression model as follows.

De�nition 14.1 The Linear Regresion Model

Yi = �0 + �1Xi + �i (14.2)

where E (�ijXi) = 0 and V(�ijXi) = �2.

Example 14.2 Figure 14.1 shows a plot of Log surface tempera-

ture (Y) versus Log light intensity (X) for some nearby stars.

Also on the plot is an estimated linear regression line which will

be explained shortly.

The unknown parameters in the model are the intercept �0

and the slope �1 and the variance �2. Let b�0 and b�1 denote

estimates of �0 and �1. The �tted line is de�ned to be

br(x) = b�0 + b�1x: (14.3)

The predicted values or �tted values are bYi = br(Xi) and the

residuals are de�ned to be

b�i = Yi � bYi = Yi ��b�0 + b�1Xi

�: (14.4)

The residual sums of squares or rss is de�ned by

rss =

nXi=1

b�2i:

Page 243: All of Statistics: A Concise Course in Statistical Inference Brief

14.1 Simple Linear Regression 247

4.0 4.5 5.0 5.5

4.0

4.1

4.2

4.3

4.4

4.5

4.6

Logsurfacetemperature(Y)

Log light intensity (X)

FIGURE 14.1. Data on stars in nearby stars.

Page 244: All of Statistics: A Concise Course in Statistical Inference Brief

248 14. Linear Regression

The quantity rss measures how well the �tted line �ts the data.

De�nition 14.3 The least squares estimates are the val-

ues b�0 and b�1 that minimize rss =P

n

i=1 b�2i .Theorem 14.4 The least squares estimates are given by

b�1 =

Pn

i=1(Xi �Xn)(Yi � Y n)Pn

i=1(Xi �Xn)2; (14.5)

b�0 = Y n � b�1Xn: (14.6)

An unbiased estimate of �2is

b�2 =

�1

n� 2

� nXi=1

b�2i: (14.7)

Example 14.5 Consider the star data from Example 14.2. The

least squares estimates are b�0 = 3:58 and b�1 = 0:166. The �tted

line br(x) = 3:58 + 0:166x is shown in Figure 14.1. �

Example 14.6 (The 2001 Presidential Election.) Figure 14.2 shows

the plot of votes for Buchanan (Y) versus votes for Bush (X)

in Florida. The least squares estimates (omitting Palm Beach

County) and the standard errors areb�0 = 66:0991 bse (b�0) = 17:2926b�1 = 00:0035 bse (b�1) = 0:0002:

The �tted line is

Buchanon = 66:0991 + :0035Bush:

(We will see later how the standard errors were computed.) Fig-

ure 14.2 also shows the residuals. The inferences from linear

Page 245: All of Statistics: A Concise Course in Statistical Inference Brief

14.2 Least Squares and Maximum Likelihood 249

regression are most accurate when the residuals behave like ran-

dom normal numbers. Based on the residual plot, this is not the

case in this example. If we repeat the analysis replacing votes

with log(votes) we get

b�0 = �2:3298 bse (b�0) = :3529b�1 = 0:730300 bse (b�1) = 0:0358:

This gives the �t

log(Buchanon) = �2:3298 + :7303 log(Bush):

The residuals look much healthier. Later, we shall address two

interesting questions: (1) how do we see if Palm Beach County

has a statistically plausible outcome? (2) how do we do this prob-

lem nonparametrically? �

14.2 Least Squares and Maximum Likelihood

Suppose we add the assumption that �ijXi � N(0; �2), that

is,

YijXi � N(�i; �2i)

where �i = �0 + �1Xi. The likelihood function is

nYi=1

f(Xi; Yi) =

nYi=1

fX(Xi)fY jX(YijXi)

=

nYi=1

fX(Xi)�nYi=1

fY jX(YijXi)

= L1 � L2

where L1 =Q

n

i=1 fX(Xi) and

L2 =

nYi=1

fY jX(YijXi): (14.8)

Page 246: All of Statistics: A Concise Course in Statistical Inference Brief

250 14. Linear Regression

•• ••• •

• •• • •••

••

•••••••••••

••

• ••••• •••••••

••• •••

••

•• • ••••••••

•• ••

Bush

Buchanon

0 50000 150000 250000

01000

3000

••

••

• •• •

••

•••

•••••••••

••

• ••••

•••• •

••••

• ••

• •

••

••

••

••••

•• ••

Bush

Resid

uals

0 50000 150000 250000

-400

0200

• •

••

• •••

••

••

•• ••

• • •

••

• ••

• ••

•• •

••

•••

••

•• •

•••••

••

•••

log Bush

log B

uchanon

7 8 9 10 11 12

23

45

67

8

•••• • •

•••

••

••

••

••

••

••

• •

••

••• ••

•••• •

••

• •••

log Bush

Resid

uals

7 8 9 10 11 12

-1.0

0.0

1.0

FIGURE 14.2. Voting Data for Election 2000.

Page 247: All of Statistics: A Concise Course in Statistical Inference Brief

14.3 Properties of the Least Squares Estimators 251

The term L1 does not involve the parameters �0 and �1. We shall

focus on the second term L2 which is called the conditional

likelihood, given by

L2 � L(�0; �1; �) =nYi=1

fY jX(YijXi) / ��n exp

(� 1

2�2

Xi

(Yi � �i)2

):

The conditional log-likelihood is

`(�0; �1; �) = �n log � � 1

2�2

nXi=1

�Yi � (�0 + �1Xi)

�2: (14.9)

To �nd the mle of (�0; �1) we maximize `(�0; �1; �). From (14.9)

we see that maximizing the likelihood is the same as minimizing

the rssP

n

i=1

�Yi�(�0+�1Xi)

�2. Therefore, we have shown the

following.

Theorem 14.7 Under the assumption of Normality, the least squares

estimator is also the maximum likelihood estimator.

We can also maximize `(�0; �1; �) over � yielding the mle

b�2 =1

n

Xi

b�2i: (14.10)

This estimator is similar to, but not identical to, the unbiased

estimator. Common practice is to use the unbiased estimator

(14.7).

14.3 Properties of the Least Squares Estimators

We now record the standard errors and limiting distribution

of the least squares estimator. In regression problems, we usu-

ally focus on the properties of the estimators conditional on

Xn = (X1; : : : ; Xn). Thus, we state the means and variances as

conditional means and variances.

Page 248: All of Statistics: A Concise Course in Statistical Inference Brief

252 14. Linear Regression

Theorem 14.8 Let b�T = (b�0; b�1)T denote the least squares

estimators. Then,

E (b�jXn) =

��0�1

�V(b�jXn) =

�2

n s2X

1n

Pn

i=1X2i�Xn

�Xn 1

!(14.11)

where s2X= n�1

Pn

i=1(Xi �Xn)2.

The estimated standard errors of b�0 and b�1 are obtained by

taking the square roots of the corresponding diagonal terms of

V(b�jXn) and inserting the estimate b� for �. Thus,

bse (b�0) =b�

sXpn

rPn

i=1X2i

n(14.12)

bse (b�1) =b�

sXpn: (14.13)

We should really write these as bse (b�0jXn) and bse (b�1jXn) but

we will use the shorter notation bse (b�0) and bse (b�1).Theorem 14.9 Under appropriate conditions we have:

1. (Consistency): b�0 P�!�0 and b�1 P�!�1.

2. (Asymptotic Normality):

b�0 � �0bse (b�0) N(0; 1) andb�1 � �1bse (b�1) N(0; 1):

3. Approximate 1� � con�dence intervals for �0 and �1 areb�0 � z�=2 bse (b�0) and b�1 � z�=2 bse (b�1): (14.14)

The Wald test statistic for testing H0 : �1 = 0 versus H1 :

�1 6= 0 is: reject H0 if W > z�=2 where W = b�1= bse (b�1).

Page 249: All of Statistics: A Concise Course in Statistical Inference Brief

14.4 Prediction 253

Example 14.10 For the election data, on the log scale, a 95 per

cent con�dence interval is :7303 � 2(:0358) = (:66; :80). The

fact that the interval excludes 0 The Wald statistics for testing

H0 : �1 = 0 versus H1 : �1 6= 0 is W = j:7303�0j=:0358 = 20:40

with a p-value of P(jZj > 20:40) � 0. This is strong evidence

that that the true slope is not 0. �

14.4 Prediction

Suppose we have estimated a regression model br(x) = b�0+ b�1from data (X1; Y1); : : : ; (Xn; Yn). We observe the value X = x�

of the covariate for a new subject and we want to predict their

outcome Y�. An estimate of Y� is

bY� = b�0 + b�1X�: (14.15)

Using the formula for the variance of the sum of two random

variables,

V(bY�) = V(b�0 + b�1x�) = V(b�0) + x2�V(b�1) + 2x�Cov(b�0; b�1):

Theorem 14.8 gives the formulas for all the terms in this equa-

tion. The estimated standard error bse (bY�) is the square root ofthis variance, with b�2 in place of �2. However, the con�dence

interval for Y� is not of the usual form bY�� z�=2. The appendix

explains why. The correct form of the con�dence interval is given

in the following Theorem. We call the interval a prediction in-

terval.

Page 250: All of Statistics: A Concise Course in Statistical Inference Brief

254 14. Linear Regression

Theorem 14.11 (Prediction Interval) Let

b�2n

= bse 2(bY�) + b�2

= b�2

�Pn

i=1(Xi �X�)2

nP

i(Xi �X)2

+ 1

�: (14.16)

An approximate 1� � predicition interval for Y� is

bY� � z�=2�n: (14.17)

Example 14.12 (Election Data Revisited.) On the log-scale, our lin-

ear regression gives the following prediction equation: log(Buchanon) =

�2:3298 + :7303log(Bush). In Palm Beach, Bush had 152954

votes and Buchanan had 3467 votes. On the log scale this is

11.93789 and 8.151045. How likely is this outcome, assuming

our regression model is appropriate? Our prediction for log Buchanan

votes -2.3298 + .7303 (11.93789)=6.388441. Now 8.151045 is

bigger than 6.388441 but is is \signi�cantly" bigger? Let us com-

pute a con�dence interval. We �nd that b�n = :093775 and the ap-

proximate 95 per cent con�dence interval is (6.200,6.578) which

clearly excludes 8.151. Indeed, 8.151 is nearly 20 standard er-

rors from bY�. Going back to the vote scale by exponentiating, thecon�dence interval is (493,717) compared to the actual number

of votes which is 3467. �

14.5 Multiple Regression

Now suppose we have k covariates X1; : : : ; Xk. The data are

of the form

(Y1; X1); : : : ; (Yi; Xi); : : : ; (Yn; Xn)

where

Xi = (Xi1; : : : ; Xik):

Page 251: All of Statistics: A Concise Course in Statistical Inference Brief

14.5 Multiple Regression 255

Here, Xi is the vector of k covariate values for the ith observa-

tion. The linear regression model is

Yi =

kXj=1

�jXij + �i (14.18)

for i = 1; : : : ; n, where E (�ijX1i; : : : ; Xki) = 0. Usually we want

to include an intercept in the model which we can do by setting

Xi1 = 1 for i = 1; : : : ; n. At this point it will be more convenient

to express the model in matrix notation. The outcomes will be

denoted by

Y =

0BBB@Y1Y2...

Yn

1CCCAand the covariates will be denoted by

X =

0BBB@X11 X12 : : : X1k

X21 X22 : : : X2k

......

......

Xn1 Xn2 : : : Xnk

1CCCA :

Each row is one observation; the columns correspond to the k

covariates. Thus, X is a (n� k) matrix. Let

� =

0B@ �1...

�k

1CA and � =

0B@ �1...

�n

1CA :

Then we can write (14.18) as

Y = X� + �: (14.19)

Theorem 14.13 Assuming that the (k � k) matrix XTX is in-

vertible, the least squares estimate is

Page 252: All of Statistics: A Concise Course in Statistical Inference Brief

256 14. Linear Regression

b� = (XTX)�1XTY: (14.20)

The estimated regression function is

br(x) = kXj=1

b�jxj: (14.21)

The variance-covariance matrix of b� is

V(b�jXn) = �2(XTX)�1:

Under appropriate conditions,

b� � N(�; �2(XTX)�1):

An unbiased estimate of �2is

b�2 =

�1

n� k

� nXi=1

b�2i

where b� = X b� � Y is the vector of residuals. An approximate

1� � con�dence interval for �j is

b�j � z�=2 bse (b�j) (14.22)

where bse 2(b�j) is the jth diagonal element of the matrix b�2(XTX)�1.

Example 14.14 Crime data on 47 states in 1960 can be obtained

at http://lib.stat.cmu.edu/DASL/Stories/USCrime.html. If we

�t a linear regression of crime rate on 10 variables we get the

following:

Page 253: All of Statistics: A Concise Course in Statistical Inference Brief

14.6 Model Selection 257

Covariate Least Estimated t value p-value

Squares Standard

Estimate Error

(Intercept) -589.39 167.59 -3.51 0.001 **

Age 1.04 0.45 2.33 0.025 *

Southern State 11.29 13.24 0.85 0.399

Education 1.18 0.68 1.7 0.093

Expenditures 0.96 0.25 3.86 0.000 ***

Labor 0.11 0.15 0.69 0.493

Number of Males 0.30 0.22 1.36 0.181

Population 0.09 0.14 0.65 0.518

Unemployment (14-24) -0.68 0.48 -1.4 0.165

Unemployment (25-39) 2.15 0.95 2.26 0.030 *

Wealth -0.08 0.09 -0.91 0.367

This table is typical of the output of a multiple regression pro-

gram. The \t-value" is the Wald test statistic for testing H0 :

�j = 0 versus H1 : �j 6= 0. The asterisks denote \degree of

signi�cance" with more asterisks being signi�cant at a smaller

level. The example raises several important questions. In partic-

ular: (1) should we eliminate some variables from this model?

(2) should we interpret this relationships as causal? For exam-

ple, should we conclude that low crime prevention expenditures

cause high crime rates? We will address question (1) in the next

section. We will not address question (2) until a later Chapter.

14.6 Model Selection

Example 14.14 illustrates a problem that often arises in mul-

tiple regression. We may have data on many covariates but we

may not want to include all of them in the model. A smaller

model with fewer covariates has two advantages: it might give

better predictions than a big model and it is more parsimonious

(simpler). Generally, as you add more variables to a regression,

Page 254: All of Statistics: A Concise Course in Statistical Inference Brief

258 14. Linear Regression

the bias of the predictions decreases and the variance increases.

Too few covariates yields high bias; too many covariates yields

high variance. Good predictions result from achieving a good

balance between bias and variance.

In model selection there are two problems: (i) assigning a

\score" to each model which measures, in some sense, how good

the model is and (ii) searching through all the models to �nd

the model with the best score.

Let us �rst discuss the problem of scoring models. Let S �f1; : : : ; kg and let XS = fXj : j 2 Sg denote a subset of the

covariates. Let �S denote the coeÆcients of the corresponding

set of covariates and let b�S denote the least squares estimate of

�S. Also, let XS denote the X matrix for this subset of covari-

ates and de�ne brS(x) to be the estimated regression function

from (14.21). The predicted valus from model S are denoted bybYi(S) = brS(Xi). The prediction risk is de�ned to be

R(S) =

nXi=1

E (bYi(S)� Y �i)2 (14.23)

where Y �i

denotes the value of a future observation of Yi at

covariate value Xi. Our goal is to choose S to make R(S) small.

The training error is de�ned to be

bRtr(S) =

nXi=1

(bYi(S)� Yi)2:

This estimate is very biased and under-estimates R(S).

Theorem 14.15 The training error is a downward biased esti-

mate of the prediction risk:

E ( bRtr(S)) < R(S):

In fact,

bias ( bRtr(S)) = E ( bRtr(S))�R(S) = �2Xi=1

Cov(bYi; Yi): (14.24)

Page 255: All of Statistics: A Concise Course in Statistical Inference Brief

14.6 Model Selection 259

The reason for the bias is that the data are being used twice:

to estimate the parameters and to estimate the risk. When

we �t a complex model with many parameters, the covariance

Cov(bYi; Yi) will be large and the bias of the training error gets

worse. In summary, the training error is a poor estimate of risk.

Here are some better estimates.

Mallow's Cp statistic is de�ned by

bR(S) = bRtr(S) + 2jSjb�2 (14.25)

where jSj denotes the number of terms in S and b�2 is the es-

timate of �2 obtained from the full model (with all covariates

in the model). This is simply the training error plus a bias cor-

rection. This estimate is named in honor of Colin Mallows who

invented it. The �rst term in (14.25) measures the �t of the

model while the second measure the complexity of the model.

Think of the Cp statistic as:

lack of �t + complexity penalty:

Thus, �nding a good model involves trading o� �t and

complexity.

A related method for estimating risk is AIC (Akaike Infor-

mation Criterion). The idea is to choose S to maximize Some texts use a

slightly di�erent

de�nition of AIC

which involves

multiplying the

de�nition here by

2 or -2. This has

no e�ect on which

model is selected.

`S � jSj (14.26)

where `S is the log-likelihood of the model evaluated at the

mle . This can be thought of \goodness of �t" minus \com-

plexity." In linear regression with Normal errors, maximizing

AIC is equivalent to minimizing Mallow's Cp; see exercise 8.

Yet another method for estimating risk is leave-one-out

cross-validation. In this case, the risk estimator is

bRCV (S) =

nXi=1

(Yi � bY(i))2 (14.27)

Page 256: All of Statistics: A Concise Course in Statistical Inference Brief

260 14. Linear Regression

where bY(i)) is the prediction for Yi obtained by �tting the model

with Yi omitted. It can be shown that

bRCV (S) =

nXi=1

Yi � bYi(S)1� Uii(S)

!2

(14.28)

where Uii(S) is the ith diagonal element of the matrix

U(S) = XS(XT

SXS)

�1XT

S: (14.29)

Thus, one need not actually drop each observation and re-�t

the model. A generalization is k-fold cross-validation. Here

we divide the data into k groups; often people take k = 10. We

omit one group of data and �t the models to the remaining data.

We use the �tted model to predict the data in the group that

was omitted. We then estimate the risk byP

i(Yi � bYi)2 where

the sum is over the the data points in the omitted group. This

process is repeated for each of the k groups and the resulting

risk estimates are averaged.

For linear regression, Mallows Cp and cross-validation often

yield essentially the same results so one might as well use Mal-

lows' method. In some of the more complex problems we will

discuss later, cross-validation will be more useful.

Another scoring method is BIC (Bayesian information crite-

rion). Here we choose a model to maximize

BIC(S) = rss (S) + 2jSjb�2: (14.30)

The BIC score has a Bayesian interpretation. Let S = fS1; : : : ; Smgdenote a set of models. Suppose we assign the prior P(Sj) = 1=m

over the models. Also, assume we put a smooth prior on the pa-

rameters within each model. It can be shown that the posterior

probability for a model is approximately,

P(Sjjdata) � eBIC(Sj )PreBIC(Sr)

:

Page 257: All of Statistics: A Concise Course in Statistical Inference Brief

14.6 Model Selection 261

Hence, choosing the model with highest BIC is like choosing the

model with highest posterior probability. The BIC score also has

an information-theoretic interpretation in terms of something

called minimum description length. The BIC score is identical

to Mallows Cp except that it puts a more severe penalty for

complexity. It thus leads one to choose a smaller model than

the other methods.

Now let us turn to the problem of model search. If there are k

covariates then there are 2k possible models. We need to search

through all these models, assign a score to each one, and choose

the model with the best score. If k is not too large we can do

a complete search over all the models. When k is large, this is

infeasible. In that case we need to search over a subset of all

the models. Two common methods are forward and back-

ward stepwise regression. In forward stepwise regression, we

start with no covariates in the model. We then add the one vari-

able that leads to the best score. We continue adding variables

one at a time until the score does not improve. Backwards step-

wise regression is the same except that we start with the biggest

model and drop one variable at a time. Both are greedy searches;

nether is guaranteed to �nd the model with the best score. An-

other popular method is to do random searching through the

set of all models. However, there is no reason to expect this to

be superior to a deterministic search.

Example 14.16 We apply backwards stepwise regression to the

crime data using AIC. The following was obtained from the pro-

gram R. This program uses minus our version of AIC. Hence,

we are seeking the smallest possible AIC. This is the same is

minimizing Mallows Cp.

Page 258: All of Statistics: A Concise Course in Statistical Inference Brief

262 14. Linear Regression

The full model (which includes all covariates) has AIC= 310.37.

The AIC scores, in ascending order, for deleting one variable are

as follows:

variable Pop Labor South Wealth Males U1 Educ. U2 Age Expend

AIC 308 309 309 309 310 310 312 314 315 324

For example, if we dropped Pop from the model and kept the

other terms, then the AIC score would be 308. Based on this in-

formation we drop \population" from the model and the current

AIC score is 308. Now we consider dropping a variable from the

current model. The AIC scores are:

variable South Labor Wealth Males U1 Education U2 Age Expend

AIC 308 308 308 309 309 310 313 313 329

We then drop \Southern" from the model. This process is con-

tinued until there is no gain in AIC by dropping any variables.

In the end, we are left with the following model:

Crime = 1:2 Age + :75 Education + :87 Expenditure + :34 Males � :86 U1 + 2:31 U2:

Warning! This does not yet address the question of which vari-

ables are causes of crime.

14.7 The Lasso

There is an easier model search method although it addresses

a slightly di�erent question. The method, due to Tibshirani, is

called the Lasso. In this section we assume that the covari-

ates have all been rescaled to have the same variance. This

puts each covariate on the same scale. Consider estimating � =

(�1; : : : ; �k) by minimizing the loss function

nXi=1

(Yi � bYi)2 + �

kXj=1

j�jj (14.31)

Page 259: All of Statistics: A Concise Course in Statistical Inference Brief

14.8 Technical Appendix 263

where � > 0. The idea is to minimize the sums of squares but

we include a penalty that gets large if any of the � 0js are large.

The solution b� = (b�1; : : : ; b�k) can be found numerically and

will depend on the choice of �. It can be shown that some of

the b�j's will be 0. We interpret this is meaning that the jth is

omitted from the model. Hence, we are doing estimation and

model selection simultaneously. We need to choose a value of

�. We can do this by estimating the prediction risk R(�) as

a function of � and choosing � to minimze the estimated risk.

For example, we can estimate the risk using leave-one-out cross-

validation.

Example 14.17 Returning to the crime data, Figure 14.3 shows

the results of the lasso. The �rst plot shows the leave-one-out

cross-validation score as a function of �. The minimum occurs

at � = :7. The second plot shows the estimated coeÆcients as a

function of �. You can see how some estimated parameters are

zero until � gets larger. At � = :7, all the b�j are non-zero so theLasso chooses the full model.

14.8 Technical Appendix

Why is the predicition interval of a di�erent form than the

other con�dence intervals we have seen? The reason is that the

quantity we want to estimate, Y�, is not a �xed parameter, it

is a random variable. To understand this point better, let � =

�0+�1X� and let b� = b�0+ b�1X�. Thus, bY� = b� while Y� = �+ �.

Now, b� � N(�; se 2) where

se 2 = V(b�) = V(b�0 + b�1x�):Note that V(b�) is the same as V(bY�). Now, b� � 2

qV ar(b�) is a

95 per cent con�dence interval for � = �0+�1x� using the usual

Page 260: All of Statistics: A Concise Course in Statistical Inference Brief

264 14. Linear Regression

0.2 0.4 0.6 0.8

80

01

20

0

0.2 0.4 0.6 0.8

−1

01

03

0

cross-validationscore

b�(�)

FIGURE 14.3. The Lasso applied to the crime data.

Page 261: All of Statistics: A Concise Course in Statistical Inference Brief

14.8 Technical Appendix 265

argument for a con�dence interval. It is not a valid con�dence

interval for Y�. To see why, let's compute the probability thatbY� � 2

qV(bY�) contains Y�. Let s =qV ar(bY�). Then,

P(bY� � 2s < Y� < bY� + 2s) = P

�2 <

bY� � Y�

s< 2

!

= P

�2 <

b� � � � �

s< 2

!

= P

�2 <

b� � �

s� �

s< 2

!

� P

��2 < N(0; 1)�N

�0;�2

s2

�< 2

�= P

��2 < N

�0; 1 +

�2

s2

�< 2

�6= :95:

The problem is that the quantity of interest Y� is equal

to a parameter � plus a random variable. We can �x this

by de�ning

�2n= V(bY�) + �2 =

� Pi(xi � x�)

2

nP

i(xi � x)2

+ 1

��2:

In practice, we substitute b� for � and we denote the resulting

quantity by b�n. Now consider the interval bY� � 2b�n. Then,P(bY� � 2b�n < Y� < bY� + 2b�n) = P

�2 <

bY� � Y�b�n < 2

!

= P

�2 <

b� � � � �b�n < 2

!

� P

��2 < N(0; s2 + �2)b�n < 2

�� P

��2 < N(0; s2 + �2)

�n< 2

Page 262: All of Statistics: A Concise Course in Statistical Inference Brief

266 14. Linear Regression

= P (�2 < N(0; 1) < 2) = :95:

Of course, a 1� � interval is given by bY� � z�=2b�n.14.9 Exercises

1. Prove Theorem 14.4.

2. Prove the formulas for the standard errors in Theorem

14.8. You should regard the Xi's as �xed constants.

3. Consider the regression through the origin model:

Yi = �Xi + �:

Find the least squares estimate for �. Find the standard

error of the estimate. Find conditions that guarantee that

the estimate is consistent.

4. Prove equation (14.24).

5. In the simple linear regression model, construct a Wald

test for H0 : �1 = 17�0 versus H1 : �1 6= 17�0.

6. Get the passenger car mileage data from

http://lib.stat.cmu.edu/DASL/Data�les/carmpgdat.html

(a) Fit a simple linear regression model to predict MPG

(miles per gallon) from HP (horsepower). Summarize your

analysis including a plot of the data with the �tted line.

(b) Repeat the analysis but use log(MPG) as the response.

Compare the analyses.

7. Get the passenger car mileage data from

http://lib.stat.cmu.edu/DASL/Data�les/carmpgdat.html

(a) Fit a multiple linear regression model to predict MPG

(miles per gallon) from HP (horsepower). Summarize your

analysis.

Page 263: All of Statistics: A Concise Course in Statistical Inference Brief

14.9 Exercises 267

(b) Use Mallow Cp to select a best sub-model. To search

trhough the models try (i) all possible models, (ii) for-

ward stepwise, (iii) backward stepwise. Summarize your

�ndings.

(c) Repeat (b) but use BIC. Compare the results.

(d) Now use the Lasso and compare the results.

8. Assume that the errors are Normal. Show that the model

with highest AIC (equation (14.26)) is the model with the

lowest Mallows Cp statistic.

9. In this question we will take a closer look at the AIC

method. Let X1; : : : ; Xn be iid observations. Consider two

modelsM0 and M1. Under M0 the data are assumed to

be N(0; 1) while under M1 the data are assumed to be

N(�; 1) for some unknown � 2 R:

M0 : X1; : : : ; Xn � N(0; 1)

M1 : X1; : : : ; Xn � N(�; 1); � 2 R:

This is just another way to view the hypothesis testing

problem: H0 : � = 0 versus H1 : � 6= 0. Let `n(�) be

the log-likelihood function. The AIC score for a model is

the log-likelihood at the mle minus the number of param-

eters. (Some people multiply this score by 2 but that is

irrelevant.) Thus, the AIC score for M0 is AIC0 = `n(0)

and the AIC score for M1 is AIC1 = `n(b�) � 1. Suppose

we choose the model with the highest AIC score. Let Jn

denote the selected model:

Jn =

�0 if AIC0 > AIC1

1 if AIC1 > AIC0:

(a) Suppose thatM0 is the true model, i.e. � = 0. Find

limn!1

P (Jn = 0) :

Page 264: All of Statistics: A Concise Course in Statistical Inference Brief

268 14. Linear Regression

Now compute limn!1 P (Jn = 0) when � 6= 0.

(b) The fact that limn!1 P (Jn = 0) 6= 1 when � = 0 is

why some people say that AIC \over�ts." But this is not

quite true as we shall now see. Let ��(x) denote a Normal

density function with mean � and variance 1. De�ne

bfn(x) = � �0(x) if Jn = 0

�b�(x) if Jn = 1:

If � = 0, show that D(�0; bfn) p! 0 as n!1 where

D(f; g) =

Zf(x) log

�f(x)

g(x)

�dx

is the Kullback-Leibler distance. Show also thatD(��; bfn) p!0 if � 6= 0. Hence, AIC consistently estimates the true den-

sity even if it \overshoots" the correct model.

REMARK: If you are feeling ambitious, repeat this anal-

ysis for BIC which is the log-likelihood minus (p=2) logn

where p is the number of parameters and n is sample size.

Page 265: All of Statistics: A Concise Course in Statistical Inference Brief

This is page 269

Printer: Opaque this

15

Multivariate Models

In this chapter we revisit the Multinomial model and the mul-

tivariate Normal, as we will need them in future chapters.

Let us �rst review some notation from linear algebra. Re-

call that if x and y are vectors then xTy =

Pj xjyj. If A is a

matrix then det(A) denotes the determinant of A, AT denotes

the transpose of A and A� denotes the the inverse of A (if the

inverse exists). The trace of a square matrix A { denoted by

tr(A) { is the sum of its diagonal elements. The trace satis-

�es tr(AB) = tr(BA) and tr(A) + tr(B). Also, tr(a) = a if a

is a scalar (i.e. a real number). A matrix is positive de�nite if

xT�x > 0 for all non-zero vectors x. If a matrix � is symmet-

ric and positive de�nite, there exists a matrix �1=2 { called the

square root of � { with the following properties:

1. �1=2 is symmetric;

2. � = �1=2�1=2;

3. �1=2��1=2 = ��1=2�1=2 = I where ��1=2 = (�1=2)�1.

Page 266: All of Statistics: A Concise Course in Statistical Inference Brief

270 15. Multivariate Models

15.1 Random Vectors

Multivariate models involve a random vector X of the form

X =

0B@ X1

...

Xk

1CA :

The mean of a random vector X is de�ned by

� =

0B@ �1

...

�k

1CA =

0B@ E(X1)...

E(Xk)

1CA : (15.1)

The covariance matrix � is de�ned to be

� = V(X) =

26664V(X1) Cov(X1; X2) � � � Cov(X1; Xk)

Cov(X2; X1) V(X2) � � � Cov(X2; Xk)...

......

...

Cov(Xk; X1) Cov(Xk; X2) � � � V(Xk)

37775 :(15.2)

This is also called the variance matrix or the variance-covariance

matrix.

Theorem 15.1 Let a be a vector of length k and let X be a ran-

dom vector of the same length with mean � and variance �.

Then E (aTX) = aT� and V(aTX) = a

T�a. If A is a matrix

with k columns then E (AX) = A� and V(AX) = A�AT .

Now suppose we have a random sample of n vectors:0BBB@X11

X21

...

Xk1

1CCCA ;

0BBB@X12

X22

...

Xk2

1CCCA ; : : : ;

0BBB@X1n

X2n

...

Xkn

1CCCA : (15.3)

The sample mean X is a vector de�ned by

X =

0B@ X1

...

Xk

1CA

Page 267: All of Statistics: A Concise Course in Statistical Inference Brief

15.2 Estimating the Correlation 271

where X i = n�1Pn

j=1Xij. The sample variance matrix, also

called the covariance matrix or the variance-covariance matrix,

is

S =

26664s11 s12 � � � s1k

s12 s22 � � � s2k...

......

...

s1k s2k � � � skk

37775 (15.4)

where

sab =1

n� 1

nXj=1

(Xaj �Xa)(Xbj �Xb):

It follows that E (X) = �. and E (S) = �.

15.2 Estimating the Correlation

Consider n data points from a bivariate distribution:�X11

X21

�;

�X12

X22

�; � � � ;

�X1n

X2n

�:

Recall that the correlation between X1 and X2 is

� =E ((X1 � �1)(X2 � �2))

�1�2: (15.5)

The sample correlation (the plug-in estimator) is

b� = Pni=1(X1i �X1)(X2i �X2)

s1s2: (15.6)

We can construct a con�dence interval for � by applying the

delta method as usual. However, it turns out that we get a more

accurate con�dence interval by �rst constructing a con�dence

interval for a function � = f(�) and then applying the inverse

function f�1. The method, due to Fisher, is as follows. De�ne

f(r) =1

2

�log(1 + r)� log(1� r)

Page 268: All of Statistics: A Concise Course in Statistical Inference Brief

272 15. Multivariate Models

akd let � = f(�). The inverse of r is

g(z) � f�1(z) =

e2z � 1

e2z + 1:

Now do the following steps:

Approximate Con�dence Interval for The Correlation

1. Compute

b� = f(b�) = 1

2

�log(1 + b�)� log(1� b�)�:

2. Compute the approximate standard error of b� which can

be shown to be bse (b�) = 1pn� 3

:

3. An approximate 1� � con�dence interval for � = f(�) is

(a; b) ��b� � z�=2p

n� 3; b� + z�=2p

n� 3

�:

4. Apply the inverse transformation f�1(z) to get a con�-

dence interval for �:�e2a � 1

e2a + 1;e2b � 1

e2b + 1

�:

15.3 Multinomial

Let us now review the Multinomial distribution. Consider

drawing a ball from an urn which has balls with k di�erent

colors labeled color 1, color 2, ... , color k. Let p = (p1; : : : ; pk)

Page 269: All of Statistics: A Concise Course in Statistical Inference Brief

15.3 Multinomial 273

where pj � 0 andPk

j=1 pj = 1 and suppose that pj is the prob-

ability of drawing a ball of color j. Draw n times (independent

draws with replacement) and let X = (X1; : : : ; Xk) where Xj is

the number of times that color j appeared. Hence, n =Pk

j=1Xj.

We say that X has a Multinomial (n,p) distribution. The prob-

ability function is

f(x; p) =

�n

x1 : : : xk

�px11 � � � pxkk

where �n

x1 : : : xk

�=

n!

x1! � � �xk!:

Theorem 15.2 Let X � Multinomial(n; p). Then the marginal

distribution of Xj is Xj � Binomial(n; pj). The mean and vari-

ance of X are

E (X) =

0B@ np1...

npk

1CAand

V(X) =

0BBB@np1(1� p1) �np1p2 � � � �np1pk�np1p2 np2(1� p2) � � � �np2pk

......

......

�np1pk �np2pk � � � npk(1� pk)

1CCCA :

Proof. That Xj � Binomial(n; pj) follows easily. Hence,

E (Xj) = npj and V(Xj) = npj(1�pj). To compute Cov(Xi; Xj)

we proceed as follows. Notice thatXi+Xj � Binomial(n; pi+pj)

and so V(Xi+Xj) = n(pi+ pj)(1� pi� pj). On the other hand,

V(Xi +Xj) = V(Xi) + V(Xj) + 2Cov(Xi; Xj)

= npi(1� pi) + npj(1� pj) + 2Cov(Xi; Xj):

Equating this last expression with n(pi+pj)(1�pi�pj) implies

that Cov(Xi; Xj) = �npipj. �

Page 270: All of Statistics: A Concise Course in Statistical Inference Brief

274 15. Multivariate Models

Theorem 15.3 The maximum likelihood estimator of p is

bp =0B@ bp1

...bpk1CA =

0B@X1

n...Xk

n

1CA =X

n:

Proof. The log-likelihood (ignoring the constant) is

`(p) =

kXj=1

Xj log pj:

When we maximize ` we have to be careful since we must enforce

the constraint thatP

j pj = 1. We use the method of Lagrange

multipliers and instead maximize

A(p) =

kXj=1

Xj log pj + �

�Xj

pj � 1�:

Now@A(p)

@pj=

Xj

pj+ �:

Setting@A(p)

@pj= 0 yields bpj = �Xj=�. Since

Pj bpj = 1 we see

that � = �n and hence bpj = Xj=n as claimed. �

Next we would like to know the variability of the mle . We

can either compute the variance matrix of bp directly or we can

approximate the variability of the mle by computing the Fisher

information matrix. These two approaches give the same answer

in this case. The direct approach is easy: V(bp) = V(X=n) =

n�2V(X) and so

V(bp) = 1

n

0BBB@p1(1� p1) �p1p2 � � � �p1pk�p1p2 p2(1� p2) � � � �p2pk

......

......

�p1pk �p2pk � � � pk(1� pk)

1CCCA :

Page 271: All of Statistics: A Concise Course in Statistical Inference Brief

15.4 Multivariate Normal 275

15.4 Multivariate Normal

Let us recall how the multivariate Normal distribution is de-

�ned. To begin, let

Z =

0B@ Z1

...

Zk

1CAwhere Z1; : : : ; Zk � N(0; 1) are independent. The density of Z

is

f(z) =1

(2�)k=2exp

(�1

2

kXj=1

z2j

)=

1

(2�)k=2exp

��1

2zTz

�:

(15.7)

The variance matrix of Z is the identity matrix I. We write

Z � N(0; I) where it is understood that 0 denotes a vector of

k zeroes. We say that Z has a standard multivariate Normal

distribution.

More generally, a vector X has a multivariate Normal distri-

bution, denoted by X � N(�;�), if its density is

f(x; �;�) =1

(2�)k=2det(�)1=2exp

��1

2(x� �)T��1(x� �)

�(15.8)

where det(�) denotes the determinant of a matrix, � is a vector

of length k and � is a k�k symmetric, positive de�nite matrix.

Then E (X) = � and V(X) = �. Setting � = 0 and � = I gives

back the standard Normal.

Theorem 15.4 The following properties hold:

1. If Z � N(0; 1) and X = �+ �1=2Z then X � N(�;�).

2. If X � N(�;�), then ��1=2(X � �) � N(0; 1).

3. If X � N(�;�) a is a vector of the same length as X,

then aTX � N(aT�; aT�a).

Page 272: All of Statistics: A Concise Course in Statistical Inference Brief

276 15. Multivariate Models

4. Let

V = (X � �)T��1(X � �):

Then V � �2k.

Suppose we partition a random Normal vector X into two

parts X = (Xa; Xb) We can similarly partition the the mean

� = (�a; �b) and the variance

� =

��aa �ab

�ba �bb

�:

Theorem 15.5 Let X � N(�;�). Then:

(1) The marginal distribition of Xa is Xa � N(�a;�aa).

(2) The conditional distribition of Xb given Xa = xa is

XbjXa = xa � N(�(xa);�(xa))

where

�(xa) = �b + �ba��1aa (xa � �a) (15.9)

�(xa) = �bb � �ba��1aa �ab: (15.10)

Theorem 15.6 Given a random sample of size n from a N(�;�),

the log-likelihood is (up to a constant not depending on � or �)

is given by

`(�;�) = �n

2(X��)T��1(X��)�

n

2tr(��1S)�

n

2log det(�):

The mle is

b� = X and b� =

�n� 1

n

�S: (15.11)

Page 273: All of Statistics: A Concise Course in Statistical Inference Brief

15.5 Appendix 277

15.5 Appendix

Proof of Theorem 15.6. Denote the ith random vector by Xi.

The log-likelihood is

`(�;�) =Xi

f(X i; �;�) = �kn

2log(2�)�

n

2log det(�)�

1

2

Xi

(X i��)T��1(X i��):

Now,Xi

(X i � �)T��1(X i � �) =Xi

[(X i �X) + (X � �)]T��1[(X i �X) + (X � �)]

=Xi

�(X i �X)T��1(X i �X)

�+ n(X � �)T��1(X � �)

sinceP

i(Xi � X)��1(X � �) = 0. Also, notice that (X i �

�)T��1(X i � �) is a scalar, soXi

(X i � �)T��1(X i � �) =Xi

tr�(X i � �)T��1(X i � �)

�=

Xi

tr���1(X i � �)(X i � �)T

�= tr

"��1

Xi

(X i � �)(X i � �)T

#= n tr

���1S

�and the conclusion follows.

15.6 Exercises

1. Prove Theorem 15.1.

2. Find the Fisher informationmatrix for themle of a Multi-

nomial.

3. Prove Theorem 15.5.

4. (Computer Experiment. Write a function to generate nsim

observations from a Multinomial(n; p) distribution.

Page 274: All of Statistics: A Concise Course in Statistical Inference Brief

278 15. Multivariate Models

5. (Computer Experiment. Write a function to generate nsim

observations from a Multivariate normal with given mean

� and covariance matrix �.

6. (Computer Experiment.Generate 1000 random vectors from

a N(�;�) distribution where

� =

�3

8

�; � =

�2 6

6 2

�:

Plot the simulation as a scatterplot. Find the distribution

of X2jX1 = x1 using theorem 15.5. In particular, what is

the formula for E (X2jX1 = x1)? Plot E (X2jX1 = x1) on

your scatterplot. Find the correlation � between X1 and

X2. Compare this with the sample correlations from your

simulation. Find a 95 per cent con�dence interval for �.

Estimate the covariance matrix �.

7. Generate 100 random vectors from a multivariate Normal

with mean (0; 2)T and variance�1 3

3 1

�:

Find a 95 per cent con�dence interval for the correlation

�. What is the true value of �?

Page 275: All of Statistics: A Concise Course in Statistical Inference Brief

This is page 279

Printer: Opaque this

16

Inference about Independence

In this chapter we address the following questions:

(1) How do we test if two random variables are independent?

(2) How do we estimate the strength of dependence between two

random variables?

When Y and Z are not independent, we say that they are

dependent or associated or related. If Y and Z are associ- Recall that we write

Y qZ to mean that

Y and Z are inde-

pendent.

ated, it does not imply that Y causes Z or that Z causes Y . If

Y does cause Z then changing Y will change the distribution of

Z, otherwise it will not change the distribution of Z.

For example, quitting smoking Y will reduce your probabil-

ity of heart disease Z. In this case Y does cause Z. As another

example, owning a TV Y is associated with having a lower in-

cidence of starvation Z. This is because if you own a TV you

are less likely to live in an impoverished nation. But giving a

starving person will not cause them to stop being hungry. In

this case, Y and Z are associated but the relationship is not

causal.

Page 276: All of Statistics: A Concise Course in Statistical Inference Brief

280 16. Inference about Independence

We defer detailed discussion of the important question of cau-

sation until a later chapter.

16.1 Two Binary Variables

Suppose that Y and Z are both binary. Consider a data set

(Y1; Z1); : : : ; (Yn; Zn). Represent the data as a two-by-two table:

Y = 0 Y = 1

Z = 0 X00 X01 X0�

Z = 1 X10 X11 X1�

X�0 X

�1 n = X��

where the Xij represent counts:

Xij = number of observations for which Y = i and Z = j:

The dotted subscripts denote sums. For example, Xi� =P

jXij.

This is a convention we use throughout the remainder of the

book. Denote the corresponding probabilities by:

Y = 0 Y = 1

Z = 0 p00 p01 p0�Z = 1 p10 p11 p1�

p�0 p

�1 1

where pij = P(Z = i; Y = j). Let X = (X00; X01; X10; X11)

denote the vector of counts. Then X � Multinomial(n; p) where

p = (p00; p01; p10; p11). It is now convenient to introduce two new

parameters.

Page 277: All of Statistics: A Concise Course in Statistical Inference Brief

16.1 Two Binary Variables 281

De�nition 16.1 The odds ratio is de�ned to be

=p00p11

p01p10: (16.1)

The log odds ratio is de�ned to be

= log( ): (16.2)

Theorem 16.2 The following statements are equivalent:

1. Y q Z.2. = 1.3. = 0.4. For i; j 2 f0; 1g,

pij = pi�p�j: (16.3)

Now consider testing

H0 : Y q Z versus H1 : Y 6qZ:

First we consider the likelihood ratio test. Under H1, X �

Multinomial(n; p) and the mle is bp = X=n. Under H0, we again

have that X � Multinomial(n; p) but p is subject to the con-

straint

pij = pi�p�j; j = 0; 1:

This leads to the following test.

Theorem 16.3 (Likelihood Ratio Test for Independence in a 2-by-2 table)

Let

T = 2

1Xi=0

2Xj=0

Xij log

�XijX��

Xi�X�j

�: (16.4)

Under H0, T �21. Thus, an approximate level � test is

obtained by rejecting H0 then T > �21;�

.

Page 278: All of Statistics: A Concise Course in Statistical Inference Brief

282 16. Inference about Independence

Another popular test for independence is Pearson's �2 test.

Theorem 16.4 (Pearson's �2 test for Independence in a 2-by-2 table)

Let

U =

1Xi=0

1Xj=0

(Xij � Eij)2

Eij

(16.5)

where

Eij =Xi�X�j

n:

Under H0, U �21. Thus, an approximate level � test is

obtained by rejecting H0 then U > �21;�

.

Here is the intuition for the Pearson test. Under H0, pij =

pi�p�j, so the maximum likelihood estimator of pij under H0 is

bpij = bpi�bp�j = Xi�

n

X�j

n:

Thus, the expected number of observations in the (i,j) cell is

Eij = nbpij = Xi�X�j

n:

The statistic U compares the observed and expected counts.

Example 16.5 The following data from Johnson and Johnson

(1972) relate tonsillectomy and Hodgkins disease. (The data are

actually from a case-control study; we postpone discussion of

this point until the next section.)

Hodgkins Disease No Disease

Tonsillectomy 90 165 255

No Tonsillectomy 84 307 391

Total 174 472 646

We would like to know if tonsillectomy is related to Hodgkins

disease. The likelihood ratio statistic is T = 14:75 and the p-

value is P(�21> 14:75) = :0001. The �2 statistic is U = 14:96

Page 279: All of Statistics: A Concise Course in Statistical Inference Brief

16.1 Two Binary Variables 283

and the p-value is P(�21> 14:96) = :0001. We reject the null

hypothesis of independence and conclude that tonsillectomy is

associated with Hodgkins disease. This does not mean that ton-

sillectomies cause Hodgkins disease. Suppose, for example, that

doctors gave tonsillectomies to the most seriosuly ill patients.

Then the assocation between tonsillectomies and Hodgkins dis-

ease may be due to the fact that those with tonsillectomies were

the most ill patients and hence more likely to have a serious

disease.

We can also estimate the strength of dependence by estimat-

ing the odds ratio and the log-odds ratio .

Theorem 16.6 The mle 's of and are

b =X00X11

X01X10

; b = log b : (16.6)

The asymptotic stanrad errors (computed from the delta method)

are

bse (b ) =

r1

X00

+1

X01

+1

X10

+1

X11

(16.7)

bse ( b ) = b bse (b ): (16.8)

Remark 16.7 For small sample sizes, b and b can have a very

large variance. In this case, we often use the modi�ed estimator

b =

�X00 +

1

2

� �X11 +

1

2

��X01 +

1

2

� �X10 +

1

2

� : (16.9)

Yet another test for indepdnence is the Wald test for = 0

given by W = (b �0)= bse (b ). A 1�� con�dence interval for isb � z�=2 bse (b ). A 1�� con�dence interval for can be obtained

in two ways. First, we could use b � z�=2 bse ( b ). Second, since = e we could use

exp�b � z�=2 bse (b ) : (16.10)

This second method is usually more accurate.

Page 280: All of Statistics: A Concise Course in Statistical Inference Brief

284 16. Inference about Independence

Example 16.8 In the previous example,

b =90� 307

165� 84= 1:99

and b = log(1:99) = :69:

So tonsillectomy patients were twice as likely to have Hodgkins

disease. The standard error of b isr1

90+

1

84+

1

165+

1

307= :18:

The Wald statistic is W = :69=:18 = 3:84 whose p-value is

P(jZj > 3:84) = :0001, the same as the other tests. A 95 per

cent con�dence interval for is b � 2(:18) = (:33; 1:05). A 95

per cent con�dence interval for is (e:33; e1:05) = (1:39; 2:86).

16.2 Interpreting The Odds Ratios

Suppose event A as probability P (A). The odds of A are de�ned

as

odds(A) =P (A)

1� P (A):

It follows that

P(A) =odds(A)

1 + odds(A):

Let E be the event that someone is exposed to something

(smoking, radiation, etc) and let D be the event that they get

a disease. The odds of getting the disease given that you are

exposed are

odds(DjE) =P(DjE)

1� P(DjE)

and the odds of getting the disease given that you are not ex-

posed are

odds(DjEc) =P(DjEc)

1� P(DjEc):

Page 281: All of Statistics: A Concise Course in Statistical Inference Brief

16.2 Interpreting The Odds Ratios 285

The odds ratio is de�ned to be

=odds(DjE)

odds(DjEc):

If = 1 then disease probability is the same for exposed and un-

exposed. This implies that these events are independent. Recall

that the log-odds ratio is de�ned as = log( ). Independence

corresponds to = 0.

Consider this table of probabilities:

Dc D

Ec p00 p01 p0�E p10 p11 p1�

p�0 p

�1 1

Denote the data by

Dc D

Ec X00 X01 X0�

E X10 X11 X1�

X�0 X

�1 X��

Now

P(DjE) =p11

p10 + p11and P(DjEc) =

p01

p00 + p01

and so

odds(DjE) =p11

p10and odds(DjEc) =

p01

p00

and therefore,

=p11p00

p01p10:

To estimate the parameters, we have to �rst consider how the

data were collected. There are three methods.

Multinomial Sampling. We draw a sample from the pop-

ulation and, for each person, record their exposure and disease

status. In this case,X = (X00; X01; X10; X11) � Multinomial(n; p).

Page 282: All of Statistics: A Concise Course in Statistical Inference Brief

286 16. Inference about Independence

We then estimate the probabilities in the table by bpij = Xij=n

and b =bp11bp00bp01bp10 =

X11X00

X01X10

:

Prospective Sampling. (Cohort Sampling). We get

some exposed and unexposed people and count the number with

disease in each group. Thus,

X01 � Binomial(X0�; P (DjEc))

X11 � Binomial(X1�; P (DjE)):

We should really write x0� and x1� instead of X0� and X1� since

in this case, these are �xed not random but for notational sim-

plicity I'll keep using capital letters. We can estimate P (DjE)

and P (DjEc) but we cannot estimate all the probabilities in the

table. Still, we can estimate since is a function of P (DjE)

and P (DjEc). Now bP (DjE) = X11

X1�

and bP (DjEc) =X01

X0�

:

Thus, b =X11X00

X01X10

just as before.

Case-Control (Retrospective) Sampling. Here we get

some diseased and non-diseased people and we observe how

many are exposed. This is much more eÆcient if the disease

is rare. Hence,

X10 � Binomial(X�0; P (EjD

c))

X11 � Binomial(X�1; P (EjD)):

From these data we can estimate P(EjD) and P(EjDc). Sur-

prisingly, we can also still estimate . To understand why, note

Page 283: All of Statistics: A Concise Course in Statistical Inference Brief

16.2 Interpreting The Odds Ratios 287

that

P(EjD) =p11

p01 + p11; 1�P(EjD) =

p01

p01 + p11; odds(EjD) =

p11

p01:

By a similar argument,

odds(EjDc) =p10

p00:

Hence,odds(EjD)

odds(EjDc)=p11p00

p01p10= :

From the data, we form the following estimates:

bP (EjD) =X11

X�1

; 1� bP (EjD) =X01

X�1

; [odds(EjD) =X11

X01

; [odds(EjDc) =X10

X00

:

Therefore, b =X00X11

X01X10

:

So in all three data collection methods, the estimate of turns

out to be the same.

It is tempting to try to estimate P(DjE) � P(DjEc). In a

case-control design, this quantity is not estimable. To see this,

we apply Bayes' theorem to get

P(DjE)� P(DjEc) =P(EjD)P(D)

P(E)�P(EcjD)P(D)

P(Ec):

Because of the way we obtained the data, P(D) is not estimable

from the data. However, we can estimate � = P(DjE)=P(DjEc),

which is called the relative risk, under the rare disease as-

sumption.

Theorem 16.9 Let � = P(DjE)=P(DjEc). Then

�! 1

as P(D)! 0.

Page 284: All of Statistics: A Concise Course in Statistical Inference Brief

288 16. Inference about Independence

Thus, under the rare disease assumption, the relative risk is

approximately the same as the odds ratio and, as we have seen,

we can estimate the ods ratio.

In a randomized experiment, we can interpret a strong associ-

ation, that is 6= 1, as a causal relationship. In an observational

(non-randomized) study, the association can be due to other

unobserved confounding variables. We'll discuss causation in

more detail later.

16.3 Two Discrete Variables

Now suppose that Y 2 f1; : : : ; Ig and Z 2 f1; : : : ; Jg are two

discrete variables. The data can be represented as an I� by�J

table of counts:

Y = 1 Y = 2 � � � Y = j � � � Y = J

Z = 1 X11 X12 � � � X1j � � � X1J X1�

......

......

......

......

Z = i Xi1 Xi2 � � � Xij � � � XiJ Xi�

......

......

......

......

Z = I XI1 XI2 � � � XIj � � � XIJ XI�

X�1 X

�2 � � � X�j � � � X

�J n

Consider testing H0 : Y q Z versus H1 : Y 6qZ.

Page 285: All of Statistics: A Concise Course in Statistical Inference Brief

16.3 Two Discrete Variables 289

Theorem 16.10 Let

T = 2

IXi=1

JXj=1

Xij log

�XijX��

Xi�X�j

�: (16.11)

The limiting distribution of T under the null hypothesis

of independence is �2�where � = (I�1)(J�1). Pearson's

�2 test statistic is

U =

IXi=1

JXj=1

(nij � Eij)2

Eij

: (16.12)

Asymptotically, under H0, U has a �2�distribution where

� = (I � 1)(J � 1).

Example 16.11 These data are from a study by Hancock et al

(1979). Patients with Hodkins disease are classi�ed by their re-

sponse to treatment and by histological type.

Type Positive Response Partial Response No Response

LP 74 18 12 104

NS 68 16 12 96

MC 154 54 58 266

LD 18 10 44 72

The �2 test statistic is 75.89 with 2� 3 = 6 degrees of freedom.

The p-value is P(�26> 75:89) � 0. The likelihood ratio test

statistic is 68.30 with 2� 3 = 6 degrees of freedom. The p-value

is P(�26> 68:30) � 0. Thus there is strong evidence that reponse

to treatment and histological type are associated.

There are a variety of ways to quantify the strength of depen-

dence between two discrete variables Y and Z. Most of them

are not very intuitive. The one we shall use is not standard but

is more interpretable.

Page 286: All of Statistics: A Concise Course in Statistical Inference Brief

290 16. Inference about Independence

We de�ne

Æ(Y; Z) = maxA;B

jPY;Z(Y 2 A;Z 2 B)�PY (Y 2 A)�PZ(Z 2 B)j

(16.13)

where the maximum is over all pairs of events A and B.

Theorem 16.12 Properties of Æ:

1. 0 � Æ(Y; Z) � 1.

2. Æ(Y; Z) = 0 if and only if Y q Z.

3. The following identity holds:

Æ(X; Y ) =1

2

IXi=1

JXj=1

���pij � pi� p�j

���: (16.14)

4. The mle of Æ is

Æ(X; Y ) =1

2

IXi=1

JXj=1

���bpij � bpi� bp�j��� (16.15)

where bpij = Xij

n; bpi� = Xi�

n; bp

�j =X�j

n:

The interpretation of Æ is this: if one person makes probability

statements assuming independence and another person makes

probability statements without assuming independence, their

probability statements may di�er by as much as Æ. Here is a

suggested scale for interpreting Æ:

0 < Æ � :01 negligible association

:01 < Æ � :05 non-negligible association

:05 < Æ � :10 substantial association

Æ > :10 very strong association

A con�dence interval for Æ can be obtained by bootstrapping.

The steps for bootstrapping are:

Page 287: All of Statistics: A Concise Course in Statistical Inference Brief

16.4 Two Continuous Variables 291

1. Draw X� � Multinomial(n; bp);2. Compute bp�

ij; bp�

i�; bp�

�j.

3. Compute �.

4. Repeat.

Now we use any of the methods we learned earlier for construct-

ing bootstrap con�dence intervals. However, we should not use

a Wald interval in this case. The reason is that if Y q Z then

Æ = 0 and we are on the boundary of the parameter space. In

this case, the Wald method is not valid.

Example 16.13 Returning to Example 16.11 we �nd that bÆ = :11.

Using a pivotal bootstrap with 10; 000 bootstrap samples, a 95

per cent con�dence interval for Æ is (.09,.22). Our conclusion

is that the association between histological type and response is

substantial.

16.4 Two Continuous Variables

Now suppose that Y and Z are both continuous. If we assume

that the joint distribution of Y and Z is bivariate Normal, then

we mesure the dependence between Y and Z by means of the

correaltion coeÆcient �. Tests, estimates and con�dence inter-

vals for � in the Normal case are given in the previous chapter.

If we do not assume Normality then we need a nonparametric

method for assessing dependence.

Recall that the correaltion is

� =E ((X1 � �1)(X2 � �2))

�1�2:

A nonparametric estimator of � is the plug-in estimator which

is

b� = Pn

i=1(X1i �X1)(X2i �X2)qP

n

i=1(X1i �X1)2

Pn

i=1(X2i �X2)2

Page 288: All of Statistics: A Concise Course in Statistical Inference Brief

292 16. Inference about Independence

which is just the sample correlation. A con�dence interval can be

constructed using the bootstrap. A test for � = 0 can be based

on the Wald test using the bootstrap to estimate the standard

error.

The plug-in approach is useful for large samples. For small

samples, we measure the correlation using the Spearman rank

correaltion coeÆcient b�S. We simply replace the data by their

ranks { ranking each variable separately { then we compute the

correaltion coeÆcient of the ranks. To test the null hypothesis

that �S = 0 we need the distribution of b�S under the null hy-

pothesis. This can be obtained easily by simulation. We �x the

ranks of the �rst variable as 1; 2; : : : ; n. The ranks of the second

variable are chosen at random from the set of n! possible order-

ings. Then we compute the correlation. This is repeated many

times and the resulting distribution P0 is the null distribution ofb�S. The p-value for the test is P0(jRj > jb�Sj) where R is drawn

from the null distribution P0.

Example 16.14 The following data (Snedecor and Cochran, 1980,

p. 191) are systolic blood pressure X1 and are diastolic blood

pressure X2 in millimeters:

X1 100 105 110 110 120 120 125 130 130 150 170 195

X2 65 65 75 70 78 80 75 82 80 90 95 90

The sample correlation is b� = :88. The bootstrap standard

error is .046 and the Wald test statistic is :88=:046 = 19:23.

The p-value is near 0 giving strong evidence that the correlation

is not 0. The 95 per cent pivotal bootstrap con�dence interval is

(.78,.94). Because the sample size is small, consider Spearman's

rank correlation. In ranking the data, we will use average ranks

if there are ties. So if the third and fourth lowest numbers are

the same, they each get rank 3.5. The ranks of the data are:

Page 289: All of Statistics: A Concise Course in Statistical Inference Brief

16.5 One Continuous Variable and One Discrete 293

X1 1 2 3.5 3.5 5.5 5.5 7 8.5 8.5 10 11 12

X2 1.5 1.5 4.5 3 6 7.5 4.5 9 7.5 10.5 12 10.5

The rank correlation is b�S = :94 and the p-value for testing the

null hypothesis that there is no correaltion is P0(jRj > :94) � 0

which is was obtained by simulation.

16.5 One Continuous Variable and One Discrete

Suppose that Y 2 f1; : : : ; Ig is discrete and Z is continuous.

Let Fi(z) = P(Z � zjY = i) denote the cdf of Z conditional

on Y = i.

Theorem 16.15 When Y 2 f1; : : : ; Ig is discrete and Z is con-

tinuous, then Y q Z if and only if F1 = � � � = FI.

It follows from the previous theorem that to test for indepen-

dence, we need to test

H0 : F1 = � � � = FI versus H1 : not H0:

For simplicity, we consider the case where I = 2. To test the

null hypothesis that F1 = F2 we will use the two sample

Kolmogorov-Smirnov test. Let n1 denote the nimber of ob-

servations for which Yi = 1 and let n2 denote the nimber of

observations for which Yi = 2. Let

bF1(z) = 1

n1

nXi=1

I(Zi � z)I(Yi = 1)

and bF2(z) = 1

n2

nXi=1

I(Zi � z)I(Yi = 2)

denote the empirical distribution function of Z given Y = 1 and

Y = 2 respectively. De�ne the test statistic

D = supx

j bF1(x)� bF2(x)j:

Page 290: All of Statistics: A Concise Course in Statistical Inference Brief

294 16. Inference about Independence

Theorem 16.16 Let

H(t) = 1� 2

1Xj=1

(�1)j�1e�2j2t2

:

Under the null hypthesis that F1 = F2,

limn!1

P

�rn1n2

n1 + n2D � t

�= H(t):

It follows from the theorem that an approximate level � test

is obtained by rejecting H0 whenrn1n2

n1 + n2D > H�1(1� �):

16.6 Bibliographic Remarks

Johnson, S.K. and Johnson, R.E. (1972). New England Journal

of Medicine. 287. 1122-1125.

Hancock, B.W. (1979). Clinical Oncology, 5, 283-297.

16.7 Exercises

1. Prove Theorem 16.2.

2. Prove Theorem 16.3.

3. Prove Theorem 16.9.

4. Prove equation (16.14).

5. The New York Times (January 8, 2003, page A12) re-

ported the following data on death sentencing and race,The data here are

an approximate re-

creation using the

information in the

article.

from a study in Maryland:

Death Sentence No Death Sentence

Black Victim 14 641

White Victim 62 594

Page 291: All of Statistics: A Concise Course in Statistical Inference Brief

16.7 Exercises 295

Analyze the data using the tools from this Chapter. In-

terpret the results. Explain why, based only on this infor-

mation, you can't make causal conclusions. (The authors

of the study did use much more information in their full

report.)

6. Analyze the data on the variables Age and Financial Sta-

tus from:

http://lib.stat.cmu.edu/DASL/Data�les/montanadat.html

7. Estimate the correlation between temperature and lati-

tude using the data from

http://lib.stat.cmu.edu/DASL/Data�les/USTemperatures.html

Use the correlation coeÆcient and the Spearman rank cor-

realtion. Provide estimates, tests and con�dence intervals.

8. Test whether calcium intake and drop in blood pressure

are associated. Use the data in

http://lib.stat.cmu.edu/DASL/Data�les/Calcium.html

Page 292: All of Statistics: A Concise Course in Statistical Inference Brief
Page 293: All of Statistics: A Concise Course in Statistical Inference Brief

This is page 297

Printer: Opaque this

17

Undirected Graphs and ConditionalIndependence

Graphical models are a class of multivariate statistical models

that are useful for representing independence relations. They are

also useful develop parsimonious models for multivariate data.

To see why parsimonious models are useful in the multivariate

setting, consider the problem of estimating the joint distribu-

tion of several discrete random variables. Two binary variables

Y1 and Y2 can be represented as a two-by-two table which corre-

sponds to a multinomial with four categories. Similarly, k binary

variables Y1; : : : ; Yk correspond to a multinomial with N = 2k

categories. When k is even moderately large, N = 2k will be

huge. It can be shown in that case that the mle is a poor es-

timator. The reason is that the data are sparse: there are not

enough data to estimate so many parameters. Graphical mod-

els often require fewer parameters and may lead to estimators

with smaller risk. There are two main types of graphical models:

undirected and directed. Here, we introduce undirected graphs.

We'll discuss directed graphs later.

Page 294: All of Statistics: A Concise Course in Statistical Inference Brief

298 17. Undirected Graphs and Conditional Independence

17.1 Conditional Independence

Underlying graphical models is the concept of conditional inde-

pendence.

De�nition 17.1 Let X, Y and Z be discrete random vari-ables. We say that X and Y are conditionally inde-

pendent given Z, written X q Y j Z, if

P(X = x; Y = yjZ = z) = P(X = xjZ = z)P(Y = yjZ = z)

(17.1)

for all x; y; z. If X, Y and Z are continuous random vari-

ables, we say that X and Y are conditionally independent

given Z if

fX;Y jZ(x; yjz) = fXjZ(xjz)fY jZ(yjz):

for all x, y and z.

Intuitively, this means that, once you know Z, Y provides no

extra information about X.

The conditional independence relation satis�es some basic

properties.

Theorem 17.2 The following implications hold:

X q Y j Z =) Y qX j Z

X q Y j Z and U = h(X) =) U q Y j Z

X q Y j Z and U = h(X) =) X q Y j (Z; U)

X q Y j Z and X qW j(Y; Z) =) X q (W;Y ) j Z

X q Y j Z and X q Z j Y =) X q (Y; Z):

The last property requires the assumption that all events have

positive probability; the �rst four do not.

Page 295: All of Statistics: A Concise Course in Statistical Inference Brief

17.2 Undirected Graphs 299

b

b

bX

Y

Z

FIGURE 17.1. A graph with vertices V = fX;Y;Zg. The edge set isE = f(X;Y ); (Y;Z)g.

17.2 Undirected Graphs

An undirected graph G = (V;E) has a �nite set V of ver-

tices (or nodes) and a set E of edges (or arcs) consisting of

pairs of vertices. The vertices correspond to random variables

X; Y; Z; : : : and edges are written as unordered pairs. For exam-

ple, (X; Y ) 2 E means that X and Y are joined by an edge.

An example of a graph is in Figure 17.1.

Two vertices are adjacent, written X � Y , if there is an edge

between them. In Figure 17.1, X and Y are adjacent but X and

Z are not adjacent. A sequence X0; : : : ; Xn is called a path if

Xi�1 � Xi for each i. In Figure 17.1, X; Y; Z is a path. A graph

is complete if there is an edge between every pair of vertices.

A subset U � V of vertices together with their edges is called a

subgraph.

If A;B and C are three distinct subsets if V , we say that C

separates A and B if every path from a variable in A to a

variable in B intersects a variable in C. In Figure 17.2 fY;Wg

and fZg are separated by fXg. Also, W and Z are separated

by fX; Y g.

Page 296: All of Statistics: A Concise Course in Statistical Inference Brief

300 17. Undirected Graphs and Conditional Independence

b

b

b

b

Y

W X

Z

FIGURE 17.2. fY;Wg and fZg are separated by fXg. Also, W and

Z are separated by fX;Y g.

b

b

bX

Y

Z

FIGURE 17.3. X q ZjY .

17.3 Probability and Graphs

Let V be a set of random variables with distribution P. Con-

struct a graph with one vertex for each random variable in V .

Suppose we omit the edge between a pair of variables if they are

independent given the rest of the variables:

no edge between X and Y () X q Y jrest

where \rest" refers to all the other variables besides X and Y .

This type of graph is called a pairwise Markov graph. Some

examples are shown in Figures 17.3, 17.4 17.6 and 17.5.

The graph encodes a set of pairwise conditional independence

relations. These relations imply other conditional independence

Page 297: All of Statistics: A Concise Course in Statistical Inference Brief

17.3 Probability and Graphs 301

b

b

bX

Y

Z

FIGURE 17.4. No implied independence relations.

b

b

b

b

Y

X W

Z

FIGURE 17.5. X q ZjfY;Wg and Y qW jfX;Zg.

b b b b

X Y Z W

FIGURE 17.6. Pairwise independence implies that X q ZjfY;Wg.But is X q ZjY ?

Page 298: All of Statistics: A Concise Course in Statistical Inference Brief

302 17. Undirected Graphs and Conditional Independence

relations. How can we �gure out what they are? Fortunately, we

can read these other conditional independence relations directly

from the graph as well, as is explained in the next theorem.

Theorem 17.3 Let G = (V;E) be a pairwise Markov graph for

a distribution P. Let A;B and C be distinct subsets of V such

that C separates A and B. Then AqBjC.

Remark 17.4 If A and B are not connected (i.e. there is no path

from A to B) then we may regard A and B as being separated

by the empty set. Then Theorem 17.3 implies that A qB.

The independence condition in Theorem 17.3 is called the

global Markov property. We thus see that the pairwise and

global Markov properties are equivalent. Let us state this more

precisely. Given a graph G, let Mpair(G) be the set of distri-

butions which satisfy the pairwise Markov property: thus P 2

Mpair(G) if, under P, X q Y jrest if and only if there is no edge

between X and Y . Let Mglobal(G) be the set of distributions

which satisfy the global Markov property: thus P 2 Mpair(G) if,

under P, AqBjC if and only if C separates A and B.

Theorem 17.5 Let G be a graph. Then, Mpair(G) =Mglobal(G).

This theorem allows us to construct graphs using the simpler

pairwise property and then we can deduce other independence

relations using the global Markov property. Think how hard this

would be to do algebraically. Returning to 17.6, we now see that

X q ZjY and Y qW jZ.

Example 17.6 Figure 17.7 implies that X q Y , X q Z and X q

(Y; Z).

Example 17.7 Figure 17.8 implies that X qW j(Y; Z) and X q

ZjY .

Page 299: All of Statistics: A Concise Course in Statistical Inference Brief

17.3 Probability and Graphs 303

b

b

bX

Y

Z

FIGURE 17.7. X q Y , X q Z and X q (Y;Z).

b

b

b

b

X Y

Z

W

FIGURE 17.8. X qW j(Y;Z) and X q ZjY .

Page 300: All of Statistics: A Concise Course in Statistical Inference Brief

304 17. Undirected Graphs and Conditional Independence

17.4 Fitting Graphs to Data

Given a data set, how do we �nd a graphical model that �ts

the data. Some authors have devoted whole books to this sub-

ject. We will only treat the discrete case and we will consider a

method based on log-linear models which are the subject of

the next chapter.

17.5 Bibliographic Remarks

Thorough treatments of undirected graphs can be found inWhit-

taker (1990) and Lauritzen (1996). Some of the exercises below

are adapted from Whittaker (1990).

17.6 Exercises

1. Consider random variables (X1; X2; X3). In each of the

following cases, draw a graph that has the given indepen-

dence relations.

(a) X1 qX3 j X2.

(b) X1 qX2 j X3 and X1 qX3 j X2.

(c) X1 qX2 j X3 and X1 qX3 j X2 and X2 qX3 j X1.

2. Consider random variables (X1; X2; X3; X4). In each of the

following cases, draw a graph that has the given indepen-

dence relations.

(a) X1 qX3 j X2; X4 and X1 qX4 j X2; X3 and X2 qX4 j

X1; X3.

Page 301: All of Statistics: A Concise Course in Statistical Inference Brief

17.6 Exercises 305

b

b b b

X1 X2 X4

X3

FIGURE 17.9.

b b b b

X1 X2 X3 X4

FIGURE 17.10.

(b) X1 qX2 j X3; X4 and X1 qX3 j X2; X4 and X2 qX3 j

X1; X4.

(c) X1 qX3 j X2; X4 and X2 qX4 j X1; X3.

3. A conditional independence between a pair of variables is

minimal if it is not possible to use the Separation The-

orem to eliminate any variable from the conditioning set,

i.e. from the right hand side of the bar (Whittaker 1990).

Write down the minimal conditional independencies from:

(a) Figure 17.9; (b) Figure 17.10; (c) Figure 17.11; (d)

Figure 17.12.

4. Here are breast cancer data on diagnostic center (X1),

nuclear grade (X2), and survival (X3):

Page 302: All of Statistics: A Concise Course in Statistical Inference Brief

306 17. Undirected Graphs and Conditional Independence

b

b

b

b

X4

X3 X2

X1

FIGURE 17.11.

b

b

b

b b

b

X4

X2 X3

X1

X6X5

FIGURE 17.12.

Page 303: All of Statistics: A Concise Course in Statistical Inference Brief

17.6 Exercises 307

X2 malignant malignant benign benign

X3 died survived died survived

X1 Boston 35 59 47 112

Glamorgan 42 77 26 76

(a) Treat this as a multinomial and �nd the maximum

likelihood estimator.

(b) If someone has a tumour classi�ed as benign at the

Glamorgan clinic, what is the estimated probability that

they will die? Find the standard error for this estimate.

(c) Test the following hypotheses:

X1 qX2jX3 versus X1 =qX2jX3

X1 qX3jX2 versus X1 =qX3jX2

X2 qX3jX1 versus X2 =qX3jX1

Based on the results of your tests, draw and interpret the

resulting graph.

Page 304: All of Statistics: A Concise Course in Statistical Inference Brief
Page 305: All of Statistics: A Concise Course in Statistical Inference Brief

This is page 309

Printer: Opaque this

18

Loglinear Models

In this chapter we study loglinear models which are useful for

modelling multivariate discrete data. There is a strong connec-

tion between loglinear models and undirected graphs. Parts of

this Chapter draw on the material in Whittaker (1990).

18.1 The Loglinear Model

Let X = (X1; : : : ; Xm) be a random vector with probability

function

f(x) = P(X = x) = P(X1 = x1; : : : ; Xm = xm)

where x = (x1; : : : ; xm). Let rj be the number of values that

Xj takes. Without loss of generality, we can assume that Xj 2

f0; 1; : : : ; rj�1g. Suppose now that we have n such random vec-

tors. We can think of the data as a sample from a Multinomial

with N = r1 � r2 � � � � � rm categories. The data can be repre-

sented as counts in a r1�r2�� � ��rm table. Let p = (p1; : : : ; pN)

denote the multinomial parameter.

Page 306: All of Statistics: A Concise Course in Statistical Inference Brief

310 18. Loglinear Models

Let S = f1; : : : ; mg. Given a vector x = (x1; : : : ; xm) and a

subset A � S, let xA = (xj : j 2 A). For example, if A = f1; 3g

then xA = (x1; x3).

Theorem 18.1 The joint probability function f(x) of a single

random vector X = (X1; : : : ; Xm) can be written as

log f(x) =XA�S

A(x) (18.1)

where the sum is over all subsets A of S = f1; : : : ; mg and the

's satisfy the following conditions:

1. ;(x) is a constant;

2. For every A � S, A(x) is only a function of xA and not

the rest of the x0js.

3. If i 2 A and xi = 0, then A(x) = 0.

The formula in equation (18.1) is called the log-linear ex-

pansion of f . Note that this is the probability function for a

single draw. Each A(x) will depend on some unknown parame-

ters �A. Let � = (�A : A � S) be the set of all these parameters.

We will write f(x) = f(x; �) when we want to estimate the de-

pendence on the unknown parameters �.

In terms of the multinomial, the parameter space is

P =np = (p1; : : : ; pN) : pj � 0;

NXj=1

pj = 1o:

This is an N � 1 dimensional space. In the log-linear represen-

ation, the parameter space is

� =n� = (�1; : : : ; �N) : � = �(p); p 2 P

owhere �(p) is the set of � values associated with p. The set � is

a N � 1 dimensinal surface in RN . We can always go back and

forth betwee the two parameterizations we can write � = �(p)

and p = p(�).

Page 307: All of Statistics: A Concise Course in Statistical Inference Brief

18.1 The Loglinear Model 311

Example 18.2 Let X � Bernoulli(p) where 0 < p < 1. We can

write the probability mass function for X as

f(x) = px(1� p)1�x = px1 p1�x

2

for x = 0; 1, where p1 = p and p2 = 1� p. Hence,

log f(x) = ;(x) + 1(x)

where

;(x) = log(p2)

1(x) = x log

�p1

p2

�:

Notice that ;(x) is a constant (as a function of x) and 1(x) =

0 when x = 0. Thus the three conditions of the Theorem hold.

The loglinear parameters are

�0 = log(p2); �1 = log

�p1

p2

�:

The original, multinomial parameter space is P = f(p1; p2) :

pj � 0; p1 + p2 = 1g. The log-linear parameter space is

� = f(�0; �1) 2 R2 : e�0+�1 + e�0 = 1:g

Given (p1; p2) we can solve for (�0; �1). Conversely, given (�0; �1)

we can solve for (p1; p2). �

Example 18.3 Let X = (X1; X2) where X1 2 f0; 1g and X2 2

f0; 1; 2g. The joint distribution of n such random vectors is a

multinomial with 6 categories. The multinomial parameters can

be written as a 2-by-3 table as follows:

multinomial x2 0 1 2

x1 0 p00 p01 p021 p10 p11 p12

Page 308: All of Statistics: A Concise Course in Statistical Inference Brief

312 18. Loglinear Models

The n data vectors can be summarized as counts:

data x2 0 1 2

x1 0 C00 C01 C02

1 C10 C11 C12

For x = (x1; x2), the log-linear expansion takes the form

log f(x) = ;(x) + 1(x) + 2(x) + 12(x)

where

;(x) = log p00

1(x) = x1 log

�p10

p00

2(x) = I(x2 = 1) log

�p01

p00

�+ I(x2 = 2) log

�p02

p00

12(x) = I(x1 = 1; x2 = 1) log

�p11p00

p01p10

�+ I(x1 = 1; x2 = 2) log

�p12p00

p02p10

�:

Convince yourself that the three conditions on the 's of the

theorem are satis�ed. The six parameters of this model are:

�1 = log p00 �2 = log�p10

p00

��3 = log

�p01

p00

��4 = log

�p02

p00

��5 = log

�p11p00

p01p10

��6 = log

�p12p00

p02p10

�:

The next Theorem gives an easy way to check for conditional

independence in a loglinear model.

Theorem 18.4 Let (Xa; Xb; Xc) be a partition of a vectors (X1; : : : ; Xm).

Then XbqXcjXa if and only if all the -terms in the log-linear

expansion that have at least one coordinate in b and one coordi-

nate in c are 0.

To prove this Theorem, we will use the following Lemma

whose proof follows easily from the de�nition of conditional in-

dependence.

Page 309: All of Statistics: A Concise Course in Statistical Inference Brief

18.2 Graphical Log-Linear Models 313

Lemma 18.5 A partition (Xa; Xb; Xc) satis�es XbqXcjXa if and

only if f(xa; xb; xc) = g(xa; xb)h(xa; xc) for some functions g and

h

Proof. (Theorem 18.4.) Suppose that t is 0 whenever t has

coordinates in b and c. Hence, t is 0 if t 6� aSb or t 6� a

Sc.

Therefore

log f(x) =X

t�aSb

t(x) +X

t�aSc

t(x)�Xt�a

t(x):

Exponentiating, we see that the joint density is of the form

g(xa; xb)h(xa; xc). By Lemma 18.5, Xb q XcjXa. The converse

follows by reversing the argument. �

18.2 Graphical Log-Linear Models

A log-linear model is graphical if missing terms correspond

only to conditional independence constraints.

De�nition 18.6 Let log f(x) =P

A�S A(x) be a log-linear

model. Then f is graphical if all -terms are non-zero

except for any pair of coordinates not in the edge set for

some graph G. In other words, A(x) = 0 if and only if

fi; jg � A and (i; j) is not an edge.

Here is way to think about the de�nition above:

If you can add a term to the model and the graph does

not change, then the model is not graphical.

Example 18.7 Consider the graph in Figure 18.1.

The graphical log-linear model that corresponds to this graph

is

log f(x) = ; + 1(x) + 2(x) + 3(x) + 4(x) + 5(x)

+ 12(x) + 23(x) + 25(x) + 34(x) + 35(x) + 45(x) + 235(x) + 345(x):

Page 310: All of Statistics: A Concise Course in Statistical Inference Brief

314 18. Loglinear Models

b b b

b b

X1 X2 X3

X5 X4

FIGURE 18.1. Graph for Example 18.7.

Let's see why this model is graphical. The edge (1; 5) is missing

in the graph. Hence any term containing that pair of indices is

omitted from the model. For example,

15; 125; 135; 145; 1235; 1245; 1345; 12345

are all omitted. Similarly, the edge (2; 4) is missing and hence

24; 124; 234; 245; 1234; 1245; 2345; 12345

are all omitted. There are other missing edges as well. You can

check that the model omits all the corresponding terms. Now

consider the model

log f(x) = ;(x) + 1(x) + 2(x) + 3(x) + 4(x) + 5(x)

+ 12(x) + 23(x) + 25(x) + 34(x) + 35(x) + 45(x):

This is the same model except that the three way interactions

were removed. If we draw a graph for this model, we will get the

same graph. For example, no terms contain (1; 5) so we omit

the edge between X1 and X5. But this is not graphical since it has

extra terms omitted. The independencies and graphs for the two

models are the same but the latter model has other constraints

besides conditional independence constraints. This is not a bad

Page 311: All of Statistics: A Concise Course in Statistical Inference Brief

18.3 Hierarchical Log-Linear Models 315

thing. It just means that if we are only concerned about presence

or absence of conditional independences, then we need not con-

sider such a model. The presence of the three-way interaction

235 means that the strength of association between X2 and X3

varies as a function of X5. Its absence indicates that this is not

so. �

18.3 Hierarchical Log-Linear Models

There is a set of log-linear models that is larger than the set of

graphical models and that are used quite a bit. These are the

hierarchical log-linear models.

De�nition 18.8 A log-linear model is hierarchical if a =

0 and a � t implies that t = 0.

Lemma 18.9 A graphical model is hierarchical but the reverse

need not be true.

Example 18.10 Let

log f(x) = ;(x) + 1(x) + 2(x) + 3(x) + 12(x) + 13(x):

The model is hierarchical; its graph is given in Figure 18.2. The

model is graphical because all terms involving (1,3) are omitted.

It is also hierarchical. �

Example 18.11 Let

log f(x) = ;(x)+ 1(x)+ 2(x)+ 3(x)+ 12(x)+ 13(x)+ 23(x):

The model is hierarchical. It is not graphical. The graph corre-

sponding to this model is complete; see Figure 18.3. It is not

graphical because 123(x) = 0 which does not correspond to any

pairwise conditional independence. �

Page 312: All of Statistics: A Concise Course in Statistical Inference Brief

316 18. Loglinear Models

b b b

X1 X2 X3

FIGURE 18.2. Graph for Example 18.10.

b

bb

X3

X1 X2

FIGURE 18.3. The graph is complete. The model is hierarchical but

not graphical.

Page 313: All of Statistics: A Concise Course in Statistical Inference Brief

18.4 Model Generators 317

b b b

X1 X2 X3

FIGURE 18.4. The model for this graph is not hierarchical.

Example 18.12 Let

log f(x) = ;(x) + 3(x) + 12(x):

The graph corresponding is in Figure 18.4. This model is not

hierarchical since 2 = 0 but 12 is not. Since it is not hierar-

chical, it is not graphical either. �

18.4 Model Generators

Hierarchical models can be written succinctly using genera-

tors. This is most easily explained by example. Suppose that

X = (X1; X2; X3). Then, M = 1:2 + 1:3 stands for

log f = ; + 1 + 2 + 3 + 12 + 13:

The formulaM = 1:2+1:3 says: \include 12 and 13." We have

to also include the lower order terms or it won't be hierarchical.

The generator M = 1:2:3 is the saturated model

log f = ; + 1 + 2 + 3 + 12 + 13 + 23 + 123:

The saturated models corresponds to �tting an unconstrained

multinomial. Consider M = 1 + 2 + 3 which means

log f = ; + 1 + 2 + 3:

Page 314: All of Statistics: A Concise Course in Statistical Inference Brief

318 18. Loglinear Models

;

1 2

1 + 2

1.2

FIGURE 18.5. The lattice with two variables.

This is the mutual independence model. Finally, consider M =

1:2 which has log-linear expansion

log f = ; + 1 + 2 + 12:

This model makes X3jX2 = x2; X1 = x1 a uniform distribution.

18.5 Lattices

Hierarchical models can be organized into something called a

lattice. This is the set of all hierarchical models partially or-

dered by inclusion. The set of all hierarchical models for two

variables can be illustrated as in Figure 18.5.

M = 1:2 is the saturated model, M = 1 + 2 is the indepen-

dence model, M = 1 is independence plus X2jX1 is uniform,

M = 2 is independence plus X1jX2 is uniform, M = 0 is the

uniform distribution.

The lattice of trivariate models is shown in �gure 18.6.

18.6 Fitting Log-Linear Models to Data

Let � denote all the parameters in a log-linear model M . The

loglikelihood for � is

`(�) =Xj

xj log pj(�)

Page 315: All of Statistics: A Concise Course in Statistical Inference Brief

18.6 Fitting Log-Linear Models to Data 319

;

1 2 3

1 + 2 1 + 3 2 + 3

1.2 1.3 2.3

1 + 2 + 3

1.2 + 3 1.3 + 2 2.3 + 1

1.2 + 1.3 1.2 + 2.3 1.3 + 2.3

1.2 + 1.3 + 2.3

1.2.3

uniform

independence

conditional uniform

mutual independence

(graphical)

graphical

graphical

two-way

saturated(graphical)

FIGURE 18.6. The lattice of models for three variables.

Page 316: All of Statistics: A Concise Course in Statistical Inference Brief

320 18. Loglinear Models

where the sum is over the cells and p(�) denotes the cell proba-

bilities corresponding to �. The mle b� generally has to be found

numerically. The model with all possible -terms is called the

saturated model. We can also �t any sub-model which cor-

responds to setting some subset of terms to 0.

De�nition 18.13 For any submodelM , de�ne the deviance

dev(M) by

dev(M) = 2(bsat � bM)

where bsat is the log-likelihood of the saturated model eval-

uated at the mle and bM is the log-likelihood of the model

M evaluated at its mle .

Theorem 18.14 The deviance is the likelihood ratio test statistic

for

H0 : the model is M versus H1 : the model is not M:

Under H0, dev(M)d

! �2�with � degrees of freedom equal to the

di�erence in the number of parameters between the saturated

model and M .

One way to �nd a good model is to use the deviance to test

every sub-model. Every model that is not rejected by this test is

then considered a plausible model. However, this is not a good

strategy for two reasons. First, we will end up doing many tests

which means that there is ample opportunity for making Type I

and Type II errors. Second, we will end up using models where

we failed to reject H0. But we might fail to reject H0 due to low

power. The result is that we end up with a bad model just due

to low power.

There are many model searching strategies. A common ap-

proach is to use some form of penalized likelihood. One version

of penalized is the AIC that we used in regression. For any model

Page 317: All of Statistics: A Concise Course in Statistical Inference Brief

18.6 Fitting Log-Linear Models to Data 321

M de�ne

AIC(M) = �2�b(M)� jM j

�(18.2)

where jM j is the number of parameters. Here is the explanation

of where AIC comes from.

Consider a set of models fM1;M2; : : :g. Let bfj(x) denote theestimated probability function obtained by using the maximum

likelihood estimator ofr mode Mj. Thus, bfj(x) = bf(x; b�j) whereb�j is the mle of the set of parameters �j for modelMj. We will

use the loss function D(f; bf) whereD(f; g) =

Xx

f(x) log

�f(x)

g(x)

is the Kullback-Leibler distance between two probability func-

tions. The corresponding risk function is R(f; bf) = E (D(f; bf).Notice that D(f; bf) = c � A(f; bf) where c = P

xf(x) log f(x)

deos not depend on bf and

A(f; bf) =Xx

f(x) log bf(x):Thus, minimizing the risk is equivalent to maximizing a(f; bf) �E (A(f; bf)).It is tempting to estimate a(f; bf) by

Px

bf(x) log bf(x) but,

just as the training error in regression is a highly biased esti-

mate of prediction risk, it is also the case thatP

xbf(x) log bf(x)

is a highly biased estimate of a(f; bf). In fact, the bias is approx-

imately equal to jMjj. Thus:

Theorem 18.15 AIC(Mj) is an approximately unbiased estimate

of a(f; bf).After �nding a \best model" this way we can draw the corre-

sponding graph. We can also check the overall �t of the selected

model using the deviance as described above.

Page 318: All of Statistics: A Concise Course in Statistical Inference Brief

322 18. Loglinear Models

Example 18.16 Data on breast cancer from Morrison et al (1973)

were presented in homework question 4 of Chapter 17. The data

are on diagnostic center (X1), nuclear grade (X2), and survival

(X3): (Morrison et al 1973):

X2 malignant malignant benign benign

X3 died survived died survived

X1 Boston 35 59 47 112

Glamorgan 42 77 26 76

The saturated log-linear model is:

CoeÆcient Standard Error Wald Statistic p-value

(Intercept) 3.56 0.17 21.03 0.00 ***

center 0.18 0.22 0.79 0.42

grade 0.29 0.22 1.32 0.18

survival 0.52 0.21 2.44 0.01 *

center�grade -0.77 0.33 -2.31 0.02 *

center�survival 0.08 0.28 0.29 0.76

grade�survival 0.34 0.27 1.25 0.20

center�grade�survival 0.12 0.40 0.29 0.76

The best sub-model, selected using AIC and backward searching

is

CoeÆcient Standard Error Wald Statistic p-value

(Intercept) 3.52 0.13 25.62 < 0.00 ***

center 0.23 0.13 1.70 0.08

grade 0.26 0.18 1.43 0.15

survival 0.56 0.14 3.98 6.65e-05 ***

center�grade -0.67 0.18 -3.62 0.00 ***

grade�survival 0.37 0.19 1.90 0.05

The graph for this model M is shown in Figure 18.7. To test

the �t of this model, we compute the deviance of M which is

0.6. The appropriate �2 has 8� 6 = 2 degrees of freedom. The

p-value is P(�22 > :6) = :74. So we have no evidence to suggest

that the model is a poor �t. �

Page 319: All of Statistics: A Concise Course in Statistical Inference Brief

18.6 Fitting Log-Linear Models to Data 323

Center Grade Survival

FIGURE 18.7. The graph for Example 18.16.

The example above is a toy example. With so few variables,

there is little to be gained by searching for a best model. The

overall multinomial mle would be suÆcient. More importantly,

we are probably interested in predicting survivial from grade and

center in which case this should really be treated as a regresion

problem with outcome Y = survival and covariates center and

grade.

Example 18.17 Here is a synthetic example. We generate n =

100 random vectors X = (X1; : : : ; X5) of length 5. We generated

the data as follows:

X1 � Bernoulli(1=2)

and

XjjX1; : : : ; Xj�1 �

�1=4 if Xj�1 = 0

3=4 if Xj�1 = 1:

It follows that

f(x1; : : : ; x5) =

�1

2

��3

4

�x1�3

4

�1�x1�3

4

�x2�3

4

�1�x2�3

4

�x3�3

4

�1�x3�3

4

�x4�3

4

�1�x4

=

�1

2

��3

4

�x1+x2+x3+x4�3

4

�4�x1�x2�x3�x4

:

We estimated f using three methods: (i) maximum likelihood

treating this as a multinomial with 32 categories, (ii) maximum

likelihood from the best loglinear model using AIC and forward

selection and (iii) maximum likelihood from the best loglinear

Page 320: All of Statistics: A Concise Course in Statistical Inference Brief

324 18. Loglinear Models

model using BIC and forward selection. We estimated the risk

by simulating the example 100 times. The average risks were:

Method Risk

MLE 0.63

AIC 0.54

BIC 0.53

In this example, there is little di�erence between AIC and BIC.

Both are better than maximum likelihood. �

18.7 Bibliographic Remarks

A classic references on loglinear models is Bishop, Fienberg and

Holland (1975). See also Whittaker (1990) from which some of

the exercises are borrowed.

18.8 Exercises

1. Solve for the p0ijs in terms of the �'s in Example 18.3.

2. Repeat example 18.17 using 7 covariates and n = 1000.

To avoid numerical problems, replace any zero count with

a one.

3. Prove Lemma 18.5.

4. Prove Lemma 18.9.

5. Consider random variables (X1; X2; X3; X4). Suppose the

log-density is

log f(x) = ;(x) + 12(x) + 13(x) + 24(x) + 34(x):

(a) Draw the graph G for these variables.

(b) Write down all independence and conditional indepen-

dence relations implied by the graph.

Page 321: All of Statistics: A Concise Course in Statistical Inference Brief

18.8 Exercises 325

(c) Is this model graphical? Is it hierarchical?

6. Suppose that parameters p(x1; x2; x3) are proportional to

the following values:

x2 0 0 1 1

x3 0 1 0 1

x1 0 2 8 4 16

1 16 128 32 256

Find the -terms for the log-linear expansion. Comment

on the model.

7. Let X1; : : : ; X4 be binary. Draw the independence graphs

corresponding to the following log-linear models. Also,

identify whether each is graphical and/or hierarchical (or

neither).

(a) log f = 7 + 11x1 + 2x2 + 1:5x3 + 17x4

(b) log f = 7+11x1+2x2+1:5x3+17x4+12x2x3+78x2x4+

3x3x4 + 32x2x3x4

(c) log f = 7+11x1+2x2+1:5x3+17x4+12x2x3+3x3x4+

x1x4 + 2x1x2

(d) log f = 7 + 5055x1x2x3x4

Page 322: All of Statistics: A Concise Course in Statistical Inference Brief
Page 323: All of Statistics: A Concise Course in Statistical Inference Brief

This is page 327

Printer: Opaque this

19

Causal Inference

In this Chapter we discuss causation. Roughly speaking, the

statement \X causes Y " means that changing the value of X

will change the distribution of Y . When X causes Y , X and Y

will be associated but the reverse is not, in general, true. Associ-

ation does not necessarily imply causation. We will consider two

frameworks for discussing causation. The �rst uses the notation

of counterfactual random variables. The second, presented in

the next Chapter, uses directed acyclic graphs.

19.1 The Counterfactual Model

Suppose that X is a binary treatment variable where X = 1

means \treated" and where X = 0 means \not treated." We are

using the word \treatment" in a very broad sense: treatment

might refer to a medication or something like smoking. An al-

ternative to \treated/not treated" is \exposed/not exposed" but

we shall use the former.

Page 324: All of Statistics: A Concise Course in Statistical Inference Brief

328 19. Causal Inference

Let Y be some outcome variable such as presence or absence of

disease. To distinguish the statement \X is associated Y " from

the statement \X causes Y " we need to enrich our probabilistic

vocabulary. We will decompose the response Y into a more �ne-

grained object.

We introduce two new random variables (C0; C1), called po-

tential outcomes with the following interpretation: C0 is the

outcome if the subject is not treated (X = 0) and C1 is the

outcome if the subject is treated (X = 1). Hence,

Y =

�C0 if X = 0

C1 if X = 1:

We can express the relationship between Y and (C0; C1) more

succintly by

Y = CX : (19.1)

This equation is called the consistency relationship.

Here is a toy data set to make the idea clear:

X Y C0 C1

0 4 4 *

0 7 7 *

0 2 2 *

0 8 8 *

1 3 * 3

1 5 * 5

1 8 * 8

1 9 * 9

The asterisks denote unobserved values. When X = 0 we don't

observe C1 in which case we say that C1 is a counterfactual

since it is the outcome you would have had if, counter to the

fact, you had been treated (X = 1). Similarly, when X = 1 we

don't observe C0 and we say that C0 is counterfactual.

Notice that there are four types of subjects:

Page 325: All of Statistics: A Concise Course in Statistical Inference Brief

19.1 The Counterfactual Model 329

Type C0 C1

Survivors 1 1

Responders 0 1

Anti-repsonders 1 0

Doomed 0 0

Think of the potential outcomes (C0; C1) as hidden variables

that contain all the relavent information about the subject.

De�ne the average causal e�ect or average treatment

e�ect to be

� = E (C1)� E (C0): (19.2)

The parameter � has the following interpretation: � is the mean

if everyone were treated (X = 1) minus the mean if everyone

were not treated (X = 0). There are other ways of measuring

the causal e�ect. For example, if C0 and C1 are binary, we de�ne

the causal odds ratio

P(C1 = 1)

P(C1 = 0)�

P(C0 = 1)

P(C0 = 0)

and the causal relative risk

P(C1 = 1)

P(C0 = 1):

The main ideas will be the same whatever causal e�ect we use.

For simplicity, we shall work with the average causal e�ect �.

De�ne the association to be

� = E (Y jX = 1)� E (Y jX = 0): (19.3)

Again, we could use odds ratios or other summaries if we wish.

Theorem 19.1 (Association is not equal to Causation) In

general, � 6= �.

Example 19.2 Suppose the whole population is as follows:

Page 326: All of Statistics: A Concise Course in Statistical Inference Brief

330 19. Causal Inference

X Y C0 C1

0 0 0 0�

0 0 0 0�

0 0 0 0�

0 0 0 0�

1 1 1� 1

1 1 1� 1

1 1 1� 1

1 1 1� 1

Again, the asterisks denote unobserved values. Notice that C0 =

C1 for every subject, thus, this treatment has no e�ect. Indeed,

� = E (C1)� E (C0) =1

8

8Xi=1

C1i �

1

8

8Xi=1

C0i

=1 + 1 + 1 + 1 + 0 + 0 + 0 + 0

8�

1 + 1 + 1 + 1 + 0 + 0 + 0 + 0

8= 0:

Thus the average causal e�ect is 0. The observed data are X's

and Y 's only from which we can estimate the association:

� = E (Y jX = 1)� E (Y jX = 0)

=1 + 1 + 1 + 1

4�

0 + 0 + 0 + 0

4= 1:

Hence, � 6= �.

To add some intutition to this example, imagine that the out-

come variable is 1 if \healthy" and 0 if \sick". Suppose that

X = 0 means that the subject does not take vitamin C and that

X = 1 means that the subject does take vitamin C. Vitamin

C has no causal e�ect since C0 = C1. In this example there

are two types of people: healthy people (C0; C1) = (1; 1) and

unhealthy people (C0; C1) = (0; 0). Healthy people tend to take

vitamin C while unhealty people don't. It is this association be-

tween (C0; C1) and X that creates an association between X and

Y . If we only had data on X and Y we would conclude that X

Page 327: All of Statistics: A Concise Course in Statistical Inference Brief

19.1 The Counterfactual Model 331

and Y are associated. Suppose we wrongly interpret this causally

and conclude that vitamin C prevents illness. Next we might en-

courage everyone to take vitamin C. If most people comply the

population will look something like this:

X Y C0 C1

0 0 0 0�

1 0 0 0�

1 0 0 0�

1 0 0 0�

1 1 1� 1

1 1 1� 1

1 1 1� 1

1 1 1� 1

Now � = (4=7)� (0=1) = 4=7. We see that � went down from 1

to 4/7. Of course, the causal e�ect never changed but the naive

observer who does not distinguish association and causation will

be confused because his advice seems to have made things worse

instead of better. �

In the last example, � = 0 and � = 1. It is not hard to create

examples in which � > 0 and yet � < 0. The fact that the

association and causal e�ects can have di�erent signs is very

confusing to many people including well educated statisticians.

The example makes it clear that, in general, we cannot use the

association to estimate the causal e�ect �. The reason that � 6= �

is that (C0; C1) was not independent of X. That is, treatment

assigment was not independent of person type.

Can we ever estimate the causal e�ect? The answer is: some-

times. In particular, random asigment to treatment makes it

possible to estimate �.

Theorem 19.3 Suppose we randomly assign subjects to treat-

ment and that P(X = 0) > 0 and P(X = 1) > 0. Then � = �.

Hence, any consistent estimator of � is a consistent estimator

Page 328: All of Statistics: A Concise Course in Statistical Inference Brief

332 19. Causal Inference

of �. In particular, a consistent estimator is

b� = bE (Y jX = 1)� bE (Y jX = 0)

= Y 1 � Y 0

is a consistent estimator of �, where

Y 1 =1

n1

Xi:Xi=1

Yi; Y 0 =1

n0

Xi:Xi=0

Yi;

n1 =P

n

i=1Xi and n0 =

Pn

i=1(1�Xi).

Proof. Since X is randomly assigned, X is independent of

(C0; C1). Hence,

� = E (C1)� E (C0)

= E (C1jX = 1)� E (C0jX = 0) since X q (C0; C1)

= E (Y jX = 1)� E (Y jX = 0) since Y = CX

= �:

The consistency follows from the law of large numbers. �

If Z is a covariate, we de�ne the conditional causal e�ect

by

�z = E (C1jZ = z)� E (C0jZ = z):

For example, if Z denotes gender with values Z = 0 (women)

and Z = 1 (men), then �0 is the causal e�ect among women and

�1 is the causal e�ect among men. In a randomized experiment,

�z = E (Y jX = 1; Z = z) � E (Y jX = 0; Z = z) and we can

estimate the conditional causal e�ect using appropriate sample

averages.

Page 329: All of Statistics: A Concise Course in Statistical Inference Brief

19.2 Beyond Binary Treatments 333

Summary

Random variables: (C0; C1; X; Y ).

Consistency relationship: Y = CX .

Causal E�ect: � = E (C1)� E (C0).

Association: � = E (Y jX = 1)� E (Y jX = 0).

Random assigment =) (C0; C1)qX =) � = �.

19.2 Beyond Binary Treatments

Let us now generalize beyond the binary case. Suppose that

X 2 X . For example, X could be the dose of a drug in which

case X 2 R. The counterfactual vector (C0; C1) now becomes

the counterfactual process

fC(x) : x 2 Xg (19.4)

where C(x) is the outcome a subject would have if he received

dose x. The observed reponse is given by the consistency relation

Y � C(X): (19.5)

See Figure 19.1. The causal regression function is

�(x) = E (C(x)): (19.6)

The regression function, which measures association, is r(x) =

E (Y jX = x).

Theorem 19.4 In general, �(x) 6= r(x). However, when X is

randomly assigned, �(x) = r(x).

Example 19.5 An example in which �(x) is constant but r(x)

is not constant is shown in Figure 19.2. The �gure shows the

counterfactuals processes for four subjects. The dots represent

Page 330: All of Statistics: A Concise Course in Statistical Inference Brief

334 19. Causal Inference

0 1 2 3 4 5 6 7 8 9 10 110

1

2

3

4

5

6

7

x

C(x)

X

Y = C(X)

FIGURE 19.1. A counterfactual process C(x). The outcome Y is the

value of the curve C(x) evaluated at the observed dose X.

Page 331: All of Statistics: A Concise Course in Statistical Inference Brief

19.3 Observational Studies and Confounding 335

their X values X1; X2; X3; X4. Since Ci(x) is constant over x

for all i, there is no causal e�ect and hence

�(x) =C1(x) + C2(x) + C3(x) + C4(x)

4

is constant. Changing the dose x will not change anyone's out-

come. The four dots in the lower plot represent the observed data

points Y1 = C1(X1); Y2 = C2(X2); Y3 = C3(X3); Y4 = C4(X4).

The dotted line represents the regression r(x) = E (Y jX = x).

Although there is no causal e�ect, there is an association since

the regression curve r(x) is not constant. �

19.3 Observational Studies and Confounding

A study in which treatment (or exposure) is not randomly as-

signed, is called an observational study. In these studies, sub-

jects select their own value of the exposure X. Many of the

health studies your read about in the newspaper are like this.

As we saw, association and causation could in general be quite

di�erent. This descrepancy occurs in non-randomized studies

because the potential outcome C is not independent of treat-

ment X. However, suppose we could �nd groupings of subjects

such that, within groups, X and fC(x) : x 2 Xg are indepen-

dent. This would happen if the subjects are very similar within

groups. For example, suppose we �nd people who are very sim-

ilar in age, gender, educational background and ethnic back-

ground. Among these people we might feel it is reasonable to

assume that the choice of X is essentially random. These other

variables are called confounding variables. If we denote these A more precise def-

inition of confound-

ing is given in the

next Chapter.

other variables collectively as Z, then we can express this idea

by saying that

fC(x) : x 2 Xg qXjZ: (19.7)

Equation (19.7) means that, within groups of Z, the choice of

treatmentX does not depend on type, as represented by fC(x) :

Page 332: All of Statistics: A Concise Course in Statistical Inference Brief

336 19. Causal Inference

0 1 2 3 4 50

1

2

3

C1(x)

C2(x)

C3(x)

C4(x)

x

b

b

b

b

0 1 2 3 4 50

1

2

3�(x)

b

b

b

b Y1

Y2

Y3

Y4

FIGURE 19.2. The top plot shows the counterfactual process C(x)

for four subjects. The dots represent their X values. Since Ci(x) is

constant over x for all i, there is no causal e�ect. Changing the dose

will not change anyone's outcome. The lower plot shows the causal

regression function �(x) = (C1(x) + C2(x) + C3(x) + C4(x))=4.

The four dots represent the observed data points

Y1 = C1(X1); Y2 = C2(X2); Y3 = C3(X3); Y4 = C4(X4). The

dotted line represents the regression r(x) = E (Y jX = x). There is

no causal e�ect since Ci(x) is constant for all i. But there is an

association since the regression curve r(x) is not constant.

Page 333: All of Statistics: A Concise Course in Statistical Inference Brief

19.3 Observational Studies and Confounding 337

x 2 Xg. If (19.7) holds and we observe Z then we say that there

is no unmeasured confounding.

Theorem 19.6 Suppose that (19.7) holds. Then,

�(x) =

ZE (Y jX = x; Z = z)dFZ(z)dz: (19.8)

If br(x; z) is a consistent estimate of the regression function E (Y jX =

x; Z = z), then a consistent estimate of �(x) is

b�(x) = 1

n

nXi=1

br(x; Zi):

In particular, if r(x; z) = �0 + �1x + �2z is linear, then a con-

sistent estimate of �(x) is

b�(x) = b�0 + b�1x+ b�2Zn (19.9)

where (b�0; b�1; b�2) are the least squares estimators.

Remark 19.7 It is useful to compare equation (19.8) to E (Y jX =

x) which can be written as E (Y jX = x) =RE (Y jX = x; Z =

z)dFZjX(zjx).

Epidemiologists call (19.8) the adjusted treatment e�ect.

The process of computing adjusted treatment e�ects is called

adjusting (or controlling) for confounding. The selection

of what confounders Z to measure and control for requires sci-

enti�c insight. Even after adjusting for confounders, we cannot

be sure that there are not other confounding variables that we

missed. This is why observational studies must be treated with

healthy skepticism. Results from observational studies start to

become believable when: (i) the results are replicated in many

studies, (ii) each of the studies controlled for plausible confound-

ing variables, (iii) there is a plausible scienti�c explanation for

the existence of a causal relationship.

Page 334: All of Statistics: A Concise Course in Statistical Inference Brief

338 19. Causal Inference

A good example is smoking and cancer. Numerous studies

have shown a relationship between smoking and cancer even af-

ter adjusting for many confouding variables. Moreover, in labo-

ratory studies, smoking has been shown to damage lung cells. Fi-

nally, a causal link between smoking and cancer has been found

in randomized animal studies. It is this collection of evidence

over many years that makes this a convincing case. One single

observational study is not, by itself, strong evidence. Remember

that when you read the newspaper.

19.4 Simpson's Paradox

Simpson's paradox is a puzzling phenomenon that is dicussed in

most statistics texts. Unfortunately, it is explained incorrectly in

most statistics texts. The reason is that it is nearly impossible to

explain the paradox without using counterfactuals (or directed

acyclic graphs).

Let X be a binary treatment variable, Y a binary outcome

and Z a third binary variable such as gender. Suppose the joint

distribution of X; Y; Z is

Y = 1 Y = 0 Y = 1 Y = 0

X = 1 .1500 .2250 .1000 .0250

X = 0 .0375 .0875 .2625 .1125

Z = 1 (men) Z = 0 (women)

The marginal distribution for (X; Y ) is

Page 335: All of Statistics: A Concise Course in Statistical Inference Brief

19.4 Simpson's Paradox 339

Y = 1 Y = 0

X = 1 .25 .25 .50

X = 0 .30 .20 .50

.55 .45 1

Now,

P(Y = 1jX = 1)� P(Y = 1jX = 0) =P(Y = 1; X = 1)

P(X = 1)�

P(Y = 1; X = 0)

P(X = 0)

=:25

:50�

:30

:50

= �0:1:

We might naively interpret this to mean that the treatment

is bad for you since P(Y = 1jX = 1) < P(Y = 1jX = 0).

Furthermore, among men,

P(Y = 1jX = 1; Z = 1) � P(Y = 1jX = 0; Z = 1)

=P(Y = 1; X = 1; Z = 1)

P(X = 1; Z = 1)�

P(Y = 1; X = 0; Z = 1)

P(X = 0; Z = 1)

=:15

:3750�

:0375

:1250

= 0:1:

Among women,

P(Y = 1jX = 1; Z = 0) � P(Y = 1jX = 0; Z = 0)

=P(Y = 1; X = 1; Z = 0)

P(X = 1; Z = 0)�

P(Y = 1; X = 0; Z = 0)

P(X = 0; Z = 0)

=:1

:1250�

:2625

:3750

= 0:1:

To summarize, we seem to have the following information:

Page 336: All of Statistics: A Concise Course in Statistical Inference Brief

340 19. Causal Inference

Mathematical Statement English Statement?

P(Y = 1jX = 1) < P(Y = 1jX = 0) treatment is harmful

P(Y = 1jX = 1; Z = 1) > P(Y = 1jX = 0; Z = 1) treatment is bene�cial to men

P(Y = 1jX = 1; Z = 0) > P(Y = 1jX = 0; Z = 0) treatment is bene�cial to women

Clearly, something is amiss. There can't be a treatment which

is good for men, good for women but bad overall. This is non-

sense. The problem is with the set of English statements in the

table. Our translation from math into english is specious.

The inequality P(Y = 1jX = 1) < P(Y = 1jX =

0) does not mean that treatment is harmful.

The phrase \treatment is harmful" should be written mathe-

matically as P(C1 = 1) < P(C0 = 1). The phrase \treatment is

harmful for men" should be written P(C1 = 1jZ = 1) < P(C0 =

1jZ = 1). The three mathematical statements in the table are

not at all contradictory. It is only the translation into English

that is wrong.

Let us now show that a real Simpson's paradox cannot hap-

pen, that is, there cannot be a treatment that is bene�cial for

men and women but harmful overall. Suppose that treatment is

bene�cial for both sexes. Then

P(C1 = 1jZ = z) > P(C0 = 1jZ = z)

for all z. It then follows that

P(C1 = 1) =Xz

P(C1 = 1jZ = z)P(Z = z)

>Xz

P(C0 = 1jZ = z)P(Z = z)

= P(C0 = 1):

Hence, P(C1 = 1) > P(C0 = 1) so treatment is bene�cial overall.

No paradox.

Page 337: All of Statistics: A Concise Course in Statistical Inference Brief

19.5 Bibliographic Remarks 341

19.5 Bibliographic Remarks

The use of potential outcomes to clarify causation is due mainly

to Jerzy Neyman and Don Rubin. Later developments are due

to Jamie Robins, Paul Rosenbaum and others. A parallel devel-

opment took place in econometrics by various people including

Jim Heckman and Charles Manski. Currently, there are no text-

book treatments of causal inference from the potential outcomes

viewpoint.

19.6 Exercises

1. Create an example like Example 19.2 in which � > 0 and

� < 0.

2. Prove Theorem 19.4.

3. Suppose you are given data (X1; Y1); : : : ; (Xn; Yn) from an

observational study, where Xi 2 f0; 1g and Yi 2 f0; 1g.

Although it is not possible to estimate the causal e�ect �,

it is possible to put bounds on �. Find upper and lower

bounds on � that can be consistently estimated from the

data. Show that the bounds have width 1. Hint: Note that

E (C1) = E (C1jX = 1)P(X = 1) + E (C1jX = 0)P(X = 0).

4. Suppose that X 2 R and that, for each subject i, Ci(x) =

�1ix. Each subject has their own slope �1i. Construct a

joint distribution on (�1; X) such that P(�1 > 0) = 1

but E (Y jX = x) is a decreasing function of x, where Y =

C(X). Interpret. Hint: Write f(�1; x) = f(�1)f(xj�1). Choose

f(xj�1) so that when �1 is large, x is small and when �1

is small x is large.

Page 338: All of Statistics: A Concise Course in Statistical Inference Brief
Page 339: All of Statistics: A Concise Course in Statistical Inference Brief

This is page 343

Printer: Opaque this

20

Directed Graphs

20.1 Introduction

Directed graphs are similar to undirected graphs except that

there are arrows between vertices instead of edges. Like undi-

rected graphs, directed graphs can be used to represent inde-

pendence relations. They can also be used as an alternative to

counterfactuals to represent causal relationships. Some people

use the phrase Bayesian network to refer to a directed graph

endowed with a probability distribution. This is a poor choice of

terminology. Statistical inference for directed graphs can be per-

formed using frequentist or Bayesian methods so it is misleading

to call them Bayesian networks.

20.2 DAG's

A directed graph G consists of a set of vertices V and an edge

set E of ordered pairs of variables. If (X; Y ) 2 E then there is

an arrow pointing from X to Y . See Figure 20.1.

Page 340: All of Statistics: A Concise Course in Statistical Inference Brief

344 20. Directed Graphs

Y

X Z

FIGURE 20.1. A directed graph with V = fX;Y;Zg and

E = f(Y;X); (Y;Z)g.

If an arrow connects two variables X and Y (in either direc-

tion) we say that X and Y are adjacent. If there is an arrow

from X to Y then X is a parent of Y and Y is a child of

X. The set of all parents of X is denoted by �X or �(X). A

directed path from X to Y is a set of vertices beginning with

X, ending with Y such that each pair is connected by an arrow

and all the arrows point in the same direction as follows:

X �! � � � �! Y or X � � � � � Y

A sequence of adjacent vertices staring with X and ending with

Y but ignoring the direction of the arrows is called an undi-

rected path. The sequence fX; Y; Zg in Figure 20.1 is an undi-

rected path. X is an ancestor of Y is there is a directed path

from X to Y . We also say that Y is a descendant of X.

A con�guration of the form:

X �! Y � Z

is called a collider. A con�guration not of that form is called a

non-collider, for example,

X �! Y �! Z

or

X � Y �! Z

Page 341: All of Statistics: A Concise Course in Statistical Inference Brief

20.3 Probability and DAG's 345

A directed path that starts and ends at the same variable is

called a cycle. A directed graph is acyclic if it has no cycles. In

this case we say that the graph is a directed acyclic graph or

DAG. From now on, we only deal with graphs that are DAG's.

20.3 Probability and DAG's

Let G be a DAG with vertices V = (X1; : : : ; Xk).

De�nition 20.1 If P is a distribution for V with probability

function p, we say that If P is Markov to G or that G

represents P if

p(v) =

kYi=1

p(xi j �i) (20.1)

where �i are the parents of Xi. The set of distributions

represented by G is denoted by M(G).

Example 20.2 For the DAG in Figure 20.2, P 2 M(G) if and

only if its probability function p has the form

p(x; y; z; w) = p(x)p(y)p(z j x; y)p(w j z): �

The following theorem says that P 2M(G) if and only if the

Markov Condition holds. Roughly speaking, the Markov Con-

dition says that every variable W is independent of the \past"

given its parents.

Theorem 20.3 A distribution P 2 M(G) if and only if the fol-

lowing Markov Condition holds: for every variable W ,

W qfW j �W (20.2)

where fW denotes all the other variables except the parents and

descendants of W .

Page 342: All of Statistics: A Concise Course in Statistical Inference Brief

346 20. Directed Graphs

Y

X

Z W

FIGURE 20.2. Another DAG.

A

C

B

D E

FIGURE 20.3. Yet another DAG.

Example 20.4 In Figure 20.2, the Markov Condition implies that

X q Y and W q fX; Y g j Z: �

Example 20.5 Consider the DAG in Figure 20.3. In this case

probability function must factor like

p(a; b; c; d; e) = p(a)p(bja)p(cja)p(djb; c)p(ejd):

The Markov Condition implies the following independence rela-

tions:

D q A j fB;Cg; E q fA;B;Cg j D and B q C j A �

Page 343: All of Statistics: A Concise Course in Statistical Inference Brief

20.4 More Independence Relations 347

X1

X2

X3

X4

X5

FIGURE 20.4. And yet another DAG.

20.4 More Independence Relations

With undirected graphs, we saw that pairwise independence re-

lations implied other independence relations. Luckily, we were

able to deduce these other relations from the graph. The situ-

ation is similar for DAG's. The Markov Condition allows us to

list some independence relations. These relations may logically

imply other other independence relations. Consider the DAG in

Figure 20.4. The Markov Condition implies:

X1 qX2; X2 q fX1; X4g; X3 qX4 j fX1; X2g;

X4 q fX2; X3g j X1; X5 q fX1; X2g j fX3; X4g

It turns out (but it is not obvious) that these conditions imply

that

fX4; X5g qX2 j fX1; X3g:

How do we �nd these extra independence relations? The an-

swer is \d-separation." d-separation can be summarized by three In what follows, we

implicitly assume

that P is faithful

to G which means

that P has no

extra independence

relations other than

those logically im-

plied by the Markov

Condition.

rules. Consider the four DAG's in Figure 20.5 and the DAG in

Figure 20.6. The �rst 3 DAG's in Figure 20.5 are non-colliders.

The DAG in the lower right of Figure 20.5 is a collider. The

DAG in Figure 20.6 is a collider with a descendent.

Page 344: All of Statistics: A Concise Course in Statistical Inference Brief

348 20. Directed Graphs

X Y Z X Y Z

X Y Z X Y Z

FIGURE 20.5. The �rst three DAG's have no colliders. The fourth

DAG in the lower right corner has a collider at Y .

X Y Z

W

FIGURE 20.6. A collider with a descendent.

The rules of d-separation.

Consider the DAG's in Figures 20.5 and 20.6.

Rule (1). In a non-collider, X and Z are d-connected,

but they are d-separated given Y .

Rule (2). If X and Z collide at Y then X and Z are

d-separated but they are d-connected given Y .

Rule (3). Conditioning on the descendant of a collider

has the same e�ect as conditioning on the collider. Thus

in Figure 20.6, X and Z are d-separated but they are

d-connected given W .

Here is a more formal de�nition of d-separation. Let X and Y

be distinct vertices and letW be a set of vertices not containing

X or Y . Then X and Y are d-separated given W if there

Page 345: All of Statistics: A Concise Course in Statistical Inference Brief

20.4 More Independence Relations 349

X U V W Y

S1 S2

FIGURE 20.7. d-separation explained.

exists no undirected path U between X and Y such that (i)

every collider on U has a descendant in W and (ii) no other

vertex on U is in W . If U; V and W are distinct sets of vertices

and U and V are not empty, then U and V are d-separated given

W if for every X 2 U and Y 2 V , X and Y are d-separated

given W . Vertices that are not d-separated are said to be d-

connected.

Example 20.6 Consider the DAG in Figure 20.7. From the d-

seperation rules we conclude that:

X and Y are d-separated (given the empty set)

X and Y are d-connected given fS1; S2g

X and Y are d-separated given fS1; S2; V g

Theorem 20.7 (Spirtes, Glymour and Scheines) Let A, B

and C be disjoint sets of vertices. Then A q B j C if and only

if A and B are d-separated by C.

Example 20.8 The fact that conditioning on a collider creates

dependence might not seem intuitive. Here is a whimsical ex-

ample from Jordan (2003) that makes this idea more palatable.

Your friend appears to be late for a meeting with you. There

are two explanations: she was abducted by aliens or you forgot

to set your watch ahead one hour for daylight savings time. See

Figure 20.8 Aliens and Watch are blocked by a collider which

implies they are marginally independent. This seems reasonable

since, before we know anything about your friend being late, we

would expect these variables to be independent. We would also

expect that P(Aliens = yesjLate = Y es) > P(Aliens = yes);

Page 346: All of Statistics: A Concise Course in Statistical Inference Brief

350 20. Directed Graphs

late

aliens watch

FIGURE 20.8. Jordan's alien example. Was your friend kidnapped

by aliens or did you forget to set your watch?

learning that your friend is late certainly increases the probabil-

ity that she was abducted. But when we learn that you forgot to

set your watch properly, we would lower the chance that your

friend was abducted. Hence, P(Aliens = yesjLate = Y es) 6=

P(Aliens = yesjLate = Y es;Watch = no). Thus, Aliens and

Watch are dependent given Late.

Graphs that look di�erent may actually imply the same inde-

pendence relations. If G is a DAG, we let I(G) denote all the in-

dependence statements implied by G. Two DAG's G1 and G2 for

the same variables V areMarkov equivalent if I(G1) = I(G2).

Given a DAG G, let skeleton(G) denote the undirected graph ob-

tained by replacing the arrows with undirected edges.

Theorem 20.9 Two DAG's G1 and G2 are Markov equivalent if

and only if (i) skeleton(G1) = skeleton(G2) and (ii) G1 and G2have the same colliders.

Example 20.10 The �rst three DAG's in Figure 20.5 are Markov

equivalent. The DAG in the lower right of the Figure is not

Markov equivalent to the others. �

20.5 Estimation for DAG's

Let G be a DAG. Assume that all the variables V = (X1; : : : ; Xm)

are discrete. The probability function can be written

p(v) =

kYi=1

p(xi j �i): (20.3)

Page 347: All of Statistics: A Concise Course in Statistical Inference Brief

20.5 Estimation for DAG's 351

To estimate p(v) we need to estimate p(xi j �i) for each i. Think

of the parents of Xi as one discrete variable eXi with many levels.

For example, suppose that X3 has parents X1 and X2 and that

X1 2 f0; 1; 2g while X2 2 f0; 1g. We can regard the parents X1

and X2 as a single variable eX3 de�ned by

eX3 =

8>>>>>><>>>>>>:

1 if X1 = 0; X2 = 0

2 if X1 = 0; X2 = 1

3 if X1 = 1; X2 = 0

4 if X1 = 1; X2 = 1

5 if X1 = 2; X2 = 0

6 if X1 = 2; X2 = 1:

Hence we can write

p(v) =

kYi=1

p(xi j exi): (20.4)

Theorem 20.11 Let V1; : : : ; Vn be iid random vectors from dis-

tribution p given in (20.4). The maximum likelihood estimator

of p is

bp(v) = kYi=1

bp(xi j exi) (20.5)

where

bp(xi j exi) = #fi : Xi = xi and eXi = exig#fi : eXi = exig :

Example 20.12 Let V = (X; Y; Z) have the DAG in the top left of

Figure 20.5. Given n observations (X1; Y1; Z1); : : : ; (Xn; Yn; Zn),

the mle of p(x; y; z) = p(x)p(yjx)p(zjy) is

bp(x) = #fi : Xi = xg

n;

bp(yjx) = #fi : Xi = x and Yi = yg

#fi : Yi = yg;

and bp(zjy) = #fi : Yi = y and Zi = zg

#fi : Zi = zg:

Page 348: All of Statistics: A Concise Course in Statistical Inference Brief

352 20. Directed Graphs

It is possible to extend these ideas to continuous random vari-

ables as well. For example, we might use some parametric model

p(xj�x; �x) for each condiitonal density. The likelihood function

is then

L(�) =

nYi=1

p(Vi; �) =

nYi=1

mYj=1

p(Xijj�j; �j)

where Xij is the value of Xj for the ith data point and �j are the

parameters for the jth conditional density. We can then proceed

using maximum likelihood.

So far, we have assumed that the structure of the DAG is

given. One can also try to estimate the structure of the DAG it-

self from the data. For example, we could �t every possible DAG

using maximum likelihood and use AIC (or some other method)

to choose a DAG. However, there are many possible DAG's so

you would need much data for such a method to be reliable.

Producing a valid, accurate con�dence set for the DAG struc-

ture would require astronomical sample sizes. DAG's are thus

most useful for encoding conditional independence information

rather than discovering it.

20.6 Causation Revisited

We discussed causation in Chapter 19 using the idea of coun-

terfactual random variables. A di�erent approach to causation

uses DAG's. The two approaches are mathematically equivalent

though they appear to be quite di�erent. In Chapter 19, the

extra element we added to clarify causation was the idea of a

counterfactual. In the DAG approach, the extra element is the

idea of intervention. Consider the DAG in Figure 20.9.

The probability function for a distribution consistent with

this DAG has the form p(x; y; z) = p(x)p(yjx)p(zjx; y). Here is

pseudo-code for generating from this distribution:

Page 349: All of Statistics: A Concise Course in Statistical Inference Brief

20.6 Causation Revisited 353

X Y Z

FIGURE 20.9. Conditioning versus intervening.

for i = 1; : : : ; n :

xi <� pX(xi)

yi <� pY jX(yijxi)

zi <� pZjX;Y (zijxi; yi)

Suppose we repeat this code many times yielding data (x1; y1; z1); : : : ; (xn; yn; zn).

Among all the times that we observe Y = y, how often is Z = z?

The answer to this question is given by the conditional distri-

bution of ZjY . Speci�cally,

P(Z = zjY = y) =P(Y = y; Z = z)

P(Y = y)=p(y; z)

p(y)

=

Px p(x; y; z)

p(y)=

Px p(x) p(yjx) p(zjx; y)

p(y)

=Xx

p(zjx; y)p(yjx) p(x)

p(y)=Xx

p(zjx; y)p(x; y)

p(y)

=Xx

p(zjx; y) p(xjy):

Now suppose we intervene by changing the computer code.

Speci�cally, suppose we �x Y at the value y. The code now

looks like this:

Page 350: All of Statistics: A Concise Course in Statistical Inference Brief

354 20. Directed Graphs

set Y = y

for i = 1; : : : ; n

xi <� pX(xi)

zi <� pZjX;Y (zijxi; y)

Having set Y = y, how often was Z = z? To answer, note

that the intervention has changed the joint probability to be

p�(x; z) = p(x)p(zjx; y):

The answer to our question is given by the marginal distribution

p�(z) =Xx

p�(x; z) =Xx

p(x)p(zjx; y):

We shall denote this as P(Z = zjY := y) or p(zjY := y). We call

P(Z = zjY = y) conditioning by observation or passive

conditioning. We call P(Z = zjY := y) conditioning by

intervention or active conditioning.

Passive conditioning is used to answer a predictive question

like:

\Given that Joe smokes, what is the probability he will get lung

cancer?"

Active conditioning is used to answer a causal question like:

\If Joe quits smoking, what is the probability he will get lung

cancer?"

Consider a pair (G;P) where G is a DAG and P is a distribu-

tion for the variables V of the DAG. Let p denote the probability

function for P. Consider intervening and �xing a variable X to

be equal to x. We represent the intervention by doing two things:

(1) Create a new DAG G� by removing all arrows pointing

into X;

(2) Create a new distribution p�(v) = P(V = vjX := x) by

removing the term p(xj�X) from p(v).

Page 351: All of Statistics: A Concise Course in Statistical Inference Brief

20.6 Causation Revisited 355

The new pair (G�; P �) represents the intervention \set X =

x."

Example 20.13 You may have noticed a correlation between rain

and having a wet lawn, that is, the variable \Rain" is not in-

dependent of the variable \Wet Lawn" and hence pR;W (r; w) 6=

pR(r)pW (w) where R denotes Rain and W denotes Wet Lawn.

Consider the following two DAGS:

Rain �!Wet Lawn Rain �Wet Lawn:

The �rst DAG implies that p(w; r) = p(r)p(wjr) while the sec-

ond implies that p(w; r) = p(w)p(rjw) No matter what the joint

distribution p(w; r) is, both graphs are correct. Both imply that

R and W are not independent. But, intuitively, if we want a

graph to indicate causation, the �rst graph is right and the sec-

ond is wrong. Throwing water on your lawn doesn't cause rain.

The reason we feel the �rst is correct while the second is wrong is

because the interventions implied by the �rst graph are correct.

Look at the �rst graph and form the interventionW = 1 where

1 denotes \wet lawn." Following the rules of intervention, we

break the arrows into W to get the modi�ed graph:

Rain set Wet Lawn =1

with distribution p�(r) = p(r). Thus P(R = r j W := w) =

P(R = r) tells us that \wet lawn" does not cause rain.

Suppose we (wrongly) assume that the second graph is the

correct causal graph and form the intervention W = 1 on the

second graph. There are no arrows intoW that need to be broken

so the intervention graph is the same as the original graph. Thus

p�(r) = p(rjw) which would imply that changing \wet" changes

\rain." Clearly, this is nonsense.

Both are correct probability graphs but only the �rst is correct

causally. We know the correct causal graph by using background

knowledge.

Page 352: All of Statistics: A Concise Course in Statistical Inference Brief

356 20. Directed Graphs

Remark 20.14 We could try to learn the correct causal graph

from data but this is dangerous. In fact it is impossible with two

variables. With more than two variables there are methods that

can �nd the causal graph under certain assumptions but they are

large sample methods and, furthermore, there is no way to ever

know if the sample size you have is large enough to make the

methods reliable.

We can use DAG's to represent confouding variables. If X is

a treatment and Y is an outcome, a confounding variable Z is

a variable with arrows into both X and Y ; see Figure 20.10. It

is easy to check, using the formalism of interventions, that the

following facts are true.

In a randomized study, the arrow betwen Z and X is bro-

ken. In this case, even with Z unobserved (represented by en-

closing Z in a circle), the causal relationship between X and

Y is estimable because it can be shown that E (Y jX := x) =

E (Y jX = x) which does not involve the unobserved Z. In

an observational study, with all confounders observed, we get

E (Y jX := x) =RE (Y jX = x; Z = z)dFZ(z) as in formula

(19.8). If Z is unobserved then we cannot estimate the causal

e�ect because E (Y jX := x) =RE (Y jX = x; Z = z)dFZ(z)

involves the unobserved Z. We can't just use X and Y since in

this case. P(Y = yjX = x) 6= P(Y = yjX := x) which is just

another way of saying that causation is not association.

In fact, we can make a precise connection between DAG's and

counterfactuals as follows. Suppose that X and Y are binary.

De�ne the confounding variable Z by

Z =

8>><>>:

1 if (C0; C1) = (0; 0)

2 if (C0; C1) = (0; 1)

3 if (C0; C1) = (1; 0)

4 if (C0; C1) = (1; 1):

Page 353: All of Statistics: A Concise Course in Statistical Inference Brief

20.7 Bibliographic Remarks 357

X Y

Z

X Y

Z

X Y

Z

FIGURE 20.10. Randomized study; Observational study with mea-

sured confounders; Observational study with unmeasured con-

founders. The circled variables are unobserved.

From this, you can make the correspondence between the DAG

approach and the counterfactual approach explicit. I leave this

for the interested reader.

20.7 Bibliographic Remarks

There are a number of texts on DAG's including Edwards (1996)

and Jordan (2003). The �rst use of DAG's for representing

causal relationships was by Wright (1934). Modern tratments

are contained in Spirtes, Glymour, and Scheines (1990) and

Pearl (2000). Robins, Scheines, Spirtes and Wasserman (2003)

discuss the problems with estimating causal structure from data.

20.8 Exercises

1. Consider the three DAG's in Figure 20.5 without a col-

lider. Prove that X q ZjY .

2. Consider the DAG in Figure 20.5 with a collider. Prove

that X q Z and that X and Z are dependent given Y .

3. Let X 2 f0; 1g, Y 2 f0; 1g, Z 2 f0; 1; 2g. Suppose the

distribution of (X; Y; Z) is Markov to:

X �! Y �! Z

Page 354: All of Statistics: A Concise Course in Statistical Inference Brief

358 20. Directed Graphs

Create a joint distribution p(x; y; z) that is Markov to this

DAG. Generate 1000 random vectors from this distribu-

tion. Estimate the distribution from the data using max-

imum likelihood. Compare the estimated distribution to

the true distribution. Let � = (�000; �001; : : : ; �112) where

�rst = P(X = r; Y = s; Z = t). Use the bootstrap to get

standard errors and 95 per cent con�dence intervals for

these 12 parameters.

4. Let V = (X; Y; Z) have the following joint distribution

X � Bernoulli

�1

2

Y j X = x � Bernoulli

�e4x�2

1 + e4x�2

Z j X = x; Y = y � Bernoulli

�e2(x+y)�2

1 + e2(x+y)�2

�:

(a) Find an expression for P(Z = z j Y = y). In particular,

�nd P(Z = 1 j Y = 1).

(b) Write a program to simulate the model. Conduct a

simulation and compute P(Z = 1 j Y = 1) empirically.

Plot this as a function of the simulation size N . It should

converge to the theoretical value you computed in (a).

(c) Write down an expression for P(Z = 1 j Y := y). In

particular, �nd P(Z = 1 j Y := 1).

Page 355: All of Statistics: A Concise Course in Statistical Inference Brief

20.8 Exercises 359

(d) Modify your program to simulate the intervention \set

Y = 1." Conduct a simulation and compute P(Z = 1 j

Y := 1) empirically. Plot this as a function of the simu-

lation size N . It should converge to the theoretical value

you computed in (c).

5. This is a continuous, Gaussian version of the last question.

Let V = (X; Y; Z) have the following joint distribution

X � Normal (0; 1)

Y j X = x � Normal (�x; 1)

Z j X = x; Y = y � Normal (�y + x; 1):

Here, �; � and are �xed parameters. Economists refer to

models like this as structural equation models.

(a) Find an explicit expression for f(z j y) and E (Z j Y =

y) =Rzf(z j y)dz.

(b) Find an explicit expression for f(z j Y := y) and then

�nd E (Z j Y := y) �Rzf(z j Y := y)dy. Compare to (b).

(c) Find the joint distribution of (Y; Z). Find the correla-

tion � between Y and Z.

(d) Suppose that X is not observed and we try to make

causal conclusions from the marginal distribution of (Y; Z).

(Think ofX as unobserved confounding variables.) In par-

ticular, suppose we declare that Y causes Z if � 6= 0 and

Page 356: All of Statistics: A Concise Course in Statistical Inference Brief

360 20. Directed Graphs

we declare that Y does not cause Z if � = 0. Show that

this will lead to erroneous conclusions.

(e) Suppose we conduct a randomized experiment in which

Y is randomly assigned. To be concrete, suppose that

X � Normal(0; 1)

Y � Normal(�; 1)

Z j X = x; Y = y � Normal(�y + x; 1):

Show that the method in (d) now yields correct conclu-

sions i.e. � = 0 if and only if f(z j Y := y) does not depend

on y.

Page 357: All of Statistics: A Concise Course in Statistical Inference Brief

This is page 359

Printer: Opaque this

21

Nonparametric Curve Estimation

In this Chapter we discuss nonparametric estimation of proba-

bility density functions and regression functions which we refer

to as curve estimation.

In Chapter 8 we saw that it is possible to consistently estimate

a cumulative distribution function F without making any as-

sumptions about F . If we want to estimate a probability density

function f(x) or a regression function r(x) = E (Y jX = x) the

situation is di�erent. We cannot estimate these functions con-

sistently without making some smoothness assumptions. Corre-

spondingly, we will perform some sort of smoothing operation

on the data.

A simple example of a density estimator is a histogram,

which we discuss in detail in Section 21.2. To form a histogram

estimator of a density f , we divide the real line to disjoint sets

called bins. The histogram estimator is piecewise constant func-

tion where the height of the function is proportional to number

of observations in each bin; see Figure 21.3. The number of bins

is an example of a smoothing parameter. If we smooth too

Page 358: All of Statistics: A Concise Course in Statistical Inference Brief

360 21. Nonparametric Curve Estimation

bg(x)

This is a function of the data This is the point at which we are

evaluating bg(�)FIGURE 21.1. A curve estimate bg is random because it is a function

of the data. The point x at which we evaluate bg is not a random

variable.

much (large bins) we get a highly biased estimator while if we

smooth too little (small bins) we get a highly variable estimator.

Much of curve estimation is concerned with trying to optimally

balance variance and bias.

21.1 The Bias-Variance Tradeo�

Let g denote an unknown function and let bgn denote an es-

timator of g. Bear in mind that bgn(x) is a random function

evaluated at a point x. bgn is random because it depends on

the data. Indeed, we could be more explicit and write bgn(x) =hx(X1; : : : ; Xn) to show that bgn(x) is a function of the data

X1; : : : ; Xn and that the function could be di�erent for each x.

See Figure 21.1.

As a loss function, we will use the integrated squared error

(ISE):

L(g; bgn) = Z (g(u)� bgn(u))2 du: (21.1)

The risk or mean integrated squared error (MISE) is

R(f; bf) = E

�L(g; bg)�: (21.2)

Page 359: All of Statistics: A Concise Course in Statistical Inference Brief

21.2 Histograms 361

Lemma 21.1 The risk can be written as

R(g;bgn) = Z b2(x) dx+

Zv(x) dx (21.3)

where

b(x) = E (bgn(x))� g(x) (21.4)

is the bias of bgn(x) at a �xed x and

v(x) = V(bgn(x)) = E

�(bgn(x)� E (bgn(x))2)� (21.5)

is the variance of bgn(x) at a �xed x.

In summary,

RISK = BIAS2 +VARIANCE: (21.6)

When the data are over-smoothed, the bias term is large and

the variance is small. When the data are under-smoothed the op-

posite is true; see Figure 21.2. This is called the bias-variance

trade-o�. Minimizing risk corresponds to balancing bias and

variance.

21.2 Histograms

Let X1; : : : ; Xn be iid on [0; 1] with density f . The restriction

to [0; 1] is not crucial; we can always rescale the data to be on

this interval. Let m be an integer and de�ne bins

B1 =

�0;

1

m

�; B2 =

�1

m;2

m

�; : : : ; Bm =

�m� 1

m; 1

�: (21.7)

De�ne the binwidth h = 1=m, let �j be the number of obser-

vations in Bj, let bpj = �j=n and let pj =RBjf(u)du.

Page 360: All of Statistics: A Concise Course in Statistical Inference Brief

362 21. Nonparametric Curve Estimation

0.1 0.2 0.3 0.4 0.5

0.00

0.02

0.04

0.06

0.08

0.10

more smoothing −−−>

Bias squared

Variance

Risk

Optimal Smoothing

FIGURE 21.2. The Bias-Variance trade-o�. The bias increases and

the variance decreases with the amount of smoothing. The optimal

amount of smoothing, indicated by the vertical line, minimizes the

risk = bias2 + variance.

Page 361: All of Statistics: A Concise Course in Statistical Inference Brief

21.2 Histograms 363

The histogram estimator is de�ned by

bfn(x) =8>>><>>>:bp1=h x 2 B1bp2=h x 2 B2

......bpm=h x 2 Bm

which we can write more succinctly as

bfn(x) = nXj=1

bpjhI(x 2 Bj): (21.8)

To

understand the motivation for this estimator, let pj =RBjf(u)du

and note that, for x 2 Bj and h small,

bfn(x) = bpjh� pj

h=

RBjf(u)du

h� f(x)h

f(x)= f(x):

Example 21.2 Figure 21.3 shows three di�erent histograms based

on n = 1; 266 data points from an astronomical sky survey.

Each data point represents the distance from us to a galaxy.

The galaxies lie on a \pencilbeam" pointing directly from the

Earth out into space. Because of the �nite speed of light, look-

ing at galaxies farther and farther away corresponds to looking

back in time. Choosing the right number of bins involves �nding

a good tradeo� between bias and variance. We shall see later

that the top left histogram has too few bins resulting in over-

smoothing and too much bias. The bottom left histogram has too

many bins resulting in undersmoothing and too few bins. The

top right histogram is just right. The histogram reveals the pres-

ence of clusters of galaxies. Seeing how the size and number of

galaxy clusters varies with time, helps cosmologists understand

the evolution of the universe. �

The mean and variance of bfn(x) are given in the following

Theorem.

Page 362: All of Statistics: A Concise Course in Statistical Inference Brief

364 21. Nonparametric Curve Estimation

Oversmoothed

0.00 0.05 0.10 0.15 0.20

02

46

810

1214

Just Right

0.00 0.05 0.10 0.15 0.20

010

2030

4050

60

Undersmoothed

0.00 0.05 0.10 0.15 0.20

020

4060

0 200 400 600 800 1000

−14

−12

−10

−8−6

number of bins

cros

s va

lidat

ion

scor

e

FIGURE 21.3. Three versions of a histogram for the astronomy data.

The top left histogram has too few bins. The bottom left histogram

has too many bins. The top right histogram is just right. The lower,

right plot shows the estimated risk versus the number of bins.

Page 363: All of Statistics: A Concise Course in Statistical Inference Brief

21.2 Histograms 365

Theorem 21.3 Consider �xed x and �xed m, and let Bj be the

bin containing x. Then,

E ( bfn(x)) = pj

hand V( bfn(x)) = pj(1� pj)

nh2: (21.9)

Let's take a closer look at the bias-variance tradeo� using

equation (21.9) . Consider some x 2 Bj. For any other u 2 Bj,

f(u) � f(x) + (u� x)f 0(x)

and so

pj =

ZBj

f(u)du �ZBj

�f(x) + (u� x)f 0(x)

�du

= f(x)h + hf 0(x)

�h

�j � 1

2

�� x

�:

Therefore, the bias b(x) is

b(x) = E ( bfn(x))� f(x) =pj

h� f(x)

� f(x)h + hf 0(x)�h�j � 1

2

�� x�

h� f(x)

= f 0(x)

�h

�j � 1

2

�� x

�:

If exj is the center of the bin, thenZBj

b2(x) dx �ZBj

(f 0(x))2�h

�j � 1

2

�� x

�2

dx

� (f 0(exj))2 ZBj

�h

�j � 1

2

�� x

�2

dx

= (f 0(exj))2h312:

Therefore,Z 1

0

b2(x)dx =

mXj=1

ZBj

b2(x)dx �mXj=1

(f 0(exj))2h312

=h2

12

mXj=1

h (f 0(exj))2 � h2

12

Z 1

0

(f 0(x))2dx:

Page 364: All of Statistics: A Concise Course in Statistical Inference Brief

366 21. Nonparametric Curve Estimation

Note that this increases as a function of h. Now consider the

variance. For h small, 1� pj � 1, so

v(x) � pj

nh2

=f(x)h+ hf 0(x)

�h�j � 1

2

�� x�

nh2

� f(x)

nh

where we have kept only the dominant term. So,Z 1

0

v(x)dx � 1

nh:

Note that this decreases with h. Putting all this together, we

get:

Theorem 21.4 Suppose thatR(f 0(u))2du <1. Then

R( bfn; f) � h2

12

Z(f 0(u))2du+

1

nh: (21.10)

The value h� that minimizes (21.10) is

h� =1

n1=3

�6R

(f 0(u))2du

�1=3

: (21.11)

With this choice of binwidth,

R( bfn; f) � C

n2=3(21.12)

where C = (3=4)2=3�R

(f 0(u))2du�1=3

.

Theorem 21.4 is quite revealing. We see that with an opti-

mally chosen binwidth, the MISE decreases to 0 at rate n�2=3.

By comparison, most parametric estimators converge at rate

Page 365: All of Statistics: A Concise Course in Statistical Inference Brief

21.2 Histograms 367

n�1. The slower rate of convergence is the price we pay for be-

ing nonparametric. The formula for the optimal binwidth h� is

of theoretical interest but it is not useful in practice since it

depends on the unknown function f .

A practical way to choose the binwidth is to estimate the

risk function and minimize over h. Recall that the loss function,

which we now write as a function of h, is

L(h) =

Z( bfn(x)� f(x))2 dx

=

Z bf 2n(x) dx� 2

Z bfn(x)f(x)dx + Z f 2(x) dx:

The last term does not depend on the binwidth h so minimizing

the risk is equivalent to minimizing the expected value of

J(h) =

Z bf 2n(x) dx� 2

Z bfn(x)f(x)dx:We shall refer to E (J(h)) as the risk, although it di�ers from

the true risk by the constant termRf 2(x) dx.

De�nition 21.5 The cross-validation estimator of risk

is

bJ(h) = Z � bfn(x)�2 dx� 2

n

nXi=1

bf(�i)(Xi) (21.13)

where bf(�i) is the histogram estimator obtained after re-

moving the ith observation. We refer to bJ(h) as the cross-validation score or estimated risk.

Theorem 21.6 The cross-validation estimator is nearly unbiased:

E ( bJ(x)) � E (J(x)):

Page 366: All of Statistics: A Concise Course in Statistical Inference Brief

368 21. Nonparametric Curve Estimation

In principle, we need to recompute the histogram n times to

compute bJ(h). Moreover, this has to be done for all values of h.

Fortunately, there is a shortcut formula.

Theorem 21.7 The following identity holds:

bJ(h) = 2

(n� 1)h� n + 1

(n� 1)

mXj=1

bp2j: (21.14)

Example 21.8 We used cross-validation in the astronomy exam-

ple. The cross-validation function is quite at near its mini-

mum. Any m in the range of 73 to 310 is an approximate min-

imizer but the resulting histogram does not change much over

this range. The histogram in the top right plot in Figure 21.3

was constructed using m = 73 bins. The bottom right plot shows

the estimated risk, or more precisely, bA, plotted versus the num-

ber of bins.

Next we want some sort of con�dence set for f . Suppose bfnis a histogram with m bins and binwidth h = 1=m. We cannot

realistically make con�dence statements about the �ne details of

the true density f . Instead, we shall make con�dence statements

about f at the resolution of the histogram. To this end, de�ne

ef(x) = pj

hfor x 2 Bj (21.15)

where pj =RBjf(u)du which is a \histogramized" version of f .

Theorem 21.9 Let m = m(n) be the number of bins in the his-

togram bfn. Assume that m(n) ! 1 and m(n) logn=n ! 0 as

n!1. De�ne

`(x) =

�max

nqbfn(x)� c; 0o�2

u(x) =

�qbfn(x) + c

�2

(21.16)

Page 367: All of Statistics: A Concise Course in Statistical Inference Brief

21.2 Histograms 369

where

c =z�=(2m)

2

rm

n: (21.17)

Then,

limn!1

P

�`(x) � ef(x) � u(x) for all x

�� 1� �: (21.18)

Proof. Here is an outline of the proof. A rigorous proof

requires fairly sophisticated tools. From the central limit the-

orem, bpj � N(pj; pj(1 � pj)=n). By the delta method,pbpj �

N(ppj; 1=(4n)). Moreover, it can be shown that the

pbpj's areapproximately independent. Therefore,

2pn�pbpj �ppj� � Zj (21.19)

where Z1; : : : ; Zm � N(0; 1). Let A be the event that `(x) �ef(x) � u(x) for all x. So,

A =n`(x) � ef(x) � u(x) for all x

o=

np`(x) �

qef(x) �pu(x) for all xo

=nqbfn(x)� c �

qef(x) �qbfn(x) + c for all xo

=nmaxx

����qbfn(x)�qef(x)���� � co:

Therefore,

P(Ac) = P

�maxx

����qbfn(x)�qef(x)���� > c

�= P

maxj

�����rbpj

h�r

pj

h

����� > c

!

= P

�maxj

���pbpj �ppj��� > cph

�= P

�maxj

2pn���pbpj �ppj��� > 2c

phn

�= P

�maxj

2pn���pbpj �ppj��� > z�=(2m)

�� P

�maxj

jZjj > z�=(2m)

��

mXj=1

P�jZjj > z�=(2m)

Page 368: All of Statistics: A Concise Course in Statistical Inference Brief

370 21. Nonparametric Curve Estimation

=

mXj=1

m= �: �

Example 21.10 Figure 21.4 shows a 95 per cent con�dence enve-

lope for the astronomy data. We see that even with over 1,000

data points, there is still substantial uncertainty. �

21.3 Kernel Density Estimation

Histograms are discontinuous.Kernel density estimators are

smoother and they converge faster to the true density than his-

tograms.

Let X1; : : : ; Xn denote the observed data, a sample from f .

In this chapter, a kernel is de�ned to be any smooth func-

tion K such that K(x) � 0,RK(x) dx = 1,

RxK(x)dx = 0

and �2K� R

x2K(x)dx > 0. Two examples of kernels are the

Epanechnikov kernel

K(x) =

�34(1� x2=5)=

p5 jxj < p

5

0 otherwise(21.20)

and the Gaussian (Normal) kernel K(x) = (2�)�1=2e�x2=2.

De�nition 21.11 Given a kernel K and a positive number

h, called the bandwidth, the kernel density estima-

tor is de�ned to be

bf(x) = 1

n

nXi=1

1

hK

�x�Xi

h

�: (21.21)

An example of a kernel density estimator is show in Figure

21.5. The kernel estimator e�ectively puts a smoothed out lump

of mass of size 1=n over each data point Xi. The bandwidth h

Page 369: All of Statistics: A Concise Course in Statistical Inference Brief

21.3 Kernel Density Estimation 371

0.00 0.05 0.10 0.15 0.20

010

2030

4050

FIGURE 21.4. 95 per cent con�dence envelope for astronomy data

using m = 73 bins.

Page 370: All of Statistics: A Concise Course in Statistical Inference Brief

372 21. Nonparametric Curve Estimation

−10 −5 0 5

0.00

0.05

0.10

0.15

FIGURE 21.5. A kernel density estimator bf . At each point x, bf(x)

is the average of the kernels centered over the data points Xi. The

data points are indicated by short vertical bars.

Page 371: All of Statistics: A Concise Course in Statistical Inference Brief

21.3 Kernel Density Estimation 373

controls the amount of smoothing. When h is close to 0, bfnconsists of a set of spikes, one at each data point. The height of

the spikes tends to in�nity as h ! 0. When h ! 1, bfn tends

to a uniform density.

Example 21.12 Figure 21.6 shows kernel density estimators for

the astronomy data using for three di�erent bandwidths. In each

case we used a Gaussian kernel. The properly smoothed kernel

density estimator in the top right panel shows similar structure

as the histogram. However, it is easier to see the clusters in the

kernel estimator. �

To construct a kernel density estimator, we need to choose

a kernel K and a bandwidth h. It can be shown theoretically

and empirically that the choice of K is not crucial. However, the

choice of bandwidth h is very important. As with the histogram,

we can make a theoretical statement about how the risk of the

estimator depends on the bandwidth.

Theorem 21.13 Under weak assumptions on f and K,

R(f; bfn) � 1

4�4Kh4Z(f

00

(x))2 +

RK2(x)dx

nh(21.22)

where �2K=Rx2K(x)dx. The optimal bandwidth is

h� =c�2=51 c

1=52 c

�1=53

n1=5(21.23)

where c1 =Rx2K(x)dx, c2 =

RK(x)2dx and c3 =

R(f 00(x))2dx.

With this choice of bandwidth,

R(f; bfn) � c4

n4=5

for some constant c4 > 0.

Proof. Write Kh(x;X) = h�1K ((x�X)=h) and bfn(x) =

n�1P

iKh(x;Xi). Thus, E [ bfn(x)] = E [Kh(x;X)] and V[ bfn(x)] =

Page 372: All of Statistics: A Concise Course in Statistical Inference Brief

374 21. Nonparametric Curve Estimation

0.00 0.05 0.10 0.15 0.20

02

46

810

12

oversmoothed

0.00 0.05 0.10 0.15 0.20

010

2030

4050

6070

just right

0.00 0.05 0.10 0.15 0.20

020

040

060

080

010

0012

0014

00

undersmoothed

0.002 0.004 0.006 0.008 0.010 0.012

−7.8

−7.6

−7.4

−7.2

−7.0

bandwidth

cross

−vali

datio

n sc

ore

FIGURE 21.6. Kernel density estimators and estimated risk for the

astronomy data.

Page 373: All of Statistics: A Concise Course in Statistical Inference Brief

21.3 Kernel Density Estimation 375

n�1V[Kh(x;X)]. Now,

E [Kh(x;X)] =

Z1

hK

�x� t

h

�f(t) dt

=

ZK(u)f(u� hu) du

=

ZK(u)

�f(u)� hf

0

(u) +1

2f00

(u) + � � ��du

= f(u) +1

2h2f

00

(u)

Zu2K(u) du � � �

sinceRK(x) dx = 1 and

RxK(x) dx = 0. The bias is

E [Kh(x;X)]� f(x) � 1

2�2kh2f

00

(x):

By a similar calculation,

V[ bfn(x)] � f(x)RK2(x) dx

n hn:

The second result follows from integrating the squared bias plus

the variance. �

We see that kernel estimators converge at rate n�4=5 while his-

tograms converge at the slower rate n�2=3. It can be shown that,

under weak assumptions, there does not exist a nonparametric

estimator that converges faster than n�4=5.

The expression for h� depends on the unknown density f

which makes the result of little practical use. As with the his-

tograms, we shall use cross-validation to �nd a bandwidth. Thus,

we estimate the risk (up to a constant) by

bJ(h) = Z bf 2(x)dz � 21

n

nXi=1

bf�i(Xi) (21.24)

where bf�i is the kernel density estimator after omitting the ith

observation.

Page 374: All of Statistics: A Concise Course in Statistical Inference Brief

376 21. Nonparametric Curve Estimation

Theorem 21.14 For any h > 0,

E

h bJ(h)i = E [J(h)] :

Also,

bJ(h) � 1

hn2

Xi

Xj

K�

�Xi �Xj

h

�+

2

nhK(0) (21.25)

whereK�(x) = K(2)(x)�2K(x) andK(2)(z) =RK(z�y)K(y)dy.

In particular, if K is a N(0,1) Gaussian kernel then K(2)(z) is

the N(0; 2) density.

We then choose the bandwidth hn that minimizes bJ(h). AFor large data sets,bf and (21.25)

can be computed

quickly using the

fast Fourier trans-

form.

justi�cation for this method is given by the following remarkable

theorem due to Stone.

Theorem 21.15 (Stone's Theorem.) Suppose that f is bounded.

Let bfh denote the kernel estimator with bandwidth h and let hn

denote the bandwidth chosen by cross-validation. Then,

R �f(x)� bfhn(x)�2 dx

infhR �

f(x)� bfh(x)�2 dxP�! 1: (21.26)

Example 21.16 The top right panel of Figure 21.6 is based on

cross-validation. These data are rounded which problems for

cross-validation. Speci�cally, it causes the minimizer to be h =

0. To overcome this problem, we added a small amount of ran-

dom Normal noise to the data. The result is that bJ(h) is very

smooth with a well de�ned minimum.

Remark 21.17 Do not assume that, if the estimator bf is wiggly,

then cross-validation has let you down. The eye is not a good

judge of risk.

Page 375: All of Statistics: A Concise Course in Statistical Inference Brief

21.3 Kernel Density Estimation 377

To construct con�dence bands, we use something similar to

histograms although the details are more complicated. The ver-

sion described here is from Chaudhuri and Marron (1999).

Page 376: All of Statistics: A Concise Course in Statistical Inference Brief

378 21. Nonparametric Curve Estimation

Con�dence Band for Kernel Density Estimator

1. Choose an evenly spaced grid of points V = fv1; : : : ; vNg.For every v 2 V, de�ne

Yi(v) =1

hK

�v �Xi

h

�: (21.27)

Note that bfn(v) = Y n(v), the average of the Yi(v)0s.

2. De�ne

se (v) =s(v)pn

(21.28)

where s2(v) = (n� 1)�1P

n

i=1(Yi(v)� Y n(v))2.

3. Compute the e�ective sample size

ESS(v) =

Pn

i=1K�(v�Xi)

h

�K(0)

: (21.29)

Roughly, this is the number of data points being averaged

together to form bfn(v).4. Let V� = fv 2 V : ESS(v) � 5g. Now de�ne the number

of independent blocks m by

1

m=

ESS

n(21.30)

where ESS is the average of ESS over V�.

5. Let

q = ��1�1 + (1� �)1=m

2

�(21.31)

and de�ne

`(v) = bfn(v)� q se (v) and u(v) = bfn(v) + q se (v)

(21.32)

where we replace `(v) with 0 if it becomes negative.

Then,

P

n`(v) � f(v) � u(v) for all v

o� 1� �:

Page 377: All of Statistics: A Concise Course in Statistical Inference Brief

21.3 Kernel Density Estimation 379

Example 21.18 Figure 21.7 shows approximate 95 per cent con-

�dence bands for the astronomy data. �

Suppose now that the dataXi = (Xi1; : : : ; Xid) are d-dimensional.

The kernel estimator can easily be generalized to d dimensions.

Let h = (h1; : : : ; hd) be a vector of bandwidths and de�ne

bfn(x) = 1

n

nXi=1

Kh(x�Xi) (21.33)

where

Kh(x�Xi) =1

nh1 � � �hd

(dY

j=1

K

�xi �Xij

hj

�)(21.34)

where h1; : : : ; hd are bandwidths. For simplicity, we might take

hj = sjh where sj is the standard deviation of the jth variable.

There is now only a single bandwidth h to choose. Using calcu-

lations like those in the one-dimensional case, the risk is given

by

R(f; bfn) � 1

4�4K

"dX

j=1

h4j

Zf 2jj(x)dx+

Xj 6=k

h2jh2k

Zfjjfkkdx

#

+

�RK2(x)dx

�dnh1 � � �hd

where fjj is the second partial derivative of f . The optimal

bandwidth satis�es hi � c1n�1=(4+d)) leading to a risk of order

n�4=(4+d). From this fact, we see that the risk increases quickly

with dimension, a problem usually called the curse of dimen-

sionality. To get a sense of how serious this problem is, con-

sider the following table from Silverman (1986) which shows the

sample size required to ensure a relative mean squared error less

than 0.1 at 0 when the density is multivariate normal and the

optimal bandwidth is selected.

Page 378: All of Statistics: A Concise Course in Statistical Inference Brief

380 21. Nonparametric Curve Estimation

0.00 0.05 0.10 0.15 0.20

010

2030

40

FIGURE 21.7. 95 per cent Con�dence bands for kernel density esti-

mate for the astronomy data.

Page 379: All of Statistics: A Concise Course in Statistical Inference Brief

21.4 Nonparametric Regression 381

Dimension Sample Size

1 4

2 19

3 67

4 223

5 768

6 2790

7 10,700

8 43,700

9 187,000

10 842,000

This is bad news indeed. It says that having 842,000 observa-

tions in a ten dimensional problem is really like having 4 obser-

vations.

21.4 Nonparametric Regression

Consider pairs of points (x1; Y1); : : : ; (xn; Yn) related by

Yi = r(xi) + �i (21.35)

where E (�i) = 0 and we are treating the xi's as �xed. In nonpara-

metric regression, we want to estimate the regression function

r(x) = E (Y jX = x).

There are many nonparametric regression estimators. Most

involve estimating r(x) by taking some sort of weighted average

of the Yi's, giving higher weight to those points near x. In par-

ticular, the Nadaraya-Watson kernel estimator is de�ned

by

Page 380: All of Statistics: A Concise Course in Statistical Inference Brief

382 21. Nonparametric Curve Estimation

br(x) = nXi=1

wi(x)Yi (21.36)

where the weights wi(x) are given by

wi(x) =K�x�xi

h

�Pn

j=1K�x�xj

h

� : (21.37)

The form of this estimator comes from �rst estimating the

joint density f(x; y) using kernel density estimation and then

inserting this into:

r(x) = E (Y jX = x) =

Zyf(yjx)dy =

Ryf(x; y)dyRf(x; y)dy

:

Theorem 21.19 Suppose that V(�i) = �2. The risk of the Nadaraya-

Watson kernel estimator is

R(brn; r) � h4

4

�Zx2K2(x)dx

�4 Z �r00(x) + 2r0(x)

f 0(x)

f(x)

�2

dx

+

Z�2RK2(x)dx

nhf(x)dx: (21.38)

The optimal bandwidth decreases at rate n�1=5 and with this

choice the risk decreases at rate n�4=5.

In practice, to choose the bandwidth h we minimize the cross

validation score

bJ(h) = nXi=1

(Yi � br�i(xi))2 (21.39)

where br�i is the estimator we get by omitting the ith variable.

An approximation to bJ is given by

bJ(h) = nXi=1

(Yi � br(xi))2 1�1� K(0)Pn

j=1K

�xi�xj

h

��2

: (21.40)

Page 381: All of Statistics: A Concise Course in Statistical Inference Brief

21.4 Nonparametric Regression 383

Example 21.20 Figures 21.8 shows cosmic microwave background

(CMB) data from BOOMERaNG (Netter�eld et al. 2001), Max-

ima (Lee et al. 2001) and DASI (Halverson 2001). The data

consist of n pairs (x1; Y1); : : : ; (xn; Yn) where xi is called the

multipole moment and Yi is the estimated power spectrum of

the temperature uctuations. What your seeing are sound waves

in the cosmic microwave background radiation which is the heat,

left over from the big bang. If r(x) denotes the true power spec-

trum then

Yi = r(xi) + �i

where �i is a random error with mean 0. The location and size

of peaks in r(x) provides valuable clues about the behavior of

the early universe. Figure 21.8 shows the �t based on cross-

validation as well as an undersmoothed and oversmoothed �t.

The cross-validation �t shows the presence of three well de�ned

peaks, as predicted by the physics of the big bang. �

The procedure for �nding con�dence bands is similar to that

for density estimation. However, we �rst need to estimate �2.

Suppose that the xi's are ordered. Assuming r(x) is smooth, we

have r(xi+1)� r(xi) � 0 and hence

Yi+1 � Yi =hr(xi+1) + �i+1

i�hr(xi) + �i

i� �i+1 � �i

and hence

V(Yi+1 � Yi) � V(�i+1 � �i) = V(�i+1) + V(�i) = 2�2:

We can thus use the average of the n � 1 di�erences Yi+1 � Yi

to estimate �2. Hence, de�ne

b�2 = 1

2(n� 1)

n�1Xi=1

(Yi+1 � Yi)2: (21.41)

Page 382: All of Statistics: A Concise Course in Statistical Inference Brief

384 21. Nonparametric Curve Estimation

200 400 600 800 1000

010

0020

0030

0040

0050

0060

00

Undersmoothed

200 400 600 800 1000

010

0020

0030

0040

0050

0060

00

Oversmoothed

200 400 600 800 1000

010

0020

0030

0040

0050

0060

00

Just Right (Using cross−valdiation)

20 40 60 80 100 120

3e+0

54e

+05

5e+0

56e

+05

7e+0

58e

+05

bandwidth

estim

ated

risk

FIGURE 21.8. Regression analysis of the CMB data. The �rst �t is

undersmoothed, the second is oversmoothed and the third is based

on cross-validation. The last panel shows the estimated risk versus

the bandwidth of the smoother. The data are from BOOMERaNG,

Maxima and DASI.

Page 383: All of Statistics: A Concise Course in Statistical Inference Brief

21.4 Nonparametric Regression 385

Con�dence Bands for Kernel Regression

Follow the same procedure as for kernel density esti-

mators except, change the de�nition of se (v) to

se (v) = b�vuut nX

i=1

w2(xi) (21.42)

where b� is de�ned in (21.41).

Example 21.21 Figure 21.9 shows a 95 per cent con�dence enve-

lope for the CMB data. We see that we are highly con�dence of

the existence and position of the �rst peak. We are more uncer-

tain about the second and third peak. At the time of this writing,

more accurate data are becoming available that apparently pro-

vide sharper estimates of the second and third peak. �

The extension to multiple regressors X = (X1; : : : ; Xp) is

straightforward. As with kernel density estimation we just re-

place the kernel with a multivariate kernel. However, the same

caveats about the curse of dimensionality apply. In some cases,

we might consider putting some restrictions on the regression

function which will then reduce the curse of dimensionality. For

example, additive regression is based on the model

Y =

pXj=1

rj(Xj) + �: (21.43)

Now we only need to �t p one-dimensional functions. The model

can be enriched by adding various interactions, for example,

Y =

pXj=1

rj(Xj) +Xj<k

rjk(XjXk) + �: (21.44)

Additive models are usually �t by an algorithm called back�t-

ting.

Page 384: All of Statistics: A Concise Course in Statistical Inference Brief

386 21. Nonparametric Curve Estimation

200 400 600 800 1000

−100

00

1000

2000

3000

4000

5000

6000

FIGURE 21.9. 95 per cent con�dence envelope for the CMB data.

Page 385: All of Statistics: A Concise Course in Statistical Inference Brief

21.5 Appendix: Con�dence Sets and Bias 387

Back�tting

1. Initialize r1(x1), . . . , rp(xp).

2. For j = 1; : : : ; p:

(a) Let �i = Yi �P

s6=j rs(xi).

(b) Let rj be the function estimate obtained by regressing

the �i's on the jth covariate.

3. If converged STOP. Else, go back to step 2.

Additive models have the advantage that they avoid the curse

of dimensionality and they can be �t quickly but they have one

disadvantage: the model is not fully nonparametric. In other

words, the true regression function r(x) may not be of the form

(21.43).

21.5 Appendix: Con�dence Sets and Bias

The con�dence bands we computed are not for the density func-

tion or regression function but rather, for the smoothed function.

For example, the con�dence band for a kernel density estimate

with bandwidth h is a band for the function one gets by smooth-

ing the true function with a kernel with the same bandwidth.

Getting a con�dence set for the true function is complicated for

reasons we now explain.

First, let's review the parametric case. When estimating a

scalar quantity � with an estimator b�, the usual con�dence in-

terval is of the form b��z�=2sn where b� is the maximum likelihood

estimator and sn =

qV(b�) is the estimated standard error of the

estimator. Under standard regularity conditions, b� � N(�; sn)

Page 386: All of Statistics: A Concise Course in Statistical Inference Brief

388 21. Nonparametric Curve Estimation

and

limn!1

P

�b� � 2 sn � � � b� + 2 sn

�= 1� �:

But let us take a closer look.

Let bn = E(b�n) � � We know that b� � N(�; sn) but a more

accurate statement is b� � N(� + bn; sn). We usually ignore the

bias term bn in large sample calculations but let's keep track of

it. The coverage of the usual con�dence interval is

P

�b� � z�=2sn � � � b� + z�=2sn

�= P

�z�=2 �

b� � �

sn� z�=2

!

= P

��z�=2 � N (bn; sn)

sn� z�=2

�= P

��z�=2 � N

�bn

sn; 1

�� z�=2

�:

In a well behaved parametric model, sn is of size n�1=2 bn is

of size n�1. Hence, bn=sn ! 0 so the last displayed probability

statement becomes P(�z�=2 � N(0; 1) � z�=2) = 1 � �. What

makes parametric con�dence intervals have the right coverage

is the fact that bn=sn ! 0.

The situation is more complicated for kernel methods. Con-

sider estimating a density f(x) at a single point x with a kernel

density estimator. Since bf(x) is a sum if iid random variables,

the central limit theorem implies that

bf(x) � N

�f(x) + bn(x);

c2f(x)

nh

�(21.45)

where

bn(x) =1

2h2f 00(x)c1 (21.46)

is the bias, c1 =Rx2K(x)dx and c2 =

RK2(x)dx. The esti-

mated standard error is

sn(x) =

(c2 bf(x)nh

)1=2

: (21.47)

Page 387: All of Statistics: A Concise Course in Statistical Inference Brief

21.6 Bibliographic Remarks 389

Suppose we use the usual interval bf(x) � z�=2sn. Arguing as

above, the coverage is approximately,

P

��z�=2 � N

�bn(x)

sn(x); 1

�� z�=2

�:

The optimal bandwidth is of the form h = cn�1=5 for some

constant c. If we plug h = cn�1=5 into (21.46) and(21.47) we

see that bn(x)=sn(x) does not tend to 0. Thus, the con�dence

interval will have coverage less than 1� �.

21.6 Bibliographic Remarks

Two very good books on curve estimation are Scott (1992) and

Silverman (1986). I drew heavily on those books for this Chap-

ter.

21.7 Exercises

1. Let X1; : : : ; Xn � f and let bfn be the kernel density esti-

mator using the boxcar kernel:

K(x) =

�1 �1

2< x < 1

2

0 otherwise:

(a) Show that

E ( bf(x)) = 1

h

Zx+(h=2)

x�(h=2)

f(y)dy

and

V( bf(x)) = 1

nh2

24Z x+(h=2)

x�(h=2)

f(y)dy � Z

x+(h=2)

x�(h=2)

f(y)dy

!235 :

Page 388: All of Statistics: A Concise Course in Statistical Inference Brief

390 21. Nonparametric Curve Estimation

(b) Show that if h ! 0 and nh ! 1 as n ! 1 thenbfn(x) P�! f(x).

2. Get the data on fragments of glass collected in forensic

work. You need to download the MASS library for R then:

library(MASS)

data(fgl)

x <- fgl[[1]]

help(fgl)

The data are also on my website. Estimate the density

of the �rst variable (refractive index) using a histogram

and use a kernel density estimator. Use cross-validation

to choose the amount of smoothing. Experiment with dif-

ferent binwidths and bandwidths. Comment on the simi-

larities and di�erences. Construct 95 per cent con�dence

bands for your estimators.

3. Consider the data from question 2. Let Y be refractive

index and let x be aluminium content (the fourth vari-

able). Do a nonparametric regression to �t the model Y =

f(x) + �. Use cross-validation to estimate the bandwidth.

Construct 95 per cent con�dence bands for your estimate.

4. Prove Lemma 21.1.

5. Prove Theorem 21.3.

6. Prove Theorem 21.7.

7. Prove Theorem 21.14.

Page 389: All of Statistics: A Concise Course in Statistical Inference Brief

21.7 Exercises 391

8. Consider regression data (x1; Y1); : : : ; (xn; Yn). Suppose that

0 � xi � 1 for all i. De�ne bins Bj as in equation (21.7).

For x 2 Bj de�ne brn(x) = Y j

where Y j is the mean of all the Yi's corresponding to those

xi's in Bj. Find the approximate risk of this estimator.

From this expression for the risk, �nd the optimal band-

width. At what rate does the risk go to zero?

9. Show that with suitable smoothness assumptions on r(x),b�2 in equation (21.41) is a consistent estimator of �2.

Page 390: All of Statistics: A Concise Course in Statistical Inference Brief
Page 391: All of Statistics: A Concise Course in Statistical Inference Brief

This is page 393

Printer: Opaque this

22

Smoothing Using OrthogonalFunctions

In this Chapter we study a di�erent approach to nonparametric

curve estimation based on orthogonal functions. We begin

with a brief introduction to the theory of orthogonal functions.

Then we turn to density estimation and regression.

22.1 Orthogonal Functions and L2 Spaces

Let v = (v1; v2; v3) denote a three dimensional vector, that is,

a list of three real numbers. Let V denote the set of all such

vectors. If a is a scalar (a number) and v is a vector, we de�ne

av = (av1; av2; av3). The sum of vectors v and w is de�ned by

v+w = (v1+w1; v2+w2; v3+w3). The inner product between

two vectors v and w is de�ned by hv; wi =P3

i=1 viwi. The norm

(or length) of a vector v is de�ned by

jjvjj =phv; vi =

vuut 3Xi=1

v2i : (22.1)

Page 392: All of Statistics: A Concise Course in Statistical Inference Brief

394 22. Smoothing Using Orthogonal Functions

Two vectors are orthogonal (or perpendicular) if hv; wi = 0.

A set of vectors are orthogonal if each pair in the set is orthog-

onal. A vector is normal if jjvjj = 1.

Let �1 = (1; 0; 0), �2 = (0; 1; 0), �3 = (0; 0; 1). These vectors

are said to be an orthonormal basis for V since they have the

following properties:

(i) they are orthogonal;

(ii) they are normal;

(iii) they form a basis for V which means that if v 2 V then v

can be written as a linear combination of �1, �2, �3:

v =

3Xj=1

�j�j where �j = h�j; vi: (22.2)

For example, if v = (12; 3; 4) then v = 12�1+3�2+4�3. There

are other orthonormal bases for V, for example,

1 =

�1p3;1p3;1p3

�; 2 =

�1p2;� 1p

2; 0

�; 3 =

�1p6;1p6;� 2p

6

�:

You can check that these three vectors also form an orthonormal

basis for V. Again, if v is any vector then we can write

v =

3Xj=1

�j j where �j = h j; vi:

For example, if v = (12; 3; 4) then

v = 10:97 1 + 6:36 2 + 2:86 3:

Now we make the leap from vectors to functions. Basically, we

just replace vectors with functions and sums with integrals. Let

L2(a; b) denote all functions de�ned on the interval [a; b] such

thatR b

af(x)2dx <1:

L2[a; b] =

�f : [a; b]! R;

Z b

a

f(x)2dx <1�: (22.3)

Page 393: All of Statistics: A Concise Course in Statistical Inference Brief

22.1 Orthogonal Functions and L2 Spaces 395

We sometimes write L2 instead of L2(a; b). The inner product

between two functions f; g 2 L2 is de�ned byRf(x)g(x)dx. The

norm of f is

jjf jj =sZ

f(x)2dx: (22.4)

Two functions are orthogonal ifRf(x)g(x)dx = 0. A function

is normal if jjf jj = 1.

A sequence of functions �1; �2; �3; : : : is orthonormal ifR�2j(x)dx =

1 for each j andR�i(x)�j(x)dx = 0 for i 6= j. An orthonor-

mal sequence is complete if the only function that is orthog-

onal to each �j is the zero function. In this case, the functions

�1; �2; �3; : : : form in basis, meaning that if f 2 L2 then f can The equality in

the displayed

equation means

thatR(f(x) �

fn(x))2dx ! 0

where fn(x) =Pn

j=1 �j�j(x).

be written as

f(x) =

1Xj=1

�j�j(x); where �j =

Z b

a

f(x)�j(x)dx:

(22.5)

Parseval's relation says that

jjf jj2 �Zf2(x)dx =

1Xj=1

�2j � jj�jj2 (22.6)

where � = (�1; �2; : : :).

Example 22.1 An example of an orthonormal basis for L2[0; 1]

is the cosine basis de�ned as follows. Let �1(x) = 1 and for

j � 2 de�ne

�j(x) =p2 cos((j � 1)�x): (22.7)

The �rst six functions are plotted in Figure 22.1. �

Example 22.2 Let

f(x) =px(1� x) sin

�2:1�

(x+ :05)

Page 394: All of Statistics: A Concise Course in Statistical Inference Brief

396 22. Smoothing Using Orthogonal Functions

0.0 0.2 0.4 0.6 0.8 1.0

0.6

0.8

1.0

1.2

1.4

0.0 0.2 0.4 0.6 0.8 1.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

0.0 0.2 0.4 0.6 0.8 1.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

0.0 0.2 0.4 0.6 0.8 1.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

0.0 0.2 0.4 0.6 0.8 1.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

0.0 0.2 0.4 0.6 0.8 1.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

FIGURE 22.1. The �rst six functions in the cosine basis.

Page 395: All of Statistics: A Concise Course in Statistical Inference Brief

22.1 Orthogonal Functions and L2 Spaces 397

which is called the \doppler function." Figure 22.2 shows f (top

left) and its approximation

fJ(x) =

JXj=1

�j�j(x)

with J equal to 5 (top right), 20 (bottom left) and 200 (bottom

right). As J increases we see that fJ(x) gets closer to f(x). The

coeÆcients �j =R 1

0f(x)�j(x)dx were computed numerically. �

Example 22.3 The Legendre polynomials on [�1; 1] are de-

�ne by

Pj(x) =1

2jj!

dj

dxj(x2 � 1)j; j = 0; 1; 2; : : : (22.8)

It can be shown that these functions are complete and orthogonal

and that Z 1

�1

P2j (x)dx =

2

2j + 1: (22.9)

It follows that the functions �j(x) =p(2j + 1)=2Pj(x), j =

0; 1; : : : form an orthonormal basis for L2(�1; 1). The �rst few

Legendre polynomials are

P0(x) = 1

P1(x) = x

P2(x) =1

2

�3x2 � 1

�P3(x) =

1

2

�5x3 � 3x

�:

These polynomials may be constructed explicitly using the fol-

lowing recursive relation:

Pj+1(x) =(2j + 1)xPj(x)� jPj�1(x)

j + 1: � (22.10)

Page 396: All of Statistics: A Concise Course in Statistical Inference Brief

398 22. Smoothing Using Orthogonal Functions

0.0 0.2 0.4 0.6 0.8 1.0

−0.4

−0.2

0.0

0.2

0.4

0.0 0.2 0.4 0.6 0.8 1.0

−0.1

0.0

0.1

0.2

0.0 0.2 0.4 0.6 0.8 1.0

−0.4

−0.2

0.0

0.2

0.4

0.0 0.2 0.4 0.6 0.8 1.0

−0.4

−0.2

0.0

0.2

0.4

FIGURE 22.2. Approximating the doppler function

f(x) =px(1� x) sin

�2:1�

(x+:05)

�with its expansion in

the cosine basis. The function f (top left) and its ap-

proximation fJ(x) =PJ

j=1 �j�j(x) with J equal to 5

(top right), 20 (bottom left) and 200 (bottom right). The

coeÆcients �j =R 1

0f(x)�j(x)dx were computed numer-

ically.

Page 397: All of Statistics: A Concise Course in Statistical Inference Brief

22.2 Density Estimation 399

The coeÆcients �1; �2; : : : are related to the smoothness of

the function f . To see why, note that if f is smooth, then

its derivatives will be �nite. Thus we expect that, for some k,R 1

0(f (k)(x))2dx <1 where f (k) is ther kth derivative of f . Now

consider the cosine basis (22.7) and let f(x) =P1

j=1 �j�j(x).

Then, Z 1

0

(f (k)(x))2dx = 2

1Xj=1

�2j (�(j � 1))2k:

The only way thatP1

j=1 �2j (�(j � 1))2k can be �nite is if the

�j's get small when j gets large. To summarize:

If the function f is smooth then the coe�cients �j will

be small when j is large.

For the rest of this chapter, assume we are using the cosine

basis unless otherwise speci�ed.

22.2 Density Estimation

Let X1; : : : ; Xn be iid observations from a distribution on [0; 1]

with density f . Assuming f 2 L2 we can write

f(x) =

1Xj=1

�j�j(x)

where �1; �2; : : : is an orthonormal basis. De�ne

b�j = 1

n

nXi=1

�j(Xi): (22.11)

Theorem 22.4 The mean and variance of b�j areE

�b�j� = �j (22.12)

V

�b�j� =�2j

n(22.13)

Page 398: All of Statistics: A Concise Course in Statistical Inference Brief

400 22. Smoothing Using Orthogonal Functions

where

�2j = V(�j(Xi)) =

Z(�j(x)� �j)

2f(x)dx: (22.14)

Proof. We have

E

�b�j� =1

n

nXi=1

E (�j(Xi))

= E (�j(X1))

=

Z�j(x)f(x)dx = �j:

The calculation for the variance is similar. �

Hence, b�j is an unbiased estimate of �j. It is tempting to

estimate f byP1

j=1b�j�j(x) but this turns out to have a very

high variance. Instead, consider the estimator

bf(x) = JXj=1

b�j�j(x): (22.15)

The number of terms J is a smoothing parameter. Increasing J

will decrease bias while increasing variance. For technical rea-

sons, we restrict J to lie in the range

1 � J � p

where p = p(n) =pn. To emphasize the dependence of the risk

function on J , we write the risk function as R(J).

Theorem 22.5 The risk of bf is given by

R(J) =

JXj=1

�2j

n+

1Xj=J+1

�2j : (22.16)

In kernel estimation, we used cross-validation to estimate the

risk. In the orthogonal function approach, we instead use the

risk estimator

bR(J) = JXj=1

b�2jn

+

pXj=J+1

�b�2j �

b�2jn

�+

(22.17)

Page 399: All of Statistics: A Concise Course in Statistical Inference Brief

22.2 Density Estimation 401

where a+ = maxfa; 0g and

b�2j = 1

n� 1

nXi=1

��j(Xi)� b�j�2 : (22.18)

To motivate this estimator, note that b�2j is an unbiased estmate

of �2j and b�2j � b�2j is an unbiased estimator of �2

j . We take the

positive part of the latter term since we know that �2j cannot be

negative. We now choose 1 � bJ � p to minimize bR( bf; f). Hereis a summary.

Summary of Orthogonal Function Density Estimation

1. Let b�j = 1

n

nXi=1

�j(Xi):

2. Choose bJ to minimize bR(J) over 1 � J � p =pn where

bR(J) = JXj=1

b�2jn

+

pXj=J+1

�b�2j �

b�2jn

�+

and b�2j = 1

n� 1

nXi=1

��j(Xi)� b�j�2 :

3. Let bf(x) = bJXj=1

b�j�j(x):

The estimator bfn can be negative. If we are interested in

exploring the shape of f , this is not a problem. However, if

we need our estimate to be a probability density function, we

Page 400: All of Statistics: A Concise Course in Statistical Inference Brief

402 22. Smoothing Using Orthogonal Functions

can truncate the estimate and then normalize it. That, we takebf � = maxf bfn(x); 0g= R 1

0maxf bfn(u); 0gdu.

Now let us construct a con�dence band for f . Suppose we

estimate f using J orthogonal functions. We are essentially

estimating ef(x) =PJ

j=1 �j�j(x) not the true density f(x) =P1

j=1 �j�j(x). Thus, the con�dence band should be regarded as

a band for ef(x).Theorem 22.6 An approximate 1 � � con�dence band for ef is

(`(x); u(x)) where

`(x) = bfn(x)� c; u(x) = bfn(x) + c (22.19)

where

c =JK

2

pn

s1 +

p2z�pJ

(22.20)

and

K = max1�j�J

maxxj�j(x)j:

For the cosine basis, K =p2.

Example 22.7 Let

f(x) =5

6�(x; 0; 1) +

1

6

5Xj=1

�(x;�j; :1)

where �(x;�; �) denotes a Normal density with mean � and

standard deviation �, and (�1; : : : ; �5) = (�1;�1=2; 0; 1=2; 1).Marron and Wand (1990) call this \the claw" although I prefer

the \Bart Simpson." Figure 22.3 shows the true density as well

as the estimated density based on n = 5; 000 observations and

a 95 per cent con�dence band. The density has been rescaled to

have most of its mass between 0 and 1 using the transformation

y = (x + 3)=6.

Page 401: All of Statistics: A Concise Course in Statistical Inference Brief

22.2 Density Estimation 403

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

FIGURE 22.3. The top plot is the true density for the Bart Simpson

distribution (rescaled to have most of its mass between 0 and 1). The

bottom plot is the orthogonal function density estimate and 95 per

cent con�dence band.

Page 402: All of Statistics: A Concise Course in Statistical Inference Brief

404 22. Smoothing Using Orthogonal Functions

22.3 Regression

Consider the regression model

Yi = r(xi) + �i; i = 1; : : : ; n (22.21)

where the �i are independent with mean 0 and variance �2. We

will initially focus on the special case where xi = i=n. We assume

that r 2 L2[0; 1] and hence we can write

r(x) =

1Xj=1

�j�j(x) where �j =

Z 1

0

r(x)�j(x)dx (22.22)

where �1; �2; : : : where is an orthonormal basis for [0; 1].

De�ne b�j = 1

n

nXi=1

Yi �j(xi); j = 1; 2; : : : (22.23)

Since b�j is an average, the central limit theorem tells us that b�jwill be approximately normally distributed.

Theorem 22.8 b�j � N

��j;

�2

n

�: (22.24)

Proof. The mean of b�j isE (b�j) =

1

n

nXi=1

E (Yi)�j(xi) =1

n

nXi=1

r(xi)�j(xi)

�Zr(x)�j(x)dx = �j

where the approximate equality follows from the de�nition of a

Riemann integral:P

i�nh(xi) !R 1

0h(x)dx where �n = 1=n.

The variance is

V(b�j) =1

n2

nXi=1

V(Yi)�2j(xi)

Page 403: All of Statistics: A Concise Course in Statistical Inference Brief

22.3 Regression 405

=�2

n2

nXi=1

�2j(xi) =

�2

n

1

n

nXi=1

�2j(xi)

� �2

n

Z�2j(x)dx =

�2

n

sinceR�2j(x)dx = 1. �

As we did for density estimation, we will estimate r by

br(x) = JXj=1

b�j�j(x):Let

R(J) = E

Z(r(x)� br(x))2 dx

be the risk of the estimator.

Theorem 22.9 The risk R(J) of the estimator brn(x) =PJj=1b�j�j(x)

is

R(J) =J�

2

n+

1Xj=J+1

�2j : (22.25)

To estimate for �2 = V ar(�i) we use

b�2 = n

k

nXi=n�k+1

b�2j (22.26)

where k = n=4. To motivate this estimator, recall that if f is

smooth then �j � 0 for large j. So, for j � k, b�j � N(0; �2=n).

Thus, b�j � �b�j=pn for for j � k, where b�j � N(0; 1). There-

fore,

b�2 =n

k

nXi=n�k+1

b�2j

� n

k

nXi=n�k+1

��pn

b�j�2

Page 404: All of Statistics: A Concise Course in Statistical Inference Brief

406 22. Smoothing Using Orthogonal Functions

=�2

k

nXi=n�k+1

b�2j

=�2

k�2k

since a sum of k Normals has a �2k distribution. Now E (�2

k) = k

and hence E (b�2) � �2. Also, V(�2

k) = 2k and hence V(b�2) �(�4=k2)(2k) = (2�4=k)! 0 as n!1. Thus we expect b�2 to bea consistent estimator of �2. There is nothing special about the

choice k = n=4. Any k that increases with n at an appropriate

rate will suÆce.

We estimate the risk with

bR(J) = Jb�2n

+

nXj=J+1

�b�2j �

b�2n

�+

: (22.27)

We are now ready to give a complete description of the method,

which Beran (2000) calls REACT (Risk Estimation and Adap-

tation by Coordinate Transformation).

Page 405: All of Statistics: A Concise Course in Statistical Inference Brief

22.3 Regression 407

Orthogonal Series Regression Estimator

1. Let b�j = 1

n

nXi=1

Yi�j(xi); j = 1; : : : ; n:

2. Let b�2 = n

k

nXi=n�k+1

b�2j (22.28)

where k � n=4.

3. For 1 � J � n, compute the risk estimate

bR(J) = Jb�2n

+

nXj=J+1

�b�2j �

b�2n

�+

:

4. Choose bJ 2 f1; : : : ng to minimize bR(J).5. Let bf(x) = bJX

j=1

b�j�j(x):

Example 22.10 Figure 22.4 shows the doppler function f and

n = 2; 048 observations generated from the model

Yi = r(xi) + �i

where xi = i=n, �i � N(0; (:1)2). Then �gure shows the true

function, the data, the estimated function and the estimated risk.

The estimated �gure was based on bJ = 234 terms.

Finally, we turn to con�dence bands. As before, these bands

are not really for the true function r(x) but rather for the

smoothed version of the function er(x) =P bJ

j=1 �j�j(x).

Page 406: All of Statistics: A Concise Course in Statistical Inference Brief

408 22. Smoothing Using Orthogonal Functions

0.0 0.2 0.4 0.6 0.8 1.0

−0.4

−0.2

0.0

0.2

0.4

0.0 0.2 0.4 0.6 0.8 1.0

−0.5

0.0

0.5

0.0 0.2 0.4 0.6 0.8 1.0

−0.4

−0.2

0.0

0.2

0.4

0 500 1000 1500 2000

0.00

0.02

0.04

0.06

0.08

J

Estim

ated

Risk

FIGURE 22.4. The doppler test function.

Page 407: All of Statistics: A Concise Course in Statistical Inference Brief

22.3 Regression 409

Theorem 22.11 Suppose the estimate br is based on J terms andb� is de�ned as in equation (22.28). Assume that J < n� k+1.

An approximate 1� � con�dence band for er is (`; u) where

`(x) = brn(x)�K

pJ

sz�b�pn+Jb�2n

(22.29)

u(x) = brn(x) +K

pJ

sz�b�pn+Jb�2n

(22.30)

where

K = max1�j�J

maxxj�j(x)j

and

b� 2 = 2Jb�4n

�1 +

J

k

�(22.31)

and k = n=4 is from equation (22.28). In the cosine basis, we

may use K =p2.

Example 22.12 Figure 22.5 shows the con�dence envelope for the

doppler signal. Th �rst plot is based on J = 234 (the value of

J that minimizes the estimated risk). The second is based on

J = 45 � pn. Larger J yields a higher resolution estimator

at the cost of large con�dence bands. Smaller J yields a lower

resolution estimator but has tighter con�dence bands. �

So far, we have assumed that the xi's are of the form f1=n; 2=n; : : : ; 1g.If the xi's are on interval [a; b] then we can rescale them so that

are in the interval [0; 1]. If the xi's are not equally spaced the

methods we have discussed still apply so long as the xi's \�ll

out" the interval [0,1] in such a way so as to not be too clumped

together. If we want to treat the xi's as random instead of �xed,

then the method needs signi�cant modi�cations which we shall

not deal with here.

Page 408: All of Statistics: A Concise Course in Statistical Inference Brief

410 22. Smoothing Using Orthogonal Functions

0.0 0.2 0.4 0.6 0.8 1.0

−1.0

−0.5

0.0

0.5

1.0

0.0 0.2 0.4 0.6 0.8 1.0

−1.0

−0.5

0.0

0.5

1.0

FIGURE 22.5. The con�dence envelope for the doppler

test function using n = 2; 048 observations. The top plot

shows the estimate and envelope using J = 234 terms.

The bottom plot shows the estimate and envelope using

J = 45 terms.

Page 409: All of Statistics: A Concise Course in Statistical Inference Brief

22.4 Wavelets 411

22.4 Wavelets

Suppose there is a sharp jump in a regression function f at some

point x but that f is otherwise very smooth. Such a function

f is said to be spatially inhomogeneous. See Figure 22.6 for

an example.

It is hard to estimate f using the methods we have discussed

so far. If we use a cosine basis and only keep low order terms, we

will miss the peak; if we allow higher order terms we will �nd

the peak but we will make the rest of the curve very wiggly.

Similar comments apply to kernel regression. If we use a large

bandwidth, then we will smooth out the peak; if we use a small

bandwidth, then we will �nd the peak but we will make the rest

of the curve very wiggly.

One way to estimate inhomogeneous functions is to use a more

carefully chosen basis that allows us to place a \blip" in some

small region without adding wiggles elsewhere. In this section,

we describe a special class of bases called wavelets, that are

aimed at �xing this problem. Statistical inference suing wavelets

is a large and active area. We will just discuss a few of the main

ideas to get a avor of this approach.

We start with a particular wavelet called the Haar wavelet.

The Haar father wavelet or Haar scaling function is de-

�ned by

�(x) =

�1 if 0 � x < 1

0 otherwise:(22.32)

The mother Haar wavelet is de�ned by

(x) =

� �1 if 0 � x � 12;

1 if 12< x � 1:

(22.33)

For any integers j and k de�ne

�j;k(x) = 2j=2�(2jx�k) and j;k(x) = 2j=2 (2jx�k): (22.34)The function j;k has the same shape as but it has been

rescaled by a factor of 2j=2 and shifted by a factor of k.

Page 410: All of Statistics: A Concise Course in Statistical Inference Brief

412 22. Smoothing Using Orthogonal Functions

0.0 0.2 0.4 0.6 0.8 1.0

0.4

0.6

0.8

1.0

1.2

1.4

FIGURE 22.6. An inhomogeneous function. The func-

tion is smooth except for a sharp peak in one place.

Page 411: All of Statistics: A Concise Course in Statistical Inference Brief

22.4 Wavelets 413

See Figure 22.7 for some examples of Haar wavelets. Notice

that for large j, j;k is a very localized function. This makes it

possible to add a blip to a function in one place without adding

wiggles elsewhere. Increasing j is like looking in a microscope at

increasing degrees of resolution. In technical terms, we say that

wavelets provide a multiresolution analysis of L2(0; 1).

Let

Wj = f jk; k = 0; 1; : : : ; 2j � 1g

be the set of rescaled and shifted mother wavelets at resolution

j.

Theorem 22.13 The set of functionsn�;W0;W1;W2; : : : ;

ois an orthonormal basis for L2(0; 1).

It follows from this theorem that we can expand any function

f 2 L2(0; 1) in this basis. Because each Wj is itself a set of

functions, we write the expansion as a double sum:

f(x) = ��(x) +

1Xj=0

2j�1Xk=0

�j;k j;k(x) (22.35)

where

� =

Z 1

0

f(x)�(x) dx; �j;k =

Z 1

0

f(x) j;k(x) dx:

We call � the scaling coeÆcient and the �j;k's are called

the detail coeÆcients. We call the �nite sum

ef(x) = ��(x) +

J�1Xj=0

2j�1Xk=0

�j;k j;k(x) (22.36)

Page 412: All of Statistics: A Concise Course in Statistical Inference Brief

414 22. Smoothing Using Orthogonal Functions

0.0 0.2 0.4 0.6 0.8 1.0

−4−2

02

4

0.0 0.2 0.4 0.6 0.8 1.0

−4−2

02

4

0.0 0.2 0.4 0.6 0.8 1.0

−4−2

02

4

0.0 0.2 0.4 0.6 0.8 1.0

−4−2

02

4

FIGURE 22.7. Some Haar wavelets. Top left: the father

wavelet �(x); top right: the mother wavelet (x); bot-

tom left: 2;2(x); bottom right: 4;10(x).

Page 413: All of Statistics: A Concise Course in Statistical Inference Brief

22.4 Wavelets 415

the resolution J approximation to f . The total number of

terms in this sum is

1 +

J�1Xj=0

2j = 1 + 2J � 1 = 2J :

Example 22.14 Figure 22.8 shows the doppler signal, and its res-

olution J approximation for J = 3; 5 and J = 8. �

Haar wavelets are localized, meaning that they are zero out-

side an interval. But they are not smooth. This raises the ques-

tion of whether there exist smooth, localized wavelets that from

an orthonormal basis. In 1988, Ingrid Daubechie showed that

such wavelets do exist. These smooth wavelets are diÆcult to

describe. They can be constructed numerically but there is no

closed form formual for the smoother wavelets. To keep things

simple, we will continue to use Haar wavelets. For your interest,

�gure 22.9 shows an example of a smooth wavelet called the

\symmlet 8."

We can now use wavelets to do density estimation and re-

gression. We shall only discuss the regression problem Yi =

r(xi) + ��i where �i � N(0; 1) and xi = i=n. To simplify the

discussion we assume that n = 2J for some J .

There is one major di�erence between estimation using wavelets

instead of a cosine (or polynomial) basis. With the cosine basis,

we used all the terms 1 � j � J for some J . With wavelets,

we use a method called thresholding where we keep a term in

the function approximation if its coeÆcient is large, otherwise,

we throw we don't use that term. There are many versions of

thresholding. The simplest is called hard, universal threshold-

ing.

Page 414: All of Statistics: A Concise Course in Statistical Inference Brief

416 22. Smoothing Using Orthogonal Functions

0.0 0.2 0.4 0.6 0.8 1.0

−0.4

−0.2

0.0

0.2

0.4

x

f

0.0 0.2 0.4 0.6 0.8 1.0

−0.4

−0.2

0.0

0.2

0.4

0.0 0.2 0.4 0.6 0.8 1.0

−0.4

−0.2

0.0

0.2

0.4

0.0 0.2 0.4 0.6 0.8 1.0

−0.4

−0.2

0.0

0.2

0.4

FIGURE 22.8. The doppler signal and its reconstructionef(x) = ��(x) +PJ�1

j=0

Pk �j;k j;k(x) based on J = 3, J = 5 and

J = 8.

Page 415: All of Statistics: A Concise Course in Statistical Inference Brief

22.4 Wavelets 417

−2 0 2 4

−1.0

0.0

0.5

1.0

1.5

0 2 4 6

−0.2

0.2

0.6

1.0

FIGURE 22.9. The symmlet 8 wavelet. The top plot is the mother

and the bottom plot is the father.

Page 416: All of Statistics: A Concise Course in Statistical Inference Brief

418 22. Smoothing Using Orthogonal Functions

Haar Wavelet Regression

1. Let J = log2(n) and de�ne

b� =1

n

Xi

�k(xi)Yi and Dj;k =1

n

Xi

j;k(xi)Yi (22.37)

for 0 � j � J � 1.

2. Estimate � by

b� =pn� median

�jDJ�1;kj : k = 0; : : : ; 2J�1 � 1�

0:6745:

(22.38)

3. Apply universal thresholding:

b�j;k =(Dj;k if jDj;kj > b�q2 log n

n

0 otherwise:

)(22.39)

4. Set bf(x) = b��(x) + J�1Xj=j0

2j�1Xk=0

b�j;k j;k(x):

In practice, we do not compute Sk and Dj;k using (22.37). In-

stead, we use the discrete wavelet transform (DWT) which

is very fast. For Haar wavelets, the DWT works as follows.

Page 417: All of Statistics: A Concise Course in Statistical Inference Brief

22.4 Wavelets 419

DWT for Haar Wavelets

Let y be the vector of Yi's (length n) and let J =

log2(n). Create a list D with elements

D[[0]]; : : : ; D[[J � 1]]:

Set:

temp <� y=pn:

Then do:

for(j in (J � 1) : 0)fm <� 2j

I <� (1 : m)

D[[j]] <��temp[2 � I]� temp[(2 � I)� 1]

�=

p2

temp <��temp[2 � I] + temp[(2 � I)� 1]

�=

p2

g

Remark 22.15 If you use a programming language that does not

allow a 0 index for a list, then index the list as

D[[1]]; : : : ; D[[J ]]

and replace the second last line with

D[[j + 1]] <��temp[2 � I]� temp[(2 � I)� 1]

�=

p2

The estimate for � probably looks strange. It is similar to the

estimate we used for the cosine basis but it is designed to be

insensitive to sharp peaks in the function.

To understand the intuition behind universal thresholding,

consider what happens when there is no signal, that is, when

�j;k = 0 for all j and k.

Page 418: All of Statistics: A Concise Course in Statistical Inference Brief

420 22. Smoothing Using Orthogonal Functions

Theorem 22.16 Suppose that �j;k = 0 for all j and k and letb�j;k be the universal threshold estimator. Then

P(b�j;k = 0 for all j; k)! 1

as n!1.

Proof. To simplify the proof, assume that � is known. Now

Dj;k � N(0; �2=n). We will need Mill's inequality: if Z �N(0; 1) then P(jZj > t) � (c=t)e�t

2=2 where c =p2=� is a

constant. Thus,

P(max jDj;kj > �) �Xj;k

P(jDj;kj > �)

�Xj;k

P

�pnjDj;kj�

>

pn�

��

Xj;k

c�

�pnexp

��1

2

n�2

�2

�=

cp2 logn

! 0: �

Example 22.17 Consider Yi = r(xi) + ��i where f is the doppler

signal, � = :1 and n = 2048. Figure 22.10 shows the data and

the estimated function using universal thresholding. Of course,

the estimate is not smooth since Haar wavelets are not smooth.

Nonetheless, the estimate is quite accurate. �

22.5 Bibliographic Remarks

A reference for orthogonal function methods is Efromovich (1999).

See also Beran (2000) and Beran and D�umbgen (1998). An in-

troduction to wavelets is given in Ogden (1997). A more ad-

vanced treatment can be found in H�ardle, Kerkyacharian, Pi-

card and Tsybakov (1998). The theory of statistical estimation

Page 419: All of Statistics: A Concise Course in Statistical Inference Brief

22.5 Bibliographic Remarks 421

0.0 0.2 0.4 0.6 0.8 1.0

−0.5

0.0

0.5

0.0 0.2 0.4 0.6 0.8 1.0

−0.4

−0.2

0.0

0.2

0.4

FIGURE 22.10. Estimate of the Doppler function using Haar

wavelets and universal thresholding.

Page 420: All of Statistics: A Concise Course in Statistical Inference Brief

422 22. Smoothing Using Orthogonal Functions

using wavelets has been developed by many authors, especially

David Donoho and Ian Johnstone. There is a series of papers es-

pecially worth reading: Donoho and Johnstone (1994), Donoho

and Johnstone (1995a), Donoho and Johnstone (1995b), and

Donoho and Johnstone (1998).

22.6 Exercises

1. Prove Theorem 22.5.

2. Prove Theorem 22.9.

3. Let

1 =

�1p3;1p3;1p3

�; 2 =

�1p2;� 1p

2; 0

�; 3 =

�1p6;1p6;� 2p

6

�:

Show that these vectors have norm 1 and are orthogonal.

4. Prove Parseval's relation equation (22.6).

5. Plot the �rst �ve Legendre polynomials. Very, numerically,

that they are orthonormal.

6. Expand the following functions in the cosine basis on [0; 1].

For (a) and (b), �nd the coeÆcients �j analytically. For

(c) and (d), �nd the coeÆcients �j numerically, i.e.

�j =

Z 1

0

f(x)�j(x) � 1

N

NXr=1

f

�r

N

��j

�r

N

�for some large integerN . Then plot the partial sum

Pn

j=1 �j�j(x)

for increasing values of n.

(a) f(x) =p2 cos(3�x).

(b) f(x) = sin(�x).

(c) f(x) =P11

j=1 hjK(x�tj) where K(t) = (1+sign(t))=2,

Page 421: All of Statistics: A Concise Course in Statistical Inference Brief

22.6 Exercises 423

(tj) = (:1; :13; :15; :23; :25; :40; :44; :65; :76; :78; :81),

(hj) = (4;�5; 3;�4; 5;�4:2; 2:1; 4:3;�3:1; 2:1;�4:2).(d) f =

px(1� x) sin

�2:1�

(x+:05)

�:

7. Consider the glass fragments data. Let Y be refractive in-

dex and let X be aluminium content (the fourth variable).

(a) Do a nonparametric regression to �t the model Y =

f(x)+� using the cosine basis method. The data are not on

a regular grid. Ignore this when estimating the function.

(But do sort the data �rst.) Provide a function estimate,

an estimate of the risk and a con�dence band.

(b) Use the wavelet method to estimate f .

8. Show that the Haar wavelets are orthonormal.

9. Consider again the doppler signal:

f(x) =px(1� x) sin

�2:1�

x + 0:05

�:

Let n = 1024 and let (x1; : : : ; xn) = (1=n; : : : ; 1). Generate

data

Yi = f(xi) + ��i

where �i � N(0; 1).

(a) Fit the curve using the cosine basis method. Plot the

function estimate and con�dence band for J = 10; 20; : : : ; 100.

(b) Use Haar wavelets to �t the curve.

10. (Haar density Estimation.) Let X1; : : : ; Xn � f for some

density f on [0; 1]. Let's consider constructing a wavelet

histogram. Let � and be the Haar father and mother

wavelet. Write

f(x) � �(x) +

J�1Xj=0

2j�1Xk=0

�j;k j;k(x)

Page 422: All of Statistics: A Concise Course in Statistical Inference Brief

424 22. Smoothing Using Orthogonal Functions

where J � log2(n). Let

b�j;k = 1

n

nXi=1

j;k(Xi):

(a) Show that b�j;k is an unbiased estimate of �j;k.

(b) De�ne the Haar histogram

bf(x) = �(x) +

BXj=0

2j�1Xk=0

b�j;k j;k(x)

for 0 � B � J � 1.

(c) Find an approximate expression for the MSE as a func-

tion of B.

(d) Generate n = 1000 observations from a Beta (15,4)

density. Estimate the density using the Haar histogram.

Use leave-one-out cross vaidation to choose B.

11. In this question, we will explore the motivation for equa-

tion (22.38). Let X1; : : : ; Xn � N(0; �2). Let

b� =pn� median (jX1j; : : : ; jXnj)

0:6745:

(a) Show that E (b�) = �.

(b) Simulate n = 100 observations from a N(0,1) distribu-

tion. Compute b� as well as the usual estimate of �. Repeat

1000 times and compare the MSE.

(c) Repeat (b) but add some outliers to the data. To do

this, simulate each observation from a N(0,1) with prob-

ability .95 and simulate each observation from a N(0,10)

with probability .95.

12. Repeat question 6 using the Haar basis.

Page 423: All of Statistics: A Concise Course in Statistical Inference Brief

This is page 425

Printer: Opaque this

23

Classi�cation

23.1 Introduction

The problem of predicting a discrete random variable Y from an-

other random variable X is called classi�cation, supervised

learning, discrimination or pattern recognition.

In more detail, consider iid data (X1; Y1); : : : ; (Xn; Yn) where

Xi = (Xi1; : : : ; Xid) 2 X � Rd

is a d-dimensional vector and Yi takes values in some �nite set

Y. A classi�cation rule is a function h : X ! Y. When we

observe a new X, we predict Y to be h(X).

Example 23.1 Here is a an example with fake data. Figure 23.1

shows 100 data points. The covariateX = (X1; X2) is 2-dimensional

and the outcome Y 2 Y = f0; 1g. The Y values are indicated on

the plot with the triangles representing Y = 1 and the squares

representing Y = 0. Also shown is a linear classi�cation rule

represented by the solid line. This is a rule of the form

h(x) =

�1 if a+ b1x1 + b2x2 > 0

0 otherwise:

Page 424: All of Statistics: A Concise Course in Statistical Inference Brief

426 23. Classi�cation

Everything above the line is classi�ed as a 0 and everything below

the line is classi�ed as a 1. �

Example 23.2 The Coronary Risk-Factor Study (CORIS) data

involve 462 males between the ages of 15 and 64 from three ru-

ral areas in South Africa (Rousseauw et al 1983, Hastie and

Tibshirani, 1987). The outcome Y is the presence (Y = 1) or

absence (Y = 0) of coronary heart disease. There are 9 covari-

ates: systolic blood pressure, cumulative tobacco (kg), ldl (low

density lipoprotein cholesterol), adiposity, famhist (family his-

tory of heart disease), typea (type-A behavior), obesity, alco-

hol (current alcohol consumption), and age. Figure 23.2 shows

the outcomes and two of the covariates, systolic blood pressure

and tobaco consumption. People with/without heart disease are

coded with triangles and squares. Also shown is a linear deci-

sion boundary which we will explain shortly. In this example,

the groups are much harder to tell apart. In fact, 141 people are

misclassi�ed using this classi�cation rule.

At this point, it is worth revisiting the Statistics/Data Mining

dictionary:

Statistics Computer Science Meaning

classi�cation supervised learning predicting a discrete Y from X

data training sample (X1; Y1); : : : ; (Xn; Yn)

covariates features the Xi's

classi�er hypothesis map h : X ! Yestimation learning �nding a good classi�er

In most cases in this Chapter, we deal with the case Y =

f0; 1g.

Page 425: All of Statistics: A Concise Course in Statistical Inference Brief

23.1 Introduction 427

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x1

x2

FIGURE 23.1. Two covariates and a linear decision

boundary. 4 means Y = 1. � means Y = 0. You won't

see real data like this.

Page 426: All of Statistics: A Concise Course in Statistical Inference Brief

428 23. Classi�cation

100 120 140 160 180 200 220

05

10

15

20

25

30

Systolic Blood Pressure

To

ba

cco

FIGURE 23.2. Two covariates and and a linear deci-

sion boundary with data from the Coronary Risk-Factor

Study (CORIS). 4 means Y = 1. � means Y = 0. This

is real life.

Page 427: All of Statistics: A Concise Course in Statistical Inference Brief

23.2 Error Rates and The Bayes Classi�er 429

23.2 Error Rates and The Bayes Classi�er

Our goal is to �nd a classi�cation rule h that makes accurate

predictions. We start with the following de�nitions. One can use other

measures of error as

well.De�nition 23.3 The true error rate of a classi�er h is

L(h) = P(fh(X) 6= Y g) (23.1)

and the empirical error rate or training error rate

is bLn(h) =1

n

nXi=1

I(h(Xi) 6= Yi): (23.2)

First we consider the special case where Y = f0; 1g. Let

r(x) = E (Y jX = x) = P(Y = 1jX = x)

denote the regression function. From Bayes' theorem we have

that

r(x) =�f1(x)

�f1(x) + (1� �)f0(x)(23.3)

where

f0(x) = f(xjY = 0)

f1(x) = f(xjY = 1)

and � = P(Y = 1).

De�nition 23.4 The Bayes classi�cation rule h� is de-

�ned to be

h�(x) =

�1 if r(x) > 1

2

0 otherwise:(23.4)

The set D(h) = fx : P(Y = 1jX = x) = P(Y = 0jX =

x)g is called the decision boundary.

Page 428: All of Statistics: A Concise Course in Statistical Inference Brief

430 23. Classi�cation

Warning! The Bayes rule has nothing to do with Bayesian

inference. We could estimate the Bayes rule using either fre-

quentist or Bayesian methods.

The Bayes' rule may be written in several equivalent forms:

h�(x) =

�1 if P(Y = 1jX = x) > P(Y = 0jX = x)

0 otherwise(23.5)

and

h�(x) =

�1 if �f1(x) > (1� �)f0(x)

0 otherwise:(23.6)

Theorem 23.5 The Bayes rule is optimal, that is, if h is any

other classi�cation rule then L(h�) � L(h).

The Bayes rule depends on unknown quantities so we need to

use the data to �nd some approximation to the Bayes rule. At

the risk of oversimplifying, there are three main approaches:

1. Empirical Risk Minimization. Choose a set of classi�ers Hand �nd bh 2 H that minimizes some estimate of L(h).

2. Regression. Find an estimate br of the regression function

r and de�ne

bh(x) = � 1 if br(x) > 12

0 otherwise:

3. Density Estimation. Estimate f0 from the Xi's for which

Yi = 0, estimate f1 from the Xi's for which Yi = 1 and letb� = n�1P

n

i=1 Yi. De�ne

br(x) = bP(Y = 1jX = x) =b� bf1(x)b� bf1(x) + (1� b�) bf0(x)

and bh(x) = � 1 if br(x) > 12

0 otherwise:

Page 429: All of Statistics: A Concise Course in Statistical Inference Brief

23.3 Gaussian and Linear Classi�ers 431

Now let us generalize to the case where Y takes on more than

two values as follows.

Theorem 23.6 Suppose that Y 2 Y = f1; : : : ; Kg. Theoptimal rule is

h(x) = argmaxk P(Y = kjX = x) (23.7)

= argmaxk�k fk(x) (23.8)

where

P(Y = kjX = x) =fk(x)�kPrfr(x)�r

; (23.9)

�r = P (Y = r), fr(x) = f(xjY = r) and argmaxk means

\the value of k that maximizes that expression."

23.3 Gaussian and Linear Classi�ers

Perhaps the simplest approach to classi�cation is to use the den-

sity estimation strategy and assume a parametric model for the

densities. Suppose that Y = f0; 1g and that f0(x) = f(xjY = 0)

and f1(x) = f(xjY = 1) are both multivariate Gaussians:

fk(x) =1

(2�)d=2j�kj1=2exp

��12(x� �k)

T��1k(x� �k)

�; k = 0; 1:

Thus, XjY = 0 � N(�0;�0) and XjY = 1 � N(�1;�1).

Page 430: All of Statistics: A Concise Course in Statistical Inference Brief

432 23. Classi�cation

Theorem 23.7 If XjY = 0 � N(�0;�0) and XjY = 1 �N(�1;�1), then the Bayes rule is

h�(x) =

(1 if r21 < r20 + 2 log

��1

�0

�+ log

�j�0j

j�1j

�0 otherwise

(23.10)

where

r2i= (x� �i)

T��1i(x� �i); i = 1; 2 (23.11)

is the Manalahobis distance. An equivalent way of ex-

pressing the Bayes' rule is

h(x) = argmaxkÆk(x)

where

Æk(x) = �1

2log j�kj �

1

2(x� �k)

T��1k(x� �k) + log�k

(23.12)

and jAj denotes the determinant of a matrix A.

The decision boundary of the above classi�er is quadratic so

this procedure is often called quadratic discriminant analy-

sis (QDA). In practice, we use sample estimates of �; �1; �2;�0;�1

in place of the true value, namely:

b�0 =1

n

nXi=1

(1� Yi); b�1 = 1

n

nXi=1

Yi

b�0 =1

n0

Xi: Yi=0

Xi; b�1 =1

n1

Xi: Yi=1

Xi

S0 =1

n0

Xi: Yi=0

(Xi � b�0)(Xi � b�0)T ; S1 =

1

n1

Xi: Yi=1

(Xi � b�1)(Xi � b�1)T

where n0 =P

i(1� Yi) and n1 =

PiYi.

Page 431: All of Statistics: A Concise Course in Statistical Inference Brief

23.3 Gaussian and Linear Classi�ers 433

A simpli�cation occurs if we assume that �0 = �0 = �. In

that case, the Bayes rule is

h(x) = argmaxkÆk(x) (23.13)

where now

Æk(x) = xT��1�k �1

2�Tk��1 + log �k: (23.14)

The parameters are estimated as before except that the mle of

� is

S =n0S0 + n1S1

n0 + n1:

The classi�cation rule is

h�(x) =

�1 if Æ1(x) > Æ0(x)

0 otherwise(23.15)

where

Æj(x) = xTS�1b�j � 1

2b�TjS�1b�j + log b�j

is called the discriminant function. The decision boundary

fx : Æ0(x) = Æ1(x)g is linear so this method is called linear

discrimination analysis (LDA).

Example 23.8 Let us return to the South African heart disease

data. The decision rule in Figure 23.2 was obtained by linear

discrimination. The outcome was

classi�ed as 0 classi�ed as 1

y = 0 277 25

y = 1 116 44

The observed misclassi�cation rate is 141=462 = :31. Including

all the covariates reduces the error rate to .27. The results from

quadratic discrimination are

classi�ed as 0 classi�ed as 1

y = 0 272 30

y = 1 113 47

Page 432: All of Statistics: A Concise Course in Statistical Inference Brief

434 23. Classi�cation

which has about the same error rate 143=462 = :31. Including

all the covariates reduces the error rate to .26. In this example,

there is little advantage to QDA over LDA.

Now we generalize to the case where Y takes on more than

two values.

Theorem 23.9 Suppose that Y 2 f1; : : : ; Kg. If fk(x) =f(xjY = k) is Gaussian, the Bayes rule is

h(x) = argmaxkÆk(x)

where

Æk(x) = �1

2log j�kj �

1

2(x� �k)

T��1k(x� �k) + log �k:

(23.16)

If the variances of the Gaussians are equal then

Æk(x) = xT��1�k �1

2�Tk��1 + log�k: (23.17)

We estimate Æk(x) by by inserting estimates of �k, �k and �k.

There is another version of linear discriminant analysis due to

Fisher. The idea is to �rst reduce the dimension of covariates to

one dimension by projecting the data onto a line. Algebraically,

this meansd replacing the covariate X = (X1; : : : ; Xd) with a

linear combination U = wTX =P

d

j=1wjXj. The goal is to

choose the vector w = (w1; : : : ; wd) that \best separates the

data." Then we perform classi�cation with the new covariate Z

instead of X.

We need de�ne what we mean by separation of the groups.

We would like the two groups to have means that are far apart

relative to their spread. Let �j denote the mean of X for Yj

and let � be the variance matrix of X. Then E (U jY = j) =

E (wTXjY = j) = wT�j and V(U) = wT�w. De�ne the separa-The quantity J

arises in physics,

where it is called the

Rayleigh coeÆcient.

Page 433: All of Statistics: A Concise Course in Statistical Inference Brief

23.3 Gaussian and Linear Classi�ers 435

tion by

J(w) =(E (U jY = 0)� E (U jY = 1))2

wT�w

=(wT�0 � wT�1)

2

wT�w

=wT (�0 � �1)(�0 � �1)

Tw

wT�w:

We estimate J as follows. Let njP

n

i=1 I(Yi = j) be the number

of observations in group j, let Xj be the sample mean vector of

the X's for group j, and let Sj be the sample covariance matrix

in group j. De�ne bJ(w) = wTSBw

wTSWw(23.18)

where

SB = (X0 �X1)(X0 �X1)

SW =(n0 � 1)S0 + (n1 � 1)S1

(n0 � 1) + (n1 � 1):

Theorem 23.10 The vector

w = S�1W(X0 �X1) (23.19)

is a minimizer of bJ(w). We call

U = wTX = (X0 �X1)TS�1

WX (23.20)

the Fisher linear discriminant function. The midpoint m

between X0 and X1 is

m =1

2(X0 +X1) =

1

2(X0 �X1)

TS�1B(X0 +X1) (23.21)

Fisher's classi�cation rule is

h(x) =

�0 if wTX � m

1 if wTX < m

=

�0 if (X0 �X1)

TS�1Wx � m

1 if (X0 �X1)TS�1

Wx < m:

Fisher's rule is the same as the Bayes linear classi�er in equa-

tion (23.14) when b� = 1=2.

Page 434: All of Statistics: A Concise Course in Statistical Inference Brief

436 23. Classi�cation

23.4 Linear Regression and Logistic Regression

Amore direct approach to classi�cation is to estimate the regres-

sion function r(x) = E (Y jX = x) without bothering to estimste

the densities fk. For the rest of this section, we will only con-

sider the case where Y = f0; 1g. Thus, r(x) = P(Y = 1jX = x)

and once we have an estimate br, we will use the classi�cation

rule

h(x) =

�1 if br(x) > 1

2

0 otherwise:(23.22)

The simplest regression model is the linear regression model

Y = r(x) + � = �0 +

dXj=1

�jXj + � (23.23)

where E (�) = 0. This model can't be correct since it does not

force Y = 0 or 1. Nonetheless, it can sometimes lead to a decent

classi�er.

Recall that the least squares estimate of � = (�0; �1; : : : ; �d)T

minimizes the residual sums of squares

RSS(�) =

nXi=1

�Yi � �0 �

dXj=1

Xij�j

�2:

Let us brie y review what that estimator is. Let X denote the

N � (d+ 1) matrix of the form

X =

266641 X11 : : : X1d

1 X21 : : : X2d

......

......

1 Xn1 : : : Xnd

37775 :Also let Y = (Y1; : : : ; Yn)

T . Then,

RSS(�) = (Y �X�)T (Y �X�)

Page 435: All of Statistics: A Concise Course in Statistical Inference Brief

23.4 Linear Regression and Logistic Regression 437

and the model can be written as

Y = X� + �

where � = (�1; : : : ; �n)T . The least squares solution b� that min-

imizes RSS can be found by elementary calculus and is given

by b� = (XTX)�1XTY:

The predicted values are

bY = Xb�:Now we use (23.22) to classify, where br(x) = b�0 +Pj

b�jxj.It seems sensible to use a regression method that takes into

account that Y 2 f0; 1g. The most common method for doing

so is logistic regression. Recall that the model is

r(x) = P (Y = 1jX = x) =e�0+

Pj �jxj

1 + e�0+P

j �jxj: (23.24)

We may write this is

logit P (Y = 1jX = x) = �0 +Xj

�jxj

where logit (a) = log(a=(1� a)). Under this model, each Yi is a

Bernoulli with success probability

pi(�) =e�0+

Pj �jXij

1 + e�0+P

j �jXij:

The likelihood function for the data set is

L(�) =nYi=1

pi(�)Yi(1� pi(�))

1�Yi:

We obtain the mle numerically.

Page 436: All of Statistics: A Concise Course in Statistical Inference Brief

438 23. Classi�cation

Example 23.11 Let us return to the heart disease data. A logistic

regression yields the following estimates and Wald tests for the

coeÆcients:Covariate Estimate Std. Error Wald statistic p-value

Intercept -6.145 1.300 -4.738 0.000

sbp 0.007 0.006 1.138 0.255

tobacco 0.079 0.027 2.991 0.003

ldl 0.174 0.059 2.925 0.003

adiposity 0.019 0.029 0.637 0.524

famhist 0.925 0.227 4.078 0.000

typea 0.040 0.012 3.233 0.001

obesity -0.063 0.044 -1.427 0.153

alcohol 0.000 0.004 0.027 0.979

age 0.045 0.012 3.754 0.000Are surprised by the fact that systolic blood pressure is not

signi�cant or by the minus sign for the obesity coeÆcient? If

yes, then you are confusing correlation and causation. There

are lots of unobserved confounding variables so the coeÆcients

above cannot be interpreted causally. The fact that blood pres-

sure is not signi�cant does not mean that blood pressure is not an

important cause of heart disease. It means that it is not an im-

portant predictor of heart disease relative to the other variables

in the model. The error rate, using this model for classi�cation,

is .27. The error rate from a linear regression is .26.

We can get a better classi�er by �tting a richer model. For

example we could �t

logit P (Y = 1jX = x) = �0 +Xj

�jxj +Xj;k

�jkxjxk: (23.25)

More generally, we could add terms of up to order r for some

integer r. Large values of r give a more complicated model which

should �t the data better. But there is a bias-variance tradeo�

which we'll discuss later.

Example 23.12 If we use model (23.25) for the heart disease data

with r = 2, the error rate is reduced to .22.

Page 437: All of Statistics: A Concise Course in Statistical Inference Brief

23.5 Relationship Between Logistic Regression and LDA 439

The logistic regression model can easily be extended to k

groups but we shall not give the details here.

23.5 Relationship Between Logistic Regression and

LDA

LDA and logistic regression are almost the same thing. If we

assume that each group is Gaussian with the same covariance

matrix then we saw that

log

�P(Y = 1jX = x)

P(Y = 0jX = x)

�= log

��0

�1

�� 1

2(�0 + �1)

T��1(�1 � �0) + xT��1(�1 � �0)

� �0 + �Tx:

On the other hand, the logistic model is, by assumption,

log

�P(Y = 1jX = x)

P(Y = 0jX = x)

�= �0 + �Tx:

These are the same model since they both lead to classi�cation

rules that are linear in x. The di�erence is in how we estimate

the parameters.

The joint density of a single observation is f(x; y) = f(xjy)f(y) =f(yjx)f(x). In LDA we estimated the whole joint distribution by

estimating f(xjy) and f(y); speci�cally, we estimated fk(x) =

f(xjY = k), �1 = fY (1) and �0 = fY (0). We maximized the

likelihood Yi

f(xi; yi) =Yi

f(xijyi)| {z }Gaussian

Yi

f(yi)| {z }Bernoulli

: (23.26)

In logistic regression we maximized the conditional likelihoodQif(yijxi) but we ignored the second term f(xi):Y

i

f(xi; yi) =Yi

f(yijxi)| {z }logistic

Yi

f(xi)| {z }ignored

: (23.27)

Page 438: All of Statistics: A Concise Course in Statistical Inference Brief

440 23. Classi�cation

Since classi�cation only requires knowing f(yjx), we don't reallyneed to estimate the whole joint distribution. Here is an analogy:

we can estimate the mean � of a distribution with the sample

mean X without bothering to estimate the whole density func-

tion. Logistic regression leaves the marginal distribution f(x)

un-speci�ed so it is more nonparametric than LDA.

To summarize: LDA and logistic regression both lead to a

linear classi�cation rule. In LDA we estimate the entire joint

distribution f(x; y) = f(xjy)f(y). In logistic regression we only

estimate f(yjx) and we don't bother estimating f(x).

23.6 Density Estimation and Naive Bayes

Recall that the Bayes rule is h(x) = argmaxk �k fk(x): If we

can estimate �k and fk then we can estimate the Bayes classi-

�cation rule. Estimating �k is easy but what about fk? We did

this previously by assuming fk was Gaussian. Another strategy

is to estimate fk with some nonparametric density estimatorbfk such as a kernel estimator. But if x = (x1; : : : ; xd) is high

dimensional, nonparametric density estimation is not very reli-

able. This problem is ameliorated if we assume that X1; : : : ; Xd

are independent, for then, fk(x1; : : : ; xd) =Q

d

j=1 fkj(xj). This

reduces the problem to d one-dimensional density estimation

problems, within each of the k groups. The resulting classi�er

is called the naive Bayes classi�er. The assumption that the

components of X are independent is usually wrong yet the re-

sulting classi�er might still be accurate. Here is a summary of

the steps in the naive Bayes classi�er:

Page 439: All of Statistics: A Concise Course in Statistical Inference Brief

23.7 Trees 441

The Naive Bayes Classi�er

1. For each group k, compute an estimate bfkj of the densityfkj for Xj, using the data for which Yi = k.

2. Let bfk(x) = bfk(x1; : : : ; xd) = dYj=1

bfkj(xj):3. Let b�k = 1

n

nXi=1

I(Yi = k)

where I(Yi = k) = 1 if Yi = k and I(Yi = k) = 0 if Yi 6= k.

4. Let

h(x) = argmaxk b�k bfk(x):The naive Bayes classi�er is especially popular when x is high

dimensional and discrete. In that case, bfkj(xj) is especially sim-

ple.

23.7 Trees

Trees are classi�cation methods that partition the covariate

space X into disjoint pieces and then classify the observations

according to which partition element they fall in. As the name

implies, the classi�er can be represented as a tree.

For illustration, suppose there are two covariates, X1 = age

and X2 = blood pressure. Figure 23.3 shows a classi�cation tree

using these variables.

The tree is used in the following way. If a subject has Age

� 50 then we classify him as Y = 1. If a subject has Age < 50

then we check his blood pressure. If blood pressure is < 100

Page 440: All of Statistics: A Concise Course in Statistical Inference Brief

442 23. Classi�cation

0 1

Blood Pressure 1

Age

< 100 � 100

< 50 � 50

FIGURE 23.3. A simple classi�cation tree.

then we classify him as Y = 1, otherwise we classify him as

Y = 0. Figure 23.4 shows the same classi�er as a partition of

the covariate space.

Here is how a tree is constructed. For simplicity, we focus on

the case in which 2 Y = f0; 1g. First, suppose there is only a

single covariate X. We choose a split point t that divides the

real line into two sets A1 = (�1; t] and A2 = (t;1). Let bps(j)be the proportion of observations in As such that Yi = j:

bps(j) = Pn

i=1 I(Yi = j;Xi 2 As)Pn

i=1 I(Xi 2 As)(23.28)

for s = 1; 2 and j = 0; 1. The impurity of the split t is de�ned

to be

I(t) =

2Xs=1

s (23.29)

where

s = 1�1X

j=0

bps(j)2: (23.30)

Page 441: All of Statistics: A Concise Course in Statistical Inference Brief

23.7 Trees 443

1

0

1

Age

BloodPressure

50

110

FIGURE 23.4. Partition representation of classi�cation tree.

Page 442: All of Statistics: A Concise Course in Statistical Inference Brief

444 23. Classi�cation

This measure of impurity is known as the Gini index. If a

partition element As contains all 0's or all 1's, then s = 0.

Otherwise, s > 0. We choose the split point t to minimize the

impurity. (Other indices of impurity besides can be used besides

the Gini index.)

When there are several covariates, we choose whichever co-

variate and split that leads to the lowest impurity. This process

is continued until some stopping criterion is met. For example,

we might stop when every partition element has fewer than n0

data points, where n0 is some �xed number. The bottom nodes

of the tree are called the leaves. Each leaf is assigned a 0 or 1

depending on whether there are more data points with Y = 0

or Y = 1 in that partition element.

This procedure is easily generalized to the case where Y 2f1; : : : ; Kg: We simply de�ne the impurity by

s = 1�kX

j=1

bps(j)2 (23.31)

where bpi(j) is the proportion of observations in the partition

element for which Y = j.

Example 23.13 Figure 23.5 shows a classi�cation tree for the

heart disease data. The misclassi�cation rate is .21. Suppose we

do the same example but only using two covariates: tobacco and

age. The misclassi�cation rate is then .29. The tree is shown in

Figure 23.6. Because there are only two covariates, we can also

display the tree as a partition of X = R2 as shown in Figure

23.7. �

Our description of how to build trees is incomplete. If we keep

splitting until there are few cases in each leaf of the tree, we are

likely to over�t the data. We should choose the complexity of

the tree in such a way that the estimated true error rate is low.

In the next section, we discuss estimation of the error rate.

Page 443: All of Statistics: A Concise Course in Statistical Inference Brief

23.7 Trees 445

|age < 31.5

tobacco < 0.51

alcohol < 11.105

age < 50.5

typea < 68.5 famhist:a

tobacco < 7.605

typea < 42.5

adiposity < 24.435

adiposity < 28.955

ldl < 6.705

tobacco < 4.15

adiposity < 28

typea < 48

0

0 0 0 1

0

0 0

1 1

0 1

0

1

1

FIGURE 23.5. A classi�cation tree for the heart disease data.

Page 444: All of Statistics: A Concise Course in Statistical Inference Brief

446 23. Classi�cation

|age < 31.5

tobacco < 0.51 age < 50.5

tobacco < 7.470 0 0

0 1

FIGURE 23.6. A classi�cation tree for the heart disease

data using two covariates.

Page 445: All of Statistics: A Concise Course in Statistical Inference Brief

23.7 Trees 447

20 30 40 50 60

05

10

15

20

25

30

age

tob

acco

0

00

0

1

FIGURE 23.7. The tree pictured as a partition of the

covariate space.

Page 446: All of Statistics: A Concise Course in Statistical Inference Brief

448 23. Classi�cation

23.8 Assessing Error Rates and Choosing a Good

Classi�er

How do we choose a good classi�er? We would like to have a

classi�er h with a low true error rate L(h). Usually, we can't use

the training error rate bLn(h) as an estimate of the true error rate

because it is biased downward.

Example 23.14 Consider the heart disease data again. Suppose

we �t a sequence logistic regression models. In the �rst model we

include one covariate. In the second model we include two co-

variates and so on. The ninth model includes all the covariates.

We can go even further. Let's also �t a tenth model that in-

cludes all nine covariates plus the �rst covariate squared. Then

we �t an eleventh model that includes all nine covariates plus the

�rst covariate squared and the second covariate squared. Con-

tinuing this way we will get a sequence of 18 classi�ers of in-

creasing complexity. The solid line in Figure 23.8 shows the ob-

served classi�cation error which steadily decreases as we make

the model more complex. If we keep going, we can make a model

with zero observed classi�cation error. The dotted line shows the

cross-validation estimate of the error rate (to be explained

shortly) which is a better estimate of the true error rate than the

observed classi�cation error. The estimated error decreases for

a while then increases. This is the bias-variance tradeo� phe-

nomenon we have seen before. �

There are many ways to estimate the error rate. We'll consider

two: cross-validation and probability inequalities.

Cross-Validation. The basic idea of cross-validation, which

we have already encountered in curve estmation, is to leave out

some of the data when �tting a model. The simplest version of

cross-validation involves randomply splitting the data into two

Page 447: All of Statistics: A Concise Course in Statistical Inference Brief

23.8 Assessing Error Rates and Choosing a Good Classi�er 449

5 10 15

0.2

60

.28

0.3

00

.32

0.3

4

Number of terms in the model

err

or

rate

FIGURE 23.8. Solid line is the observed error rate and

dashed line is the cross-validation estimate of true error

rate.

Page 448: All of Statistics: A Concise Course in Statistical Inference Brief

450 23. Classi�cation

Training Data T Validation Data V

| {z }bh | {z }bLFIGURE 23.9. Cross-validation.

pieces: the training set T and the validation set V. Often,about 10 per cent of the data might be set aside as the valida-

tion set. The classi�er h is constructed from the training set.

We then estimate the error by

bL(h) = 1

m

XXi2V

I(h(Xi) 6= YI): (23.32)

where m is the size of the validation set. See Figure 23.9.

Another approach to cross-validation isK-fold cross-validation

which is obtained from the following algorithm.

Page 449: All of Statistics: A Concise Course in Statistical Inference Brief

23.8 Assessing Error Rates and Choosing a Good Classi�er 451

K-fold cross-validation.

1. Randomly divide the data intoK chunks of approximately

equal size. A common choice is K = 10.

2. For k = 1 to K do the following:

(a) Delete chunk k from the data.

(b) Compute the classi�er bh(k) from the rest of the data.

(c) Use bh(k) to the predict the data in chunk k. Let bL(k)

denote the observed error rate.

3. Let bL(h) = 1

K

KXk=1

bL(k): (23.33)

This is how dotted line in Figure 23.8 was computed. An al-

ternative approach to K-fold cross-validation is to split the data

into two pieces called the training set and the test set. All

the models are �t on the training set. The models are then used

to do predictions on the test set and the errors are estimated

from these predictions. When the clasi�cation procedure is time

consuming and there are lots of data, this approach is preferred

to K-fold cross-validation.

Cross-validation can be applied to any classi�cation method.

To apply it to trees, one begins by �tting an initial tree. Smaller

trees are obtained by pruning tree, that is removing splits from

the tree. We can do this for trees of various sizes where size refers

to the number of terminal nodes on the tree. Cross-validation is

then used to estimate error rate as a function of tree size.

Example 23.15 Figure 23.10 shows the cross-validation plot for

a classi�cation tree �tted to the heart disease data. The error

Page 450: All of Statistics: A Concise Course in Statistical Inference Brief

452 23. Classi�cation

is shown as a function of tree size. Figure 23.11 shows the tree

with the size chosen by minimizing the cross-validation error. �

Probability Inequalities. Another approach to estimat-

ing the error rate is to �nd a con�dence interval for bLn(h) using

probability inequalities. This method is useful in the context of

empirical risk minimization.

Let H be a set of classi�ers, for example, all linear classi�ers.

Empirical risk minimization means choosing the classi�er bh 2 Hto minimize the training error bLn(h), also called the empirical

risk. Thus,

bh = argminh2HbLn(h) = argminh2H

1

n

Xi

I(h(Xi) 6= Yi)

!:

(23.34)

Typically, bLn(bh) underestimates the true error rate L(bh) becausebh was chosen to make bLn(bh) small. Our goal is to assess how

much underestimation is taking place. Our main tool for this

analysis isHoe�ding's inequality. Recall that ifX1; : : : ; Xn �Bernoulli(p), then, for any � > 0,

P (jbp� pj > �) � 2e�2n�2

(23.35)

where bp = n�1P

n

i=1Xi.

First, suppose thatH = fh1; : : : ; hmg consists of �nitely many

classi�ers. For any �xed h, bLn(h) converges in almost surely

to L(h) by the law of large numbers. We will now establish a

stronger result.

Theorem 23.16 (Uniform Convergence.) AssumeH is �nite

and has m elements. Then,

P

�maxh2H

jbLn(h)� L(h)j > �

�� 2me�2n�

2

:

Proof. We will use Hoe�ding's inequality and we will also

use the fact that ifA1; : : : ; Am is a set of events then P(S

m

i=1Ai) �

Page 451: All of Statistics: A Concise Course in Statistical Inference Brief

23.8 Assessing Error Rates and Choosing a Good Classi�er 453

2 4 6 8 10 12 14

13

01

35

14

01

45

15

01

55

16

0

size

sco

re

FIGURE 23.10. Cross validation plot

Page 452: All of Statistics: A Concise Course in Statistical Inference Brief

454 23. Classi�cation

|age < 31.5

age < 50.5

typea < 68.5 famhist:a

tobacco < 7.605

0

0 1

0 1

1

FIGURE 23.11. Smaller classi�cation tree with size cho-

sen by cross-validation.

Page 453: All of Statistics: A Concise Course in Statistical Inference Brief

23.8 Assessing Error Rates and Choosing a Good Classi�er 455Pm

i=1 P(Ai). Now,

P

�maxh2H

jbLn(h)� L(h)j > �

�= P

[h2H

jbLn(h)� L(h)j > �

!�

XH2H

P

�jbLn(h)� L(h)j > �

��

XH2H

2e�2n�2

= 2me�2n�2

: �

Theorem 23.17 Let

� =

s2

nlog

�2m

�:

Then bLn(bh)� � is a 1� � con�dence interval for L(bh).Proof. This follows from the fact that

P (jbLn(bh)� L(bh)j > �) � P

�maxh2H

jbLn(bh)� L(bh)j > �

�� 2me�2n�

2

= �: �

When H is large the con�dence interval for L(bh) is large.

The more functions there are in H the more likely it is we have

\over�t" which we compensate for by having a larger con�dence

interval.

In practice we usually use sets H that are in�nite, such as the

set of linear classi�ers. To extend our analysis to these cases we

want to be able to say something like

P

�suph2H

jbLn(h)� L(h)j > �

�� something not too big:

All the other results followed from this inequality. One way

to develop such a generalization is by way of the Vapnik-

Chervonenkis or VC dimension. We now consider the main

ideas in VC theory.

Page 454: All of Statistics: A Concise Course in Statistical Inference Brief

456 23. Classi�cation

Let A be a class of sets. Give a �nite set F = fx1; : : : ; xng let

NA(F ) = #nF\

A : A 2 Ao

(23.36)

be the number of subsets of F \picked out" by A. Here #(B)denotes the number of elements of a set B. The shatter coef-

�cient is de�ned by

s(A; n) = maxF2Fn

NA(F ) (23.37)

where Fn consists of all �nite sets of size n. Now letX1; : : : ; Xn �P and let

Pn(A) =1

n

Xi

I(Xi 2 A)

denote the empirical probability measure. The following remark-

able theorem bounds the distance between P and Pn.

Theorem 23.18 (Vapnik and Chervonenkis (1971).) For any

P, n and � > 0,

P

�supA2A

jPn(A)� P(A)j > �

�� 8s(A; n)e�n�2=32: (23.38)

The proof, though very elegant, is long and we omit it. If His a set of classi�ers, de�ne A to be the class of sets of the form

fx : h(x) = 1g. When then de�ne s(H; n) = s(A; n).

Theorem 23.19

P

�suph2H

jbLn(h)� L(h)j > �

�� 8s(H; n)e�n�2=32:

A 1� � con�dence interval for L(bh) is bLn(bh)� �n where

�2n=

32

nlog

�8s(H; n)

�:

Page 455: All of Statistics: A Concise Course in Statistical Inference Brief

23.8 Assessing Error Rates and Choosing a Good Classi�er 457

These theorems are only useful if the shatter coeÆcients do

not grow too quickly with n. This is where VC dimension enters.

De�nition 23.20 The VC (Vapnik-Chervonenkis) dimen-

sion of a class of sets A is de�ned as follows. If s(A; n) =2n for all n set V C(A) = 1. Otherwise, de�ne V C(A)to be the largest k for which s(A; n) = 2k.

Thus, the VC-dimension is the size of the largest �nite set

F than can be shattered by A meaning that A picks out each

subset of F . IfH is a set of classi�ers we de�ne V C(H) = V C(A)where A is the class of sets of the form fx : h(x) = 1g as hvaries in H. The following theorem shows that if A has �nite

VC-dimension then the shatter coeÆcients grow as a polynomial

in n.

Theorem 23.21 If A has �nite VC-dimension v, then

s(A; n) � nv + 1:

Example 23.22 Let A = f(�1; a]; a 2 Rg. The A shatters

ever one point set fxg but it shatters no set of the form fx; yg.Therefore, V C(A) = 1.

Example 23.23 Let A be the set of closed intervals on the real

line. Then A shatters S = fx; yg but it cannot shatter sets with3 points. Consider S = fx; y; zg where x < y < z. One cannot

�nd an interval A such that ATS = fx; zg. So, V C(A) = 2.

Example 23.24 Let A be all linear half-spaces on the plane. Any

three point set (not all on a line) can be shattered. No 4 point

set can be shattered. Consider, for example, 4 points forming a

diamond. Let T be the left and rightmost points. This can't be

picked out. Other con�gurations can also be seen to be unshat-

terable. So V C(A) = 3. In general, halfspaces in Rd have VC

dimension d+ 1.

Page 456: All of Statistics: A Concise Course in Statistical Inference Brief

458 23. Classi�cation

Example 23.25 Let A be all rectangles on the plane with sides

parallel to the axes. Any 4 point set can be shattered. Let S be

a 5 point set. There is one point that is not leftmost, rightmost,

uppermost, or lowermost. Let T be all points in S except this

point. Then T can't be picked out. So V C(A) = 4.

Theorem 23.26 Let x have dimension d and let H be th set of

linear classi�ers. The VC dimension of H is d+1. Hence, a 1��con�dence interval for the true error rate is bL(bh)� � where

�2n=

32

nlog

�8(nd+1 + 1)

�:

23.9 Support Vector Machines

In this section we consider a class of linear classi�ers called sup-

port vector machines. Throughout this section, we assume

that Y is binary. It will be convenient to label the outcomes as

�1 and +1 instead of 0 and 1. A linear classi�er can then be

written as

h(x) = sign�H(x)

�where x = (x1; : : : ; xd),

H(x) = a0 +

dXi=1

aixi

and

sign(z) =

8<:�1 if z < 0

0 if z = 0

1 if z > 0:

First, suppose that the data are linearly separable, that

is, there exists a hyperplane that perfectly separates the two

classes.

Page 457: All of Statistics: A Concise Course in Statistical Inference Brief

23.9 Support Vector Machines 459

Lemma 23.27 The data can be separated by some hyperplane if

and only if there exists a hyperplane H(x) = a0+P

d

i=1 aixi such

that

YiH(xi) � 1; i = 1; : : : ; n: (23.39)

Proof. Suppose the data can be separated by a hyperplane

W (x) = b0+P

d

i=1 bixi. It follows that there exists some constant

c such that Yi = 1 implies W (Xi) � c and Yi = �1 implies

W (Xi) � �c. Therefore, YiW (Xi) � c for all i. Let H(x) =

a0+P

d

i=1 aixi where aj = bj=c. Then YiH(Xi) � 1 for all i. The

reverse direction is straightforward. �

In the separable case, there will be many separating hyper-

planes. How should we choose one? Intuitively, it seems reason-

able to choose the hyperplane \furthest" from the data in the

sense that it separates the +1'a and -1's and maximizes the dis-

tance to the closest point. This hyperplane is called the maxi-

mum margin hyperplane. The margin is the distance to from

the hyperplane to the nearest point. Points on the boundary of

the margin are called support vectors. See Figure 23.12.

Theorem 23.28 The hyperplane bH(x) = ba0+Pd

i=1 baixi that sep-arates the data and maximizes the margin is given by minimizing

(1=2)P

d

j=1 b2jsubject to (23.39).

It turns out that this problem can be recast as a quadratic

programming problem. Recall that hXi; Xki = XT

iXk is the in-

ner product of Xi and Xk.

Theorem 23.29 Let bH(x) = ba0 +Pd

i=1 baixi denote the optimal

(largest margin) hyperplane. Then, for j = 1; : : : ; d,

baj = nXi=1

b�iYiXj(i)

Page 458: All of Statistics: A Concise Course in Statistical Inference Brief

460 23. Classi�cation

bc

bc

bc

bc

bc

bc

bc

bc

bc

bc

bc

bc

bc

bc

H(x) = a0 + aTx = 0

FIGURE 23.12. The hyperplane H(x) has the largest margin of all

hyperplanes that separate the two classes.

where Xj(i) is the value of the covariate Xj for the ith data

point, and b� = (b�1; : : : ; b�n) is the vector that maximizes

nXi=1

�i �1

2

nXi=1

nXk=1

�i�kYiYkhXi; Xki (23.40)

subject to

�i � 0

and

0 =Xi

�iYi:

The points Xi for which b� 6= 0 are called support vectors. ba0can be found by solving

b�i�Yi(XT

iba+ b�0� = 0

for any support point Xi. bH may be written as

bH(x) = ba0 + nXi=1

b�iYihx;Xii:

Page 459: All of Statistics: A Concise Course in Statistical Inference Brief

23.10 Kernelization 461

There are many software packages that will solve this problem

quickly. If there is no perfect linear classi�er, then one allows

overlap between the groups by replacing the condition (23.39)

with

YiH(xi) � 1� �i; �i � 0; i = 1; : : : ; n: (23.41)

The variables �1; : : : ; �n are called slack variables.

We now maximize (23.40) subject to

0 � �i � c; i = 1; : : : ; n

andnXi=1

�iYi = 0:

The constant c is a tuning parameter that controls the amount

of overlap.

23.10 Kernelization

There is a trick for improving a computationally simple classi�er

h called kernelization. The idea is to map the covariate X {

which takes values in X { into a higher dimensional space Z and

apply the classi�er in the bigger space Z. This can yield a more

exible classi�er while retaining computationally simplicity.

The standard example of this idea is illustrated in Figure

23.13. The covariate x = (x1; x2). The Yi's can be separated

into two groups using an ellipse. De�ne a mapping � by

z = (z1; z2; z3) = �(x) = (x21;p2x1x2; x

22):

This � maps X = R2 into Z = R

3 . In the higher dimensional

space Z, the Yi's are seperable by a linear decision boundary.

In other words, a linear classi�er in a higher dimenional space

corresponds to a non-linear classi�er in the original space.

Page 460: All of Statistics: A Concise Course in Statistical Inference Brief

462 23. Classi�cation

x1

x2

+

+

+

++

+

+ +

+

+

+++

+ +

z1

z2

z3

+

+

+

+

+

+

+

+

FIGURE 23.13. Kernelization. Mapping the covariates into a higher

dimensional space can make a complicated decision boundary into a

simpler decision boundary.

Page 461: All of Statistics: A Concise Course in Statistical Inference Brief

23.10 Kernelization 463

There is a potential drawback. If we signicantly expand the

dimension of the problem, we might increase the computational

burden. For example, x has dimension d = 256 and we wanted

to use all fourth order terms, then z = �(x) has dimension

183,181,376. What saves us is two observations. First, many

classi�ers do not require that we know the values of the indi-

vidual points but, rather, just the inner product between pairs

of points. Second, notice in our example that the inner product

in Z can be written

hz; ezi = h�(x); �(ex)i= x21ex21 + 2x1ex1x2ex2 + x22ex22= (hx; exi)2� k(x; ex):

Thus, we can compute hz; ezi without ever computing Zi =

�(Xi).

To summarize, kernelization means involves �nding a map-

ping � : X ! Z and a classi�er such that:

1. Z has higher dimension than X and so leads a richer set

of classi�ers.

2. The classi�er only requires computing inner products.

3. There is a functionK, called a kernel, such that h�(x); �(ex)i =K(x; ex).

4. Everywhere the term hx; exi appears in the algorithm, re-

place it with K(x; ex).In fact, we never need to construct the mapping � at all.

We only need to specify a kernel k(x; ex) that corresponds to

h�(x); �(ex)i for some �. This raises an interesting question: given

a function of two variables K(x; y), does there exist a function

�(x) such that K(x; y) = h�(x); �(y)i. The answer is provided

Page 462: All of Statistics: A Concise Course in Statistical Inference Brief

464 23. Classi�cation

byMercer's theorem which says, roughly, that ifK is positive

de�nite { meaning thatZ ZK(x; y)f(x)f(y)dxdy � 0

for square integrable functions f { then such a � exists. Exam-

ples of commonly used kernels are:

polynomial K(x; ex) =�hx; exi+ a

�rsigmoid K(x; ex) = tanh(ahx; exi+ b)

Gaussian K(x; ex) = exp��jjx� exjj2=(2�2)

�Let us now see how we can use this trick in LDA and in

support vector machines.

Recall that the Fisher linear discriminant method replaces X

with U = wTX where w is chosen to maximize the Rayleigh

coeÆcient

J(w) =wTSBw

wTSWw;

SB = (X0 �X1)(X0 �X1)T

and

SW =

�(n0 � 1)S0

(n0 � 1) + (n1 � 1)

�+

�(n1 � 1)S1

(n0 � 1) + (n1 � 1)

�:

In the kernelized version, we replace Xi with Zi = �(Xi) and

we �nd w to maximize

J(w) =wT eSBwwT eSWw (23.42)

where eSB = (Z0 � Z1)(Z0 � Z1)T

and

SW =

(n0 � 1)eS0

(n0 � 1) + (n1 � 1)

!+

(n1 � 1)eS1

(n0 � 1) + (n1 � 1)

!:

Page 463: All of Statistics: A Concise Course in Statistical Inference Brief

23.10 Kernelization 465

Here, eSj is the sample of covariance of the Zi's for which Y =

j. However, to take advantage of kernelization, we need to re-

express this in terms of inner products and then replace the

inner products with kernels.

It can be shown that the maximizing vector w is a linear

combination of the Zi's. Hence we can write

w =

nXi=1

�iZi:

Also,

Zj =1

nj

nXi=1

�(Xi)I(Yi = j):

Therefore,

wTZj =� nXi=1

�iZi

�T� 1

nj

nXi=1

�(Xi)I(Yi = j)�

=1

nj

nXi=1

nXs=1

�iI(Ys = j)ZT

i�(Xs)

=1

nj

nXi=1

�i

nXs=1

I(Ys = j)�(Xi)T�(Xs)

=1

nj

nXi=1

�i

nXs=1

I(Ys = j)K(Xi; Xs)

= �TMj

where Mj is a vector whose ith component is

Mj(i) =1

nj

nXs=1

K(Xi; Xs)I(Yi = j):

It follows that

wT eSBw = �TM�

where M = (M0 �M1)(M0 �M1)T . By similar calcuations, we

can write

wT eSWw = �TN�

Page 464: All of Statistics: A Concise Course in Statistical Inference Brief

466 23. Classi�cation

where

N = K0

�I � 1

n01

�KT

0 +K1

�I � 1

n11

�KT

1 ;

I is the identity matrix, 1 is a matrix of all one's, and Kj is

the n� nj matrix with entries (Kj)rs = K(xr; xs) with xs vary-

ing over the observations in group j. Hence, we now �nd � to

maximize

J(�) =�TM�

�TN�:

Notice that all the qauntities are expressed in terms of the ker-

nel. Formally, the solution is � = N�1(M0 �M1). However, N

might be non-invertible. In this case on replaces N by N + bI

form some constant b. Finally, the projection onto the new sub-

space can be written as

U = wT�(x) =

nXi=1

�iK(xi; x):

The support vector machine can similarly be kernelized. We

simply replace hXi; Xji with K(Xi; Xj). For example, instead

of maximizing (23.40), we now maximize

nXi=1

�i �1

2

nXi=1

nXk=1

�i�kYiYkK(Xi; Xj): (23.43)

The hyperplane can be written as bH(x) = ba0+Pn

i=1 b�iYiK(X;Xi).

23.11 Other Classi�ers

There are many other classi�ers and space precludes a full di-

cussion of all of them. Let us brie y mention a few.

The k-nearest-neighbors classi�er is very simple. Given a

point x, �nd the k data points closest to x. Classify x using the

majority vote of these k neighbors. Ties can be broke randomly.

The parameter k can be chosen by cross-validation.

Page 465: All of Statistics: A Concise Course in Statistical Inference Brief

23.11 Other Classi�ers 467

Bagging is a method for reducing the variability of a clas-

si�er. It is most helpful for highly non-linear classi�ers such as

a tree. We draw B boostrap samples from the data. The bth

bootstrap sample yields a classi�er hb. The �nal classi�er is

bh(x) = � 1 if 1B

PB

b=1 hb(x) � 12

0 otherwise:

Boosting is a method for starting with a simple classi�er

and gradually improving it by re�tting the data giving higher

weight to misclassi�ed samples. Suppose thatH is a collection of

classi�ers, for example, tress with onely one split. Assume that

Yi 2 f�1; 1g and that each h is such that h(x) 2 f�1; 1g. We

usually give equal weight to all data points in the methods we

have discussed. But one can incorporate unequal weights quite

easily in most algorithms. For example, in constructing a tree,

we could replace the impurity measure with a weighted impurity

measure. The original version of bossting, called AdaBoost, is

as follows.

1. Set the weights wi = 1=n, i = 1; : : : ; n.

2. For j = 1; : : : ; J , do the following steps:

(a) Constructing a classi�er hj from the data using the

weights w1; : : : ; wn.

(b) Compute the weighted error estimate:

bLj =

Pn

i=1wiI(Yi 6= hj(Xi))Pn

i=1wi

:

(c) Let �j = log((1� bLj)=bLj).

(d) Update the weights:

wi � wie�jI(Yi 6=hj(Xi))

Page 466: All of Statistics: A Concise Course in Statistical Inference Brief

468 23. Classi�cation

3. The �nal classi�er is

bh(x) = sign� JXj=1

�jhj(x)�:

There is now an enormous literature trying to explain and

improve on boosting. Whereas bagging is a variance reduction

technique, boosting can be thought of as a bias reduction tech-

nique. We starting with a simple { and hence highly biased {

classi�er, and we gradually reduce the bias. The disadvantage

of boosting is that the �nal classi�er is quite complicated.

23.12 Bibliographic Remarks

An excellent reference on classi�cation is Hastie, Tibshirani and

Friedman (2001). For more on the theory, see Devroye, Gy�or�,

Lugosi (1996) and Vapnik (2001).

23.13 Exercises

1. Prove Theorem 23.5.

2. Prove Theorem 23.7.

3. Download the spam data from:

http://www-stat.stanford.edu/�tibs/ElemStatLearn/index.html

The data �le can also be found on the course web page.

The data contain 57 covariates relating to email messages.

Each email message was classi�ed as spam (Y=1) or not

spam (Y=0). The outcome Y is the last column in the �le.

The goal is to predict whether an email is spam or not.

Page 467: All of Statistics: A Concise Course in Statistical Inference Brief

23.13 Exercises 469

(a) Construct classi�cation rules using (i) LDA, (ii) QDA,

(iii) logistic regression and (iv) a classi�cation tree. For

each, report the observed misclassi�cation error rate and

construct a 2-by-2 table of the form

bh(x) = 0 bh(x) = 1

Y = 0

Y = 1

Hint: Here are some things to remember when using R.

(i) To �t a logistic regression and get the classi�ed values:

d <- data.frame(x,y)

n <- nrow(x)

out <- glm(y ~ .,family=binomial,data=d)

p <- out$fitted.values

pred <- rep(0,n)

pred[p > .5] <- 1

table(y,pred)

(ii) To build a tree you must turn y into a factor and you

must do it before you make the data frame:

y <- as.factor(y)

d <- data.frame(x,y)

library(tree)

out <- tree(y ~ .,data=d)

print(summary(out))

plot(out,type="u",lwd=3)

text(out)

pred <- predict(out,d,type="class")

table(y,pred)

Page 468: All of Statistics: A Concise Course in Statistical Inference Brief

470 23. Classi�cation

(b) Use 5-fold cross-validation to estimate the prediction

accuracy of LDA and logistic regression.

(c) Sometimes it helps to reduce the number of covariates.

One strategy is to compare Xi for the spam and email

group.. For each of the 57 covariates, test whether the

mean of the covriate is the same or di�erent between the

two groups. Keep the 10 covariates with the smallest p-

values. Try LDA and logstic regression using only these 10

variables.

4. Let A be the set of two-dimensional spheres. That is, A 2A if A = f(x; y) : (x�a)2+(y� b)2 � c2g for some a; b; c.

Find the VC dimension of A.

5. Classify the spam data using support vector machines.

Free software for the support vector machine is at

http://svmlight.joachims.org/

It is very easy to use. After installing, do the following.

First, you must recode the Yi's as +1 and -1. Then type

svm_learn data output

where \data" is the �le with the data. It expects the data

in the following form:

1 1:3.4 2:.17 3:129 etc

where the �rst number is Y (+1 or -1) and the rest of the

line means: X1 is 3.4, X2 is .17, etc.

If the �le \test" contains test data, then type

Page 469: All of Statistics: A Concise Course in Statistical Inference Brief

23.13 Exercises 471

svm_classify test output predictions

to get the predictions on the test data. There are many

options available, as explained on the web site.

6. Use VC theory to get a con�dence interval on the true

error rate of the LDA classi�er in question 3. Suppose

that the covariate X = (X1; : : : ; Xd) has d dimensions.

Assume that each variable has a continuous distribution.

7. Suppose that Xi 2 R and that Yi = 1 whenever jXij �1 and Yi = 0 whenever jXij > 1. Show that no linear

classi�er can perfectly classify these data. Show that the

kernelized data Zi = (Xi; X2i) can be linearly separated.

8. Repeat question 5 using the kernel K(x; ex) = (1 + xT ex)p.Choose p by cross-validation.

9. Apply the k nearest neighbors classi�er to the \iris data."

Get the data in R from the command data(iris). Choose

k by cross-validation.

10. (Curse of Dimensionality.) Suppose that X has a uniform

distribution on the d-dimensional cube [�1=2; 1=2]d. LetR be the distance from the origin to the closest neighbor.

Show that the median of R is0@�1�

�12

�1=n�vd(1)

1A1=d

where

vd(r) = rd�d=2

�((d=2) + 1)

is the volume of a sphere of radius r. When does R exceed

the edge of the cube when n = 100; 1000; 10000?

Page 470: All of Statistics: A Concise Course in Statistical Inference Brief

472 23. Classi�cation

11. Fit a tree to the data in question 3. Now apply bagging

and report your results.

12. Fit a tree that uses only one split on one variable to the

data in question 3. Now apply boosting.

13. Let r(x) = P(Y = 1jX = x) and let br(x) be an estimate

of r(x). Consider the classi�er

h(x) =

�1 if br(x) � 1=2

0 otherwise:

Assume that br(x) � N(r(x); �2(x)) for some functions

r(x) and �2(x)). Show that

P(Y 6= h(X)) � 1��

0B@sign�(br(x)� (1=2))(r(x)� (1=2))

��(x)

1CAwhere � is the standard Normal cdf. Regard sign

�(br(x)�

(1=2))(r(x)� (1=2))�as a type of bias term. Explain the

implications for the bias-variance trade-o� in classi�cation

(Friedman, 1997).

Page 471: All of Statistics: A Concise Course in Statistical Inference Brief

This is page 473

Printer: Opaque this

24

Stochastic Processes

24.1 Introduction

Most of this book has focused on iid sequences of random vari-

ables. Now we consider sequences of dependent random vari-

ables. For example, daily temperatures will form a sequence of

time-ordered random variables and clearly the temperature on

one day is not independent of the temperature on the previous

day.

A stochastic process fXt : t 2 Tg is a collection of ran-

dom variables. We shall sometimes write X(t) instead of Xt.

The variables Xt take values in some set X called the state

space. The set T is called the index set and for our pur-

poses can be thought of as time. The index set can be discrete

T = f0; 1; 2; : : :g or continuous T = [0;1) depending on the

application.

Example 24.1 (iid observations.) A sequence of iid random vari-

ables can be written as fXt : t 2 Tg where T = f1; 2; 3; : : : ; g.

Page 472: All of Statistics: A Concise Course in Statistical Inference Brief

474 24. Stochastic Processes

Thus, a sequence of iid random variables is an example of a

stochastic process. �

Example 24.2 (The Weather.) Let X = fsunny; cloudyg. A typi-

cal sequence (depending on where you live) might be

sunny; sunny; cloudy; sunny; cloudy; cloudy; � � �This process has a discrete state space and a discrete index set.

Example 24.3 (Stock Prices.) Figure 24.1 shows the price of a

stock over time. The price is monitored continuously so the index

set T is not discrete. Price is discrete but, for practical purposes

we can imagine treating it as a continuous variable. �

Example 24.4 (Empirical Distribution Function.) Let X1; : : : ; Xn �F where F is some cdf on [0,1]. Let

bFn(t) =1

n

nXi=1

I(Xi � t)

be the empirical cdf . For any �xed value t, bFn(t) is a random

variable. But the whole empirical cdfn bFn(t) : t 2 [0; 1]o

is a stochastic process with a continuous state space and a con-

tinuous index set. �

We end this section by recalling a basic fact. If X1; : : : ; Xn

are random variables then we can write the joint density as

f(x1; : : : ; xn) = f(x1)f(x2jx1) � � �f(xnjx1; : : : ; xn�1)

=

nYi=1

f(xijpasti) (24.1)

where pasti refers to all the variables before Xi.

Page 473: All of Statistics: A Concise Course in Statistical Inference Brief

24.1 Introduction 475

0 2 4 6 8 10

78

91

01

11

21

3

time

price

FIGURE 24.1. Stock price over ten week period.

Page 474: All of Statistics: A Concise Course in Statistical Inference Brief

476 24. Stochastic Processes

24.2 Markov Chains

The simplest stochastic process is a Markov chain in which the

distribution of Xt depends only on Xt�1. In this section we as-

sume that the state space is discrete, either X = f1; : : : ; Ngor X = f1; 2; : : : ; g and that the index set is T = f0; 1; 2; : : :g.Typically, most authors write Xn instead of Xt when discussing

Markov chains and I will do so as well.

De�nition 24.5 The process fXn : n 2 Tg is a Markov

chain if

P(Xn = x j X0; : : : ; Xn�1) = P(Xn = x j Xn�1) (24.2)

for all n and for all x 2 X .

For a Markov chain, equation (24.1) simplifes to

f(x1; : : : ; xn) = f(x1)f(x2jx1)f(x3jx2) � � � f(xnjxn�1):

A Markov chain can be represented by the following DAG:

X1 X2 X3 � � � Xn � � �

Each variable has a single parent, namely, the previous obser-

vation.

The theory of Markov chains is a very rich and complex. We

have to get through many de�nitions before we can do anything

interesting. Our goal is to answer the following questions:

1. When does a Markov chain \settle down" into some sort

of equilibrium?

2. How do we estimate the parameters of a Markov chain?

Page 475: All of Statistics: A Concise Course in Statistical Inference Brief

24.2 Markov Chains 477

3. How can we construct Markov chains that converge to a

given equilibrium distribution and why would we want to

do that?

We will answer questions 1 and 2 in this chapter. We will

answer question 3 in the next chapter. To understand question

1, look at the two chains in Figure 24.2. The �rst chain oscillates

all over the place and will continue to do so forever. The second

chain eventually settles into an equilibrium. If we constructed a

histogram of the �rst process, it would keep changing as we got

more and more observations. But a histogram from the second

chain would eventually converge to some �xed distribution.

Transition Probabilities.The key quantities of a Markov

chain are the probabilities of jumping from one state into an-

other state.

De�nition 24.6 We call

P(Xn+1 = jjXn = i) (24.3)

the transition probabilities. If the transition probabil-

ities do not change with time, we say the chain is homo-

geneous. In this case we de�ne pij = P(Xn+1 = jjXn =

i). The matrix P whose (i; j) element is pij is called the

transition matrix.

We will only consider homogeneous chains. Notice that P has

two properties: (i) pij � 0 and (ii)P

ipij = 1. Each row is

a probability mass function. A matrix with these properties is

called a stochastic matrix.

Example 24.7 (Random Walk With Absorbing Barriers.) Let X =

f1; : : : ; Ng. Suppose you are standing at one of these points.

Page 476: All of Statistics: A Concise Course in Statistical Inference Brief

478 24. Stochastic Processes

0 200 400 600 800 1000

−1

50

−1

00

−5

00

time

x

0 200 400 600 800 1000

−2

02

46

81

0

time

x

FIGURE 24.2. The �rst chain does not settle down into

an equilibrium. The second does.

Page 477: All of Statistics: A Concise Course in Statistical Inference Brief

24.2 Markov Chains 479

Flip a coin with P(Heads) = p and P(Tails) = q = 1 � p. If

it is heads, take one step to the right. If it is tails, take one

step to the left. If you hit one of the endpoints, stay there. The

transition matrix is

P =

26666664

1 0 0 0 � � � 0 0

q 0 p 0 � � � 0 0

0 q 0 p � � � 0 0

� � � � � � � � � � � � � � � � � � � � �0 0 0 0 q 0 p

0 0 0 0 0 0 1

37777775: �

Example 24.8 Suppose the state space is X = fsunny; cloudyg.Then X1; X2; : : : represents the weather for a sequence of days.

The weather today clearly depends on yesterday's weather. It

might also depend on the weather two days ago but as a �rst

approximation we might assume that the dependence is only one

day back. In that case the weather is a Markov chain and a

typical transition matrix might be

Sunny Cloudy

Sunny 0.4 0.6Cloudy 0.8 0.2

For example, if it is sunny today, there is a 60 per cent chance

it will be cloudy tommorrow. �

Let

pij(n) = P(Xm+n = jjXm = i) (24.4)

be the probability of of going from state i to state j in n steps.

Let Pn be the matrix whose (i; j) element is pij(n). These are

called the n-step transition probabilities.

Theorem 24.9 (The Chapman-Kolmogorov equations.) The

n-step probabilities satisfy

pij(m + n) =Xk

pik(m)pkj(n): (24.5)

Page 478: All of Statistics: A Concise Course in Statistical Inference Brief

480 24. Stochastic Processes

Proof. Recall that, in general,

P(X = x; Y = y) = P(X = x)P(Y = yjX = x):

This fact is true in the more general form

P(X = x; Y = yjZ = z) = P(X = xjZ = z)P(Y = yjX = x; Z = z):

Also, recall the law of total probability:

P(X = x) =Xy

P(X = x; Y = y):

Using these facts and the Markov property we have

pij(m + n) = P(Xm+n = jjX0 = i)

=Xk

P(Xm+n = j;Xm = kjX0 = i)

=Xk

P(Xm+n = jjXm = k;X0 = i)P(Xm = kjX0 = i)

=Xk

P(Xm+n = jjXm = k)P(Xm = kjX0 = i)

=Xk

pik(m)pkj(n): �

Look closely at equation (24.5). This is nothing more than the

equation for matrix multiplication. Hence we have shown that

Pm+n = PmPn: (24.6)

By de�nition, P1 = P. Using the above theorem, P2 = P1+1 =

P1P1 = PP = P2. Continuing this way, we see that

Pn = Pn � P�P� � � � �P| {z }

multiply the matrix n times

: (24.7)

Page 479: All of Statistics: A Concise Course in Statistical Inference Brief

24.2 Markov Chains 481

Let �n = (�n(1); : : : ; �n(N)) be a row vector where

�n(i) = P(Xn = i) (24.8)

is the marginal probability that the chain is in state i at time

n. In particular, �0 is called the initial distribution. To sim-

ulate a Markov chain, all you need to know is �0 and P. The

simulation would look like this:

Step 1: Draw X0 � �0. Thus, P(X0 = i) = �0(i).

Step 2: Suppose the outcome of step 1 is i. Draw X1 � P. In

other words, P(X1 = jjX0 = i) = pij.

Step 3: Suppose the outcome of step 2 is j. Draw X2 � P. In

other words, P(X2 = kjX1 = j) = pjk.

And so on.

It might be diÆcult to understand the meaning of �n. Imag-

ing simulating the chain many times. Collect all the outcomes at

time n from all the chains. This histogram would look approxi-

mately like �n. A consequence of theorem 24.9 is the following.

Lemma 24.10 The marginal probabilities are given by

�n = �0Pn:

Proof.

�n(j) = P(Xn = j)

=Xi

P(Xn = jjX0 = i)P (X0 = i)

=Xi

�0(i)pij(n) = �0Pn: �

Page 480: All of Statistics: A Concise Course in Statistical Inference Brief

482 24. Stochastic Processes

Summary

1. Transition matrix: P(i; j) = P(Xn+1 = jjXn = i).

2. n-step matrix: Pn(i; j) = P(Xn+m = jjXm = i).

3. Pn = Pn.

4. Marginal: �n(i) = P(Xn = i).

5. �n = �0Pn.

States. The states of a Markov chain can be classi�ed ac-

cording to various properties.

De�nition 24.11 We say that i reaches j (or j is ac-

cessible from i) if pij(n) > 0 for some n, and we write

i ! j. If i ! j and j ! i then we write i $ j and we

say that i and j communicate.

Theorem 24.12 The communication relation satis�es the fol-

lowing properties:

1. i$ i.

2. If i$ j then j $ i.

3. If i$ j and j $ k then i$ k.

4. The set of states X can be written as a disjoint union

of classes X = X1

SX2

S � � � where two states i and j

communicate with each other if and only if they are in the

same class.

Page 481: All of Statistics: A Concise Course in Statistical Inference Brief

24.2 Markov Chains 483

If all states communicate with each other then the chain is

called irreducible. A set of states is closed if, once you enter

that set of states you never leave. A closed set consisting of a

single state is called an absorbing state.

Example 24.13 Let X = f1; 2; 3; 4g and

P =

0BBBB@

13

23

0 023

13

0 014

14

14

14

0 0 0 1

1CCCCA

The classes are f1; 2g; f3g and f4g. State 4 is an absorbing state.

Suppose we start a chain in state i. Will the chain ever return

to state i? If so, that state is called persistent or recurrent.

De�nition 24.14 State i is recurrent or persistent if

P(Xn = i for some n � 1 j X0 = i) = 1:

Otherwise, state i is transient.

Theorem 24.15 A state i is recurrent if and only ifXn

pii(n) =1: (24.9)

A state i is transient if and only ifXn

pii(n) <1: (24.10)

Proof. De�ne

In =

�1 if Xn = i

0 if Xn 6= i:

Page 482: All of Statistics: A Concise Course in Statistical Inference Brief

484 24. Stochastic Processes

The number of times that the chain is in state i is Y =P1

n=0 In.

The mean of Y , given that the chain starts in state i, is

E (Y jX0 = i) =

1Xn=0

E (InjX0 = i)

=

1Xn=0

P(Xn = ijX0 = i)

=

1Xn=0

pii(n):

De�ne ai = P(Xn = i for some n � 1 j X0 = i). If i is recurrent,

ai = 1. Thus, the chain will eventually return to i. Once it

does return to i, we argue again that since ai = 1, the chain will

return to state i again. By repeating this argument, we conclude

that E (Y jX0 = i) =1. If i is transient, then ai < 1. When the

chain is in state i, there is a probability 1� ai > 0 that it will

never return to state i. Thus, the probability that the chain is

in state i exactly n times is an�1i (1 � ai). This is a geometric

distribution which has �nite mean. �

Theorem 24.16 Facts about recurrence.

1. If state i is recurrent and i$ j then j is recurrent.

2. If state i is transient and i$ j then j is transient.

3. A �nite Markov chain must have at least one recurrent

state.

4. The states of a �nite, irreducible Markov chain are all

recurrent.

Theorem 24.17 (Decomposition Theorem.) The state space

X can be written as the disjoint union

X = XT

[X1

[X2 � � �

Page 483: All of Statistics: A Concise Course in Statistical Inference Brief

24.2 Markov Chains 485

where XT are the transient states and each Xi is a closed, irre-

ducible set of recurrent states.

Convergence of Markov Chains. To discuss the convre-

gence of chains, we need a few more de�nitions. Suppose that

X0 = i. De�ne the recurrence time

Tij = minfn > 0 : Xn = jg (24.11)

assuming Xn ever returns to state i, otherwise de�ne Tij =1.

The mean recurrence time of a recurrent state i is

mi = E(Tii) =Xn

nfii(n) (24.12)

where

fij(n) = P (X1 6= j;X2 6= j; : : : ; Xn�1 6= j;Xn = jjX0 = i):

A recurrent state is null if mi =1 otherwise it is called non-

null or positive.

Lemma 24.18 If a state is null and recurrent, then pnii ! 0.

Lemma 24.19 In a �nite state Markov chain, all recurrent states

are positive.

Consider a three state chain with transition matrix24 0 1 0

0 0 1

1 0 0

35 :

Suppose we start the chain in state 1. Then we will be in state

3 at times 3, 6, 9, . . . . This is an example of a periodic chain.

Formally, the period of state i is d if pii(n) = 0 whenever n is

not divisible by d and d is the largest integer with this property.

Thus, d = gcdfn : pii(n) > 0g where gcd means \greater

common divisor." State i is periodic if d(i) > 1 and aperiodic

if d(i) = 1. A state with period 1 is called aperiodic.

Page 484: All of Statistics: A Concise Course in Statistical Inference Brief

486 24. Stochastic Processes

Lemma 24.20 If state i has period d and i$ j then j has period

d.

De�nition 24.21 A state is ergodic if it is recurrent, non-

null and aperiodic. A chain is ergodic if all its states are

ergodic.

Let � = (�i : i 2 X ) be a vector of non-negative numbers

that sum to one. Thus � can be thought of as a probability mass

function.

De�nition 24.22 We say that � is a stationary (or in-

variant) distribution if � = �P.

Here is the intuition. Draw X0 from distribution � and sup-

pose that � is a stationary distribution. Now draw X1 accord-

ing to the transition probability of the chain. The distribition

of X1 is then �1 = �0P = �P = �. The distribution of X2 is

�P2 = (�P)P = �P = �. Continuing this way, we see that the

distribution of Xn is �Pn = �. In other words:

If at any time the chain has distribution � then it will

continue to have distribution � forever.

Page 485: All of Statistics: A Concise Course in Statistical Inference Brief

24.2 Markov Chains 487

De�nition 24.23 We say that a chain has limiting dis-

tribution if

Pn !

26664�

�...

37775

for some �. In other words, �j = limn!1Pnij exists and

is independent of i.

Here is the main theorem.

Theorem 24.24 An irreducible, ergodic Markov chain has

a unique stationary distribution �. The limiting distrubu-

tion exists and is equal to �. If g is any bounded function,

then, with probability 1,

limN!1

1

N

NXn=1

g(Xn)! E �(g) �Xj

g(j)�j: (24.13)

The last statement of the theorem, is the law of large numbers

for Markov chains. It says that sample averages converge to their

expectations. Finally, there is a special condition which will be

useful later. We say that � satis�es detailed balance if

�ipij = pji�j: (24.14)

Detailed balance guarantees that � is a stationary distribution.

Theorem 24.25 If � satis�es detailed balance then � is a sta-

tionary distribution.

Proof. We need to show that �P = �. The jth element of

�P isP

i�ipij =

Pi�jpji = �j

Pipji = �j: �

Page 486: All of Statistics: A Concise Course in Statistical Inference Brief

488 24. Stochastic Processes

The importance of detailed balance will become clear when

we discuss Markov chain Monte Carlo methods.

Warning! Just because a chain has a stationary distribution

does not mean it converges.

Example 24.26 Let

P =

24 0 1 0

0 0 1

1 0 0

35 :

Let � = (1=3; 1=3; 1=3). Then �P = � so � is a stationary

distribution. If the chain is started with the distribution � it will

stay in that distribution. Imagine simulating many chains and

checking the marginal distribution at each time n. It will always

be the uniform distribution �. But this chain does not have a

limit. It continues to cycle around forever. �

Examples of Markov Chains

Example 24.27 (Random Walk) . Let X = f: : : ;�2;�1; 0; 1; 2; : : : ; gand suppose that pi;i+1 = p, pi;i�1 = q = 1 � p. All states com-

municate hence either all the states are recurrent or all are tran-

sient. To see which, suppose we start at X0 = 0. Note that

p00(2n) =

�2n

n

�pnqn (24.15)

since, the only way to get back to 0 is to have n heads (steps

to the right) and n tails (steps to the left). We can approximate

this expression using Stirling's formula which says that

n! � nnpne�n

p2�:

Page 487: All of Statistics: A Concise Course in Statistical Inference Brief

24.2 Markov Chains 489

Inserting this approximation into (24.15) shows that

p00(2n) � (4pq)npn�

:

It is easy to check thatP

np00(n) <1 if and only if

Pnp00(2n) <

1. Moreover,P

np00(2n) = 1 if and only if p = q = 1=2. By

Theorem (24.15), the chain is recurrent if p = 1=2 otherwise it

is transient. �

Example 24.28 Let X = f1; 2; 3; 4; 5; 6g. Let

P =

26666666664

12

12

0 0 0 014

34

0 0 0 014

14

14

14

0 014

0 14

14

0 14

0 0 0 0 12

12

0 0 0 0 12

12

37777777775

Then C1 = f1; 2g and C2 = f5; 6g are irreducible closed sets.

States 3 and 4 are transient because the of the path 3! 4! 6

and once you hit state 6 you cannot return to 3 or 4. Since

pii(1) > 0, all the states are aperiodic. In summary, 3 and 4 are

transient while 1,2,5 and 6 are ergodic. �

Example 24.29 (Hardy-Weinberg.) Here is a famous example from

genetics. Suppose a gene can be type A or type a. There are three

types of people (called genotypes): AA, Aa and aa. Let (p; q; r)

denote the fraction of people of each genotype. We assume that

everyone contributes one of their two copies of the gene at ran-

dom to their children. We also assume that mates are selected

at random. The latter is not realistic however, it is often rea-

sonable to assume that you do not choose your mate based on

whether they are AA, Aa or aa. (This would be false if the gene

was for eye color and if people chose mates based on eye color.)

Imagine if we pooled eveyone's genes together. The proportion

Page 488: All of Statistics: A Concise Course in Statistical Inference Brief

490 24. Stochastic Processes

of A genes is P = p + (q=2) and the proportion of a genes is

Q = r+(q=2). A child is AA with probability P 2, aA with prob-

ability 2PQ and aa with probability Q2. Thus, the fraction of A

genes in this generation is

P 2 + PQ =�p+

q

2

�2+�p+

q

2

��r +

q

2

�:

However, r = 1 � p � q. Substitute this in the above equation

and you get P 2 + PQ = P . A similar calculation shows that

fraction of a genes is Q. We have shown that the proportion of

type A and type a is P and Q and this remains stable after the

�rst generation. The proportion of people of type AA, Aa, aa is

thus (P 2; 2PQ;Q2) from the second generation and on. This is

called the Hardy-Weinberg law.

Assume everyone has exactly one child. Now consider a �xed

person and let Xn be the genotype of their nth descendent. This

is a Markov chain with state space X = fAA;Aa; aag. Some

basic calculations will show you that the transition matrix is24 P Q 0

P

2

P+Q

2

Q

2

0 P Q

35 :

The stationary distribution is � = (P 2; 2PQ;Q2). �

Inference for Markov Chains. Consider a chain with �nite

state space X = f1; 2; : : : ; Ng. Suppose we observe n observa-

tions X1; : : : ; Xn from this chain. The unknown parameters of a

Markov chain are the initial probabilities �0 = (�0(1); �0(2); : : : ; )

and the elements of the transition matrix P. Each row of P is

a multinomial distribution. So we are essentially estimating N

distributions (plus the initial probabilities). Let nij be the ob-

served number of transitions from state i to state j. The likeli-

hood function is

L(�0;P) = �0(x0)

nYr=1

pXr�1;Xr= �0(x0)

NYi=1

NYj=1

pnijij :

Page 489: All of Statistics: A Concise Course in Statistical Inference Brief

24.3 Poisson Processes 491

There is only one observation on �0 so we can't estimate that.

Rather, we focus on estimatingP. The mle is obtained by max-

imizing L(�0;P) subject to the constraint that the elements are

non-negative and the rows sum to 1. The solution is

bpij = nij

ni

where ni =PN

j=1 nij. Here we are assuming that ni > 0. If note,

then we set bpij = 0 by convention.

Theorem 24.30 (Consistency and Asymptotic Normality of the mle .)

Assume that the chain is ergodic. Let bpij(n) denote the mle after

n observations. Then bpij(n) P�! pij. Also,hpNi(n)(bpij � pij)

i N(0;�)

where the left hand side is a matrix, Ni(n) =Pn

r=1 I(Xr = i)

and

�ij;k` =

8<:

pij(1� pij) (i; j) = (k; `)

�pijpi` i = k; j 6= `

0 otherwise:

24.3 Poisson Processes

One of the most studied and useful stochastic processes is the

Poisson process. It arises when we count occurences of events

over time, for example, traÆc accidents, radioactive decay, ar-

rival of email messages etc. As the name suggests, the Poisson

process is intimately related to the Poisson distribution. Let's

�rst review the Poisson distribution.

Recall that X has a Poisson distribution with parameter � {

written X � Poisson(�) { if

P(X = x) � p(x; �) =e���x

x!; x = 0; 1; 2; : : :

Page 490: All of Statistics: A Concise Course in Statistical Inference Brief

492 24. Stochastic Processes

Also recall that E (X) = � and V(X) = �. If X � Poisson(�),

Y � Poisson(�) and X q Y , then X + Y � Poisson(� + �).

Finally, if N � Poisson(�) and Y jN = n � Binomial(n; p), then

the marginal distribution of Y is Y � Poisson(�p).

Now we describe the Poisson process. Imaging you are at your

computer. Each time a new email message arrives you record the

time. Let Xt be the number of messages you have received up

to and including time t. Then, fXt : t 2 [0;1)g is a stochastic

process with state space X = f0; 1; 2; : : :g. A process of this form

is called a counting process. A Poisson process is a counting

process that satis�es certain conditions. In what follows, we will

sometimes write X(t) instead of Xt. Also, we need the following

notation. Write f(h) = o(h) if f(h)=h ! 0 as h ! 0. This

means that f(h) is smaller than h when h is close to 0. For

example, h2 = o(h).

Page 491: All of Statistics: A Concise Course in Statistical Inference Brief

24.3 Poisson Processes 493

De�nition 24.31 A Poisson process is a stochastic pro-

cess fXt : t 2 [0;1)g with state space X = f0; 1; 2; : : :gsuch that

1. X(0) = 0.

2. For any 0 = t0 < t1 < t2 < � � � < tn, the increments

X(t1)�X(t0); X(t2)�X(t1); � � � ; X(tn)�X(tn�1)

are independent.

3. There is a function �(t) such that

P(X(t+ h)�X(t) = 1) = �(t)h+ o(h) (24.16)

P(X(t+ h)�X(t) � 2) = o(h): (24.17)

We call �(t) the intensity function.

The last condition means that the probability of an event in

[t; t + h] is approximately h�(t) while the probability of more

than one event is small.

Theorem 24.32 If Xt is a Poisson process with intensity func-

tion �(t), then

X(s+ t)�X(s) � Poisson(m(s+ t)�m(s))

where

m(t) =

Z t

0

�(s) ds:

In particular, X(t) � Poisson(m(t)). Hence, E (X(t)) = m(t)

and V(X(t)) = m(t).

Page 492: All of Statistics: A Concise Course in Statistical Inference Brief

494 24. Stochastic Processes

De�nition 24.33 A Poisson process with intensity func-

tion �(t) � � for some � > 0 is called a homogeneous

Poisson process with rate �. In this case,

X(t) � Poisson(�t):

Let X(t) be a homogeneous Poisson process with rate �. Let

Wn be the time at which the nth event occurs and set W0 = 0.

The random variables W0;W1; : : : ; are called waiting times.

Let Sn = Wn+1�Wn. Then S0; S1; : : : ; are called sojourn times

or interarraival times.

Theorem 24.34 The sojourn times S0; S1; : : : are iid random

variables. Their distribution is exponential with mean 1=�, that

is, they have density

f(s) = �e��s; s � 0:

The waiting time Wn � Gamma(n; 1=�) i.e. it has density

f(w) =1

�(n)�nwn�1e��t:

Hence, E (Wn) = n=� and V(Wn) = n=�2.

Proof. First, we have

P(S1 > t) = P(X(t) = 0) = e��t

with shows that the cdf for S1 is 1�e��t. This shows the resultfor S1. Now,

P(S2 > tjS1 = s) = P(no events in (s; s+ t]jS1 = s)

= P(no events in (s; s+ t]) (increments are independent)

= e��t:

Page 493: All of Statistics: A Concise Course in Statistical Inference Brief

24.3 Poisson Processes 495

Hence, S2 has an exponential distribution and is independent

of S1. The result follows by repeating the argument. The re-

sult for Wn follows since a sum of exponentials has a Gamma

distribution. �

Example 24.35 Figures 24.3 and 24.4 show requests to a WWW

server in Calgary.1 Assuming that this is a homogeneous Pois-

son process, N � X(T ) � Poisson(�T ). The likelihood is

L(�) / e��T (�T )N

which is maxmimized at

b� =N

T= 48:0077

in units per minute.

In the preceding example we might want to check if the as-

sumption that the data follow a Poisson process is reasonable.

To proceed, let us �rst note that if a Poisson process is observed

on [0; T ] then, given N = X(T ), the event times W1; : : : ;WN

are a sample from a Uniform(0; T ) distribution. To see intutively

why this is true, note that the probability of �nding an event in

a small interval [t; t + h] is proportional to h. But this doesn't

depend on t. Since the probability is constant over t it must be

uniform.

Figure 24.5 shows a histogram and a some density estimates.

The density does not appear to be uniform. To test this formally,

let is divide into 4 equal length intervals I1; I2; I3; I4. Let pi be

the probability of a point being in Ii. The null hypothesis is that

p1 = p2 = p3 = p4. We can test this hypothesis using either a

likelihood ratio test or a �2 test. The latter is

4Xi=1

(Oi � Ei)2

Ei

1See http://ita.ee.lbl.gov/html/contrib/Calgary-HTTP.html for more information.

Page 494: All of Statistics: A Concise Course in Statistical Inference Brief

496 24. Stochastic Processes

0 2 4 6

0.6

0.8

1.0

1.2

1.4

time

0 200 400 600 800 1000 1200

0.6

0.8

1.0

1.2

1.4

time

FIGURE 24.3. Hits on a web server. First plot is �rst

20 events. Second plot is several hours worth of data.

Page 495: All of Statistics: A Concise Course in Statistical Inference Brief

24.3 Poisson Processes 4975

10

15

20

time (seconds)

X(t

)

24/09 24/09 24/09 24/09 24/09 24/09

02

00

60

01

00

0

time (seconds)

X(t

)

24/09 24/09 25/09 25/09 25/09

FIGURE 24.4. X(t) for hits on a web server. First plot

is �rst 20 events. Second plot is several hours worth of

data.

Page 496: All of Statistics: A Concise Course in Statistical Inference Brief

498 24. Stochastic Processes

where Oi is the number of observations in Ii and Ei = n=4 is the

expected number under the null. This yields �2 = 252 which a

p-value of 0. Strong evidence againt the null.

24.4 Hidden Markov Models

24.5 Bibliographic Remarks

This is standard material and there are many good references.

In this Chapter, I followed the following references quite closely:

Probability and Random Processes by Grimmett and Stirzaker

(1982), An Introduction to Stochastic Modeling by Taylor and

Karlin (1998), Stochastic Modeling of Scienti�c Data by Guttorp

(1995) and Probability Models for Computer Science by Ross

(2002). Some of the exercises are from these texts.

24.6 Exercises

1. Let X0; X1; : : : be a Markov chain with states f0; 1; 2g andtransition matrix

P =

24 0:1 0:2 0:7

0:9 0:1 0:0

0:1 0:8 0:1

35

Assume that �0 = (0:3; 0:4; 0:3). Find P(X0 = 0; X1 =

1; X2 = 2) and P(X0 = 0; X1 = 1; X2 = 1).

2. Let Y1; Y2; : : : be a sequence of iid observations such that

P(Y = 0) = 0:1, P(Y = 1) = 0:3, P(Y = 2) = 0:2,

P(Y = 3) = 0:4. Let X0 = 0 and let

Xn = maxfY1; : : : ; Yng:

Show that X0; X1; : : : is a Markov chain and �nd the tran-

sition matrix.

Page 497: All of Statistics: A Concise Course in Statistical Inference Brief

24.6 Exercises 499

Histogram of w/60

w/60

Fre

quency

0 200 400 600 800 1000 1200

050

100

150

200

0 500 1000 15000.0

000

0.0

005

0.0

010

0.0

015

density(x = w/60)

Density

0 200 400 600 800 1000 1200

0.0

00

0.0

01

0.0

02

0.0

03

0.0

04

density(x = w/60, bw = "ucv")

Density

FIGURE 24.5. Density estimate of event times.

Page 498: All of Statistics: A Concise Course in Statistical Inference Brief

500 24. Stochastic Processes

3. Consider a two state Markov chain with states X = f1; 2gand transition matrix

P =

�1� a a

b 1� b

�where 0 < a < 1 and 0 < b < 1. Prove that

limn!1

Pn =

�b

a+ba

a+bb

a+ba

a+b

�:

4. Consider the chain from question 3 and set a = :1 and

b = :3. Simulate the chain. Let

bpn(1) =1

n

nXi=1

I(Xi = 1)

bpn(2) =1

n

nXi=1

I(Xi = 2)

be the proportion of times the chain is in state 1 and

state 2. Plot bpn(1) and bpn(2) versus n and verify that they

converge to the values predicted from the answer in the

previous question.

5. An important Markov chain is the branching process

which is used in biology, genetics, nuclear physics and

many other �elds. Suppose that an animal has Y chil-

dren. Let pk = P (Y = k). Hence, pk � 0 for all k andP1

k=0 pk = 1. Assume each animal has the same lifespan

and that they produce o�spring according to the distri-

bution pk. Let Xn be the number of animals in the nth

generation. Let Y(n)1 ; : : : ; Y

(n)

Xnbe the o�spring produced

in the nth generation. Note that

Xn+1 = Y(n)1 + � � �+ Y

(n)

Xn:

Let � = E (Y ) and �2 = V(Y ). Assume throughout this

question that X0 = 1. Let M(n) = E (Xn) and V (n) =

V(Xn).

Page 499: All of Statistics: A Concise Course in Statistical Inference Brief

24.6 Exercises 501

(a) Show thatM(n+1) = �M(n) and V (n+1) = �2M(n)+

�2V (n).

(b) Show that M(n) = �n and that V (n) = �2�n�1(1 +

�+ � � �+ �n�1).

(c) What happens to the variance if � > 1? What happens

to the variance if � = 1? What happens to the variance if

� < 1?

(d) The population goes extinct if Xn = 0 for some n. Let

us thus de�ne the extinction time N by

N = minfn : Xn = 0g:

Let F (n) = P (N � n) be the cdf of the random variable

N . Show that

F (n) =

1Xk=0

pk(F (n� 1))k; n = 1; 2; : : :

Hint: Note that the event fN � ng is the same as event

fXn = 0g. Thus, P(fN � ng) = P(fXn = 0g). Let k be

the number of o�spring of the original parent. The pop-

ulation becomes extinct at time n if and only if each of

the k sub-populations generated from the k o�spring goes

extinct in n� 1 generations.

(e) Suppose that p0 = 1=4, p1 = 1=2, p2 = 1=4. Use the

formula from (5d) to compute the cdf F (n).

Page 500: All of Statistics: A Concise Course in Statistical Inference Brief

502 24. Stochastic Processes

6. Let

P =

24 0:40 0:50 0:10

0:05 0:70 0:25

0:05 0:50 0:45

35

Find the stationary distribution �.

7. Show that if i is a recurrent state and i $ j, then j is a

recurrent state.

8. Let

P =

26666664

13

0 13

0 0 13

12

14

14

0 0 0

0 0 0 0 1 014

14

14

0 0 14

0 0 1 0 0 0

0 0 0 0 0 1

37777775

Which states are transient? Which states are recurrent?

9. Let

P =

�0 1

1 0

�Show that � = (1=2; 1=2) is a stationary distribution. Does

this chain converge? Why/why not?

10. Let 0 < p < 1 and q = 1� p. Let

P =

266664q p 0 0 0

q 0 p 0 0

q 0 0 p 0

q 0 0 0 p

1 0 0 0 0

377775

Find the limiting distribution of the chain.

11. Let X(t) be an inhomogeneous Poisson process with in-

tensity function �(t) > 0. Let �(t) =R t0�(u)du. De�ne

Y (s) = X(t) where s = �(t). Show that Y (s) is a homo-

geneous Poisson process with intensity � = 1.

Page 501: All of Statistics: A Concise Course in Statistical Inference Brief

24.6 Exercises 503

12. Let X(t) be a Poisson process with intensity �. Find the

conditional distribution of X(t) given that X(t+ s) = n.

13. Let X(t) be a Poisson process with intensity �. Find the

probability that X(t) is odd, i.e. P(X(t) = 1; 3; 5; : : :).

14. Suppose that people logging in to the University computer

system is described by a Poisson process X(t) with inten-

sity �. Assume that a person stays logged in for some

random time with cdf G. Assume these times are all inde-

pendent. Let Y (t) be the number of people on the system

at time t. Find the distribution of Y (t).

15. LetX(t) be a Poisson process with intensity �. LetW1;W2; : : : ;

be the waiting times. Let f be an arbitrary function. Show

that

E

0@X(t)X

i=1

f(Wi)

1A = �

Z t

0

f(w)dw:

16. A two dimensional Poisson point process is a process of

random points on the plane such that (i) for any set A,

the number of points falling in A is Poisson with mean

��(A) where �(A) is the area of A, (ii) the number of

events in nonoverlapping regions is independent. Consider

an arbitrary point x0 in the plane. Let X denote the dis-

tance from x0 to the nearest random point. Show that

P(X > t) = e���t2

and

E (X) =1

2p�:

Page 502: All of Statistics: A Concise Course in Statistical Inference Brief
Page 503: All of Statistics: A Concise Course in Statistical Inference Brief

This is page 505

Printer: Opaque this

25

Simulation Methods

In this we will see that by generating data in a clever way, we

can solve a number of problems such as integrating or maxi-

mizing a complicated function.1 For integration, we will study

three methods: (i) basic Monte Carlo integration, (ii) impor-

tance sampling and (iii) Markov chain Monte Carlo (MCMC).

25.1 Bayesian Inference Revisited

Simulation methods are especially useful in Bayesian inference

so let us brie y review the main ideas in Bayesian inference.

Given a prior f(�) and data Xn = (X1; : : : ; Xn) the posterior

density is

f(�jXn) =L(�)f(�)R L(u)f(u)du

1My main source for this chapter is Monte Carlo Statistical Methods by C. Robert

and G. Casella.

Page 504: All of Statistics: A Concise Course in Statistical Inference Brief

506 25. Simulation Methods

where L(�) is the likelihood function. The posterior mean is

� =

Z�f(�jXn)d� =

R�L(�)f(�)d�R L(�)f(�)d� :

If � = (�1; : : : ; �k) is multidimensional, then we might be inter-

ested in the posterior for one of the components, �1, say. This

marginal posterior density is

f(�1jXn) =

Z Z� � �Zf(�1; : : : ; �kjXn)d�2 � � �d�k

which involves high dimensional integration.

You can see that integrals play a big role in Bayesian infer-

ence. When � is high dimensional, it may not be feasible to

calculate these integrals analytically. Simulation methods will

often be very helpful.

25.2 Basic Monte Carlo Integration

Suppose you want to evaluate the integral I =Rb

ah(x)dx for

some function h. If h is an \easy" function like a polynomial

or triginometric function then we can do the integral in closed

form. In practice, h can be very complicated and there may be no

known closed form expression for I. There are many numerical

techniques for evaluating I such as Simpson's rule, the trape-

zoidal rule, Guassian quadrature and so on. In some cases these

techniques work very well. But other times they may not work

so well. In particular, it is hard to extend them to higher dimen-

sions. Monte Carlo integration is another approach to evaluating

I which is notable for its simplicity, generality and scalability.

Let us begin by writing

I =

Zb

a

h(x)dx =

Zb

a

h(x)(b� a)1

b� adx =

Zb

a

w(x)f(x)dx

Page 505: All of Statistics: A Concise Course in Statistical Inference Brief

25.2 Basic Monte Carlo Integration 507

where w(x) = h(x)(b � a) and f(x) = 1=(b � a). Notice that f

is the density for a uniform random variable over (a; b). Hence,

I = Ef(w(X))

where X � Unif(a; b).

Suppose we generate X1; : : : ; XN � Unif(a; b) where N is

large. By the law of large numbers

bI � 1

N

NXi=1

w(Xi)p! E(w(X)) = I:

This is the basic Monte Carlo integration method. We can

also compute the standard error of the estimate

bse = spN

where

s2 =

Pi(Yi � bI)2N � 1

where Yi = w(Xi). A 1�� con�dence interval for I is bI�z�=2 bse.We can take N as large as we want and hence make the length

of the con�dence inteval very small.

Example 25.1 Let's try this on an example where we know the

true answer. Let h(x) = x3. Hence, I =R 10x3dx = 1=4. Based

on N = 10; 000 I got bI = :2481101 with a standard error of

.0028. Not bad at all.

A simple generalization of the basic method is to consider

integrals of the form

I =

Zh(x)f(x)dx

where f(x) is a probability density function. Taking f to be a

Unif (a,b) gives us the special case above. The only di�erence is

Page 506: All of Statistics: A Concise Course in Statistical Inference Brief

508 25. Simulation Methods

that now we draw X1; : : : ; XN � f and take

bI � 1

N

NXi=1

h(Xi)

as before.

Example 25.2 Let

f(x) =1p2�e�x

2=2

be the standard Normal pdf. Suppose we want to compute the cdf

at some point x:

I =

Zx

�1

f(s)ds = �(x):

Of course, we could look this up in a table or use the pnorm

command in R. But suppose you don't have access to either.

Write

I =

Zh(s)f(s)ds

where

h(s) =

�1 s < x

0 s � x:

Now we generate X1; : : : ; XN � N(0; 1) and set

bI = 1

N

Xi

h(Xi) =number of observations � x

N:

For example, with x = 2, the true answer is �(2) = :9772 and

the Monte Carlo estimate with N = 10; 000 yields .9751. Using

N = 100; 000 I get .9771.

Example 25.3 (Two Binomials) Let X � Binomial(n; p1) and Y �Binomial(m; p2). We would like to estimate Æ = p2�p1. The mle

is bÆ = bp2� bp1 = (Y=m)� (X=n). We can get the standard errorbse using the delta method (remember?) which yields

bse =rbp1(1� bp1)

n+bp2(1� bp2)

m

Page 507: All of Statistics: A Concise Course in Statistical Inference Brief

25.2 Basic Monte Carlo Integration 509

and then construct a 95 per cent con�dence interval bÆ � 2 bse.Now consider a Bayesian analysis. Suppose we use the prior

f(p1; p2) = f(p1)f(p2) = 1, that is, a at prior on (p1; p2). The

posterior is

f(p1; p2jX; Y ) / pX1 (1� p1)n�X pY2 (1� p2)

m�Y :

The posterior mean of Æ is

Æ =

Z 1

0

Z 1

0

Æ(p1; p2)f(p1; p2jX; Y ) =Z 1

0

Z 1

0

(p2�p1)f(p1; p2jX; Y ):

If we want the posterior density of Æ we can �rst get the posterior

cdf

F (cjX; Y ) = P (Æ � cjX; Y ) =ZA

f(p1; p2jX; Y )

where A = f(p1; p2) : p2 � p1 � cg. The density can then be

obtained by di�erentiating F .

To avoid all these integrals, let's use simulation. Note that

f(p1; p2jX; Y ) = f(p1jX)f(p2jY ) which implies that p1 and p2are independent under the posterior distribution. Also, we see

that p1jX � Beta(X+1; n�X+1) and p2jY � Beta(Y +1; m�Y + 1). Hence, we can simulate (P

(1)1 ; P

(1)2 ); : : : ; (P

(N)1 ; P

(N)2 )

from the posterior by drawing

P(i)1 � Beta(X + 1; n�X + 1)

P(i)2 � Beta(Y + 1; m� Y + 1)

for i = 1; : : : ; N . Now let Æ(i) = P(i)2 � P

(i)1 . Then,

Æ � 1

N

Xi

Æ(i):

We can also get a 95 per cent posterior interval for Æ by sorting

the simulated values, and �nding the :025 and :975 quantile. The

posterior density f(ÆjX; Y ) can be obtained by applying density

Page 508: All of Statistics: A Concise Course in Statistical Inference Brief

510 25. Simulation Methods

estimation techniques to Æ(1); : : : ; Æ(N) or, simply by plotting a

histogram.

For example, suppose that n = m = 10, X = 8 and Y = 6.

The R code is:

x <- 8; y <- 6

n <- 10; m <- 10

B <- 10000

p1 <- rbeta(B,x+1,n-x+1)

p2 <- rbeta(B,y+1,m-y+1)

delta <- p2 - p1

print(mean(delta))

left <- quantile(delta,.025)

right <- quantile(delta,.975)

print(c(left,right))

I found that Æ and a 95 per cent posterior interval of (-0.52,0.20).

The posterior density can be estimated from a histogram of the

simulated values as shown in Figure 25.1.

Example 25.4 (Dose Response) Suppose we conduct an experiment

by giving rats one of ten possible doses of a drug, denoted by

x1 < x2 < : : : < x10. For each dose level xi we use n rats and

we observe Yi, the number that survive. Thus we have ten inde-

pendent binomials Yi � Binomial(n; pi). Suppose we know from

biological considerations that higher doses should have higher

probability of death. Thus, p1 � p2 � � � � � p10. Suppose we

want to estimate the dose at which the animals have a 50 per

cent chance of dying. This is called the LD50. Formally, Æ = xjwhere

j = minfi : pi � :50g:

Notice that Æ is implicitly a (complicated) function of p1; : : : ; p10so we can write Æ = g(p1; : : : ; p10) for some g. This just means

that if we know (p1; : : : ; p10) then we can �nd Æ. The posterior

Page 509: All of Statistics: A Concise Course in Statistical Inference Brief

25.2 Basic Monte Carlo Integration 511

delta = p2 − p1

f(d

elta

| d

ata

)

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6

0.0

0.5

1.0

1.5

2.0

FIGURE 25.1. Posterior of Æ from simulation.

Page 510: All of Statistics: A Concise Course in Statistical Inference Brief

512 25. Simulation Methods

mean of Æ isZ Z� � �ZA

g(p1; : : : ; p10)f(p1; : : : ; p10jY1; : : : ; Y10)dp1dp2 : : : dp10:

The integral is over the region

A = f(p1; : : : ; p10) : p1 � � � � � p10g:

Similarly, the posterior cdf of Æ is

F (cjY1; : : : ; Y10) = P (Æ � cjY1; : : : ; Y10)=

Z Z� � �ZB

f(p1; : : : ; p10jY1; : : : ; Y10)dp1dp2 : : : dp10

where

B = A\f(p1; : : : ; p10) : g(p1; : : : ; p10) � cg:

We would need to do a 10 dimensional integral over a resticted

region A. Instead, let's use simulation.

Let us take a at prior truncated over A. Except for the trun-

cation, each Pi has once again a Beta distribution. To draw from

the posterior we do the following steps:

(1) Draw Pi � Beta(Yi + 1; n� Yi + 1); i = 1; : : : ; 10.

(2) If P1 � P2 � � � � � P10 keep this draw. Otherwise, throw

it away and draw again until you get one you can keep.

(3) Let Æ = xj where

j = minfi : Pi > :50g:

We repeat this N times to get Æ(1); : : : ; Æ(N) and take

E(ÆjY1; : : : ; Y10) � 1

N

Xi

Æ(i):

Page 511: All of Statistics: A Concise Course in Statistical Inference Brief

25.2 Basic Monte Carlo Integration 513

Æ is a discrete variable. We can estimate its probability mass

function by

P (Æ = xjjY1; : : : ; Y10) � 1

N

NXi=1

I(Æ(i) = j):

For example, suppose that x = (1; 2; : : : ; 10). Figure 25.2 shows

some data with n = 15 animals at each dose. The R code is a

bit tricky.

## assume x and y are given

n <- rep(15,10)

B <- 10000

p <- matrix(0,B,10)

check <- rep(0,B)

for(i in 1:10){

p[,i] <- rbeta(B, y[i]+1, n[i]-y[i]+1)

}

for(i in 1:B){

check[i] <- min(diff(p[i,]))

}

good <- (1:B)[check >= 0]

p <- p[good,]

B <- nrow(p)

delta <- rep(0,B)

for(i in 1:B){

temp <- cumsum(p[i,])

j <- (1:10)[temp > .5]

j <- min(j)

delta[i] <- x[j]

}

print(mean(delta))

left <- quantile(delta,.025)

right <- quantile(delta,.975)

print(c(left,right))

Page 512: All of Statistics: A Concise Course in Statistical Inference Brief

514 25. Simulation Methods

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

dose

pro

port

ion w

ho d

ied

11 1

1

1

1

1 11 1

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

xt(

p)

2 2 22

2

22

2 2 2

33

3 33

3

3 33 3

4

44

4

4

4 44

4 4

5 5

5 5

5

5

5 5 5 5

66 6

6

6

6

66 6 6

77

7

7

7

7

7 7 77

8

88

8 8

8

8 8 88

9

9 9

9 9

9

99 9 9

0 00

0 0

00 0

00

a a a

a

a

a

a a a a

b bb

b

b

b

b b b b

c c c

c c

cc

cc c

d d

d

dd

d

d d d d

ee e

ee

e e

ee e

f ff

f

f

f

f f

f f

g

g g

g

g

g gg g

g

h h

h

h

h

h

h h h h

i

i ii

i

i

i i i i

jj

j jj

j

j j jj

kk

k

k

k

kk

k kk

l ll

ll

l

l l ll

m mm

m

m

m

m

m mm

nn

n

nn

n

n

n n n

o oo

oo

o

o o o o

p p pp

p

pp p p

p

q q

q

qq

q

qq q q

r

r r

r r

r

r r r r

s ss

ss

s

ss

s s

t t

t

t

t

tt

t t t

u u

u

u

u

uu u u

u

v

vv

v

v

vv v

v v

w w w

w w

w

w w w w

xx

x x

x

xx

x x x

y

y yy y

y

y y yy

z z

z z z

zz

z z z

N

NN

NN

N

NN N N

N

N

N

N N

N

N N N N

N

NN

N

N

NN

N N N

N N

NN N

N

N N N N

N

N

N N

N

N

N N NN

N

NN

N

N

NN N N N

NN N

N

N

N

NN N N

N

N N

N N

NN N N N

N N

N

N N

NN N N N

N

N N

N

N

NN

N NN

N

N NN

N

N

N N N N

NN

N

N

N

N

N NN N

N N

N

N

N N

N N N N

N

N N

N N

N

N NN N

0.0

0.2

0.4

0.6

0.8

LD50

f(delta | d

ata

)

3 4 5

FIGURE 25.2. Dose repoonse data.

Page 513: All of Statistics: A Concise Course in Statistical Inference Brief

25.3 Importance Sampling 515

The posterior draws for p1; : : : ; p10 are shown in the second

panel in the �gure. We �nd that that Æ = 4:04 with a 95 per

cent interval of (3,5). The third panel shows the posterior for Æ

which is necessarily a discrete distribution.

25.3 Importance Sampling

Consider the integral I =Rh(x)f(x)dx where f is a probability

density. The basic Monte Carlo method involves sampling from

f . However, there are cases where we may not know how to

sample from f . For example, in Bayesian inference, the posterior

density density is is obtained by multiplying the likelihood L(�)times the prior f(�). There is no guarantee that f(�jx) will bea known distribution like a Normal or Gamma or whatever.

Importance sampling is a generalization of basic Monte Carlo

which overcomes this problem. Let g be a probability density

that we know how to simulate from. Then

I =

Zh(x)f(x)dx =

Zh(x)f(x)

g(x)g(x)dx = Eg(Y )

where Y = h(X)f(X)=g(X) and the expectation Eg(Y ) is with

respect to g. We can simulate X1; : : : ; XN � g and estimate I

by

bI = 1

N

Xi

Yi =1

N

Xi

h(Xi)f(Xi)

g(Xi):

This is called importance sampling. By the law of large num-

bers, bI p! I. However, there is a catch. It's possible that bI might

have an in�nite standard error. To see why, recall that I is the

mean of w(x) = h(x)f(x)=g(x). The second moment of this

quantity is

Eg(w2(X)) =

Z �h(x)f(x)

g(x)

�2

g(x)dx =

Zh2(x)f 2(x)

g(x)dx:

Page 514: All of Statistics: A Concise Course in Statistical Inference Brief

516 25. Simulation Methods

If g has thinner tails than f , then this integral might be in�nite.

To avoid this, a basic rule in importance sampling is to sample

from a density g with thicker tails than f . Also, suppose that

g(x) is small over some set A where f(x) is large. Again, the

ratio of f=g could be large leading to a large variance. This

implies that we should choose g to be similar in shape to f . In

summary, a good choice for an importance sampling density g

should be similar to f but with thicker tails. In fact, we can say

what the optimal choice of g is.

Theorem 25.5 The choice of g that minimizes the variance of bIis

g�(x) =jh(x)jf(x)R jh(s)jf(s)ds:

PROOF. The variance of w = fh=g is

Eg(w2)� (E(w2))2 =

Zw2(x)g(x)dx�

�Zw(x)g(x)dx

�2

=

Zh2(x)f 2(x)

g2(x)g(x)dx�

�Zh(x)f(x)

g(x)g(x)dx

�2

=

Zh2(x)f 2(x)

g2(x)g(x)dx�

�Zh(x)f(x)dx

�2

:

The second integral does not depend on g so we only need to

minimize the �rst integral. Now, from Jensen's inequality we

have

Eg(W2) � (Eg(jW j))2 =

�Zjh(x)jf(x)dx

�2

:

This establishes a lower bound on Eg(W2). However, Eg�(W

2)

equals this lower bound which proves the claim. �

This theorem is interesting but it is only of theoretical inter-

est. If we did not know how to sample from f then it is unlikely

Page 515: All of Statistics: A Concise Course in Statistical Inference Brief

25.3 Importance Sampling 517

that we could sample from jh(x)jf(x)= R jh(s)jf(s)ds. In prac-

tice, we simply try to �nd a thick tailed distribution g which is

similar to f jhj.Example 25.6 (Tail Probability) Let's estimate I = P (Z > 3) =

:0013 where Z � N(0; 1). Write I =Rh(x)f(x)dx where f(x) is

the standard Normal density and h(x) = 1 if x > 3 and 0 other-

wise. The basic Monte Carlo estimator is bI = N�1P

ih(Xi).

Using N = 100 we �nd (from simulating many times) that

E(bI) = :0015 and V ar(bI) = :0039. Notice that most obser-

vations are wasted in the sense that most are not near the right

tail. We will estimate this with importance sampling taking g

to be a Normal(4,1) density. We draw values from g and the

estimate is now bI = N�1P

if(Xi)h(Xi)=g(Xi). In this case we

�nd that E(bI) = :0011 and V ar(bI) = :0002: We have reduced

the standard deviation by a factor of 20.

Example 25.7 (Measurement Model With Outliers) Suppose we have

measurements X1; : : : ; Xn of some physical quantity �. We might

model this as

Xi = � + �i:

If we assume that �i � N(0; 1) then Xi � N(�i; 1). However,

when taking measurements, it is often the case that we get the

occasional wild obvervation, or outlier. This suggests that a Nor-

mal might be a poor model since Normals have thin tails which

implies that extreme observations are rare. One way to improve

the model is to use a density for �i with a thicker tail, for ex-

ample, a t-distribution with � degrees of freedom which has the

form

t(x) =���+12

����

2

� 1

��

�1 +

x2

��(�+1)=2

:

Smaller values of � correspond to thicker tails. For the sake of

illustration we will take � = 3. Suppose we observe n Xi = �+�i,

i = 1; : : : ; n where �i has a t distribution with � = 3. We will

Page 516: All of Statistics: A Concise Course in Statistical Inference Brief

518 25. Simulation Methods

take a at prior on �. The likelihood is L(�) = Qn

i=1 t(Xi � �)

and the posterior mean of � is

� =

R�L(�)d�R L(�)d� :

We can estimate the top and bottom integral using importance

sampling. We draw �1; : : : ; �N � g and then

� �1N

PN

j=1

�jL(�j)

g(�j)

1N

PN

j=1

L(�j)

g(�j)

:

To illustrate the idea, I drew n = 2 observations. The posterior

mean (ccomputed numerically) is -0.54. Using a Normal impor-

tance sampler g yields an estimate of -0.74. Using a Cauchy (t-

distribution with 1 degree of freedom) importance sampler yields

an estimate of -0.53.

25.4 MCMC Part I: The Metropolis-Hastings

Algorithm

Consider again the problem of estimating the integral I =Rh(x)f(x)dx.

In this chapter we introduce Markov chain Monte Carlo (MCMC)

methods. The idea is to constuct a Markov chain X1; X2; : : : ;

whose stationary distribution is f . Under certain conditions it

will then follow that

1

N

NXi=1

h(Xi)p! Ef(h(X)) = I:

This works because there is a law of large numbers for Markov

chains called the ergodic theorem.

The Metropolis-Hastings algorithm is a speci�c MCMC

method that works as follows. Let q(yjx) be an arbitrary, friendlydistribution (i.e. we know how to sample from q(yjx)). The con-ditional density q(yjx) is called the proposal distribution.

Page 517: All of Statistics: A Concise Course in Statistical Inference Brief

25.4 MCMC Part I: The Metropolis-Hastings Algorithm 519

The Metropolis-Hastings algorithm creates a sequence of obser-

vations X0; X1; : : : ; as follows.

Metropolis-Hastings Algorithm. Choose X0 arbitrarily. Suppose

we have generated X0; X1; : : : ; Xi. To generate Xi+1 do the following:

(1) Generate a proposal or candidate value Y � q(yjXi).

(2) Evaluate r � r(Xi; Y ) where

r(x; y) = min

�f(y)

f(x)

q(xjy)q(yjx) ; 1

�:

(3) Set

Xi+1 =

�Y with probability r

Xi with probability 1� r:

REMARK 1: A simple way to execute step (3) is to generate

U � (0; 1). If U < r set Xi+1 = Y otherwise set Xi+1 = Xi.

REMARK 2: A common choice for q(yjx) is N(x; b2) for some

b > 0. This means that the proposal is draw from a Normal,

centered at the current value.

REMARK 3: If the proposal density q is symmetric, q(yjx) =q(xjy), then r simpli�es to

r = min

�f(Y )

f(Xi); 1

�:

The Normal proposal distribution mentioned in Remark 2 is an

example of a symmetric proposal density.

Page 518: All of Statistics: A Concise Course in Statistical Inference Brief

520 25. Simulation Methods

By construction, X0; X1; : : : is a Markov chain. But why does

this Markov chain have f as its stationary distribution? Before

we explain why, let us �rst do an example.

Example 25.8 The Cauchy distribution has density

f(x) =1

1

1 + x2:

Our goal is to simulate a Markov chain whose stationary distri-

bution is f . As suggested in the remark above, we take q(yjx) tobe a N(x; b2). So in this case,

r(x; y) = min

�f(y)

f(x); 1

�= min

�1 + x2

1 + y2; 1

�:

So the algorithm is to draw Y � N(Xi; b2) and set

Xi+1 =

�Y with probability r(Xi; Y )

Xi with probability 1� r(Xi; Y ):

The R code is very simple:

metrop <- function(N,b){

x <- rep(0,N)

for(i in 2:N){

y <- rnorm(1,x[i-1],b)

r <- (1+x^2)/(1+y^2)

u <- runif(1)

if(u < r)x[i] <- y else x[i] <- x[i-1]

}

return(x)

}

The simulator requires a choice of b. Figure 25.3 shows three

chains of length N = 1000 using b = :1, b = 1 and b = 10. The

histograms of the simulated values are shown in Figure 25.4. Set-

ting b = :1 forces the chain to take small steps. As a result, the

chain doesn't \explore" much of the sample space. The histogram

Page 519: All of Statistics: A Concise Course in Statistical Inference Brief

25.4 MCMC Part I: The Metropolis-Hastings Algorithm 521

from the sample does not approximate the true density very well.

Setting b = 10 causes the proposals to often be far in the tails,

making r small and hence we reject the proposal and keep the

chain at its current position. The result is that the chain \gets

stuck" at the same place quite often. Again, this means that the

histogram from the sample does not approximate the true density

very well. The middle choice avoids these extremes and results in

a Markov chain sample that better represents the density sooner.

In summary, there are are tuning parameters and the eÆciency

of the chain depends on these parameters. We'll discus this in

more detail later.

If the sample from the Markov chain starts to \look like"

the target distribution f quickly, then we say that the chain is

\mixing well." Constructing a chain that mixes well is somewhat

of an art.

Why It Works. Recall from the previous chapter that a

distribution � satis�es detailed balance for a Markov chain if

pij�i = pji�j:

We then showed that if � satis�es detailed balance, then it is a

stationary distribution for the chain.

Because we are now dealing with continuous state Markov

chains, we will change notation a little and write p(x; y) for the

probability of making a transition from x to y. Also, let's use

f(x) instead of � for a distribution. In this new notation, f is

a stationary distribution if f(x) =Rf(y)p(y; x)dy and detailed

balance holds for f if

f(x)p(x; y) = f(y)p(y; x):

Detailed balance implies that f is a stationary distribution since,

if detailed balance holds, thenZf(y)p(y; x)dy =

Zf(x)p(x; y)dy = f(x)

Zp(x; y)dy = f(x)

Page 520: All of Statistics: A Concise Course in Statistical Inference Brief

522 25. Simulation Methods

0 200 400 600 800 1000

−1

.00

.01

.02

.0

0 200 400 600 800 1000

−2

02

4

0 200 400 600 800 1000

−8

−4

02

4

FIGURE 25.3. Three Metropolis chains corresponding

to b = :1, b = 1, b = 10.

Page 521: All of Statistics: A Concise Course in Statistical Inference Brief

25.4 MCMC Part I: The Metropolis-Hastings Algorithm 523

−5 0 5

0.0

0.1

0.2

0.3

0.4

0.5

−5 0 5

0.0

0.1

0.2

0.3

0.4

0.5

−5 0 5

0.0

0.1

0.2

0.3

0.4

0.5

FIGURE 25.4. Histograms from the three chains. The

true density is also plotted.

Page 522: All of Statistics: A Concise Course in Statistical Inference Brief

524 25. Simulation Methods

which shows that f(x) =Rf(y)p(y; x)dy as required. Our goal

is to show that f satis�es detailed balance which will imply that

f is a stationary distribution for the chain.

Consider two points x and y. Either

f(x)q(yjx) < f(y)q(xjy) or f(x)q(yjx) > f(y)q(xjy):

We will ignore ties which we can do in the continuous setting.

Without loss of generality, assume that f(x)q(yjx) > f(y)q(xjy).This implies that

r(x; y) =f(y)

f(x)

q(xjy)q(yjx)

and that r(y; x) = 1. Now p(x; y) is the probability of jumping

from x to y. This requires two things: (i) the proposal distribu-

tion must generate y and (ii) you must accept y. Thus,

p(x; y) = q(yjx)r(x; y) = q(yjx)f(y)f(x)

q(xjy)q(yjx) =

f(y)

f(x)q(xjy):

Therefore,

f(x)p(x; y) = f(y)q(xjy): (25.1)

On the other hand, p(y; x) is the probability of jumping from

y to x. This requires two things: (i) the proposal distribution

must generate x and (ii) you must accept x. This occurs with

probability p(y; x) = q(xjy)r(y; x) = q(xjy). Hence,

f(y)p(y; x) = f(y)q(xjy): (25.2)

Comparing (25.1) and (25.2), we see that we have shown that

detailed balance holds.

25.5 MCMC Part II: Di�erent Flavors

There are di�erent types of MCMC algorithm. Here we will

consider a few of the most popular versions.

Page 523: All of Statistics: A Concise Course in Statistical Inference Brief

25.5 MCMC Part II: Di�erent Flavors 525

Random-walk-Metropolis-Hastings. In the previous sec-

tion we considered drawing a proposal Y of the form

Y = Xi + �i

where �i comes from some distribution with density g. In other

words, q(yjx) = g(y � x). We saw that in this case,

r(x; y) = min

�1;f(y)

f(x)

�:

This is called a random-walk-Metropolis-Hastingsmethod.

The reason for the name is that, if we did not do the accept-

reject step, we would be simulating a random walk. The most

common choice for g is a N(0; b2). The hard part is choosng b

so that the chain mixes well. A good rule of thumb is: choose b

so that you accept the proposals about 50 per cent of the time.

Warning: This method doesn't make sense unless X takes

values on the whole real line. If X is restricted to some interval

then it is best to transformX, say, Y = m(X) say, where Y takes

values on the whole real line. For example, if X 2 (0;1) then

you might take Y = logX and then simulate the distribution

for Y instead of X.

Independence-Metropolis-Hastings.This is like an importance-

sampling version of MCMC. In this we draw the proposal from

a �xed distribution g. Generally, g is chosen to be an approxi-

mation to f . The acceptance probability becomes

r(x; y) = min

�1;f(y)

f(x)

g(x)

g(y)

�:

Gibbs Sampling. The two previous methods can be easily

adapted, in principle, to work in higher dimensions. In practice,

Page 524: All of Statistics: A Concise Course in Statistical Inference Brief

526 25. Simulation Methods

tuning the chains to make them mix well is hard. Gibbs sampling

is a way to turn a high-dimensional problem into several one

dimensional problems.

Here's how it works for a bivariate problem. Suppose that

(X; Y ) has density fX;Y (x; y). First, suppose that it is possi-

ble to simulate from the conditional distributions fXjY (xjy) andfY jX(yjx). Let (X0; Y0) be starting values. Asume we have drawn

(X0; Y0); : : : ; (Xn; Yn). Then the Gibbs sampling algorithm for

getting (Xn+1; Yn+1) is:

The n+ 1 step in Gibbs sampling:

Xn+1 � fXjY (xjYn)Yn+1 � fY jX(yjXn+1)

This generalizes in the obvious way to higher dimensions.

Example 25.9 (Normal Hierarchical Model) Gibbs sampling is very

useful for a class of models called hierarchical models. Here

is a simple case. Suppose we draw a sample of k cities. From

each city we draw ni people and observe how many people Yihave a disease. Thus, Yi � Binomial(ni; pi). We are allowing

for di�erent disease rates in di�erent cities. We can also think

of the p0is as random draws from some distribution F . We can

write this model in the following way:

Pi � F

YijPi = pi � Binomial(ni; pi):

We are interested in estimating the p0is and the overall disease

rateRpdF (p).

To proceed, it will simplify matters if we make some transfor-

mations that allow us to use some Normal approxmations. Let

Page 525: All of Statistics: A Concise Course in Statistical Inference Brief

25.5 MCMC Part II: Di�erent Flavors 527

bpi = Yi=ni. Recall that bpi � N(pi; si) where si =pbpi(1� bpi)=ni.

Let i = log(pi=(1� pi)) and de�ne Zi � b i = log(bpi=(1� bpi)).By the delta method,

b i � N( i; �2i)

where �2i= 1=(nbpi(1 � bpi)). Experience shows that the Normal

approximation for is more accurate than the Normal approx-

imation for p so we shall work with . We shall treat �i as

known. Furthermore, we shall take the distribution of the 0is to

be Normal. The hierarchical model is now

i � N(�; � 2)

Zij i � N( i; �2i):

As yet another simpli�cation we take � = 1. The unknown pa-

rameter are � = (�; 1; : : : ; k). The likelihood function is

L(�) /Yi

f( ij�)Yi

f(Zij )

/Yi

exp

��1

2( i � �)2

�exp

�� 1

2�2i

(Zi � i)2

�:

If we use the prior f(�) / 1 then the posterior is proportional

to the likelihood. To use Gibbs sampling, we need to �nd the

conditional distribution of each parameter conditional on all the

others. Let us begin by �nding f(�jrest) where \rest" refers to

all the other variables. We can throw away any terms that don't

involve �. Thus,

f(�jrest) /Yi

exp

��1

2( i � �)2

/ exp

��k2(�� b)2

�where

b =1

k

Xi

i:

Page 526: All of Statistics: A Concise Course in Statistical Inference Brief

528 25. Simulation Methods

Hence we see that �jrest � N(b; 1=k). Next we will �nd f( jrest).Again, we can throw away any terms not involving i leaving us

with

f( ijrest) / exp

��1

2( i � �)2

�exp

�� 1

2�2i

(Zi � i)2

/ exp

�� 1

2d2i

( i � ei)2

where

ei =

Zi

�2i+ �

1 + 1�2i

and d2i=

1

1 + 1�2i

and so ijrest � N(ei; d2i). The Gibbs sampling algorithm then

involves iterating the following steps N times:

draw � � N(b; v2)

draw 1 � N(e1; d21)

......

draw k � N(ek; d2k):

It is understood that at each step, the most recently drawn ver-

sion of each variable is used.

Let's now consider a numerical example. Suppose that there

are k = 20 cities and we sample n = 20 people from each city.

The R function for this problem is:

gibbs.fun <- function(y,n,N){

k <- length(y)

p.hat <- y/n

Z <- log(p.hat/(1-p.hat))

sigma <- sqrt(1/(n*p.hat*(1-p.hat)))

v <- sqrt(1/sum(1/sigma^2))

mu <- rep(0,N)

psi <- matrix(0,N,k)

for(i in 2:N){

Page 527: All of Statistics: A Concise Course in Statistical Inference Brief

25.5 MCMC Part II: Di�erent Flavors 529

### draw mu given rest

b <- v^2*sum(psi[i-1,]/sigma^2)

mu[i] <- rnorm(1,b,v)

### draw psi given rest

e <- (Z + mu[i]/sigma^2)/(1 + (1/sigma^2))

d <- sqrt(1/(1 + (1/sigma^2)))

psi[i,] <- rnorm(k,e,d)

}

list(mu=mu,psi=psi)

}

After running the chain, we can convert each i back into piby way of pi = e i=(1 + e i). The raw proportions are shown in

Figure 25.6. Figure 25.5 shows \trace plots" of the Markov chain

for p1 and �. (We should also look at the plots for p2; : : : ; p20.)

Figure 25.6 shows the posterior for � based on the simulated

values. The second panel of Figure 25.6 shows the raw propor-

tions and the Bayes estimates. Note that the Bayes estimates

are \shrunk" together. The parameter � controls the amount of

shrinkage. We set � = 1 but in practice, we should treat � as an-

other unknown parameter and let the data determine how much

shrinkage is needed.

So far we assumed that we know how to draw samples from

the conditionals fXjY (xjy) and fY jX(yjx). If we don't know how,

we can still use the Gibbs sampling algorithm by drawing each

observation using a Metropolis-Hastings step. Let q be a pro-

posal distribution for x and let eq be a proposal distribution for

y. When we do a Metropolis step for X we treat Y as �xed.

Similarly, when we do a Metropolis step for Y we treat X as

�xed. Here are the steps:

Page 528: All of Statistics: A Concise Course in Statistical Inference Brief

530 25. Simulation Methods

0 200 400 600 800 1000

0.4

0.6

0.8

Simulated values of p1

p1

0 200 400 600 800 1000

−0

.6−

0.2

0.2

Simulated values of mu

mu

FIGURE 25.5. Simulated values of p1 and �.

Page 529: All of Statistics: A Concise Course in Statistical Inference Brief

25.5 MCMC Part II: Di�erent Flavors 531

mu

−0.6 −0.4 −0.2 0.0 0.2 0.4

05

01

00

15

02

00

0.0 0.2 0.4 0.6 0.8 1.0

0.0

1.0

2.0

3.0

raw estimates and Bayes estimates

FIGURE 25.6. Top panel: posterior histogram of �.

Lower panel: raw proptions and the Bayes posterior es-

timates.

Page 530: All of Statistics: A Concise Course in Statistical Inference Brief

532 25. Simulation Methods

Metropolis within Gibbs:

(1a) Draw a proposal Z � q(zjXn).

(1b) Evaluate

r = min

�f(Z; Yn)

f(Xn; Yn)

q(XnjZ)q(ZjXn)

; 1

�:

(1c) Set

Xn+1 =

�Z with probability r

Xn with probability 1� r:

(2a) Draw a proposal Z � eq(zjYn).

(2b) Evaluate

r = min

�f(Xn+1; Z)

f(Xn+1; Yn)

eq(YnjZ)eq(ZjYn) ; 1�:

(2c) Set

Yn+1 =

�Z with probability r

Yn with probability 1� r:

Again, this generalizes to more than two dimensions.

Page 531: All of Statistics: A Concise Course in Statistical Inference Brief

� � ��� ����

� � � ��� � ����� � � � ������� � � � ���� � �

��� �"!$#%�'&)(+*,(+�"-.#/� 0213*5476'0/(+8:9<;=� �>6?(@#%A3(+�"-.#CBD(+A$� 9E6?(+#%�3FG47�3(H4 &I-.J'*5(+(K-L!M1N(O9+P(+9Q-.#/8R�S-5#/47�UTV;+47�VBDA3(+�';+(K95(O-59W4 *XJY!M1?4 -.J'(+95#%9<-5(+9Q-.#/�'F3Z

[]\?^`_NaWbMcOad^fe�g'ad^)\?_G*5(h&)(+*59i-.4]13*,4kjV#%A3#/�'F]�]95#%�3F70%(ml,6?(+9Q-IF7n3(O959,op4 &q95478:(srYnD�S�Vt-.#C-L!H4S&D#/�"-.(O*5(O9,-=ZvuXJ3(wrYnD� �"-.#C-L!K4 &q#/�"-.(O*5(O9,-x;O47n30%Am6N(s�E1D� *5� 8:(O-.(O*i#%�y�W13� *.� 8:(O-5*5#%;8R4MA3(O0zTY�H{U|i}>~�TM�H13*,476D� 63#%0/#C-L!@A3(O�395#C-L!�&)n3�3;O-5#/47���iTM��*5(+F *5(+9,95#%47�R&)n3�';O-.#%47���iTY47*�y13*5(OA3#/;h-.#/4 ��&)4 *��y&)nV-.n3*,(�jS� 0%n3(���4 &i95478:(�*.� �3A'478�jS� *,#�� 630%( Z

�����+�V�k�d�S�k�.���V�Y�v�I�G�"���d� �����>�7�V���k�H�O�Q�.���]�k���G� �W�¡ O��¢�'£�¤N�S�W�S�K 7��¥�§¦��k�w�R� ���H¨Y©S�O�D��ª"�"«��d�k�<�@¬3ª��Y�k�.� �`�+£x­<¦d�p�O�Q�.���]�k��� ¢�y�"�S�7���d�"�X�'���§¦d��Y�k�Q�@�,��� �<���X��¥��Y�d�"�V�®���S¥`���Y "¯ �7£

°²± \?_´³xµ2bM_ ± b¶c+b"aK·s¸H#/9X�@9,(O-X-5JD�S-X;O47�"-§� #%�39w�mrYnD� �"-5#%-L!G4S&2#/�"-.(O*5(+9Q- �H¹ #C-.J95478:(E1'*5(+9,;+*,#/6?(+AG13*,476D� 6'#/0/#C-L!�ºI»$¼½ZI¾D47*½(h¿'� 8:130/(ST ¹ (]8:#/F7J"- ¹ �S�Y-s-.4@;O47�39Q-.*5n';O-� �:#/�"-.(O*,jS� 0q·s¸KÀÂÁfÃDħÅhÆx-5JD�S-Ç-.*5� 139½� �:n3�3ÈY�34 ¹ �G1D�S*.� 8:(O-5(+* ��É"Ê 1N(O*½;O(+�"-Ë4 &�-5J3(-.#%8R(SZ

ÌL�RÍ?δÏI\DadÍÐbMc=^fcWakbMcOak^`_2Ñ2T ¹ (X9Q-§� *Q- ¹ #%-5Jy95478:(wA3(h&f� n30C-x-.J3(O47*,!pÒH;=�S0/0/(OAy�p_UÓÐÔ`ÔÍNδÏI\DadÍ2bMc=^`cwÒ:� �'A ¹ (E� 9,È�#%&�-5J3(<AD�S-§��13*,4djM#%A3(W9,n'Õ:;+#%(+�"-w(hjV#%A3(+�';+(E-54H*,(�Ö,(+;h-s-5J3(-.J3(O47*,! ZÇÌ×&x�34 - ¹ (K*5(O-.� #/��-5J3(��Mn'0/0UJ"!M1N4S-.J3(O95#/9OZ ­<¦d�E����¥z�ÙØ5¥/�O�Q�7�Ú�7���dÛ

�§¦d�R�"ª"¯�¯q¦k�"� � �§¦d�+�§����Ü���X�3ªd�]����Ýv¦"¥f���EÞÐ�S�=ß� �d�+�Q�7£HàX�§¦d��¥I���S¥`�K� ß�d�'¯ �7ÛS�K� �:Ø×���O�+�S�k�.���dÛ�§¦d�¶�"ª"¯�¯ Ü$�"¥GØ`�f�"��¯á�Ú�dÛ���â¥/�×ãL�+�h�s�§¦d�y�"ª"¯�¯Ú£äÜ

º7ºkå

Page 532: All of Statistics: A Concise Course in Statistical Inference Brief

º7º�� ���������� ����������������������� �"!"�#�����%$�&'�(&'����� )���#���*Y© �"�K�"¯ �,+3£.-0/QÝ��V�Ú�213¯á�Ú�"�7���dÛD£436587:9;9=<:>:?%@BA ÄDC�C�C+Ä @ ¸,EGFs(+*,�347n30%0/#`Á.H?Æ"I ?KJL<NMO?%PRQSJ=TI ?O9=?KJ I ?KJ=MUV<NQSJXWYQZ9=>D[X\^]_?a`N>:<NJb`;cDde?�9=<NQSJ=M�?V>KMfQSgh`iM'?B<'j H Qe> ¢H�À Plk A=monp.q A @ p [\ º]» ¼ 9=?K]rUV?KJ=M�UV<NJ�s I ?KJLUV?tQSJ=M'?K]vuN`id�j�<N] H Qe> ·s¸GÀ Á ¢H3¸H»xw,¸3Ä ¢H'¸�y w,¸"Æ#z%{ ?K]V?wa|¸ À~}�� P�� k A 0%47FqÁ���� ¼xÆ [R��<2MO?V>KM)M { ?�J=7�dSd {�� 9=<NM { ?V>KQe>�M { `iM)M { ?2UV<NQSJRQe>�j�`iQS]_��M { `iMQe>V� H$À�º���� � z ?hUa`iJ�]_?���?VUKMYM { ?#J=7�dSd {�� 9=<NM { ?V>KQe> z%{ ?KJ�� ¢H3¸K» A| � �8� ¢H3¸DÁ�º�» ¢H3¸"Æ�� PQe>td�`i]���?K]tM { `iJ���[�� { ?��v7;>KMfQ s�Ua`iM�Q�<NJ0j�<N]#M { ?V>:?9b]_<:UV? I 7�]_?V> z QSdSd�cK?#M { ?�>K7�c���?VUKM<'jM { ?�JL?���M8j�? z U { `K9bMO?K]_>D[

���:  ¡G¢h£�¤X¥§¦G¨�¥©£Dª «�¥©£�¢r¤¬´(O- @BA Ä�C�C�COÄ @ ¸�6N( Px­�­ | AD��-§��1N4 #/�"-�&)*5478 9,478:(:A'#/9,-5*5#%63n'-.#%47�¡~�Zr® 1?47#/�"-

(+9Q-.#%8��S-547* ¢� ¸�4 &v�y1D� *5� 8R(h-.(O* � #%9X95478:(]&)n3�3;h-.#%47�¶4S& @BA Ä�CDC�C+Ä @ ¸MP¢� ¸HÀ°¯�Á @BA Ä�CDC�C+Ä @ ¸7ÆKC

±�(KA'(OBD�3( 7�ä�d� Á ¢� ¸"ÆÇÀ ²´³dÁ ¢� ¸"Æv» � Á�µ'ZCºdÆ

-.4�6?(:-5J3(�63#��S9p4 & ¢� ¸VZh±�(�9.�=!>-.J3�S- ¢� ¸�#/9KÓi_�¶v^)gDc+bMµ #%&©²pÁ ¢� ¸7ÆpÀ � Zh·<�M6'#�� 9,(+AVt�3(O959<n395(OA�-.4:*5(O;+(+#Cj7(K8mn3;§J��S-5-5(+�"-.#%47�¶6'n'-W-5J3(+9,(HA3�k!M9W#%-�#/9��34S-W;O47�39,#/A3(O*5(+Aâj7(O*,!#/8:1?47*,-.� �"-�¸Y8��S�Y!:4 &´-5J3(](O9,-.#%8���-.47*,9 ¹ ( ¹ #/0%0?n'95(p� *,(E63#/� 95(OAUZ©®®1?47#/�"-Ë(+9Q-.#/8R�S-547*¢� ¸$4S&<�$1D� *.�S8R(h-.(+* � #/9 ± \N_Ðc=^fcOakbM_Na�#%& ¢� ¸G¹»bº � ZR»s4 �395#%9,-.(O�3;O!�#/9m�$*,(=� 9,47�D� 6'0/(*5(OrYn3#/*,(+8:(+�"-@&)47*H(+9Q-.#/8R�S-547*59OZ¶uXJ3(�A3#/9Q-.*5#%63n'-5#/47� 4 & ¢� ¸¶#/9@;+� 0/0%(+A -.J3(�cOgDe�ÏiÔ`^f_ÐѵÐ^`cOa�¼S^�¶iÓ2ad^)\?_½ZWuXJ3(y9,-.� �3AD�S*5A¡A3(OjM#/�S-.#%47�$4 & ¢� ¸�#/9];+� 0/0%(+A�-5J3(ycOakg3_iµ2g�¼�µ®b½¼N¼d\8¼YTA3(O�34 -.(OA¶6"! �Q� P

�,� À �,� Á ¢� ¸YÆIÀ¿¾ ÀyÁ ¢� ¸YÆKC Á�µ'ZZ�7ÆÁ & -.(O�UT2#%-�#/9p�'4 -K1N479,95#%630/(m-54¶;O478R1'n'-.(y-.J'(:9Q-§� �'AD� *5A (+*,*547*�63nV-Hn'95nD� 0%0%! ¹ (:;=� �(+9Q-.#%8��S-5(<-5J3(]9Q-§� �3A3� *5A�(O*5*54 *+ZIuXJ3(](O9,-5#/8R�S-.(OAG9,-.� �3AD� *,A�(+*,*547*Ë#/9sA'(+�34 -5(+A�6"! ¢�Q� Z*Y© �"�K�"¯ �,+3£Ã°Äl?KM´@BA Ä�CDC�C+Ä @ ¸ÅEÆFË(+*5�'47n30/0%#`Á.H?Æ `iJ I de?KM ¢H'¸>À P k A m p @ p [Ç� { ?KJ²]Á ¢H3¸7Æ@À P k A m p ²pÁ @ p Æ@À�H >:< ¢H'¸ Qe>r7�JbcDQf`N>:? I [È� { ?0>KM�`iJ I `i] I ?K]v]_<N]2Qe>H�Q� À� À�Á ¢H3¸7ÆÇÀÉ� HÐÁ�º�»ÊH?Æa� P©[#� { ?�?V>KM�QSgh`iMO? I >KM�`iJ I `i] I ?K]K]_<N]�Qe> ¢�Q� À¿� ¢HÐÁQº�» ¢H?Æ�� P©[Ë

Page 533: All of Statistics: A Concise Course in Statistical Inference Brief

���������)!�&'�0��Y$ �& ����&K!"� º º Ê

uXJ3( rMn3� 0/#C-L!®4 &R� 1N4 #/�"->(+9Q-.#%8��S-5(�#/9>9,478:(O-.#%8R(O9>� 9,95(+9,95(OA 6"!®-.J'( e�bMg3_c��UÓ2g�¼dbMµ b½¼�¼d\8¼MT'47*�����HT3A3(hBD�3(OA�6"!

��L* À°²Y³dÁ ¢� ¸�» � Æ | C <(O;=� 0%0�-.J3�S-�²�³kÁ�� Æs*,(O&)(O*59X-.4:(h¿V1?(+;O-.�S-.#%47� ¹ #%-5Jâ*,(+9,1N(O;O-<-.4�-.J3(�A3#%9,-.*,#/63nV-.#/4 �

�xÁ�� A Ä�C�C�COÄ��D¸�¸ � ÆÇÀ ¸� p.q A �xÁ�� p ¸ � Æ

-.JD��-pF7(+�'(+*.��-.(+A¡-.J'(yAD�S-§�VZKÌ×-�A'4V(O9p�34 -p8:(=� � ¹ (:� *5(m�kj (+*5� F7#/�'F�4dj (+*K�GA3(O�395#C-L!&)47* � Z­<¦d�O�"¥��S������� � { ?���L* Ua`iJRcK? z ]vQSM�MO?KJR`N>

���* À  7�ä�d� Á ¢� ¸YÆ | y À�³dÁ ¢� ¸YÆvC Á�µVZ�å"Æ������v} "¬´(O- � ¸HÀ"!�³�Á ¢� ¸7ÆhZIuXJ3(O�²Y³dÁ ¢� ¸]» � Æ | À ²Y³dÁ ¢� ¸�» � ¸)y � ¸p» � Æ |

À ²Y³dÁ ¢� ¸�» � ¸7Æ | y��3Á � ¸p» � Æ | ²Y³dÁ ¢� ¸�» � ¸7Æ y�²Y³dÁ � ¸p» � Æ |À Á � ¸p» � Æ | y�²´³dÁ ¢� ¸p» � ¸"Æ |À  7�ä�d� | y À�Á ¢� ¸7ÆKCÊË

­<¦d�O�"¥��S�����$#&%�j� 7����� º ' `iJ I �,� º ' `N>P º ( M { ?KJ$¢� ¸ Qe>0UV<NJ >KQe>KMO?KJ=M��´M { `iMQe>V� ¢� ¸ ¹»�º ��[������v} âÌ×&  7�ä�d� º 'â� �3A �Q� º '�-5J3(+��Tv6"!¡uXJ3(+4 *5(+8 µ'Záå'T ��L* º ''ZRÌ×-

&)470/0%4 ¹ 9x-.JD�S- ¢� ¸*)�+»�º � Z�Á, �(+;+� 0/0DA'(OBD�3#C-.#%47�.-'Záå'ZÚÆ@uXJ'(W*,(+95n'0%-Ç&)470/0%4 ¹ 9v&)*5478 1D� *,-EÁ)6qÆ4 &iuXJ3(+47*,(+8/-'ZÃ�3Z�Ë*Y© �"�H�"¯ ��+3£1032�?KM�7�]vJ=QSJ��#M'<�M { ?�UV<NQSJ�WYQZ9;9bQSJ��h?��½`ig9bde?V� z ? { `iui?)M { `iM ² n Á ¢H3¸7ÆIÀ H>:<XM { `iM 63#/� 9�À°H�»6H$À&' `iJ I �Q� À � HÐÁ�º<»ÅHqÆa� P º ' [54�?KJLUV?V� ¢H3¸ ¹»�º H �"M { `iMQe>V� ¢H3¸ Qe>t` UV<NJ >KQe>KM'?KJ=M�?V>KMfQSgh`iM'<N]�[ Ë

Page 534: All of Statistics: A Concise Course in Statistical Inference Brief

º7º - ���������� ����������������������� �"!"�#�����%$�&'�(&'����� )���#���»s47�39,#/A3(O* P AD��-§�y1D� #%*59KÁ @BA Ä.� A Æ Ä�C�C�COÄdÁ @ ¸Mħ�D¸"ÆË&)*,478 ��*5(OF7*5(O959,#/47�â8R4MA3(O0

� p À®�xÁ @ p Æ yxw p¹ J3(O*5(�²pÁ�w p Æ�À 'G� �'A ��#/9E� �$n3�3ÈY�34 ¹ �¡*5(OF7*5(O95#%47��&)n3�3;h-.#%47�UZ�®W�$(+9Q-.#/8R�S-5(@4 &Ë�#/9

¢�k¸DÁ��NƽÀ �kj (+*5� F7(�4 &�� 0%0´� p 95n3;§Jâ-.JD�S- � @ p » � � ���&)47*y9,478R( ��� ''Z��<4 -.#%;+(G-5JD�S-�&)47*y(Oj (+*Q! � ¹ (âJD�=j7(�� � n3�3ÈY�34 ¹ � 1D�S*.� 8:(O-5(+*�xÁ��NÆ��S�3A��S�¶(O9,-.#%8���-.47* ¢�k¸DÁ��NÆhZ½uXJ3(m^f_NakbYÑ8¼dg'akbMµ e�bYgD_ c���ÓÐg�¼dbMµ b�¼�¼d\8¼�� ��L*#/9XA'(OBD�3(OA�6"!

� ��L* À ��L* Á ¢�d¸3Á �NÆ5Æ |� �À �´ 7����� Á ¢�d¸3Á �NÆ,Æ | y À�Á ¢�k¸DÁ��NÆ5Æ |��� �À  7�ä�d� Á ¢�k¸DÁ��NÆ5Æ |� �#y�~À�Á ¢�d¸3Á �NÆ,Æ |� �À #/�"-.(OF7*.��-.(+A�63#/� 9 | y #/�"-.(OF7*.�S-5(+A�jS� *5#/� �3;O(iC Á�µ'ZÃ�YÆ� 8R� 0/0Vj �S0/n3(O9Ë4 & � ;=� n39,(E9,8�� 0%0363#/� 9Ë6'n'-s0/� *5F (wj �S*5#��S�3;+(SZ©¬2� *5F7(�jS� 0/n'(+9Ç4 & � ;+� n395(

0��S*5F7(Ë63#�� 9i63n'-I9,8��S0/0Yj �S*5#��S�3;+(SZvuXJ3#%9x0/(+� A39v-54E-.J3("¶i^)gDc����3g�¼�^)gD_ ± bya:¼dgDµ2bY\���-5JD�S-¹ ( ¹ #%0/0UA'#/95;On3959�#%�âA'(O-§�S#/0 ¹ J3(O� ¹ (H;O4kj7(+*<�347�'1D� *.�S8R(h-.*5#%;]*5(OF7*5(O959,#/47�UZ

ÌL��8R� �"!�;=� 9,(+9OT3�H1?47#%�Y-½(O9,-.#%8���-.(�JD� 9Ë� �G�S9,!M8R1V-.471'-5#/;��W47*,8��S03A3#/9Q-.*,#/63n'-5#/47��T-.J3�S-�#/9OT

¢� ¸�����Á � Ä �Q� | ÆKC Á�µ'Z Ê ÆuXJ3#%9X#/9<�m&f� ;O- ¹ ( ¹ #%0/0N4 & -.(O�¶8R� È (pn'95(K4 &�Z

����� �~¢h¤ �"!$#�¤&%'# ()#¥©¨®�ºI»�¼ ± \?_´³xµ2bM_ ± b�^f_Nakb½¼*�3g3Ô�&)47*s�K1D� *5� 8R(h-.(O* � #/9Ë� ��#%�"-.(+*QjS� 0?·s¸HÀ Á`ÃDħÅhÆ

¹ J3(O*5(�Ã�À Ã?Á @BA ÄDC�C�C+Ä @ ¸7ÆH�S�3A ÅGÀ Å�Á @BA Ä�C�CDCOÄ @ ¸7ÆH� *,(�&)n3�3;h-.#%47�39@4S&X-.J3(GA3�S-§�95n';§J¶-.J3�S- + ³dÁ �-, ·s¸7Æ/. º<»�¼ËÄ &)47*��S0/0 �-,10 C Á�µ'Z -"ÆÌL� ¹ 47*,A39+TvÁ`ÃDħÅhÆs-.*5� 139 �m¹ #%-.Jâ13*,476D� 63#%0/#C-L!�º<» ¼ËZ�±�(K;=�S0/0xº<» ¼�-.J3( ± \2�?b½¼dg3Ñqb4 &p-5J3($;+4 �'BDA3(O�3;+(>#%�Y-5(+*Qj �S0zZ(»s4 8R8:47�30C!7TX1?(+47130%(�n39,( É"Ê 1?(+*�;+(+�"-�;O47�'BDA3(O�3;+(

Page 535: All of Statistics: A Concise Course in Statistical Inference Brief

����� �G�"!"���&'�����t��� $ �%$ º º��

#/�"-.(O*,jS� 0%9 ¹ J3#/;§J ;+47*,*5(+9,1N4 �3A39�-.4 ;§J34V4 95#/�'F�¼ À ' C ' Ê Z �<4 -.(SP�·s¸ ^fc�¼dgD_еÐ\?egD_е � ^fc�³��´bMµ��]Ì×& � #/9��yj7(O;O-.4 *<-5J3(+� ¹ (Kn395(H��;+47�'B3A3(+�3;O(@9,(O-mÁf95n';§J>�y9513J3(O*5(47*�� �â(+0%0/#/1'95(kƽ#/�39Q-.(=�SA¶4 &v� ��#/�"-5(+*,jS� 0`Z

� g�¼�_Ð^`_2Ñ��<uXJ'(+*5(K#/9X8@n3;§J¶;O47�'&)n39,#/47�â� 6?47n'-�J'4 ¹ -.4�#/�"-5(+*51'*5(O-W�y;+4 �'BDA3(O�3;+(#/�"-.(O*,jS� 0`Z�®�;+47�'B3A3(+�3;O(�#/�"-.(O*,jS� 0v#%9E�34S-��G13*54 6D� 63#%0/#%-L!â9Q-§�S-5(+8:(+�"-�� 6?47n'- � 9,#/�3;O(� #%9w�HB'¿V(OA�rMn3� �"-.#%-L! T'�34S-X�@*5� �3A3478�jS� *,#�� 630%( Z � 478:(W-.( ¿M-.9w#%�Y-5(+*,13*5(h-w;+4 �'BDA3(O�3;+(#/�"-.(O*,jS� 0%9I� 9x&)470%0/4 ¹ 9+PÐ#%&qÌi*5(+1?(=��-I-.J3(X( ¿'1?(+*,#/8:(+�"-Ç4kj7(O*Ë�S�3A�4dj (+*OTY-5J3(�#/�"-.(O*,jS� 0 ¹ #/0/0;+47�"-.� #/�p-.J'(½13� *.� 8:(O-5(+* É"Ê 1?(+*´;O(+�"-Ð4 &V-5J3(I-.#%8R(SZÐuXJ'#/9´#/9´;O47*5*,(+;h-i6'n'-Ðn395(O0/(+9,9i9,#/�3;O(¹ (E*.� *,(+0%!m*5(+1?(=��-s-.J'(E9.�S8R(<(h¿V1?(+*5#%8R(O�"-s4kj7(O*X� �3AR4kj7(+*OZ�® 6?(O-,-.(+*Ë#/�"-.(O*513*,(O-.�S-.#%47�#/9w-5J3#/9OP

à]���Y�O� -7�U�O�'ª:�+�V¯�¯ �O�O�E�Y�d�Q�R�Y�d���+�V�d�Q�§¥`ªd�O��� 0$�7��¥Ç�+�S�k�E�+�'�k¨V�"�S�d�O��Ú�k����¥/���"¯2�)�Y¥E�����S¥��Y�W�O����¥]��AO£ àp���Y�O�6Â'�½�+�Vª$�+�V¯�¯ �O�O�G�d� � �Y�d�Q�>�"�d��+�V�d�Q�§¥`ªd�O�´�� 0]�7��¥Y�+�S�k���+�'�k¨V�"�S�d�O�w�Ú�k���S¥%���Y¯§�)�"¥M�"��ª"�"¥��S¯ �d���O�H���S¥��Y�W�O����¥� | £ àp���Y�O� �'�Ç�O�'ª$�O�'¯�¯ �+�O���d�h� �Y�k�Q�$�Y�d���+�'�d���§¥zªd�h�:�� 0 �7��¥W�O�S�k��+�V�k¨'�"���d�+�@�Ú�k����¥/���"¯ �)�"¥´�"�yª"�"¥���¯��k���+�����S¥��Y�W�O����¥2���d£��N�'ªK�+�'�k�.���"ªd�E�§¦7����Ç�h���O�'�d���§¥zªd�O�.���dÛ �O�'�k¨V�"�S�d�+�����k���S¥%���Y¯ ���)�"¥����Q�+¬3ªd���d�+�¡� �yª"�"¥���¯��k���+����S¥��Y�W�O����¥��H��A Ä � | Ä�C�C�C ­�¦d�S�� 0�� �S¥½�+���k�p�+�Vª"¥<���k���S¥/���"¯ �]���Ú¯á¯N�§¥��Y�R�§¦d��§¥zªd�����S¥ �"�E�h���S¥]���"¯�ªd� £ ­�¦d�S¥/� �����d� �d�O�+� ��� �Ú�k�§¥/�V�3ªd�O���§¦d� � �"�=� � �¥����7�+�d�.���dÛ:�§¦d���,�Y�W��� ©M�7��¥`���E���k�X� �d�S¥Ð�"�d�:� �d�S¥×£

*Y© �"�H�"¯ ��+3£����Yui?K] � I ` � ��JL? z >O9 `K9=?K]_>�]_?O9=<N]vM<V9bQSJ=Q�<NJ�9=<NdSdZ>D[���<N]0?��½`ig9bde?V�"M { ? �g�Q�� { M�>D` � M { `iM������9=?K]�UV?KJ=M�<'jM { ?�9=<V9b7�d�`iMfQ�<NJ�j�`iui<N]�`i]vg�QSJ���9bQSde<NM�> z QSM { �i7�J >D[�� >K7 `idSd � � � <N7 z QSdSd8>:?V?�`t>KM `iMO?Kg2?KJ=M�d Q"!;?#��M { Qe>Y9=<NdSd8Qe>�`�UVUK7�]�`iMO?�MO< z QSM { QSJ%$#9=<NQSJ=M�>&�' 9=?K] UV?KJ=MX<'j6M { ?�M�QSg2?D[�� � { ? � `i]V?Ê>D` � QSJ��xM { `iM µ7å(�� Qe>�` &�' 9=?K] UV?KJ=MUV<NJ�s I ?KJLUV?"QSJ=MO?K]vuN`id�j�<N]"M { ?)Mf]v78?cD7�M%7�J)!NJL< z J�9b]_<V9=<N]vM�Q�<NJ H <'j�9=?V<V9bde? z%{ <�j�`iui<N]`i]vg�QSJ���9bQSde<NMS> z QSM { �i7�J >D[ %Sj � <N7�j�<N]Kg `BUV<NJ�s I ?KJLUV?�QSJ=MO?K]vuN`idLM { Qe> z ` � ?Kui?K] �iI ` �j�<N]rM { ?r]_?V>KM�<'j � <N7�]rd Q j�?V� &�' 9=?K]XUV?KJ=M�<'j � <N7�]hQSJ=MO?K]KuN`idZ> z QSdSd´UV<NJ=M�`iQSJ M { ?hMf]v78?9 `i]�`ig2?KM'?K]�[Ê� { Qe>�Qe>#Mf]v78?r?Kui?KJRM { <N7i� {Å� <N7R`i]_?h?V>KMfQSgh`iMfQSJ���` I Q *�?K]_?KJ=M,+D7 `iJ=M�QSM �- ` I Q *�?K]_?KJ=M 9=<NdSd.+D78?V>KMfQ�<NJ0/B?Kui?K] �2I ` � [¬Ð��-.(+*OT ¹ ( ¹ #%0/0?A3#/9,;+n39,9�Fs�k! (+9,#�� �â8R(h-.J34MA39X#%� ¹ J3#/;§J ¹ (p-5*5(=��- � � 9X#%&Ð#%- ¹ (O*5(

�R*5� �3A3478�jS� *,#�� 630%(K� �3A ¹ (mA34R8��SÈ (@1'*5476D�S63#/0%#%-L!G9Q-§�S-5(+8:(+�"-.9]� 6N4 n'- � Z�ÌL�$1D� *�t-.#%;+n30/� *+T ¹ ( ¹ #/0/0?8R� È (]9Q-§�S-5(+8:(+�"-.9s0/#/ÈS(�lQ-5J3(]13*,476D� 6'#/0/#C-L!y-.JD��- � #%9X·s¸MT3F7#%j (+�G-5J3(

Page 536: All of Statistics: A Concise Course in Statistical Inference Brief

º7º:µ ���������� ����������������������� �"!"�#�����%$�&'�(&'����� )���#���AD�S-.�'T´#/9 É7Ê 1?(+*�;O(+�"-=Z o��<4 ¹ (Oj (+*OTi-.J'(+95(rFw�=!7(O95#/� ��#%�"-.(+*QjS� 0/9p*,(O&)(+*p-54âA'(+F7*,(+(ht×4 &�t6?(+0/#%(O&s13*54763� 63#/0%#%-5#/(+9OZ�uXJ3(+9,(BFw�=!7(O95#/� � #/�"-.(O*,jS� 0%9 ¹ #%0/0Ç�34 -+Tv#/��F7(+�'(+*.�S0zTi-5*.� 1�-.J3(1D� *5� 8:(O-.(O* É"Ê 1N(O*�;+(+�"-<4 &Ð-.J3(p-5#/8:( Z*Y© �"�K�"¯ �,+3£�� %vJ M { ?BUV<NQSJrW´QZ9;9bQSJ���>:?KMfMfQSJ��N�de?KM ·s¸âÀ Á ¢H3¸H» w,¸VÄ ¢H3¸�yÈwQ¸"Æ0z%{ ?K]V?w |¸ À 0/47FqÁ���� ¼vÆa�3Á�� P Æ [ � ]_<Ng 4�<:? * I QSJ���� >,QSJL? +D7 `id QSM � - ' [ $0/0QSM8j�<NdSde< z >,M { `iM+

Á H , ·s¸7Æ/. º�» ¼j�<N]t?Kui?K] �)H [�4�?KJLUV?V� ·s¸ Qe>t` º�» ¼ UV<NJ�s I ?KJLUV?,QSJ=M'?K]vuN`id�[ Ë®E9�8:(+�"-.#%47�3(+Aâ(=�S*50/#%(+*OT31N4 #/�"-�(+9Q-.#/8R�S-547*59w4S& -.(+�âJD�=j7(H��0/#/8:#%-5#/�3F��W4 *58R� 0UA3#%9Qt

-.*,#/63nV-.#/4 �UTY8R(+� �3#/�'FH-.J3�S-s(OrYnD�S-.#%47�¡Áfµ'Z Ê Æ½J3470%A39+TY-.J3�S-w#%9+T ¢� ¸ ����Á � Ä ¢�,� | Æ ZxÌL�G-.J'#/9;=�S95( ¹ (K;=�S�¶;O47�39,-5*5n3;h-@Á`� 131'*54=¿V#/8R�S-.(=Æs;+47�'B3A3(+�3;O(H#%�"-.(+*QjS� 0/9�� 9X&)470%0/4 ¹ 9+Z­�¦d�+�"¥������ ������v�"¥z�E�"¯ ß. ��d�Q�+�âÝ��'�k¨V�"�S�d�O� � �k���S¥%���Y¯Ú£Z3Ê587:9;9=<:>:?�M { `iM ¢� ¸�����Á � Ä ¢�Q� | Æ [Äl?KM� cK?�M { ? {U|Ð} <'j#`r>KM�`iJ I `i] I � <N]vgh`id�`iJ I de?KM������ | À k A ÁQº<» Áf¼���� Æ5Æ ��M { `iMQe>V� + Á�� � ����� | ƽÀ ¼���� `iJ I + Á�» ����� | � � � ����� | ÆÇÀº½»¡¼ z%{ ?K]V? �ÈE���Á,'VÄ=ºdÆ [Äl?KM

·s¸HÀ Á ¢� ¸�» � ��� | ¢�Q� Ä ¢� ¸�y � ��� | ¢�Q� ÆKC Á�µ'Z �7Æ� { ?KJ + ³dÁ � , ·s¸ Æ�º º<» ¼�C Á�µ'Z4µ"Æ������v} �¬´(O-��I¸ À Á ¢� ¸�» � Æa� ¢�,� ZÇFË! � 959,n38R1V-.#/4 ���x¸�� � ¹ J3(O*5(�� E��Á ''Ä=ºdÆ Z��W(O�3;+(ST+ ³dÁ �-, ·s¸7Æ À

+ ³�� ¢� ¸p» ����� | ¢�,� � � � ¢� ¸)y ����� | ¢�Q���À+ ³��v» � ��� | � ¢� ¸�» �

¢�,� � � ��� |! º +#"

» � ��� | � � � � ��� |%$À º�» ¼�C Ë

¾D47* É"Ê 1N(O*x;+(O�Y-Ç;O47�'BDA'(+�3;O(<#%�"-.(+*QjS� 0/9OT"¼¡À ' C ' Ê � �'A � ��� | ÀºiC É - � �W0%(=� A3#%�3F-.4H-.J3(�� 1'13*54=¿V#/8R�S-.( É"Ê 1?(+*w;O(+�"-�;+4 �'BDA3(O�3;+(�#/�"-5(+*,jS� 0 ¢� ¸�( � ¢�,� Z�±�( ¹ #/0%0?A'#/95;On3959-.J'(H;O47�39,-5*5n3;h-.#%47�¶4S&x;+4 �'BDA3(O�3;+(m#%�"-.(+*QjS� 0/9�#/��8:47*5(�F7(O�3(+*5� 0/#C-L!�#/��-.J'(H*,(+9,-W4 &i-.J3(6?4V47ÈqZ

Page 537: All of Statistics: A Concise Course in Statistical Inference Brief

�����������©�)!�����$�& $����$l�&'��� º º É

*Y© �"�H�"¯ ��+3£�oÄl?KM�@BA Ä�CDC�C+Ä @ ¸�E(Fs(O*5�347n'0/0/#`Á H?Æ `iJ I de?KM ¢H3¸@À P k A�m ¸p.q A @ p [2� { ?KJÀ:Á ¢H3¸7Æ À Plk�| m ¸p.q A À:Á @ p Æ�À P�k�| m ¸p.q A HÐÁQº:»oH?Æ¡À Plk�|VP HÐÁ�º:»oHqÆ�À H2ÁQº:»H?Æa� P©[ 4�?KJLUV?V�I�,� À � HÐÁ�ºX»ÅH?Æa� P `iJ I ¢�Q� À � ¢H3¸3ÁQº�» ¢H3¸7Æa� P©[�� � M { ?�Y?KJ=M�]�`idÄ�QSg�QSM�� { ?V<N]V?Kgt� ¢H3¸�����Á.H´Ä ¢�� | Æ [h� { ?K]_?�j�<N]V?V�©`iJ6`K9;9b]_<���QSgh`iMO? ºÇ» ¼ UV<NJ�s I ?KJLUV?QSJ=M'?K]vuN`id Qe>¢H�( � ��� | ¢�Q� À ¢H�( � ��� | ¢H'¸3Á�º�» ¢H3¸"ÆP C

�Y<Ng9 `i]_? M { Qe> z QSM { M { ?xUV<NJ�s I ?KJLUV?RQSJ=MO?K]vuN`id,QSJ M { ? 9b]V?Ku:Q�<N7;> ?��½`ig9bde?D[ � { ?� <N]vgh`id T cV`N>:? I QSJ=M'?K]vuN`idYQe>2> { <N]vM'?K]BcD7�M,QSM#<NJ=d �Ê{ `N> `K9;9b]_<���QSgh`iMO?Kd � - d�`i]���?2>D`ig�T9bde? /XUV<N]v]_?VUKM�UV<Nui?K]�`:��?D[ Ë����� ����� ¢�¥��$#�¨�£�¨�� #�¨�¥�£�¤���½47n3*Ç;+470%0/(+� F7n3(wJ3� 9 ¹ *5#%-,-.(O��� �R� 0%F747*5#C-.J38 &)47*I9,#/8mn'0��S-5#/�3F��3#/139Ç4 &��]&f� #/*x;O47#/��Z�½47n � *,(R9,n3951'#/;+#%47n39H4 &X!747n'*@;+4 0/0/(+� F7n3(��á9K13*,47F7*.�S8R8:#/�3F�9,ÈM#%0/0%9+Z�¬´(O- @�A ÄDC�C�ChÄ @ ¸

A3(+�'4 -.( P 9,#/8mn'0��S-5(+A ;+47#%� �3#/139:� �3A 0/(h-,H À+Á @ À�ºdÆ Z�¬´(O-"!$#RA'(+�34 -5(¶-5J3(

J"!V1?4 -5J3(+9,#/9�-.JD�S-y-5J3(�95#%8mn30���-.47*H#/9y;+4 *5*5(O;O-R� �3A 0/(h-%! A A3(O�34 -5(�-5J3(�J"!V1?4 -5J3(+9,#/9-.JD��-p-.J3(y9,#/8mn30/�S-.4 *E#/9E�34 -p&f� #%*+Z&!$#K#%9p;=� 0%0/(OA�-.J3(:_UÓÐÔfÔWÍNÎUÏI\qakÍÐbMc=^fc�� �3A'! A #%9;=� 0%0/(OA�-5J3(�gDÔ akb½¼S_2g'ad^ �?b�Í?δÏI\DadÍÐbMc=^fc Z�±�(�;=� � ¹ *5#C-.(]-.J'(KJ"!M1?4 -.J3(O95(O9]� 9

!$#<P�HGÀº���� j7(O*595n'9 ! A P�H)(À�º:����CÌ×& ¹ (WF7(O-Eº ' '�95#%8mn30���-.(+A��D#%139w� �'AG� 0%0к�' '@� *,(Hº T ¹ (]8:#/F7J"-½*5(+� 9547�3� 630%!y-.� È (<-.J3#%9� 9p(OjM#%A3(+�3;O(:� F"� #%�39,-*!$#=ZHuXJ3#/9p#%9p� �¡(h¿'� 8:130%(m4 &sÍNδÏI\DadÍ2bVc+^`cGakbMcOad^f_2ÑÐZ�¬2�S-.(O*¹ ( ¹ #/0%0�9,(+(�-.JD�S-�;O478:8R47��1'*.� ;h-.#/;O(�#/9w-.4�*,(�Ö,(+;h-+!�# ¹ J3(O�

� ¢H3¸p» A| �¾ A, ¸ � ��C±�( ¹ #%0/0�A'#/95;On3959�-.J'#/9�� �3Aâ4 -5J3(+*�J"!M1N4 -5J3(+9,#/9X-.(O9,-59W#%�âA'(O-§�S#/0U#%�¶�y0/�S-.(O*X;§JD� 1'-5(+*+Z���.- � # %/�6¤6£ %�«10 2��3� #�¤&!6£54Á n'*mA3(OB3�3#%-5#/47��4 &�;+47�VBDA3(+�';+(�#/�"-5(+*,jS� 0½*5(OrYn3#/*,(+9@-5JD�S-

+ ³dÁ � , ·s¸7Æ . º�» ¼&)47*]�S0/0 �),�0 Z�®E�$ÏI\?^f_Na76R^`c+b¡gDchÎ2e�Ï2ak\Dad^ ± ;+47�'B3A3(+�3;O(y#/�"-.(O*,jS� 0Ð*5(OrMn'#/*5(O9E-.J3�S-

Page 538: All of Statistics: A Concise Course in Statistical Inference Brief

º�� ' ���������� ����������������������� �"!"�#�����%$�&'�(&'����� )���#���0/#%8 #/�V&z¸�� �

+ ³�Á � , ·s¸ Æ). º@» ¼ &)47*y�S0/0 � , 0 Z ®W� Ói_Ð^��z\b¼�e gDcOδe�ÏÐa=\qak^ ±;+4 �'BDA3(O�3;+(y#%�Y-5(+*Qj �S0v*5(OrYn3*5(O9]-.JD�S-E0/#%8 #/�'&`¸�� � #/�'& ³���� + ³kÁ �), ·s¸7Æ . ºW» ¼ËZpuXJ3(� 131'*54=¿V#/8R�S-.( �<47*58R� 0Ct×6D� 9,(+A�#/�"-5(+*,jS� 0´#%9��:1?47#%�Y- ¹ #%95(K� 9Q!V8:1'-54 -.#%;p;+47�'B3A3(+�3;O(@#%�Vt-.(O*,jS� 0`ZxÌL��F7(+�3(O*.� 0`T'#C-X8R#%F7J"-w�34S-�6N(��mn3�'#%&)47*,8 � 9Q!M8R1'-54 -.#%;];+4 �'BDA3(O�3;+(K#/�"-.(O*,jS� 0`Z