Upload
ed-orlando
View
123
Download
1
Embed Size (px)
Citation preview
1
R – Research Study on the
Powerful Statistical Programming Language
Author:EdOrlando,AnalyticsandBusinessIntelligenceManager,WhiteLodging
Course:IDS742:ToolsforDecisionMaking
Professor:JohnWard
HonorCode:Ihaveneithergivennorreceivedunauthorizedaid
2
TableofContentsExecutiveSummary‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 4HighLevelDescriptionofRandItsHistory‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 4 RCanBePassedThroughTableautoCreateVisualizations‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 5InstructionsonhowtopassRthroughTableau‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 6 BasicSampleCodeExamples‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 7 StatisticalCodingExamplesinRalongwithTableauTranslations‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 8 CaseStudy–MultipleRegressionResultsforMultipleDataSets‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 12StrengthsandWeaknessesofR‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 15Summary‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 16
TablesTable1:ExampleofaProperDataStructureandanExampleofanImproperDataStructure 8Table2:PopularStatisticalCodesinRandTableauTranslationsProvidedbyEdOrlando‐‐‐ 9Table3:OneHotelMultipleRegressionModelExample‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 12Table4:MicrosoftExcelDataAnalysisMultipleRegressionOutput‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 12Table5:RMultipleRegressionOutputforOneHotel(entirecodeislistedinAppendix)‐‐‐‐‐ 13Table6:MultipleHotel/MultipleRegressionModelExample‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 14Table7:RMultipleRegressionOutputforMultipleHotels’DataSets(entirecodeislistedinAppendix)‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 15
3
GraphsGraph1:ScreenshotofRInterface‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 4Graph2:ScreenshotofRStudioInterface‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 5
References‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 17
Appendix:SampleofRCodeforMultipleRegressionModelResultsforMultipleClassesofData‐‐‐‐‐‐‐‐ 19
4
ExecutiveSummaryInthispaper,ahigh‐leveldescriptionofRanditshistorywillbeprovided.OneofthesectionswilldescribehowRcanconnecttovarioussoftwareplatform,inparticular,Tableau.AsIhavebeenaRprogrammerforabout2years,somebasicandadvancedcodingexampleswillbeprovided.Inaddition,onebusinesscasewillbeprovidedthatshowhowRcanassistwithcreatingmultipleregressionmodelsonmultipledatasets(exampleofcodeislistedinAppendix).Lastly,someofthestrengthsandweaknessesofRwillbehighlightedaswell.
HighLevelDescriptionofRandItsHistoryRisastatisticalprogramminglanguagethatwasdevelopedbyRossIhakaandRobertGentlemenwhowereprofessorsattheUniversityofAucklandlocatedinNewZealand.Theydevelopeditoriginallyfortheirstudentstoperformstatisticalcalculationsintheearly1990sanditeventuallywasreleasedforpublicuseinFebruary2000(Vries&Meysmp.10).In2012,therewere18individualsthathadaccesstomakechangestotheRprogram.ThesepeoplearepartofwhatiscalledtheCoreDevelopmentTeam(Vries&Meysmp.10).ThesepeopleareresponsibleforcreatingandmodifyingaswellasdocumentingtheRpackagesandfunctionality.OnecangainaccesstotwodifferentRmodulesfreeofchargebydownloadingitfromwww.r‐project.org.OnecaneitherdownloadtheoriginalRinterfaceordownloadamoreGUI(graphicaluserinterface)typeofinterfacereferredtoasRStudio.RStudiohasgainedpopularityoverthelastfewyearssinceitconsideredalittlemoreuser‐friendly.MypersonalpreferenceistheoriginalRplatform.Graph1:ScreenshotofRInterface
5
Graph2:ScreenshotofRStudioInterface
Risopen‐sourcesoftwareandcanbedownloadedformanydifferenttypesofcomputersaswellasbedownloadedandavailablethroughoutmostoftheworld.Thefactthatthesoftwareisfreeisoneofthemostattractivepartsofthesoftware.Inaddition,sincethelanguageisopensource,thatalsomeansthattheunderlyingcodecanbeaccessedbyanyone(Vries&Meysmp.10‐14).Inotherwords,otherdeveloperscanauditthepackagesandcodingtoensurethattheyareworkingproperly.Lastly,itisestimatedthatmorethan2millionpeopleutilizeRasof2012withonlinecommunitiesavailableforhelpaswellmanybookswrittenonthelanguage(Vries&Meysmp.10‐14).Oneofthemostdecoratedstatisticiansintheworld,DavidMease,whoformerlyworkedforGoogleandApple,isahugeadvocateofthesoftware.Inaddition,someoftopprogrammingschoolsinthecountryinstallRintotheircurriculums,includingtheUniversityofMichigan(Mease,2007).
RCanBePassedThroughTableautoCreateVisualizationsRcanbeconnectedtomanydifferentsoftwarepackages.OneexampleisanRplug‐inintoTableau.Tableauisdatavisualizationsoftwarethatcanbeusedtocreategraphs,visualizations,datastories,andquickfilteringcapabilitiesthatismuchhardertoaccomplishinRalone.However,therearemanypackagesthatcanbe“passed‐through”RwhenaccessingTableau.Forexample,amultiplecorrelationmodelcanbecodedinTableauthatwilllinkuptoR.RwillthenperformthestatisticalcalculationwithitsengineandthenpassthoughtheresultbackintoTableau.TheinstructionsprovidedbyTableauthatarelistedbelowwalksthroughhowtoconnecttoRthroughTableau.Asasidenote,RcanalsohaveaconnectionestablishedthroughMicrosoft’sPowerBIsoftwareandSPSSaswellasmanyothersoftwarepackages.
6
InstructionsonhowtopassRthroughTableauListedbelowisaretheexactinstructionsprovidedbyTableau(UsingR&Tableau,2014).ThesitedescribeshowtoestablishanRconnectionthroughTableau.HowdoIstartusingTableauwithR?ForuserswhoarealreadyfamiliarwithRanditscapabilities,itisfairlysimpletoestablishtheconnectionbetweenRandTableau.Theinstructionsbelowarefornewinstallationsusingtheopen‐sourceversionofR.Otheroptionsmaybeavailableusingotherpackages,suchasthosefromRevolutionAnalytics.1.DownloadandInstallR.ClickheretofindthefileandinstructionsondownloadingR.2.DownloadandInstallRserve.YouwillneedtoinstallanRserveforTableautoconnecttoinordertoutilizethenewscriptfunctions.IntheRconsole,enterthefollowingcommands:
‐install.packages(“Rserve”)‐library(Rserve)‐Rserve()
3.ConnectTableautotheRServer.OnceRserveisinstalled,openTableauDesktopandfollowthestepsbelow:
a.GototheHelpmenuandselect“ManageRConnection”.b.Enteraservernameof“Localhost”(or“127.0.0.1”)andaportof“6311”.c.Clickonthe“TestConnection”buttontomakesureeverythingrunssmoothly.
Youshouldseeasuccessfulmessage.ClickOKtoclose.
4.StartusingtheRscriptsinTableau.NowyouwillbeabletocreatenewcalculatedfieldsinTableauDesktopthatutilizetheSCRIPT_*functionstomakeRfunctionalcalls.Pleasenotethattheaboveinstructionswerepulleddirectlyfromthetableaulinkbelowandnoneofthetextwasmodified(UsingR&Tableau,2014).
7
BasicSampleCodeExamplesSomeofthemostcommonusedcodesandexplanationsarelistedbelow.TheseexampleswerederivedfromthefollowingsiteRStatistics.net(RPracticeExercises:Leven1–Beginners,2015).Exercise1: Calculatethesquarerootof729.Answer: sqrt (729) Exercise2: Createanewvariable‘b’withvalue1947.0Answer: b <- 1947.0 Exercise3: Convert‘b’frompreviousexercisetocharacterAnswer: b <- as.character(b) print (b) Exercise4: Setupyourworkingdirectorytoanew'work'folderinyourdesktopAnswer: setwd ("path/to/my/desktop/work")
getwd() Exercise5: Createavectornumbersfrom1to6Answer: one_to_six <- c(1, 2, 3, 4, 5, 6) Exercise6: RandomSamplingAnswer: mySample <- sample(1:100, 5-, replace=T)
8
StatisticalCodingExamplesinRalongwithTableauTranslationsRrequiresmanyofitsdatasetstobecolumnized.Forexample,ifyouhadtwogroupsofdata,GroupAandGroupB.InordertorunandANOVAanalysisonthedatasetsoracorrelationmodel,thedatashouldbelaidoutliketheexampleontheleft.Inotherwords,theattributenamesneedtobelistedinthesamecolumnorvectorinorderforthestatisticalteststorunproperly.Table1:ExampleofaProperDataStructureandanExampleofanImproperDataStructureRRequiredLayout TypicalDataLayoutWILLWORKINR WILLNOTWORKINRRecord Group GroupA Record GroupA GroupB1 A 5 1 5 42 A 7 2 7 23 A 8 3 8 84 A 9 4 9 45 A 10 5 10 76 A 12 6 12 87 A 5 7 5 98 A 8 8 8 109 A 7 9 7 1510 B 4 11 B 2 12 B 8 13 B 4 14 B 7 15 B 8 16 B 9 17 B 10 18 B 15 ListedonthenextpageisTable2,whichincludessomecommonlyusedadvancedstatisticalcodes.Listedinthetableincludesthedescriptionofthecalculation/model,aswellastheRcodeandtheTableautranslation.
9
StatisticalDescription
Model
RCode
TableauCode(translationwasmadebyEdOrlandothroughtrial&error)
mean Descriptive Analytics mean(Column.A) SCRIPT_REAL('mean(.arg1)',SUM([Column A]))
median Descriptive Analytics median(Column.A) SCRIPT_REAL('median(.arg1)',SUM([Column A]))
max Descriptive Analytics max(Column.A) SCRIPT_REAL('max(.arg1)',SUM([Column A]))
min Descriptive Analytics min(Column.A) SCRIPT_REAL('min(.arg1)',SUM([Column A]))
sd Descriptive Analytics sd(Column.A) SCRIPT_REAL('sd(.arg1)',SUM([Column A]))
length or count
Descriptive Analytics length(Column.A) SCRIPT_REAL('length(.arg1)',SUM([Column A]))
variance Descriptive Analytics var(Column.A) SCRIPT_REAL('var(.arg1)',SUM([Column A]))
p-value Simple Regression cor.test(Column.A,Column.B)$p.value SCRIPT_REAL("cor.test(.arg1,.arg2)$p.value",SUM([Column
A]),SUM([Column B]))
r squared Simple Regression summary(lm(Column.A~Column.B))$r.squared SCRIPT_REAL('summary(lm(.arg1~.arg2))$r.squared',SUM([Column
A]),SUM([Column B]))
Adjusted r squared
Simple Regression summary(lm(Column.A~Column.B))$adj.r.squared SCRIPT_REAL('summary(lm(.arg1~.arg2))$adj.r.squared',SUM([Column
A]),SUM([Column B]))
df (Regression)
Simple Regression anova(lm(Column.A~Column.B))$"Df"[1] SCRIPT_REAL('anova(lm(.arg1~.arg2))$"Df"[1]',SUM([Column
A]),SUM([Column B]))
df (Residual) Simple Regression anova(lm(Column.A~Column.B))$"Df"[2] SCRIPT_REAL('anova(lm(.arg1~.arg2))$"Df"[2]',SUM([Column
A]),SUM([Column B]))
df (Total) Simple Regression anova(lm(Column.A~Column.B))$"Df"[3] ([df (Regression)])+([df (Residual)])
SSR (1st Row)
Simple Regression anova(lm(Column.A~Column.B))$"Sum Sq"[1] SCRIPT_REAL('anova(lm(.arg1~.arg2))$"Sum Sq"[1]',SUM([Column
A]),SUM([Column B]))
SSE (2nd Row)
Simple Regression anova(lm(Column.A~Column.B))$"Sum Sq"[2] SCRIPT_REAL('anova(lm(.arg1~.arg2))$"Sum Sq"[2]',SUM([Column
A]),SUM([Column B]))
MSR (1st Row)
Simple Regression anova(lm(Column.A~Column.B))$"Mean Sq"[1] SCRIPT_REAL('anova(lm(.arg1~.arg2))$"Mean Sq"[1]',SUM([Column
A]),SUM([Column B]))
10
MSE (2nd Row)
Simple Regression anova(lm(Column.A~Column.B))$"Mean Sq"[2] SCRIPT_REAL('anova(lm(.arg1~.arg2))$"Mean Sq"[2]',SUM([Column
A]),SUM([Column B]))
F value (Top Row)
Simple Regression anova(lm(Column.A~Column.B))$"F value"[1] SCRIPT_REAL('anova(lm(.arg1~.arg2))$"F value"[1]',SUM([Column
A]),SUM([Column B]))
p-value Simple Regression anova(lm(Column.A~Column.B))$"Pr(>F)"[1] SCRIPT_REAL("cor.test(.arg1,.arg2)$p.value",SUM([Column
A]),SUM([Column B]))
Intercept Variable Coeffcient
Simple Regression summary(lm(Column.A~Column.B))$coefficients[1,1] SCRIPT_REAL('summary(lm(.arg1~.arg2))$"coefficients"[1,1]',SUM([Column
A]),SUM([Column B]))
Intercept Standard Error
Simple Regression summary(lm(Column.A~Column.B))$coefficients[1,2] SCRIPT_REAL('summary(lm(.arg1~.arg2))$"coefficients"[1,2]',SUM([Column
A]),SUM([Column B]))
Intercept Stat Simple Regression summary(lm(Column.A~Column.B))$coefficients[1,3] SCRIPT_REAL('summary(lm(.arg1~.arg2))$"coefficients"[1,3]',SUM([Column
A]),SUM([Column B]))
Intercept P-value
Simple Regression summary(lm(Column.A~Column.B))$coefficients[1,4] SCRIPT_REAL('summary(lm(.arg1~.arg2))$"coefficients"[1,4]',SUM([Column
A]),SUM([Column B]))
X1 Variable Coefficient
Simple Regression summary(lm(Column.A~Column.B))$coefficients[2,1] SCRIPT_REAL('summary(lm(.arg1~.arg2))$"coefficients"[2,1]',SUM([Column
A]),SUM([Column B]))
X1 Variable Standard Error
Simple Regression summary(lm(Column.A~Column.B))$coefficients[2,2] SCRIPT_REAL('summary(lm(.arg1~.arg2))$"coefficients"[2,2]',SUM([Column
A]),SUM([Column B]))
X1 Variable Stat
Simple Regression summary(lm(Column.A~Column.B))$coefficients[2,3] SCRIPT_REAL('summary(lm(.arg1~.arg2))$"coefficients"[2,3]',SUM([Column
A]),SUM([Column B]))
X1 Variable P-value
Simple Regression summary(lm(Column.A~Column.B))$coefficients[2,4] SCRIPT_REAL('summary(lm(.arg1~.arg2))$"coefficients"[2,4]',SUM([Column
A]),SUM([Column B]))
p-value ANOVA - One Way
oneway.test(Values~Data2,var.equal = TRUE)$p.value
SCRIPT_REAL("oneway.test(.arg1~.arg2, var.equal = TRUE)$p.value",SUM([Values]),ATTR([Data2]))
df (1st Row) ANOVA - One Way
oneway.test(Values~Data2,var.equal = TRUE)$parameter[1]
SCRIPT_REAL("oneway.test(.arg1~.arg2, var.equal = TRUE)$parameter[1]",SUM([Values]),ATTR([Data2]))
11
df (2nd Row) ANOVA - One Way
oneway.test(Values~Data2,var.equal = TRUE)$parameter[2]
SCRIPT_REAL("oneway.test(.arg1~.arg2, var.equal = TRUE)$parameter[2]",SUM([Values]),ATTR([Data2]))
F Value ANOVA - One Way
oneway.test(Values~Data2,var.equal = TRUE)$statistic
SCRIPT_REAL("oneway.test(.arg1~.arg2, var.equal = TRUE)$statistic",SUM([Values]),ATTR([Data2]))
p-value Kruskal Wallis Rank Sum ANOVA test
kruskal.test(Values, as.factor(Data2))$p.value SCRIPT_REAL("kruskal.test(.arg1, as.factor(.arg2))$p.value",SUM([Values]),ATTR([Data2]))
H value Kruskal Wallis Rank Sum ANOVA test
kruskal.test(Values, as.factor(Data2))$statistic
SCRIPT_REAL("kruskal.test(.arg1, as.factor(.arg2))$statistic",SUM([Values]),ATTR([Data2]))
df Kruskal Wallis Rank Sum ANOVA test
kruskal.test(Values, as.factor(Data2))$parameter
SCRIPT_REAL("kruskal.test(.arg1, as.factor(.arg2))$parameter",SUM([Values]),ATTR([Data2]))
Table2:PopularStatisticalCodesinRandTableauTranslationsProvidedbyEdOrlando(Logos,2009)(NumericalMeasures,2016)(OneWayAnova,2016)(SimpleLinearRegressionwithR,2016)
12
CaseStudy–MultipleRegressionResultsforMultipleDataSetsTherearesituationswhereanalystsorstatisticianswanttorunamultipleregressionmodelforacertainsetofdata.Thiscaneasilybeaccomplishedinmultiplesoftwarepackages(Excel,SPSS,Megastat,etc.)withinafewseconds.However,mostofthetime,thisisaccomplishedusingthesamedatasetforthesamevariable.Forexample,ifananalystwantedtorunaregressionmodelononehotelwiththefollowingdataset,itcouldbeaccomplishedveryquicklyandeasilyusingExcel’sDataAnalysistoolkitadd‐in.Seethedataandthemultipleregressionresultsbelow:Table3:OneHotelMultipleRegressionModelExampleHotel Future
Date Y = Dependent Variable Occupancy %
X1 = Independent Variable (1 year prior Occupancy)
X2 = Independent Variable (2 years’ prior Occupancy)
Courtyard A 7/1/2016 32% 58% 52% Courtyard A 7/2/2016 33% 62% 68% Courtyard A 7/3/2016 41% 52% 48% Courtyard A 7/4/2016 74% 78% 72% Courtyard A 7/5/2016 74% 76% 62% Courtyard A 7/6/2017 67% 69% 58% Courtyard A 7/7/2017 31% 32% 38% AfterrunningtheExcelDataAnalysistoolpakMultipleRegressionModel,thefollowingdatatableresultsaredisplayed.
Table4:MicrosoftExcelDataAnalysisMultipleRegressionOutput(datasetprovidedabove)
ThesamedatasetwasuploadedintoRandafterutilizingsomeextensivecode,thefollowingoutputwasproducedina.csvoutputfile.Asonecansee,itproducesmanyofthesamedatapointsaswellasmatchesthevalues.TheentirecodethatwasutilizedislistedintheAppendix.
13
Table5:RMultipleRegressionOutputforOneHotel(entirecodeislistedinAppendix)
Afterreviewingtheextensivecodelistedintheappendix,onemightaskwhyshouldsomeoneutilizeRtoproducethesameresults?ThecodeismuchmoreextensiveandmoredifficulttolearncomparedtoaGUIinterfacesoftwarepackagesuchasExcel.However,therealbenefitofutilizingthepowerofRcomesintoplaywhenyouwanttorunthissametypeofregressionmodelfor100differenthotels–allwithdifferentsetsofdata.InExcel,thisexercisecouldbeaccomplishedbyrunningthesamemodeloverandover,buttheprocessbecomesverycumbersomeandispronetoerrorsduetomanymanualclicksandprocessesmade.TheRcode,utilizingthenewdatasetbelowcanproduceresultsfor5hotelsveryquicklywithnochangestooriginalcodeproducedinthefirstexample.Basically,itrunstheregressionmodelforthefirstsetofHotelA’sdata,thenHotelB’s,andrepeatsuntiltheendofthedataset.Theresultsareprovidedwithinseconds.
Obs 1mult.r 0.86 r.square 0.74 adj.r.square 0.61 Std.Err.Model 0.13 obs 7df.regress 2df.resid 4df.total 6SSR 0.18 SSE 0.06 SST 0.25 MSR 0.09 MSE 0.02 fvalue 5.71 Y1coef 0.05 X1coef 1.75 X2coef (1.07) Y1StdErr 0.26 X1StdErr 0.70 X2StdErr 0.94
14
Table6:MultipleHotel/MultipleRegressionModelExample
Hotel Date Y = Dependent Variable Occupancy %
X1 = Independent Variable (1 year
prior Occupancy)
X2 = Independent Variable (2 years’ prior Occupancy)
Courtyard A 7/1/2016 0.32
0.58
0.52
Courtyard A 7/2/2016 0.33
0.62
0.68
Courtyard A 7/3/2016 0.41
0.52
0.48
Courtyard A 7/4/2016 0.74
0.78
0.72
Courtyard A 7/5/2016 0.74
0.76
0.62
Courtyard A 7/6/2017 0.67
0.69
0.58
Courtyard A 7/7/2017 0.31
0.32
0.38
Courtyard B 7/1/2016 0.50
0.72
0.55
Courtyard B 7/2/2016 0.44
0.77
0.71
Courtyard B 7/3/2016 0.57
0.70
0.66
Courtyard B 7/4/2016 0.92
0.84
0.86
Courtyard B 7/5/2016 0.87
0.92
0.79
Courtyard B 7/6/2017 0.77
0.84
0.74
Courtyard B 7/7/2017 0.33
0.33
0.51
Courtyard C 7/1/2016 0.51
0.87
0.58
Courtyard C 7/2/2016 0.38
0.57
0.63
Courtyard C 7/3/2016 0.45
0.53
0.64
Courtyard C 7/4/2016 0.88
0.81
0.76
Courtyard C 7/5/2016 0.78
0.75
0.70
Courtyard C 7/6/2017 0.66
0.72
0.68
Courtyard C 7/7/2017 0.27
0.15
0.48
Courtyard D 7/1/2016 0.38
0.77
0.43
Courtyard D 7/2/2016 0.31
0.45
0.60
Courtyard D 7/3/2016 0.36
0.43
0.45
Courtyard D 7/4/2016 0.69
0.74
0.72
Courtyard D 7/5/2016 0.71
0.65
0.52
Courtyard D 7/6/2017 0.48
0.61
0.59
Courtyard D 7/7/2017 0.13
0.03
0.39
15
Listedbelowarefourdifferentregressionmodelstatisticsbasedonthedifferentdatasetslistedabove.Asmentionedabove,thecodewillruntheregressionmodelfortheonesetofdata,recordtheresultsandthenrunitagainforthenexthotels’setofdata.Thisisveryusefulifyouhaveamodelthatisconsistent,buttherearedozensorevenhundredsofvariousdatasetsthatneedtohavethemodelbuiltoftheiruniquedatasets.Table7:RMultipleRegressionOutputforMultipleHotels’DataSets(entirecodeislistedinAppendix)
StrengthsandWeaknessesofRSomeofthekeystrengthstoRarethefollowing:
‐ Softwareisfree/opensource‐ Adoptedandutilizedbymorethan2millionpeople‐ Severalbooksandblogscanprovideassistancewithgettingstarted‐ ManyofthelatestandgreateststatisticalmodelsareadoptedandcreatedfirstinR‐ Easilycustomizable‐ Handlesbigdatawell(seeweaknessesbelowforloopingperformance)
SomeofthekeyweaknessestoRarethefollowing:
‐ Risaprogramminglanguagewithverylittle“drop‐down”options(notatrueGUIinterface)
Courtyard A Courtyard B Courtyard C Courtyard D
mult.r 0.861 0.892 0.907 0.828
r.square 0.741 0.795 0.822 0.686
adj.r.square 0.611 0.692 0.733 0.529
Std.Err.Model 0.127 0.126 0.114 0.143
obs 7 7 7 7
df.regress 2 2 2 2
df.resid 4 4 4 4
df.total 6 6 6 6
SSR 0.185 0.245 0.240 0.178
SSE 0.065 0.063 0.052 0.082
SST 0.250 0.308 0.292 0.260
MSR 0.092 0.122 0.120 0.089
MSE 0.016 0.016 0.013 0.020
fvalue 5.712 7.746 9.254 4.370
Y1coef 0.046 (0.431) (0.712) (0.140)
X1coef 1.750 0.272 0.197 0.493
X2coef (1.074) 1.250 1.800 0.602
Y1StdErr 0.258 0.291 0.373 0.272
X1StdErr 0.703 0.459 0.276 0.268
X2StdErr 0.943 0.704 0.745 0.588
X1tvalue 2.489 0.592 0.715 1.840
X2tvalue (1.138) 1.776 2.416 1.024
Y1pvalue 0.868 0.213 0.129 0.633
X1pvalue 0.068 0.586 0.514 0.140
X2pvalue 0.319 0.150 0.073 0.364
16
‐ Rtypicallyhaslongerlearningcurveinthebeginningcomparedtoothersoftwarepackages‐ Packagescanbeupdatedorchangedthatcanimpactyourcode‐ Thesoftwareisnotsupportedinthesensethatyoucan’tholdanyofthedevelopersaccountable.However,
therearemanywaystocommunicatesuspectedbugsinthecode‐ Filtering,sorting,datavisualizationsandotherfunctionsaredefinitelyavailableandcustomizableinthe
software,butarehardertoperformcomparedtoothersoftwarepackagessuchasTableau‐ Loops,althoughpossible,takelongercomparedtootherprogramminglanguages(Mease,2007)
SummaryRisanopensourcestatisticalprogramminglanguagethatcanbedownloadedforfree.Risanextremelypowerfulandcustomizablestatisticalprogramminglanguageandtypicallycontainsthelateststatisticalpackages.Similartootherprogramminglanguages,Rcanbeintimidatingatfirstsincethereisonlyablankscreenthatonestartswith.AlthoughRStudioprovidessomeGUIfeatures,mostofthelanguagesstrengthsrelyonthecapabilitiesoftheprogrammer.Thankfully,therearemanycommunities,books,videos,andothersourcesthatcanassist.Oncethedataisloadednormalizedandstructuredproperly,thestatisticalsummariesandtestscanrunwithlittleeffort.Rcanalsobeconnectedwithdozensofplatforms,suchasTableauandMicrosoft’sPowerBI,whicharebothdatavisualizationsoftwarepackages.Sinceitisaprogramminglanguage,therearemillionsofwaystocustomizevisualswithinR.Lastly,asitwasshownabove,Rcanhelpdrivebusinessdecisions,performtextanalyticsonbigdataandassistwithpredictiveanalyticsinpowerfulways.
17
References
1. Correlation(Pearson,Kendall,Spearman)(2014).RetrievedNovember1,2014,fromhttp://www.statisticssolutions.com/correlation‐pearson‐kendall‐spearman/
2. Davenport,T.(2014).BigDataatWork:DispellingtheMyths,UncoveringtheOpportunities.Harvard
BusinessReviewPress.
3. Davenport,T.(2007).CompetingonAnalytics:TheNewScienceofWinning.HarvardBusinessReviewPress.
4. Doane,S.(2013).NonparametricTests:Chapter15fromAppliedStatisticsinBusinessandEconomics.
McGraw‐HillCompanies.
5. Evans,J.(n.d.).Lesson2:TheStandardDeviationandtheNormalCurve.RetrievedNovember11,2014,fromhttp://www.fgse.nova.edu/edl/secure/stats/lesson2.htm
6. Foreman,J.(2013).DataSmart:UsingDataSciencetoTransformInformationintoInsight.Wiley.
7. InterquartileRange(n.d.).RetrievedDecember18thfromhttp://en.wikipedia.org/wiki/Interquartile_range
8. Lane,DavidM.(n.d.).Introduction:AnalysisofVariance.RetrievedDecember2,2014,from
http://onlinestatbook.com/2/analysis_of_variance/intro.html
9. Logos,T.(2009).Kruskal‐WallisOne‐WayAnalysisofVariance.RetrievedJune1fromhttp://www.r‐bloggers.com/kruskal‐wallis‐one‐way‐analysis‐of‐variance/.
10. Mease,David.GoogleTechTalks:StatisticalAspectsofDataMining(2007).RetrievedNovember1,2014
fromhttps://www.youtube.com/playlist?list=PLDA74C8620B138B61
11. Medianvs.AveragetoDescribeNormal(n.d.).RetrievedDecember15,2014fromhttp://www.wcc.nrcs.usda.gov/normals/median_average.htm
12. Miller,T.(2013).ModelingTechniquesinPredictiveAnalytics:BusinessProblemsandSolutionswithR(FT
PressAnalytics).PearsonFTPress.
13. NumericalMeasures.(2016)RetrievedJune4fromhttp://www.r‐tutor.com/elementary‐statistics/numerical‐measures
14. OneWayAnova.(2016).RetrievedJune1,2016from
http://www.stat.columbia.edu/~martin/W2024/R3.pdf
15. Provost,F.,&Fawcett,T.(2013).DataScienceforBusiness:Whatyouneedtoknowaboutdatamininganddata‐analyticthinking.O'ReillyMedia.
16. RPracticeExercises:Level1(beginners).(2015).RetrievedJune20,2016fromhttp://rstatistics.net/r‐
lang‐practice‐exercises‐level‐1‐beginners/.
17. SimpleLinearRegressionwithR.(2016)RetrievedJuly17fromhttp://courses.statistics.com/software/R/R_Ch02.htm
18
18. UsingR&Tableau(n.d.).RetrievedNovember12,2014,fromhttp://www.tableausoftware.com/sites/default/files/media/using‐r‐and‐tableau‐software_0.pdf
19. VriesA.&J.Meysm(2015).RforDummies.JohnWiley&Sons.
19
Appendix:
SampleofRCodeforMultipleRegressionModelResultsforMultipleClassesofDataTheentirecodedevelopedbelowwaspreparedbyEdOrlandowiththeassistanceofthesitelocatedathttps://cran.r‐project.org/web/packages/plyr/plyr.pdf## sets working directory to desktop setwd("C:/Users/ed.orlando07/Desktop/Data Science/IDS742/Final Paper - R") ## gets working directory getwd() ## reads in csv file into a data table data1 <- read.csv("Regression Data Sets.csv") ## shows the first 6 lines of the data frame head(data1) ## shows the last 6 lines of the data frame tail(data1) ## installs the plyr library so that multiple hotel stats ## can be ran at one time library(plyr) ## more info at https://cran.r-project.org/web/packages/plyr/plyr.pdf ## For each subset of a data frame, apply function then combine results into a list. dlply is similar to ## by except that the results are returned in a different format. ## .data = data frame to be processed ## df = data frame ## .variables = variables to split data frame by, as as.quoted variables, a formula or character ## vector ## lm = linear model models <- dlply(data1, "Hotel", function(df) lm(Y ~ X1 + X2, data = df)) ## retrieves r square value r.square <- laply(models, function(mod) summary(mod)$r.squared) r.square ## retrieves adjusted r square value adj.r.square <- laply(models, function(mod) summary(mod)$adj.r.squared) adj.r.square ## retieves multiple r (does not mean much in multiple regression)
20
mult.r <- sqrt(r.square) mult.r ## retrieves predictive coefficients for each of the variables ## references the data table by row, column Y1coef <- laply(models, function(mod) summary(mod)$coefficients[1,1]) X1coef <- laply(models, function(mod) summary(mod)$coefficients[2,1]) X2coef <- laply(models, function(mod) summary(mod)$coefficients[3,1]) Y1coef X1coef X2coef ## retrieves standard errors for each of the coefficient variables ## references the data table by row, column Y1StdErr <- laply(models, function(mod) summary(mod)$coefficients[1,2]) X1StdErr <- laply(models, function(mod) summary(mod)$coefficients[2,2]) X2StdErr <- laply(models, function(mod) summary(mod)$coefficients[3,2]) Y1StdErr X1StdErr X2StdErr ## retrieves t stat for each of the coefficient variables ## references the data table by row, column Y1tvalue <- laply(models, function(mod) summary(mod)$coefficients[1,3]) X1tvalue <- laply(models, function(mod) summary(mod)$coefficients[2,3]) X2tvalue <- laply(models, function(mod) summary(mod)$coefficients[3,3]) Y1tvalue X1tvalue X2tvalue ## retrieves p value for each of the coefficient variables ## references the data table by row, column Y1pvalue <- laply(models, function(mod) summary(mod)$coefficients[1,4]) X1pvalue <- laply(models, function(mod) summary(mod)$coefficients[2,4]) X2pvalue <- laply(models, function(mod) summary(mod)$coefficients[3,4]) Y1pvalue X1pvalue X2pvalue ## retrieves degrees of freedom for SSR (line 1) - variation explained by the regression
21
df.regress <- laply(models, function(mod) anova(mod)$"Df"[1]) + laply(models, function(mod) anova(mod)$"Df"[2]) df.regress ## retrieves degrees of freedom for SSE (line 2) - unexplained or error variation df.resid <- laply(models, function(mod) anova(mod)$"Df"[3]) df.resid ## total degrees of freedom df.total <- df.regress + df.resid df.total ## total observations obs <- df.total + 1 obs ## SSR = sum of squared variation explained by the regression (line 1) SSR <- laply(models, function(mod) anova(mod)$"Sum Sq"[1]) + laply(models, function(mod) anova(mod)$"Sum Sq"[2]) SSR ## MSR = mean of sum of squared variation explained by the regression (line 1) MSR <- SSR / df.regress MSR ## MSE = mean of sum of squared variation explained by the error (line 2) MSE <- laply(models, function(mod) anova(mod)$"Mean Sq"[3]) MSE ## SSE - Sum of Squared error SSE <- MSE*df.resid SSE ## Total Sum of Squares SST <- SSE + SSR SST ## Standard Error of Model Std.Err.Model <- sqrt(SSE / (df.resid)) Std.Err.Model ## F value of model fvalue <- MSR / MSE fvalue output <- data.frame( mult.r, r.square, adj.r.square, Std.Err.Model, obs, df.regress, df.resid, df.total, SSR, SSE, SST, MSR, MSE, fvalue, Y1coef, X1coef, X2coef, Y1StdErr, X1StdErr, X2StdErr, X1tvalue, X2tvalue, Y1pvalue, X1pvalue, X2pvalue) print(output) str(output)
22
write.csv(output, "C:/Users/ed.orlando07/Desktop/Data Science/IDS742/Final Paper - R/MultRegressOutput.csv", row.names=T)