22
1 R – Research Study on the Powerful Statistical Programming Language Author: Ed Orlando, Analytics and Business Intelligence Manager, White Lodging Course: IDS 742: Tools for Decision Making Professor: John Ward Honor Code: I have neither given nor received unauthorized aid

R - Research Study on the Powerful Statistical Programming Language

Embed Size (px)

Citation preview

Page 1: R - Research Study on the Powerful Statistical Programming Language

1  

 

R – Research Study on the   

Powerful Statistical Programming Language   

Author:EdOrlando,AnalyticsandBusinessIntelligenceManager,WhiteLodging

Course:IDS742:ToolsforDecisionMaking

Professor:JohnWard

HonorCode:Ihaveneithergivennorreceivedunauthorizedaid

Page 2: R - Research Study on the Powerful Statistical Programming Language

2  

TableofContentsExecutiveSummary‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 4HighLevelDescriptionofRandItsHistory‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 4   RCanBePassedThroughTableautoCreateVisualizations‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 5InstructionsonhowtopassRthroughTableau‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 6   BasicSampleCodeExamples‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 7   StatisticalCodingExamplesinRalongwithTableauTranslations‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 8  CaseStudy–MultipleRegressionResultsforMultipleDataSets‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 12StrengthsandWeaknessesofR‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 15Summary‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 16

TablesTable1:ExampleofaProperDataStructureandanExampleofanImproperDataStructure 8Table2:PopularStatisticalCodesinRandTableauTranslationsProvidedbyEdOrlando‐‐‐ 9Table3:OneHotelMultipleRegressionModelExample‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 12Table4:MicrosoftExcelDataAnalysisMultipleRegressionOutput‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 12Table5:RMultipleRegressionOutputforOneHotel(entirecodeislistedinAppendix)‐‐‐‐‐ 13Table6:MultipleHotel/MultipleRegressionModelExample‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 14Table7:RMultipleRegressionOutputforMultipleHotels’DataSets(entirecodeislistedinAppendix)‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 15

Page 3: R - Research Study on the Powerful Statistical Programming Language

3  

GraphsGraph1:ScreenshotofRInterface‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 4Graph2:ScreenshotofRStudioInterface‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 5

References‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 17

Appendix:SampleofRCodeforMultipleRegressionModelResultsforMultipleClassesofData‐‐‐‐‐‐‐‐ 19

Page 4: R - Research Study on the Powerful Statistical Programming Language

4  

ExecutiveSummaryInthispaper,ahigh‐leveldescriptionofRanditshistorywillbeprovided.OneofthesectionswilldescribehowRcanconnecttovarioussoftwareplatform,inparticular,Tableau.AsIhavebeenaRprogrammerforabout2years,somebasicandadvancedcodingexampleswillbeprovided.Inaddition,onebusinesscasewillbeprovidedthatshowhowRcanassistwithcreatingmultipleregressionmodelsonmultipledatasets(exampleofcodeislistedinAppendix).Lastly,someofthestrengthsandweaknessesofRwillbehighlightedaswell.

HighLevelDescriptionofRandItsHistoryRisastatisticalprogramminglanguagethatwasdevelopedbyRossIhakaandRobertGentlemenwhowereprofessorsattheUniversityofAucklandlocatedinNewZealand.Theydevelopeditoriginallyfortheirstudentstoperformstatisticalcalculationsintheearly1990sanditeventuallywasreleasedforpublicuseinFebruary2000(Vries&Meysmp.10).In2012,therewere18individualsthathadaccesstomakechangestotheRprogram.ThesepeoplearepartofwhatiscalledtheCoreDevelopmentTeam(Vries&Meysmp.10).ThesepeopleareresponsibleforcreatingandmodifyingaswellasdocumentingtheRpackagesandfunctionality.OnecangainaccesstotwodifferentRmodulesfreeofchargebydownloadingitfromwww.r‐project.org.OnecaneitherdownloadtheoriginalRinterfaceordownloadamoreGUI(graphicaluserinterface)typeofinterfacereferredtoasRStudio.RStudiohasgainedpopularityoverthelastfewyearssinceitconsideredalittlemoreuser‐friendly.MypersonalpreferenceistheoriginalRplatform.Graph1:ScreenshotofRInterface

Page 5: R - Research Study on the Powerful Statistical Programming Language

5  

Graph2:ScreenshotofRStudioInterface

Risopen‐sourcesoftwareandcanbedownloadedformanydifferenttypesofcomputersaswellasbedownloadedandavailablethroughoutmostoftheworld.Thefactthatthesoftwareisfreeisoneofthemostattractivepartsofthesoftware.Inaddition,sincethelanguageisopensource,thatalsomeansthattheunderlyingcodecanbeaccessedbyanyone(Vries&Meysmp.10‐14).Inotherwords,otherdeveloperscanauditthepackagesandcodingtoensurethattheyareworkingproperly.Lastly,itisestimatedthatmorethan2millionpeopleutilizeRasof2012withonlinecommunitiesavailableforhelpaswellmanybookswrittenonthelanguage(Vries&Meysmp.10‐14).Oneofthemostdecoratedstatisticiansintheworld,DavidMease,whoformerlyworkedforGoogleandApple,isahugeadvocateofthesoftware.Inaddition,someoftopprogrammingschoolsinthecountryinstallRintotheircurriculums,includingtheUniversityofMichigan(Mease,2007).

RCanBePassedThroughTableautoCreateVisualizationsRcanbeconnectedtomanydifferentsoftwarepackages.OneexampleisanRplug‐inintoTableau.Tableauisdatavisualizationsoftwarethatcanbeusedtocreategraphs,visualizations,datastories,andquickfilteringcapabilitiesthatismuchhardertoaccomplishinRalone.However,therearemanypackagesthatcanbe“passed‐through”RwhenaccessingTableau.Forexample,amultiplecorrelationmodelcanbecodedinTableauthatwilllinkuptoR.RwillthenperformthestatisticalcalculationwithitsengineandthenpassthoughtheresultbackintoTableau.TheinstructionsprovidedbyTableauthatarelistedbelowwalksthroughhowtoconnecttoRthroughTableau.Asasidenote,RcanalsohaveaconnectionestablishedthroughMicrosoft’sPowerBIsoftwareandSPSSaswellasmanyothersoftwarepackages.

Page 6: R - Research Study on the Powerful Statistical Programming Language

6  

InstructionsonhowtopassRthroughTableauListedbelowisaretheexactinstructionsprovidedbyTableau(UsingR&Tableau,2014).ThesitedescribeshowtoestablishanRconnectionthroughTableau.HowdoIstartusingTableauwithR?ForuserswhoarealreadyfamiliarwithRanditscapabilities,itisfairlysimpletoestablishtheconnectionbetweenRandTableau.Theinstructionsbelowarefornewinstallationsusingtheopen‐sourceversionofR.Otheroptionsmaybeavailableusingotherpackages,suchasthosefromRevolutionAnalytics.1.DownloadandInstallR.ClickheretofindthefileandinstructionsondownloadingR.2.DownloadandInstallRserve.YouwillneedtoinstallanRserveforTableautoconnecttoinordertoutilizethenewscriptfunctions.IntheRconsole,enterthefollowingcommands:

‐install.packages(“Rserve”)‐library(Rserve)‐Rserve()

3.ConnectTableautotheRServer.OnceRserveisinstalled,openTableauDesktopandfollowthestepsbelow:

a.GototheHelpmenuandselect“ManageRConnection”.b.Enteraservernameof“Localhost”(or“127.0.0.1”)andaportof“6311”.c.Clickonthe“TestConnection”buttontomakesureeverythingrunssmoothly.

Youshouldseeasuccessfulmessage.ClickOKtoclose.

4.StartusingtheRscriptsinTableau.NowyouwillbeabletocreatenewcalculatedfieldsinTableauDesktopthatutilizetheSCRIPT_*functionstomakeRfunctionalcalls.Pleasenotethattheaboveinstructionswerepulleddirectlyfromthetableaulinkbelowandnoneofthetextwasmodified(UsingR&Tableau,2014).

Page 7: R - Research Study on the Powerful Statistical Programming Language

7  

BasicSampleCodeExamplesSomeofthemostcommonusedcodesandexplanationsarelistedbelow.TheseexampleswerederivedfromthefollowingsiteRStatistics.net(RPracticeExercises:Leven1–Beginners,2015).Exercise1: Calculatethesquarerootof729.Answer: sqrt (729) Exercise2: Createanewvariable‘b’withvalue1947.0Answer: b <- 1947.0 Exercise3: Convert‘b’frompreviousexercisetocharacterAnswer: b <- as.character(b) print (b)   Exercise4: Setupyourworkingdirectorytoanew'work'folderinyourdesktopAnswer: setwd ("path/to/my/desktop/work")

getwd() Exercise5: Createavectornumbersfrom1to6Answer: one_to_six <- c(1, 2, 3, 4, 5, 6) Exercise6: RandomSamplingAnswer: mySample <- sample(1:100, 5-, replace=T)

Page 8: R - Research Study on the Powerful Statistical Programming Language

8  

StatisticalCodingExamplesinRalongwithTableauTranslationsRrequiresmanyofitsdatasetstobecolumnized.Forexample,ifyouhadtwogroupsofdata,GroupAandGroupB.InordertorunandANOVAanalysisonthedatasetsoracorrelationmodel,thedatashouldbelaidoutliketheexampleontheleft.Inotherwords,theattributenamesneedtobelistedinthesamecolumnorvectorinorderforthestatisticalteststorunproperly.Table1:ExampleofaProperDataStructureandanExampleofanImproperDataStructureRRequiredLayout TypicalDataLayoutWILLWORKINR WILLNOTWORKINRRecord Group GroupA Record GroupA GroupB1 A 5 1 5 42 A 7 2 7 23 A 8 3 8 84 A 9 4 9 45 A 10 5 10 76 A 12 6 12 87 A 5 7 5 98 A 8 8 8 109 A 7 9 7 1510 B 4 11 B 2 12 B 8 13 B 4 14 B 7 15 B 8 16 B 9 17 B 10 18 B 15 ListedonthenextpageisTable2,whichincludessomecommonlyusedadvancedstatisticalcodes.Listedinthetableincludesthedescriptionofthecalculation/model,aswellastheRcodeandtheTableautranslation.

Page 9: R - Research Study on the Powerful Statistical Programming Language

9  

StatisticalDescription

Model

RCode

TableauCode(translationwasmadebyEdOrlandothroughtrial&error)

mean Descriptive Analytics mean(Column.A) SCRIPT_REAL('mean(.arg1)',SUM([Column A]))

median Descriptive Analytics median(Column.A) SCRIPT_REAL('median(.arg1)',SUM([Column A]))

max Descriptive Analytics max(Column.A) SCRIPT_REAL('max(.arg1)',SUM([Column A]))

min Descriptive Analytics min(Column.A) SCRIPT_REAL('min(.arg1)',SUM([Column A]))

sd Descriptive Analytics sd(Column.A) SCRIPT_REAL('sd(.arg1)',SUM([Column A]))

length or count

Descriptive Analytics length(Column.A) SCRIPT_REAL('length(.arg1)',SUM([Column A]))

variance Descriptive Analytics var(Column.A) SCRIPT_REAL('var(.arg1)',SUM([Column A]))

p-value Simple Regression cor.test(Column.A,Column.B)$p.value SCRIPT_REAL("cor.test(.arg1,.arg2)$p.value",SUM([Column

A]),SUM([Column B]))

r squared Simple Regression summary(lm(Column.A~Column.B))$r.squared SCRIPT_REAL('summary(lm(.arg1~.arg2))$r.squared',SUM([Column

A]),SUM([Column B]))

Adjusted r squared

Simple Regression summary(lm(Column.A~Column.B))$adj.r.squared SCRIPT_REAL('summary(lm(.arg1~.arg2))$adj.r.squared',SUM([Column

A]),SUM([Column B]))

df (Regression)

Simple Regression anova(lm(Column.A~Column.B))$"Df"[1] SCRIPT_REAL('anova(lm(.arg1~.arg2))$"Df"[1]',SUM([Column

A]),SUM([Column B]))

df (Residual) Simple Regression anova(lm(Column.A~Column.B))$"Df"[2] SCRIPT_REAL('anova(lm(.arg1~.arg2))$"Df"[2]',SUM([Column

A]),SUM([Column B]))

df (Total) Simple Regression anova(lm(Column.A~Column.B))$"Df"[3] ([df (Regression)])+([df (Residual)])

SSR (1st Row)

Simple Regression anova(lm(Column.A~Column.B))$"Sum Sq"[1] SCRIPT_REAL('anova(lm(.arg1~.arg2))$"Sum Sq"[1]',SUM([Column

A]),SUM([Column B]))

SSE (2nd Row)

Simple Regression anova(lm(Column.A~Column.B))$"Sum Sq"[2] SCRIPT_REAL('anova(lm(.arg1~.arg2))$"Sum Sq"[2]',SUM([Column

A]),SUM([Column B]))

MSR (1st Row)

Simple Regression anova(lm(Column.A~Column.B))$"Mean Sq"[1] SCRIPT_REAL('anova(lm(.arg1~.arg2))$"Mean Sq"[1]',SUM([Column

A]),SUM([Column B]))

Page 10: R - Research Study on the Powerful Statistical Programming Language

10  

MSE (2nd Row)

Simple Regression anova(lm(Column.A~Column.B))$"Mean Sq"[2] SCRIPT_REAL('anova(lm(.arg1~.arg2))$"Mean Sq"[2]',SUM([Column

A]),SUM([Column B]))

F value (Top Row)

Simple Regression anova(lm(Column.A~Column.B))$"F value"[1] SCRIPT_REAL('anova(lm(.arg1~.arg2))$"F value"[1]',SUM([Column

A]),SUM([Column B]))

p-value Simple Regression anova(lm(Column.A~Column.B))$"Pr(>F)"[1] SCRIPT_REAL("cor.test(.arg1,.arg2)$p.value",SUM([Column

A]),SUM([Column B]))

Intercept Variable Coeffcient

Simple Regression summary(lm(Column.A~Column.B))$coefficients[1,1] SCRIPT_REAL('summary(lm(.arg1~.arg2))$"coefficients"[1,1]',SUM([Column

A]),SUM([Column B]))

Intercept Standard Error

Simple Regression summary(lm(Column.A~Column.B))$coefficients[1,2] SCRIPT_REAL('summary(lm(.arg1~.arg2))$"coefficients"[1,2]',SUM([Column

A]),SUM([Column B]))

Intercept Stat Simple Regression summary(lm(Column.A~Column.B))$coefficients[1,3] SCRIPT_REAL('summary(lm(.arg1~.arg2))$"coefficients"[1,3]',SUM([Column

A]),SUM([Column B]))

Intercept P-value

Simple Regression summary(lm(Column.A~Column.B))$coefficients[1,4] SCRIPT_REAL('summary(lm(.arg1~.arg2))$"coefficients"[1,4]',SUM([Column

A]),SUM([Column B]))

X1 Variable Coefficient

Simple Regression summary(lm(Column.A~Column.B))$coefficients[2,1] SCRIPT_REAL('summary(lm(.arg1~.arg2))$"coefficients"[2,1]',SUM([Column

A]),SUM([Column B]))

X1 Variable Standard Error

Simple Regression summary(lm(Column.A~Column.B))$coefficients[2,2] SCRIPT_REAL('summary(lm(.arg1~.arg2))$"coefficients"[2,2]',SUM([Column

A]),SUM([Column B]))

X1 Variable Stat

Simple Regression summary(lm(Column.A~Column.B))$coefficients[2,3] SCRIPT_REAL('summary(lm(.arg1~.arg2))$"coefficients"[2,3]',SUM([Column

A]),SUM([Column B]))

X1 Variable P-value

Simple Regression summary(lm(Column.A~Column.B))$coefficients[2,4] SCRIPT_REAL('summary(lm(.arg1~.arg2))$"coefficients"[2,4]',SUM([Column

A]),SUM([Column B]))

p-value ANOVA - One Way

oneway.test(Values~Data2,var.equal = TRUE)$p.value

SCRIPT_REAL("oneway.test(.arg1~.arg2, var.equal = TRUE)$p.value",SUM([Values]),ATTR([Data2]))

df (1st Row) ANOVA - One Way

oneway.test(Values~Data2,var.equal = TRUE)$parameter[1]

SCRIPT_REAL("oneway.test(.arg1~.arg2, var.equal = TRUE)$parameter[1]",SUM([Values]),ATTR([Data2]))

Page 11: R - Research Study on the Powerful Statistical Programming Language

11  

df (2nd Row) ANOVA - One Way

oneway.test(Values~Data2,var.equal = TRUE)$parameter[2]

SCRIPT_REAL("oneway.test(.arg1~.arg2, var.equal = TRUE)$parameter[2]",SUM([Values]),ATTR([Data2]))

F Value ANOVA - One Way

oneway.test(Values~Data2,var.equal = TRUE)$statistic

SCRIPT_REAL("oneway.test(.arg1~.arg2, var.equal = TRUE)$statistic",SUM([Values]),ATTR([Data2]))

p-value Kruskal Wallis Rank Sum ANOVA test

kruskal.test(Values, as.factor(Data2))$p.value SCRIPT_REAL("kruskal.test(.arg1, as.factor(.arg2))$p.value",SUM([Values]),ATTR([Data2]))

H value Kruskal Wallis Rank Sum ANOVA test

kruskal.test(Values, as.factor(Data2))$statistic

SCRIPT_REAL("kruskal.test(.arg1, as.factor(.arg2))$statistic",SUM([Values]),ATTR([Data2]))

df Kruskal Wallis Rank Sum ANOVA test

kruskal.test(Values, as.factor(Data2))$parameter

SCRIPT_REAL("kruskal.test(.arg1, as.factor(.arg2))$parameter",SUM([Values]),ATTR([Data2]))

Table2:PopularStatisticalCodesinRandTableauTranslationsProvidedbyEdOrlando(Logos,2009)(NumericalMeasures,2016)(OneWayAnova,2016)(SimpleLinearRegressionwithR,2016)

Page 12: R - Research Study on the Powerful Statistical Programming Language

12  

CaseStudy–MultipleRegressionResultsforMultipleDataSetsTherearesituationswhereanalystsorstatisticianswanttorunamultipleregressionmodelforacertainsetofdata.Thiscaneasilybeaccomplishedinmultiplesoftwarepackages(Excel,SPSS,Megastat,etc.)withinafewseconds.However,mostofthetime,thisisaccomplishedusingthesamedatasetforthesamevariable.Forexample,ifananalystwantedtorunaregressionmodelononehotelwiththefollowingdataset,itcouldbeaccomplishedveryquicklyandeasilyusingExcel’sDataAnalysistoolkitadd‐in.Seethedataandthemultipleregressionresultsbelow:Table3:OneHotelMultipleRegressionModelExampleHotel Future

Date Y = Dependent Variable Occupancy %

X1 = Independent Variable (1 year prior Occupancy)

X2 = Independent Variable (2 years’ prior Occupancy)

Courtyard A 7/1/2016 32% 58% 52% Courtyard A 7/2/2016 33% 62% 68% Courtyard A 7/3/2016 41% 52% 48% Courtyard A 7/4/2016 74% 78% 72% Courtyard A 7/5/2016 74% 76% 62% Courtyard A 7/6/2017 67% 69% 58% Courtyard A 7/7/2017 31% 32% 38% AfterrunningtheExcelDataAnalysistoolpakMultipleRegressionModel,thefollowingdatatableresultsaredisplayed. 

Table4:MicrosoftExcelDataAnalysisMultipleRegressionOutput(datasetprovidedabove) 

ThesamedatasetwasuploadedintoRandafterutilizingsomeextensivecode,thefollowingoutputwasproducedina.csvoutputfile.Asonecansee,itproducesmanyofthesamedatapointsaswellasmatchesthevalues.TheentirecodethatwasutilizedislistedintheAppendix.

Page 13: R - Research Study on the Powerful Statistical Programming Language

13  

Table5:RMultipleRegressionOutputforOneHotel(entirecodeislistedinAppendix)

Afterreviewingtheextensivecodelistedintheappendix,onemightaskwhyshouldsomeoneutilizeRtoproducethesameresults?ThecodeismuchmoreextensiveandmoredifficulttolearncomparedtoaGUIinterfacesoftwarepackagesuchasExcel.However,therealbenefitofutilizingthepowerofRcomesintoplaywhenyouwanttorunthissametypeofregressionmodelfor100differenthotels–allwithdifferentsetsofdata.InExcel,thisexercisecouldbeaccomplishedbyrunningthesamemodeloverandover,buttheprocessbecomesverycumbersomeandispronetoerrorsduetomanymanualclicksandprocessesmade.TheRcode,utilizingthenewdatasetbelowcanproduceresultsfor5hotelsveryquicklywithnochangestooriginalcodeproducedinthefirstexample.Basically,itrunstheregressionmodelforthefirstsetofHotelA’sdata,thenHotelB’s,andrepeatsuntiltheendofthedataset.Theresultsareprovidedwithinseconds.

Obs 1mult.r 0.86 r.square 0.74 adj.r.square 0.61 Std.Err.Model 0.13 obs 7df.regress 2df.resid 4df.total 6SSR 0.18 SSE 0.06 SST 0.25 MSR 0.09 MSE 0.02 fvalue 5.71 Y1coef 0.05 X1coef 1.75 X2coef (1.07) Y1StdErr 0.26 X1StdErr 0.70 X2StdErr 0.94

Page 14: R - Research Study on the Powerful Statistical Programming Language

14  

Table6:MultipleHotel/MultipleRegressionModelExample

Hotel Date Y = Dependent Variable Occupancy %

X1 = Independent Variable (1 year

prior Occupancy)

X2 = Independent Variable (2 years’ prior Occupancy)

Courtyard A 7/1/2016 0.32

0.58

0.52

Courtyard A 7/2/2016 0.33

0.62

0.68

Courtyard A 7/3/2016 0.41

0.52

0.48

Courtyard A 7/4/2016 0.74

0.78

0.72

Courtyard A 7/5/2016 0.74

0.76

0.62

Courtyard A 7/6/2017 0.67

0.69

0.58

Courtyard A 7/7/2017 0.31

0.32

0.38

Courtyard B 7/1/2016 0.50

0.72

0.55

Courtyard B 7/2/2016 0.44

0.77

0.71

Courtyard B 7/3/2016 0.57

0.70

0.66

Courtyard B 7/4/2016 0.92

0.84

0.86

Courtyard B 7/5/2016 0.87

0.92

0.79

Courtyard B 7/6/2017 0.77

0.84

0.74

Courtyard B 7/7/2017 0.33

0.33

0.51

Courtyard C 7/1/2016 0.51

0.87

0.58

Courtyard C 7/2/2016 0.38

0.57

0.63

Courtyard C 7/3/2016 0.45

0.53

0.64

Courtyard C 7/4/2016 0.88

0.81

0.76

Courtyard C 7/5/2016 0.78

0.75

0.70

Courtyard C 7/6/2017 0.66

0.72

0.68

Courtyard C 7/7/2017 0.27

0.15

0.48

Courtyard D 7/1/2016 0.38

0.77

0.43

Courtyard D 7/2/2016 0.31

0.45

0.60

Courtyard D 7/3/2016 0.36

0.43

0.45

Courtyard D 7/4/2016 0.69

0.74

0.72

Courtyard D 7/5/2016 0.71

0.65

0.52

Courtyard D 7/6/2017 0.48

0.61

0.59

Courtyard D 7/7/2017 0.13

0.03

0.39

Page 15: R - Research Study on the Powerful Statistical Programming Language

15  

Listedbelowarefourdifferentregressionmodelstatisticsbasedonthedifferentdatasetslistedabove.Asmentionedabove,thecodewillruntheregressionmodelfortheonesetofdata,recordtheresultsandthenrunitagainforthenexthotels’setofdata.Thisisveryusefulifyouhaveamodelthatisconsistent,buttherearedozensorevenhundredsofvariousdatasetsthatneedtohavethemodelbuiltoftheiruniquedatasets.Table7:RMultipleRegressionOutputforMultipleHotels’DataSets(entirecodeislistedinAppendix)

StrengthsandWeaknessesofRSomeofthekeystrengthstoRarethefollowing:

‐ Softwareisfree/opensource‐ Adoptedandutilizedbymorethan2millionpeople‐ Severalbooksandblogscanprovideassistancewithgettingstarted‐ ManyofthelatestandgreateststatisticalmodelsareadoptedandcreatedfirstinR‐ Easilycustomizable‐ Handlesbigdatawell(seeweaknessesbelowforloopingperformance)

SomeofthekeyweaknessestoRarethefollowing:

‐ Risaprogramminglanguagewithverylittle“drop‐down”options(notatrueGUIinterface)

Courtyard A Courtyard B Courtyard C Courtyard D

mult.r 0.861 0.892 0.907 0.828

r.square 0.741 0.795 0.822 0.686

adj.r.square 0.611 0.692 0.733 0.529

Std.Err.Model 0.127 0.126 0.114 0.143

obs 7 7 7 7

df.regress 2 2 2 2

df.resid 4 4 4 4

df.total 6 6 6 6

SSR 0.185 0.245 0.240 0.178

SSE 0.065 0.063 0.052 0.082

SST 0.250 0.308 0.292 0.260

MSR 0.092 0.122 0.120 0.089

MSE 0.016 0.016 0.013 0.020

fvalue 5.712 7.746 9.254 4.370

Y1coef 0.046 (0.431) (0.712) (0.140)

X1coef 1.750 0.272 0.197 0.493

X2coef (1.074) 1.250 1.800 0.602

Y1StdErr 0.258 0.291 0.373 0.272

X1StdErr 0.703 0.459 0.276 0.268

X2StdErr 0.943 0.704 0.745 0.588

X1tvalue 2.489 0.592 0.715 1.840

X2tvalue (1.138) 1.776 2.416 1.024

Y1pvalue 0.868 0.213 0.129 0.633

X1pvalue 0.068 0.586 0.514 0.140

X2pvalue 0.319 0.150 0.073 0.364

Page 16: R - Research Study on the Powerful Statistical Programming Language

16  

‐ Rtypicallyhaslongerlearningcurveinthebeginningcomparedtoothersoftwarepackages‐ Packagescanbeupdatedorchangedthatcanimpactyourcode‐ Thesoftwareisnotsupportedinthesensethatyoucan’tholdanyofthedevelopersaccountable.However,

therearemanywaystocommunicatesuspectedbugsinthecode‐ Filtering,sorting,datavisualizationsandotherfunctionsaredefinitelyavailableandcustomizableinthe

software,butarehardertoperformcomparedtoothersoftwarepackagessuchasTableau‐ Loops,althoughpossible,takelongercomparedtootherprogramminglanguages(Mease,2007)

SummaryRisanopensourcestatisticalprogramminglanguagethatcanbedownloadedforfree.Risanextremelypowerfulandcustomizablestatisticalprogramminglanguageandtypicallycontainsthelateststatisticalpackages.Similartootherprogramminglanguages,Rcanbeintimidatingatfirstsincethereisonlyablankscreenthatonestartswith.AlthoughRStudioprovidessomeGUIfeatures,mostofthelanguagesstrengthsrelyonthecapabilitiesoftheprogrammer.Thankfully,therearemanycommunities,books,videos,andothersourcesthatcanassist.Oncethedataisloadednormalizedandstructuredproperly,thestatisticalsummariesandtestscanrunwithlittleeffort.Rcanalsobeconnectedwithdozensofplatforms,suchasTableauandMicrosoft’sPowerBI,whicharebothdatavisualizationsoftwarepackages.Sinceitisaprogramminglanguage,therearemillionsofwaystocustomizevisualswithinR.Lastly,asitwasshownabove,Rcanhelpdrivebusinessdecisions,performtextanalyticsonbigdataandassistwithpredictiveanalyticsinpowerfulways.

Page 17: R - Research Study on the Powerful Statistical Programming Language

17  

References

1. Correlation(Pearson,Kendall,Spearman)(2014).RetrievedNovember1,2014,fromhttp://www.statisticssolutions.com/correlation‐pearson‐kendall‐spearman/

2. Davenport,T.(2014).BigDataatWork:DispellingtheMyths,UncoveringtheOpportunities.Harvard

BusinessReviewPress.

3. Davenport,T.(2007).CompetingonAnalytics:TheNewScienceofWinning.HarvardBusinessReviewPress.

4. Doane,S.(2013).NonparametricTests:Chapter15fromAppliedStatisticsinBusinessandEconomics.

McGraw‐HillCompanies.

5. Evans,J.(n.d.).Lesson2:TheStandardDeviationandtheNormalCurve.RetrievedNovember11,2014,fromhttp://www.fgse.nova.edu/edl/secure/stats/lesson2.htm

6. Foreman,J.(2013).DataSmart:UsingDataSciencetoTransformInformationintoInsight.Wiley.

7. InterquartileRange(n.d.).RetrievedDecember18thfromhttp://en.wikipedia.org/wiki/Interquartile_range

8. Lane,DavidM.(n.d.).Introduction:AnalysisofVariance.RetrievedDecember2,2014,from

http://onlinestatbook.com/2/analysis_of_variance/intro.html

9. Logos,T.(2009).Kruskal‐WallisOne‐WayAnalysisofVariance.RetrievedJune1fromhttp://www.r‐bloggers.com/kruskal‐wallis‐one‐way‐analysis‐of‐variance/.

10. Mease,David.GoogleTechTalks:StatisticalAspectsofDataMining(2007).RetrievedNovember1,2014

fromhttps://www.youtube.com/playlist?list=PLDA74C8620B138B61

11. Medianvs.AveragetoDescribeNormal(n.d.).RetrievedDecember15,2014fromhttp://www.wcc.nrcs.usda.gov/normals/median_average.htm

12. Miller,T.(2013).ModelingTechniquesinPredictiveAnalytics:BusinessProblemsandSolutionswithR(FT

PressAnalytics).PearsonFTPress.

13. NumericalMeasures.(2016)RetrievedJune4fromhttp://www.r‐tutor.com/elementary‐statistics/numerical‐measures

14. OneWayAnova.(2016).RetrievedJune1,2016from

http://www.stat.columbia.edu/~martin/W2024/R3.pdf

15. Provost,F.,&Fawcett,T.(2013).DataScienceforBusiness:Whatyouneedtoknowaboutdatamininganddata‐analyticthinking.O'ReillyMedia.

16. RPracticeExercises:Level1(beginners).(2015).RetrievedJune20,2016fromhttp://rstatistics.net/r‐

lang‐practice‐exercises‐level‐1‐beginners/.

17. SimpleLinearRegressionwithR.(2016)RetrievedJuly17fromhttp://courses.statistics.com/software/R/R_Ch02.htm

Page 18: R - Research Study on the Powerful Statistical Programming Language

18  

18. UsingR&Tableau(n.d.).RetrievedNovember12,2014,fromhttp://www.tableausoftware.com/sites/default/files/media/using‐r‐and‐tableau‐software_0.pdf

19. VriesA.&J.Meysm(2015).RforDummies.JohnWiley&Sons. 

  

     

 

 

 

Page 19: R - Research Study on the Powerful Statistical Programming Language

19  

Appendix:

SampleofRCodeforMultipleRegressionModelResultsforMultipleClassesofDataTheentirecodedevelopedbelowwaspreparedbyEdOrlandowiththeassistanceofthesitelocatedathttps://cran.r‐project.org/web/packages/plyr/plyr.pdf## sets working directory to desktop setwd("C:/Users/ed.orlando07/Desktop/Data Science/IDS742/Final Paper - R") ## gets working directory getwd() ## reads in csv file into a data table data1 <- read.csv("Regression Data Sets.csv") ## shows the first 6 lines of the data frame head(data1) ## shows the last 6 lines of the data frame tail(data1) ## installs the plyr library so that multiple hotel stats ## can be ran at one time library(plyr) ## more info at https://cran.r-project.org/web/packages/plyr/plyr.pdf ## For each subset of a data frame, apply function then combine results into a list. dlply is similar to ## by except that the results are returned in a different format. ## .data = data frame to be processed ## df = data frame ## .variables = variables to split data frame by, as as.quoted variables, a formula or character ## vector ## lm = linear model models <- dlply(data1, "Hotel", function(df) lm(Y ~ X1 + X2, data = df)) ## retrieves r square value r.square <- laply(models, function(mod) summary(mod)$r.squared) r.square ## retrieves adjusted r square value adj.r.square <- laply(models, function(mod) summary(mod)$adj.r.squared) adj.r.square ## retieves multiple r (does not mean much in multiple regression)

Page 20: R - Research Study on the Powerful Statistical Programming Language

20  

mult.r <- sqrt(r.square) mult.r ## retrieves predictive coefficients for each of the variables ## references the data table by row, column Y1coef <- laply(models, function(mod) summary(mod)$coefficients[1,1]) X1coef <- laply(models, function(mod) summary(mod)$coefficients[2,1]) X2coef <- laply(models, function(mod) summary(mod)$coefficients[3,1]) Y1coef X1coef X2coef ## retrieves standard errors for each of the coefficient variables ## references the data table by row, column Y1StdErr <- laply(models, function(mod) summary(mod)$coefficients[1,2]) X1StdErr <- laply(models, function(mod) summary(mod)$coefficients[2,2]) X2StdErr <- laply(models, function(mod) summary(mod)$coefficients[3,2]) Y1StdErr X1StdErr X2StdErr ## retrieves t stat for each of the coefficient variables ## references the data table by row, column Y1tvalue <- laply(models, function(mod) summary(mod)$coefficients[1,3]) X1tvalue <- laply(models, function(mod) summary(mod)$coefficients[2,3]) X2tvalue <- laply(models, function(mod) summary(mod)$coefficients[3,3]) Y1tvalue X1tvalue X2tvalue ## retrieves p value for each of the coefficient variables ## references the data table by row, column Y1pvalue <- laply(models, function(mod) summary(mod)$coefficients[1,4]) X1pvalue <- laply(models, function(mod) summary(mod)$coefficients[2,4]) X2pvalue <- laply(models, function(mod) summary(mod)$coefficients[3,4]) Y1pvalue X1pvalue X2pvalue ## retrieves degrees of freedom for SSR (line 1) - variation explained by the regression

Page 21: R - Research Study on the Powerful Statistical Programming Language

21  

df.regress <- laply(models, function(mod) anova(mod)$"Df"[1]) + laply(models, function(mod) anova(mod)$"Df"[2]) df.regress ## retrieves degrees of freedom for SSE (line 2) - unexplained or error variation df.resid <- laply(models, function(mod) anova(mod)$"Df"[3]) df.resid ## total degrees of freedom df.total <- df.regress + df.resid df.total ## total observations obs <- df.total + 1 obs ## SSR = sum of squared variation explained by the regression (line 1) SSR <- laply(models, function(mod) anova(mod)$"Sum Sq"[1]) + laply(models, function(mod) anova(mod)$"Sum Sq"[2]) SSR ## MSR = mean of sum of squared variation explained by the regression (line 1) MSR <- SSR / df.regress MSR ## MSE = mean of sum of squared variation explained by the error (line 2) MSE <- laply(models, function(mod) anova(mod)$"Mean Sq"[3]) MSE ## SSE - Sum of Squared error SSE <- MSE*df.resid SSE ## Total Sum of Squares SST <- SSE + SSR SST ## Standard Error of Model Std.Err.Model <- sqrt(SSE / (df.resid)) Std.Err.Model ## F value of model fvalue <- MSR / MSE fvalue output <- data.frame( mult.r, r.square, adj.r.square, Std.Err.Model, obs, df.regress, df.resid, df.total, SSR, SSE, SST, MSR, MSE, fvalue, Y1coef, X1coef, X2coef, Y1StdErr, X1StdErr, X2StdErr, X1tvalue, X2tvalue, Y1pvalue, X1pvalue, X2pvalue) print(output) str(output)

Page 22: R - Research Study on the Powerful Statistical Programming Language

22  

write.csv(output, "C:/Users/ed.orlando07/Desktop/Data Science/IDS742/Final Paper - R/MultRegressOutput.csv", row.names=T)