R - Research Study on the Powerful Statistical Programming Language

1

R – Research Study on the

Powerful Statistical Programming Language

Author:EdOrlando,AnalyticsandBusinessIntelligenceManager,WhiteLodging

Course:IDS742:ToolsforDecisionMaking

Professor:JohnWard

HonorCode:Ihaveneithergivennorreceivedunauthorizedaid

2

TableofContentsExecutiveSummary‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 4HighLevelDescriptionofRandItsHistory‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 4 RCanBePassedThroughTableautoCreateVisualizations‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 5InstructionsonhowtopassRthroughTableau‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 6 BasicSampleCodeExamples‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 7 StatisticalCodingExamplesinRalongwithTableauTranslations‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 8 CaseStudy–MultipleRegressionResultsforMultipleDataSets‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 12StrengthsandWeaknessesofR‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 15Summary‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 16

TablesTable1:ExampleofaProperDataStructureandanExampleofanImproperDataStructure 8Table2:PopularStatisticalCodesinRandTableauTranslationsProvidedbyEdOrlando‐‐‐ 9Table3:OneHotelMultipleRegressionModelExample‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 12Table4:MicrosoftExcelDataAnalysisMultipleRegressionOutput‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 12Table5:RMultipleRegressionOutputforOneHotel(entirecodeislistedinAppendix)‐‐‐‐‐ 13Table6:MultipleHotel/MultipleRegressionModelExample‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 14Table7:RMultipleRegressionOutputforMultipleHotels’DataSets(entirecodeislistedinAppendix)‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 15

3

GraphsGraph1:ScreenshotofRInterface‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 4Graph2:ScreenshotofRStudioInterface‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 5

References‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 17

Appendix:SampleofRCodeforMultipleRegressionModelResultsforMultipleClassesofData‐‐‐‐‐‐‐‐ 19

4

ExecutiveSummaryInthispaper,ahigh‐leveldescriptionofRanditshistorywillbeprovided.OneofthesectionswilldescribehowRcanconnecttovarioussoftwareplatform,inparticular,Tableau.AsIhavebeenaRprogrammerforabout2years,somebasicandadvancedcodingexampleswillbeprovided.Inaddition,onebusinesscasewillbeprovidedthatshowhowRcanassistwithcreatingmultipleregressionmodelsonmultipledatasets(exampleofcodeislistedinAppendix).Lastly,someofthestrengthsandweaknessesofRwillbehighlightedaswell.

HighLevelDescriptionofRandItsHistoryRisastatisticalprogramminglanguagethatwasdevelopedbyRossIhakaandRobertGentlemenwhowereprofessorsattheUniversityofAucklandlocatedinNewZealand.Theydevelopeditoriginallyfortheirstudentstoperformstatisticalcalculationsintheearly1990sanditeventuallywasreleasedforpublicuseinFebruary2000(Vries&Meysmp.10).In2012,therewere18individualsthathadaccesstomakechangestotheRprogram.ThesepeoplearepartofwhatiscalledtheCoreDevelopmentTeam(Vries&Meysmp.10).ThesepeopleareresponsibleforcreatingandmodifyingaswellasdocumentingtheRpackagesandfunctionality.OnecangainaccesstotwodifferentRmodulesfreeofchargebydownloadingitfromwww.r‐project.org.OnecaneitherdownloadtheoriginalRinterfaceordownloadamoreGUI(graphicaluserinterface)typeofinterfacereferredtoasRStudio.RStudiohasgainedpopularityoverthelastfewyearssinceitconsideredalittlemoreuser‐friendly.MypersonalpreferenceistheoriginalRplatform.Graph1:ScreenshotofRInterface

5

Graph2:ScreenshotofRStudioInterface

Risopen‐sourcesoftwareandcanbedownloadedformanydifferenttypesofcomputersaswellasbedownloadedandavailablethroughoutmostoftheworld.Thefactthatthesoftwareisfreeisoneofthemostattractivepartsofthesoftware.Inaddition,sincethelanguageisopensource,thatalsomeansthattheunderlyingcodecanbeaccessedbyanyone(Vries&Meysmp.10‐14).Inotherwords,otherdeveloperscanauditthepackagesandcodingtoensurethattheyareworkingproperly.Lastly,itisestimatedthatmorethan2millionpeopleutilizeRasof2012withonlinecommunitiesavailableforhelpaswellmanybookswrittenonthelanguage(Vries&Meysmp.10‐14).Oneofthemostdecoratedstatisticiansintheworld,DavidMease,whoformerlyworkedforGoogleandApple,isahugeadvocateofthesoftware.Inaddition,someoftopprogrammingschoolsinthecountryinstallRintotheircurriculums,includingtheUniversityofMichigan(Mease,2007).

RCanBePassedThroughTableautoCreateVisualizationsRcanbeconnectedtomanydifferentsoftwarepackages.OneexampleisanRplug‐inintoTableau.Tableauisdatavisualizationsoftwarethatcanbeusedtocreategraphs,visualizations,datastories,andquickfilteringcapabilitiesthatismuchhardertoaccomplishinRalone.However,therearemanypackagesthatcanbe“passed‐through”RwhenaccessingTableau.Forexample,amultiplecorrelationmodelcanbecodedinTableauthatwilllinkuptoR.RwillthenperformthestatisticalcalculationwithitsengineandthenpassthoughtheresultbackintoTableau.TheinstructionsprovidedbyTableauthatarelistedbelowwalksthroughhowtoconnecttoRthroughTableau.Asasidenote,RcanalsohaveaconnectionestablishedthroughMicrosoft’sPowerBIsoftwareandSPSSaswellasmanyothersoftwarepackages.

6

InstructionsonhowtopassRthroughTableauListedbelowisaretheexactinstructionsprovidedbyTableau(UsingR&Tableau,2014).ThesitedescribeshowtoestablishanRconnectionthroughTableau.HowdoIstartusingTableauwithR?ForuserswhoarealreadyfamiliarwithRanditscapabilities,itisfairlysimpletoestablishtheconnectionbetweenRandTableau.Theinstructionsbelowarefornewinstallationsusingtheopen‐sourceversionofR.Otheroptionsmaybeavailableusingotherpackages,suchasthosefromRevolutionAnalytics.1.DownloadandInstallR.ClickheretofindthefileandinstructionsondownloadingR.2.DownloadandInstallRserve.YouwillneedtoinstallanRserveforTableautoconnecttoinordertoutilizethenewscriptfunctions.IntheRconsole,enterthefollowingcommands:

‐install.packages(“Rserve”)‐library(Rserve)‐Rserve()

3.ConnectTableautotheRServer.OnceRserveisinstalled,openTableauDesktopandfollowthestepsbelow:

a.GototheHelpmenuandselect“ManageRConnection”.b.Enteraservernameof“Localhost”(or“127.0.0.1”)andaportof“6311”.c.Clickonthe“TestConnection”buttontomakesureeverythingrunssmoothly.

Youshouldseeasuccessfulmessage.ClickOKtoclose.

4.StartusingtheRscriptsinTableau.NowyouwillbeabletocreatenewcalculatedfieldsinTableauDesktopthatutilizetheSCRIPT_*functionstomakeRfunctionalcalls.Pleasenotethattheaboveinstructionswerepulleddirectlyfromthetableaulinkbelowandnoneofthetextwasmodified(UsingR&Tableau,2014).

7

BasicSampleCodeExamplesSomeofthemostcommonusedcodesandexplanationsarelistedbelow.TheseexampleswerederivedfromthefollowingsiteRStatistics.net(RPracticeExercises:Leven1–Beginners,2015).Exercise1: Calculatethesquarerootof729.Answer: sqrt (729) Exercise2: Createanewvariable‘b’withvalue1947.0Answer: b <- 1947.0 Exercise3: Convert‘b’frompreviousexercisetocharacterAnswer: b <- as.character(b) print (b) Exercise4: Setupyourworkingdirectorytoanew'work'folderinyourdesktopAnswer: setwd ("path/to/my/desktop/work")

getwd() Exercise5: Createavectornumbersfrom1to6Answer: one_to_six <- c(1, 2, 3, 4, 5, 6) Exercise6: RandomSamplingAnswer: mySample <- sample(1:100, 5-, replace=T)

8

StatisticalCodingExamplesinRalongwithTableauTranslationsRrequiresmanyofitsdatasetstobecolumnized.Forexample,ifyouhadtwogroupsofdata,GroupAandGroupB.InordertorunandANOVAanalysisonthedatasetsoracorrelationmodel,thedatashouldbelaidoutliketheexampleontheleft.Inotherwords,theattributenamesneedtobelistedinthesamecolumnorvectorinorderforthestatisticalteststorunproperly.Table1:ExampleofaProperDataStructureandanExampleofanImproperDataStructureRRequiredLayout TypicalDataLayoutWILLWORKINR WILLNOTWORKINRRecord Group GroupA Record GroupA GroupB1 A 5 1 5 42 A 7 2 7 23 A 8 3 8 84 A 9 4 9 45 A 10 5 10 76 A 12 6 12 87 A 5 7 5 98 A 8 8 8 109 A 7 9 7 1510 B 4 11 B 2 12 B 8 13 B 4 14 B 7 15 B 8 16 B 9 17 B 10 18 B 15 ListedonthenextpageisTable2,whichincludessomecommonlyusedadvancedstatisticalcodes.Listedinthetableincludesthedescriptionofthecalculation/model,aswellastheRcodeandtheTableautranslation.

9

StatisticalDescription

Model

RCode

TableauCode(translationwasmadebyEdOrlandothroughtrial&error)

mean Descriptive Analytics mean(Column.A) SCRIPT_REAL('mean(.arg1)',SUM([Column A]))

median Descriptive Analytics median(Column.A) SCRIPT_REAL('median(.arg1)',SUM([Column A]))

max Descriptive Analytics max(Column.A) SCRIPT_REAL('max(.arg1)',SUM([Column A]))

min Descriptive Analytics min(Column.A) SCRIPT_REAL('min(.arg1)',SUM([Column A]))

sd Descriptive Analytics sd(Column.A) SCRIPT_REAL('sd(.arg1)',SUM([Column A]))

length or count

Descriptive Analytics length(Column.A) SCRIPT_REAL('length(.arg1)',SUM([Column A]))

variance Descriptive Analytics var(Column.A) SCRIPT_REAL('var(.arg1)',SUM([Column A]))

p-value Simple Regression cor.test(Column.A,Column.B)$p.value SCRIPT_REAL("cor.test(.arg1,.arg2)$p.value",SUM([Column

A]),SUM([Column B]))

r squared Simple Regression summary(lm(Column.A~Column.B))$r.squared SCRIPT_REAL('summary(lm(.arg1~.arg2))$r.squared',SUM([Column


Adjusted r squared

Simple Regression summary(lm(Column.A~Column.B))$adj.r.squared SCRIPT_REAL('summary(lm(.arg1~.arg2))$adj.r.squared',SUM([Column


df (Regression)

Simple Regression anova(lm(Column.A~Column.B))$"Df"[1] SCRIPT_REAL('anova(lm(.arg1~.arg2))$"Df"[1]',SUM([Column


df (Residual) Simple Regression anova(lm(Column.A~Column.B))$"Df"[2] SCRIPT_REAL('anova(lm(.arg1~.arg2))$"Df"[2]',SUM([Column


df (Total) Simple Regression anova(lm(Column.A~Column.B))$"Df"[3] ([df (Regression)])+([df (Residual)])

SSR (1st Row)

Simple Regression anova(lm(Column.A~Column.B))$"Sum Sq"[1] SCRIPT_REAL('anova(lm(.arg1~.arg2))$"Sum Sq"[1]',SUM([Column


SSE (2nd Row)

Simple Regression anova(lm(Column.A~Column.B))$"Sum Sq"[2] SCRIPT_REAL('anova(lm(.arg1~.arg2))$"Sum Sq"[2]',SUM([Column


MSR (1st Row)

Simple Regression anova(lm(Column.A~Column.B))$"Mean Sq"[1] SCRIPT_REAL('anova(lm(.arg1~.arg2))$"Mean Sq"[1]',SUM([Column


10

MSE (2nd Row)

Simple Regression anova(lm(Column.A~Column.B))$"Mean Sq"[2] SCRIPT_REAL('anova(lm(.arg1~.arg2))$"Mean Sq"[2]',SUM([Column


F value (Top Row)

Simple Regression anova(lm(Column.A~Column.B))$"F value"[1] SCRIPT_REAL('anova(lm(.arg1~.arg2))$"F value"[1]',SUM([Column


p-value Simple Regression anova(lm(Column.A~Column.B))$"Pr(>F)"[1] SCRIPT_REAL("cor.test(.arg1,.arg2)$p.value",SUM([Column


Intercept Variable Coeffcient

Simple Regression summary(lm(Column.A~Column.B))$coefficients[1,1] SCRIPT_REAL('summary(lm(.arg1~.arg2))$"coefficients"[1,1]',SUM([Column


Intercept Standard Error



Intercept Stat Simple Regression summary(lm(Column.A~Column.B))$coefficients[1,3] SCRIPT_REAL('summary(lm(.arg1~.arg2))$"coefficients"[1,3]',SUM([Column


Intercept P-value



X1 Variable Coefficient



X1 Variable Standard Error



X1 Variable Stat



X1 Variable P-value



p-value ANOVA - One Way

oneway.test(Values~Data2,var.equal = TRUE)$p.value

SCRIPT_REAL("oneway.test(.arg1~.arg2, var.equal = TRUE)$p.value",SUM([Values]),ATTR([Data2]))

df (1st Row) ANOVA - One Way

oneway.test(Values~Data2,var.equal = TRUE)$parameter[1]

SCRIPT_REAL("oneway.test(.arg1~.arg2, var.equal = TRUE)$parameter[1]",SUM([Values]),ATTR([Data2]))

11

df (2nd Row) ANOVA - One Way

oneway.test(Values~Data2,var.equal = TRUE)$parameter[2]

SCRIPT_REAL("oneway.test(.arg1~.arg2, var.equal = TRUE)$parameter[2]",SUM([Values]),ATTR([Data2]))

F Value ANOVA - One Way

oneway.test(Values~Data2,var.equal = TRUE)$statistic

SCRIPT_REAL("oneway.test(.arg1~.arg2, var.equal = TRUE)$statistic",SUM([Values]),ATTR([Data2]))

p-value Kruskal Wallis Rank Sum ANOVA test

kruskal.test(Values, as.factor(Data2))$p.value SCRIPT_REAL("kruskal.test(.arg1, as.factor(.arg2))$p.value",SUM([Values]),ATTR([Data2]))

H value Kruskal Wallis Rank Sum ANOVA test

kruskal.test(Values, as.factor(Data2))$statistic

SCRIPT_REAL("kruskal.test(.arg1, as.factor(.arg2))$statistic",SUM([Values]),ATTR([Data2]))

df Kruskal Wallis Rank Sum ANOVA test

kruskal.test(Values, as.factor(Data2))$parameter

SCRIPT_REAL("kruskal.test(.arg1, as.factor(.arg2))$parameter",SUM([Values]),ATTR([Data2]))

Table2:PopularStatisticalCodesinRandTableauTranslationsProvidedbyEdOrlando(Logos,2009)(NumericalMeasures,2016)(OneWayAnova,2016)(SimpleLinearRegressionwithR,2016)

12

CaseStudy–MultipleRegressionResultsforMultipleDataSetsTherearesituationswhereanalystsorstatisticianswanttorunamultipleregressionmodelforacertainsetofdata.Thiscaneasilybeaccomplishedinmultiplesoftwarepackages(Excel,SPSS,Megastat,etc.)withinafewseconds.However,mostofthetime,thisisaccomplishedusingthesamedatasetforthesamevariable.Forexample,ifananalystwantedtorunaregressionmodelononehotelwiththefollowingdataset,itcouldbeaccomplishedveryquicklyandeasilyusingExcel’sDataAnalysistoolkitadd‐in.Seethedataandthemultipleregressionresultsbelow:Table3:OneHotelMultipleRegressionModelExampleHotel Future

Date Y = Dependent Variable Occupancy %

X1 = Independent Variable (1 year prior Occupancy)

X2 = Independent Variable (2 years’ prior Occupancy)

Courtyard A 7/1/2016 32% 58% 52% Courtyard A 7/2/2016 33% 62% 68% Courtyard A 7/3/2016 41% 52% 48% Courtyard A 7/4/2016 74% 78% 72% Courtyard A 7/5/2016 74% 76% 62% Courtyard A 7/6/2017 67% 69% 58% Courtyard A 7/7/2017 31% 32% 38% AfterrunningtheExcelDataAnalysistoolpakMultipleRegressionModel,thefollowingdatatableresultsaredisplayed.

Table4:MicrosoftExcelDataAnalysisMultipleRegressionOutput(datasetprovidedabove)

ThesamedatasetwasuploadedintoRandafterutilizingsomeextensivecode,thefollowingoutputwasproducedina.csvoutputfile.Asonecansee,itproducesmanyofthesamedatapointsaswellasmatchesthevalues.TheentirecodethatwasutilizedislistedintheAppendix.

13

Table5:RMultipleRegressionOutputforOneHotel(entirecodeislistedinAppendix)

Afterreviewingtheextensivecodelistedintheappendix,onemightaskwhyshouldsomeoneutilizeRtoproducethesameresults?ThecodeismuchmoreextensiveandmoredifficulttolearncomparedtoaGUIinterfacesoftwarepackagesuchasExcel.However,therealbenefitofutilizingthepowerofRcomesintoplaywhenyouwanttorunthissametypeofregressionmodelfor100differenthotels–allwithdifferentsetsofdata.InExcel,thisexercisecouldbeaccomplishedbyrunningthesamemodeloverandover,buttheprocessbecomesverycumbersomeandispronetoerrorsduetomanymanualclicksandprocessesmade.TheRcode,utilizingthenewdatasetbelowcanproduceresultsfor5hotelsveryquicklywithnochangestooriginalcodeproducedinthefirstexample.Basically,itrunstheregressionmodelforthefirstsetofHotelA’sdata,thenHotelB’s,andrepeatsuntiltheendofthedataset.Theresultsareprovidedwithinseconds.

Obs 1mult.r 0.86 r.square 0.74 adj.r.square 0.61 Std.Err.Model 0.13 obs 7df.regress 2df.resid 4df.total 6SSR 0.18 SSE 0.06 SST 0.25 MSR 0.09 MSE 0.02 fvalue 5.71 Y1coef 0.05 X1coef 1.75 X2coef (1.07) Y1StdErr 0.26 X1StdErr 0.70 X2StdErr 0.94

14

Table6:MultipleHotel/MultipleRegressionModelExample

Hotel Date Y = Dependent Variable Occupancy %

X1 = Independent Variable (1 year

prior Occupancy)

X2 = Independent Variable (2 years’ prior Occupancy)

Courtyard A 7/1/2016 0.32

0.58

0.52


0.62

0.68


0.52

0.48


0.78

0.72


0.76

0.62


0.69

0.58


0.32

0.38

Courtyard B 7/1/2016 0.50

0.72

0.55


0.77

0.71


0.70

0.66


0.84

0.86


0.92

0.79


0.84

0.74


0.33

0.51

Courtyard C 7/1/2016 0.51

0.87

0.58


0.57

0.63


0.53

0.64


0.81

0.76


0.75

0.70


0.72

0.68


0.15

0.48

Courtyard D 7/1/2016 0.38

0.77

0.43


0.45

0.60


0.43

0.45


0.74

0.72


0.65

0.52


0.61

0.59


0.03

0.39

15

Listedbelowarefourdifferentregressionmodelstatisticsbasedonthedifferentdatasetslistedabove.Asmentionedabove,thecodewillruntheregressionmodelfortheonesetofdata,recordtheresultsandthenrunitagainforthenexthotels’setofdata.Thisisveryusefulifyouhaveamodelthatisconsistent,buttherearedozensorevenhundredsofvariousdatasetsthatneedtohavethemodelbuiltoftheiruniquedatasets.Table7:RMultipleRegressionOutputforMultipleHotels’DataSets(entirecodeislistedinAppendix)

StrengthsandWeaknessesofRSomeofthekeystrengthstoRarethefollowing:

‐ Softwareisfree/opensource‐ Adoptedandutilizedbymorethan2millionpeople‐ Severalbooksandblogscanprovideassistancewithgettingstarted‐ ManyofthelatestandgreateststatisticalmodelsareadoptedandcreatedfirstinR‐ Easilycustomizable‐ Handlesbigdatawell(seeweaknessesbelowforloopingperformance)

SomeofthekeyweaknessestoRarethefollowing:

‐ Risaprogramminglanguagewithverylittle“drop‐down”options(notatrueGUIinterface)

Courtyard A Courtyard B Courtyard C Courtyard D

mult.r 0.861 0.892 0.907 0.828

r.square 0.741 0.795 0.822 0.686

adj.r.square 0.611 0.692 0.733 0.529

Std.Err.Model 0.127 0.126 0.114 0.143

obs 7 7 7 7

df.regress 2 2 2 2

df.resid 4 4 4 4

df.total 6 6 6 6

SSR 0.185 0.245 0.240 0.178

SSE 0.065 0.063 0.052 0.082

SST 0.250 0.308 0.292 0.260

MSR 0.092 0.122 0.120 0.089

MSE 0.016 0.016 0.013 0.020

fvalue 5.712 7.746 9.254 4.370

Y1coef 0.046 (0.431) (0.712) (0.140)

X1coef 1.750 0.272 0.197 0.493

X2coef (1.074) 1.250 1.800 0.602

Y1StdErr 0.258 0.291 0.373 0.272

X1StdErr 0.703 0.459 0.276 0.268

X2StdErr 0.943 0.704 0.745 0.588

X1tvalue 2.489 0.592 0.715 1.840

X2tvalue (1.138) 1.776 2.416 1.024

Y1pvalue 0.868 0.213 0.129 0.633

X1pvalue 0.068 0.586 0.514 0.140

X2pvalue 0.319 0.150 0.073 0.364

16

‐ Rtypicallyhaslongerlearningcurveinthebeginningcomparedtoothersoftwarepackages‐ Packagescanbeupdatedorchangedthatcanimpactyourcode‐ Thesoftwareisnotsupportedinthesensethatyoucan’tholdanyofthedevelopersaccountable.However,

therearemanywaystocommunicatesuspectedbugsinthecode‐ Filtering,sorting,datavisualizationsandotherfunctionsaredefinitelyavailableandcustomizableinthe

software,butarehardertoperformcomparedtoothersoftwarepackagessuchasTableau‐ Loops,althoughpossible,takelongercomparedtootherprogramminglanguages(Mease,2007)

SummaryRisanopensourcestatisticalprogramminglanguagethatcanbedownloadedforfree.Risanextremelypowerfulandcustomizablestatisticalprogramminglanguageandtypicallycontainsthelateststatisticalpackages.Similartootherprogramminglanguages,Rcanbeintimidatingatfirstsincethereisonlyablankscreenthatonestartswith.AlthoughRStudioprovidessomeGUIfeatures,mostofthelanguagesstrengthsrelyonthecapabilitiesoftheprogrammer.Thankfully,therearemanycommunities,books,videos,andothersourcesthatcanassist.Oncethedataisloadednormalizedandstructuredproperly,thestatisticalsummariesandtestscanrunwithlittleeffort.Rcanalsobeconnectedwithdozensofplatforms,suchasTableauandMicrosoft’sPowerBI,whicharebothdatavisualizationsoftwarepackages.Sinceitisaprogramminglanguage,therearemillionsofwaystocustomizevisualswithinR.Lastly,asitwasshownabove,Rcanhelpdrivebusinessdecisions,performtextanalyticsonbigdataandassistwithpredictiveanalyticsinpowerfulways.

17

References

1. Correlation(Pearson,Kendall,Spearman)(2014).RetrievedNovember1,2014,fromhttp://www.statisticssolutions.com/correlation‐pearson‐kendall‐spearman/

2. Davenport,T.(2014).BigDataatWork:DispellingtheMyths,UncoveringtheOpportunities.Harvard

BusinessReviewPress.

3. Davenport,T.(2007).CompetingonAnalytics:TheNewScienceofWinning.HarvardBusinessReviewPress.

4. Doane,S.(2013).NonparametricTests:Chapter15fromAppliedStatisticsinBusinessandEconomics.

McGraw‐HillCompanies.

5. Evans,J.(n.d.).Lesson2:TheStandardDeviationandtheNormalCurve.RetrievedNovember11,2014,fromhttp://www.fgse.nova.edu/edl/secure/stats/lesson2.htm

6. Foreman,J.(2013).DataSmart:UsingDataSciencetoTransformInformationintoInsight.Wiley.

7. InterquartileRange(n.d.).RetrievedDecember18thfromhttp://en.wikipedia.org/wiki/Interquartile_range

8. Lane,DavidM.(n.d.).Introduction:AnalysisofVariance.RetrievedDecember2,2014,from

http://onlinestatbook.com/2/analysis_of_variance/intro.html

9. Logos,T.(2009).Kruskal‐WallisOne‐WayAnalysisofVariance.RetrievedJune1fromhttp://www.r‐bloggers.com/kruskal‐wallis‐one‐way‐analysis‐of‐variance/.

10. Mease,David.GoogleTechTalks:StatisticalAspectsofDataMining(2007).RetrievedNovember1,2014

fromhttps://www.youtube.com/playlist?list=PLDA74C8620B138B61

11. Medianvs.AveragetoDescribeNormal(n.d.).RetrievedDecember15,2014fromhttp://www.wcc.nrcs.usda.gov/normals/median_average.htm

12. Miller,T.(2013).ModelingTechniquesinPredictiveAnalytics:BusinessProblemsandSolutionswithR(FT

PressAnalytics).PearsonFTPress.

13. NumericalMeasures.(2016)RetrievedJune4fromhttp://www.r‐tutor.com/elementary‐statistics/numerical‐measures

14. OneWayAnova.(2016).RetrievedJune1,2016from

http://www.stat.columbia.edu/~martin/W2024/R3.pdf

15. Provost,F.,&Fawcett,T.(2013).DataScienceforBusiness:Whatyouneedtoknowaboutdatamininganddata‐analyticthinking.O'ReillyMedia.

16. RPracticeExercises:Level1(beginners).(2015).RetrievedJune20,2016fromhttp://rstatistics.net/r‐

lang‐practice‐exercises‐level‐1‐beginners/.

17. SimpleLinearRegressionwithR.(2016)RetrievedJuly17fromhttp://courses.statistics.com/software/R/R_Ch02.htm

18

18. UsingR&Tableau(n.d.).RetrievedNovember12,2014,fromhttp://www.tableausoftware.com/sites/default/files/media/using‐r‐and‐tableau‐software_0.pdf

19. VriesA.&J.Meysm(2015).RforDummies.JohnWiley&Sons.

19

Appendix:

SampleofRCodeforMultipleRegressionModelResultsforMultipleClassesofDataTheentirecodedevelopedbelowwaspreparedbyEdOrlandowiththeassistanceofthesitelocatedathttps://cran.r‐project.org/web/packages/plyr/plyr.pdf## sets working directory to desktop setwd("C:/Users/ed.orlando07/Desktop/Data Science/IDS742/Final Paper - R") ## gets working directory getwd() ## reads in csv file into a data table data1 <- read.csv("Regression Data Sets.csv") ## shows the first 6 lines of the data frame head(data1) ## shows the last 6 lines of the data frame tail(data1) ## installs the plyr library so that multiple hotel stats ## can be ran at one time library(plyr) ## more info at https://cran.r-project.org/web/packages/plyr/plyr.pdf ## For each subset of a data frame, apply function then combine results into a list. dlply is similar to ## by except that the results are returned in a different format. ## .data = data frame to be processed ## df = data frame ## .variables = variables to split data frame by, as as.quoted variables, a formula or character ## vector ## lm = linear model models <- dlply(data1, "Hotel", function(df) lm(Y ~ X1 + X2, data = df)) ## retrieves r square value r.square <- laply(models, function(mod) summary(mod)$r.squared) r.square ## retrieves adjusted r square value adj.r.square <- laply(models, function(mod) summary(mod)$adj.r.squared) adj.r.square ## retieves multiple r (does not mean much in multiple regression)

20

mult.r <- sqrt(r.square) mult.r ## retrieves predictive coefficients for each of the variables ## references the data table by row, column Y1coef <- laply(models, function(mod) summary(mod)$coefficients[1,1]) X1coef <- laply(models, function(mod) summary(mod)$coefficients[2,1]) X2coef <- laply(models, function(mod) summary(mod)$coefficients[3,1]) Y1coef X1coef X2coef ## retrieves standard errors for each of the coefficient variables ## references the data table by row, column Y1StdErr <- laply(models, function(mod) summary(mod)$coefficients[1,2]) X1StdErr <- laply(models, function(mod) summary(mod)$coefficients[2,2]) X2StdErr <- laply(models, function(mod) summary(mod)$coefficients[3,2]) Y1StdErr X1StdErr X2StdErr ## retrieves t stat for each of the coefficient variables ## references the data table by row, column Y1tvalue <- laply(models, function(mod) summary(mod)$coefficients[1,3]) X1tvalue <- laply(models, function(mod) summary(mod)$coefficients[2,3]) X2tvalue <- laply(models, function(mod) summary(mod)$coefficients[3,3]) Y1tvalue X1tvalue X2tvalue ## retrieves p value for each of the coefficient variables ## references the data table by row, column Y1pvalue <- laply(models, function(mod) summary(mod)$coefficients[1,4]) X1pvalue <- laply(models, function(mod) summary(mod)$coefficients[2,4]) X2pvalue <- laply(models, function(mod) summary(mod)$coefficients[3,4]) Y1pvalue X1pvalue X2pvalue ## retrieves degrees of freedom for SSR (line 1) - variation explained by the regression

21

df.regress <- laply(models, function(mod) anova(mod)$"Df"[1]) + laply(models, function(mod) anova(mod)$"Df"[2]) df.regress ## retrieves degrees of freedom for SSE (line 2) - unexplained or error variation df.resid <- laply(models, function(mod) anova(mod)$"Df"[3]) df.resid ## total degrees of freedom df.total <- df.regress + df.resid df.total ## total observations obs <- df.total + 1 obs ## SSR = sum of squared variation explained by the regression (line 1) SSR <- laply(models, function(mod) anova(mod)$"Sum Sq"[1]) + laply(models, function(mod) anova(mod)$"Sum Sq"[2]) SSR ## MSR = mean of sum of squared variation explained by the regression (line 1) MSR <- SSR / df.regress MSR ## MSE = mean of sum of squared variation explained by the error (line 2) MSE <- laply(models, function(mod) anova(mod)$"Mean Sq"[3]) MSE ## SSE - Sum of Squared error SSE <- MSE*df.resid SSE ## Total Sum of Squares SST <- SSE + SSR SST ## Standard Error of Model Std.Err.Model <- sqrt(SSE / (df.resid)) Std.Err.Model ## F value of model fvalue <- MSR / MSE fvalue output <- data.frame( mult.r, r.square, adj.r.square, Std.Err.Model, obs, df.regress, df.resid, df.total, SSR, SSE, SST, MSR, MSE, fvalue, Y1coef, X1coef, X2coef, Y1StdErr, X1StdErr, X2StdErr, X1tvalue, X2tvalue, Y1pvalue, X1pvalue, X2pvalue) print(output) str(output)

22

write.csv(output, "C:/Users/ed.orlando07/Desktop/Data Science/IDS742/Final Paper - R/MultRegressOutput.csv", row.names=T)

Documents

R - Research Study on the Powerful Statistical Programming Language