Upload
galit-shmueli
View
636
Download
0
Embed Size (px)
Citation preview
Big Data – To Explain or To Predict?
Big Data Experts Speaker Series Rotman School of Management, U Toronto, March 2016
Galit Shmueli
Galit Shmueli ( 徐茉莉 )www.galitshmueli.com
❷ 2000-2002 Carnegie Mellon Univ.Visiting Assistant Prof.Dept. of Statistics
❸ 2002-2012 Univ. of Maryland College ParkAssistant then Associate Prof. of
Statistics & Management Science
R H Smith School of Business
2008-2014 Rigsum Institute (Bhutan)
Co-Director, Rigsum Research Lab
❹ 2011-2014 Indian School of Business SRITNE Chaired Prof. of Data
Analytics, Associate Prof. of Statistics & Info Systems
❶ 1994-2000 Israel Institute of
TechnologyMSc + PhD, Statistics
2014-… NTHUInstitute of Service ScienceDirector, Center for Service
Innovation & Analytics
Research in Data Analytics‘Entrepreneurial’
statistical & data mining modeling (for today’s problems)
Interdisciplinary modeling
Statistical StrategyTo Explain or To Predict?Information QualityRegression with Big Data
Road Map
DefinitionsExplanatory-dominated social sciencesExplanatory modeling ≠ predictive modeling
Why?Different modeling pathsExplanatory power vs. predictive power
Implications
Definitions
Explanatory modeling:Theory-based, statistical testing of causal hypotheses
Explanatory power:Strength of relationship in statistical model
Definitions
Predictive modeling:Empirical method for predicting new observations
Predictive power:Ability to accurately predict new observations
Explain PredictDescribe
Matching Game
Social Sciences
Machine learning
Statistics
Statistical modeling in social sciences &
management research
Purpose: test causal theory (“explain”)Association-based statistical models
Prediction nearly absent
Start with a causal theory
Generate causal hypotheses on constructs
Operationalize constructs → Measurable variables
Fit statistical model
Statistical inference → Causal conclusions
Classic journal paper
In the social sciences,
data analysis is mainly used for testing causal theory.
“If it explains, it predicts”
“Empirical prediction aloneis un-scientific”
Some statisticians share this view:
The two goals in analyzing data... I prefer to describe as “management” and “science”. Management seeks profit... Science seeks truth.
- Parzen, Statistical Science 2001
Prediction in top research journals in Information Systems
Predictive goal?Predictive modeling?Predictive assessment?
1990-2006
52 “predictive” articles among 1,072 in Information Systems top journals
“A good explanatory model will also predict well”
“You must understand the underlying causes in order to predict”
Meanwhile… in industry
Philosophy of Science
“Explanation and prediction have the same logical structure”
Hempel & Oppenheim, 1948
“It becomes pertinent to investigate the possibilities of predictive procedures autonomous of those used for explanation”
Helmer & Rescher, 1959
“Theories of social and human behavior address themselves to two distinct goals of science: (1) prediction and (2) understanding”
Dubin, Theory Building, 1969
Why statistical
explanatory modeling differs from
predictive modeling
Explanatory Model: Test/quantify causal effect for “average” record in population
Predictive Model: Predict new individual observations
Different Scientific Goals
Different generalization
Theory vs. its manifestation
?
Four aspects
1. Theory – Data
2. Causation – Association
3. Retrospective – Prospective
4. Bias - Variance
“The goal of finding models that are predictively accurate differs from the goal of finding models that are true.”
Best explanatory model
Best predictive model
≠
Point #1
Predict ≠ Explain
+ ?
“we tried to benefit from an extensive set of attributes describing each of the movies in the dataset. Those attributes certainly carry a significant signal and can explain some of the user behavior. However… they could not help at all for improving the [predictive] accuracy.”
Bell et al., 2008
Explain ≠ PredictThe FDA considers two products bioequivalent if the 90% CI of the relative mean of the generic to brand formulation is within 80%-125%
“We are planning to… develop predictive models for bioavailability and bioequivalence”
Lester M. Crawford, 2005Acting Commissioner of Food & Drugs
“For a long time, we thought that Tamoxifen was roughly 80% effective for breast cancer patients.
But now we know much more: we know that it’s 100% effective in 70%-80% of the patients, and ineffective in the rest.”
Goal Definition
Design & Collection
Data Preparation
EDA
Variables? Methods? Evaluation,
Validation & Model Selection
Model Use & Reporting
Study design
Hierarchical data
Observational or experiment?
Primary or secondary data?
Instrument (reliability+validity vs. meas. accuracy)
How much data?
How to sample?
& data collection
Data Preprocessing
reduced-feature models
missing
partitioning
Data exploration, viz, reduction
PCA
Factor Analysis(interpretable)
Dimension Reduction(fast, small)
Which Variables?
Multicollinearity?causation associations
endogeneity ex-post
availability
A, B, A*B?
ensemblesShrinkage models
variance bias
Methods / ModelsBlackbox / interpretableMapping to theory
Evaluation, Validation& Model Selection
Training dataEmpirical model Holdout data
Predictive power
Over-fitting analysis
Theoretical model
Empirical model
Data
ValidationModel fit ≠
Explanatory power
Inference
Model Use: Industry
Identify causal factors
generate predictions for new data
Predictive performance
Over-fitting analysis
Null hypothesis
Naïve/baseline
Inference
Model Use (Science)
test causal theory
generate new theorydevelop measurescompare theoriesimprove theoryassess relevanceEvaluate predictability
Predictive performance
Over-fitting analysis
Null hypothesis
Naïve/baseline
Point #2
Explanatory Power
Predictive Power ≠
Cannot infer one from the other
out-of-sample
Performance Metrics
type I,II errors
goodness-of-fit
p-values
over-fitting
costs
prediction accuracy
interpretation
Training vs. holdout
R2
Explanatory Power
Pred
ictiv
e Po
wer
The predictive power of an explanatory model has important scientific value
Relevance, reality check, predictability
Current state in academia (social sciences and management)
“While the value of scientific prediction… is beyond question… the inexact sciences [do not] have…the use of predictive expertise well in hand.”
Helmer & Rescher, 1959
Distinction blurred
Unfamiliarity with predictive modeling/assessment
Prediction underappreciated
State-of-the-art in industry
Distinction blurred
Prediction over-appreciated
“Big Data” synonymous with prediction
How does this impact
Scientific research?
How does this impact organizations’ actions?
…and our lives?
Will the customer pay?
What causes non-payment?
ExplainPredict
PredictPotential explanations
Shmueli (2010) “To Explain or To Predict?”, Statistical ScienceShmueli & Koppius (2011) “Predictive Analytics in IS Research”, MISQ