36
Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Work ing_Group_Meetup April 11, 2014 DRAFT for April 17 and May 6 1

Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

Embed Size (px)

Citation preview

Page 1: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

1

Big Earth Sciences Data – From Descriptive to Prescriptive Analytics

Dr. Brand NiemannDirector and Senior Data Scientist

Semantic Communityhttp://semanticommunity.info/

http://www.meetup.com/Federal-Big-Data-Working-Group/http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup

April 11, 2014 DRAFT for April 17 and May 6

Page 2: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

2

Overview• Must Read Articles:

– Big Data – From Descriptive to Prescriptive:• Follow the progression from: What Happened (descriptive analytics), Why Did It

Happen (correlation analytics), What Will Happen Next (predictive analytics), and What Should I Do About It (prescriptive analytics). We agree and will follow this framework.

– The Sexiest Job of the 21st Century is Tedious, and that Needs to Change:• Data preparation (easy) and coding (minimal). We agree and use NodeXL &

Spotfire.

– Practical illustration of Map-Reduce (Hadoop-style), on real data• Excellent. I have been looking for something like this.

• New Book: Developing Analytic Talent - Becoming a Data Scientist:– Eight Chapters (buy book) and Six Addendums (free)

• Excellent. Each Chapter has a Summary.

Page 3: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

3

Overview (continued)• My Story: From Data Science Central to Data Science Results:

– Data Science Central is:• Online Resource for Big Data Practitioners: Robust Editorial Platform, Social

Interaction, Forum-Based Technical Support, Latest in Technology Tools and Trends, and Industry Job Opportunities. Very comprehensive.

– Data Science Central Data Science Results are:• What Happened (descriptive analytics): Registered meteorites that has impacted

on Earth visualized. I did this.

• My Story: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics:– April 17th ESIP Earth Sciences Analytics Meeting and May 6th Federal Big

Data Working Group Meetup• Find environmental/climate change data sets for Why Did It Happen (correlation

analytics), What Will Happen Next (predictive analytics), and What Should I Do About It (prescriptive analytics). We are working on this.

Page 4: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

4

Data Science Central:9 “must read” articles

http://www.datasciencecentral.com/profiles/blogs/9-must-read-articles

My Selection and Other Internal Links: See next slides.

Page 5: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

5

My Selection (Vincent Granville):List

• Big Data – From Descriptive to Prescriptive• Can big data be racist?• NodeXL Graph Gallery: Graph Details• Best Metrics For Digital Marketing: Rock Your Own And Rent

Strategies• Big Data: from mining to meaning• Beautiful versus useful visualizations (in French, but interesting)• Learning and Teaching Machine Learning: A Personal Journey• Big data techniques and technologies• The Sexiest Job of the 21st Century is Tedious, and that Needs to

Change (*) My Note: See next slides.• From the trenches: 360-degree data science

Page 6: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

6

Big Data – From Descriptive to Prescriptive

Visually communicating the value of Big Data is challenging because of the need to convey different concepts simultaneously.These charts plot analytical complexity against some sort of business value measurement in a positive correlation that looks entertainingly similar to human evolution charts we’ve all seen, with man becoming more upright and intelligent with time.Regardless of graphic representation, they all follow the progression from (1) What Happened (descriptive analytics), (2) Why Did It Happen (correlation analytics), (3) What Will Happen Next (predictive analytics), and (4) What Should I Do About It (prescriptive analytics).This chart is unique in that it goes all the way back to the beginning when data is first created and gathered in raw form. So much of the resources needed to develop prescriptive analytics takes place in the very early stages of the process.Source: SAP

Page 7: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

7

My Selection (Vincent Granville): Footnote and Comment

• * I (Vincent Granville) disagree with this Harvard Business Review author. Senior data scientists work on high level data from various sources, use automated processes for EDA (exploratory analysis) and spend little to no time in tedious, routine, mundane tasks (less than 5% of my time, in my case). I also use robust techniques that work well on relatively dirty data, and ... I create and design the data myself in many cases.

• My (Brand Niemann) experience as a senior data scientist is similar in that I find the data preparation to be a very interesting and worthwhile activity that informs my data science results and story and I actually delight in "creating and designing the data myself in many cases." It is the art in the data science work along with the resulting visualizations. I also avoid coding if at all possible by using Spotfire.

Page 8: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

8

My Selection (Vincent Granville):Other Internal Links

• 17 short tutorials all data scientists should read (and practice)• 10 types of data scientists My Note: Actually 9. See next slide.• 66 job interview questions for data scientists• Data Science Certification• Update about our Data Science Apprenticeship• Our Wiley Book on Data Science• Data Science Top Articles• Our Data Science Weekly Newsletter• Practical illustration of Map-Reduce (Hadoop-style), on real data• What makes up data science?• Data science webinars• Data science competition

Page 9: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

9

Six Categories of Data Scientists• Those strong in statistics:

– They sometimes develop new statistical theories for big data that even traditional statisticians are not aware of.

• Those strong in mathematics:– NSA (national security agency) or defense/military people working on big data.

• Those strong in data engineering:– Hadoop, database/memory/file systems optimization and architecture.

• Those strong in machine learning• Those strong in business• Those strong in production code development, software engineering• Those strong in visualization• Those strong in GIS, spatial data, data modeled by graphs, graph databases• Those strong in a few of the above. (Vincent Granville)My Note: I suggest you read his interesting commentary on this at

http://www.datasciencecentral.com/profiles/blogs/six-categories-of-data-scientists

Page 10: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

10

Update About our Data Science Apprenticeship - March 10, 2014

• At the request of many prospective participants, here's an update about our DSA (Data Science Apprenticeship):– Stage 1 (Available now): DIY (do-it-yourself) for self-learners: material is available

for free throughout DSC, including data sets and projects to work on. No registration required. Get started by checking our most recent announcements. My Note: See next slide

– Stage 2 (April 2014): Participants will purchase our Wiley book as well as our data science cheat sheet to get jump-started.

– Stage 3: Projects will be evaluated for a fee, and a certification delivered.• Also, I have added a few large data sets, new projects and more material.• If you have already earned a data science certificate or diploma, but was

not requested to develop and use your own API in batch mode, and harvest/work on a data set with at least 50 million observations in a distributed environment, then it's time to learn the real stuff that will land you a real job! My Note: I added this bolding.http://www.datasciencecentral.com/group/data-science-apprenticeship/forum/topics/update-about-our-data-science-apprenticeship

Page 11: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

11

Update About Our Data Science Apprenticeship - March 29, 2014

• Here are six important updates:– Our book will be on the market by April 7. Check the updated table of contents (PDF document)

and download the additional material not published in the book. My Note: See next slides– We have added new tutorials, projects and data sets: check the starred items. My Note: See

previous slide for URL– Successful candidates will automatically become certified data scientists. My Note: See next

slides– There is now one project (involving creating and working on simulated data) that you can work

on to complete our program: click here for details. More projects will be considered later, but right now, we only have one reviewer (Dr. Granville) to grade submitted contributions. The good thing is that the apprenticeship is still free for now - even better, you can earn $1,000 by completing this project. My Note: See next slides

– We will soon add a test that applicants will have to complete, as part of the apprenticeship. Many of the questions have answers in our book. Different questions will be sent to each candidates, via e-mail.

– Also, we are making progress on writing our data science cheat sheet. A preliminary version can be found here, but it will be much more comprehensive and useful when completed, within the next 30 days. My Note: Now it is 17 short tutorials all data scientists should read (and practice)

http://www.datasciencecentral.com/group/data-science-apprenticeship/forum/topics/update-about-our-data-science-apprenticeship-march-29-2014

Page 12: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

12

Developing Analytic Talent• Acknowledgments and Introduction• Chapter 1: What is Data Science?• Chapter 2: Big Data is Different• Chapter 3: Becoming a Data Scientist• Chapter 4: Data Science Craftsmanship - Part I• Chapter 5: Data Science Craftsmanship - Part II• Chapter 6: Data Science Applications - Case Studies• Chapter 7: Launching Your New Data Science Career• Chapter 8: Data Science Resources• Addendum (released free)

– 1. Nine Categories of Data Scientists– 2. Practical Illustration of Map-Reduce (Hadoop-Style), on Real Data– 3. Answers to Job Interview Questions– 4. Additional Topics– 5. Improving Visuals– 6. Essential Features for any Database, SQL or NoSQLhttp://semanticommunity.info/Data_Science/Data_Science_Central#Developing_Analytic_Talent

Page 13: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

13

Data Science Certification• This group is for Certified Data Scientists only. There are three ways to become a Certified

Data Scientist: – Join this group, there is no cost, but only Data Science Central members are allowed. Your profile

will be reviewed in 2-3 days. Based on your experience (two years of practice minimum, in an analytic, data-intensive occupation, with success stories) you may be accepted, regardless of your actual job title (data scientist, statistician, analytics manager, operations research analyst etc.).

– Or you successfully complete our Data Science Apprenticeship (DSA). Join this group, mention the DSA on your profile, you will automatically be approved.

– Or (coming soon) you are certified or graduated from a program managed by one of our partner universities and organizations.

• Once approved, you can add our certification in your profile (LinkedIn, resume, etc.) and be found by companies and organizations looking for serious data scientists and related professionals. See also our data science handbook (aka new book: Developing Analytic Talent) to learn core data science principles, featuring salary surveys, job interview questions, reference books, skills to acquire, sample resumes, difference between data scientist and other analytic professions, big data, case studies, becoming a freelance data scientist, Map-Reduce, Hadoop, and data science tricks, recipes, rules of thumb and tutorials. My Note: “difference between data scientist and other analytic professions”

http://www.datasciencecentral.com/group/data-science-certification

Page 14: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

14

Write a data science research paper and win fame and award

• In connection with our proposed methodology to create a black-box, automated, easy-to-interpret, sample-based, robust technique called jackknife regression, to be used in small and big data environments by non-statisticians, We offer an award and massive promotion to the successful candidate who– 1. Provide the exact formulas for the solution of the 2x2, 3x3 and 4x4 linear systems of equations

described in section 3.2 in my recent article (this is straightforward)– 2. Perform more tests on simulated data (say 10 data sets, each with 10,000 observations) to compare

my methodology (with one and two M's computed on the first 100 observations) with full classical regression. The test must include data with strong correlation structure, and data with up to n=20 independent variables. Comparison should be about (i) accuracy and (ii) sensitivity to little changes in the data set (measured e.g. via confidence intervals for regression coefficients, both for classical regression and my methodology)

• This project must be completed by August 31, 2014. You will be authorized to publish a paper featuring your research results (with your name as main or only author), and your results will also be published on Data Science Central, and seen by dozens of thousands of practitioners. Your article must meet professional quality standards similar to those required by leading peer-reviewed statistical journals. Payment will be sent after completion of the project. Depending on the success of this initiative, and the quality of participants, we might offer more than one award.

• Read details here (see section 5). My Note: See URLs belowhttp://www.datasciencecentral.com/profiles/blogs/jackknife-logistic-and-linear-regressionhttp://www.datasciencecentral.com/profiles/blogs/write-a-data-science-research-paper-and-win-fame-and-money

Page 15: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

15

Dr. Vincent Granville:A Visionary Data Scientist

http://www.datasciencecentral.com/profile/VincentGranville

After 20 years of experience across many industries, big and small companies (and lots of training), I'm strong both in stats, machine learning, business, mathematics and more than just familiar with visualization and data engineering. This could happen to you as well over time, as you build experience. I mention this because so many people still think that it is not possible to develop a strong knowledge base across multiple domains that are traditionally perceived as separated (the silo mentality). Indeed, that's the very reason why data science was created.

Page 16: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

16

Big Data – From Descriptive to Prescriptive Examples

• What Happened (descriptive analytics)– Data Science Central: Registered meteorites that has

impacted on Earth visualized• Why Did It Happen (correlation analytics),– In process

• What Will Happen Next (predictive analytics), and– In process

• What Should I Do About It (prescriptive analytics)– In process

My Note: See Forecasting Meteorite Hits, pages 248-252.

Page 17: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

17

How was the data collected?

http://semanticommunity.info/Data_Science/Data_Science_Central#Registered_meteorites_that_has_impacted_on_Earth_visualized

Page 18: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

18

Where is the data stored?

http://semanticommunity.info/@api/deki/files/27220/meteors.xlsx

Page 20: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

20

What is the data story?

http://semanticommunity.info/Data_Science/Data_Science_Central#From_Data_Science_Central_to_Data_Science_Results

Vincent Granville is interested to see this info visually summarized in 5 dimensions, as follows:• 2 dimensions for the location: Mouse over to see Latitude and Longitude• 1 dimension for the size (represented by radius): Mouse over to see

mass in 5 bins• 1 dimension for the type (represented by color): Mouse over to see type

of meteoriteClick on point to see Details-on-Demand. Then Unmark Marked Rows• 1 dimension for time: turning this static image into a video, where each

second represent (say) one year: Use Filter to Right to select Year. Then Reset All Filters

QED

Page 21: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

21

Developing Analytic Talent• Acknowledgments and Introduction:

– Book publishing is like data scientist turning unstructured into structured data– Data Science Central is the leading data science community and a modern, lean start-up focused on value– How this book is structured: What data science & big data is, Career training resources, & Technical Material

• Chapter 1: What is Data Science?• Chapter 2: Big Data is Different• Chapter 3: Becoming a Data Scientist• Chapter 4: Data Science Craftsmanship - Part I• Chapter 5: Data Science Craftsmanship - Part II• Chapter 6: Data Science Applications - Case Studies• Chapter 7: Launching Your New Data Science Career• Chapter 8: Data Science Resources• Addendum (released free):

– 1. Nine Categories of Data Scientists– 2. Practical Illustration of Map-Reduce (Hadoop-Style), on Real Data– 3. Answers to Job Interview Questions– 4. Additional Topics– 5. Improving Visuals– 6. Essential Features for any Database, SQL or NoSQLhttp://semanticommunity.info/Data_Science/Data_Science_Central#Developing_Analytic_Talent

Page 22: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

22

Chapter 1: What is Data Science?• Real Versus Fake Data Science: 2

– Repackaging old material like statistics and R programming with the new label “data science.”– See Chapter 2 for what MapReduce can’t do.

• The Data Scientist: 3– ETL (extract/transform/load) is for data engineers and DAD (discover/access/distill) is for data

scientists• Data Science Applications in 13 Real-World Scenarios: 13

– Chapters 4 and 5 discuss solutions to such problems.• Data Science History, Pioneers, and Modern Trends: 4

– Data scientist is broader than data miner, and encompasses data integration, data gathering, data visualization (including dashboards), and data architecture. Data scientist also measures ROI on data science activities.

– I have a few examples of “light analytics” doing better than sophisticated architectures in Chapter 6.

– The big data ecosystem is discussed in Chapter 2.• Summary:

– What data science is not, including how traditional degrees will have to adapt as business and government evolves.

Page 23: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

23

Chapter 2: Big Data is Different• Two Big Data Issues: 2

– The “curse” and the rapid data flow.• Examples of Big Data Techniques: 3

– Excel with 100 Million Rows: Use the PowerPivot add-in from Microsoft to work with large datasets.• What MapReduce Can’t Do: 3

– Problems requiring massive computations.• Communication Issues: 1

– It’s definitely a people/organization issue.• Data Science: The End of Statistics?: 3

– See how modern statistics can help make data science better.• The Big Data Ecosystem: 1

– It consists of products and services (hardware, cloud providers, data integration and database vendors, dashboards, visualization tools, and data science and analytic tools). My Note: Why I like TIBCO Spotfire.

• Summary:– Why standard statistical techniques fail when blindly applied to big data.– In general solutions include sampling and/or compression in cases where it makes sense.– Data science is more than data analysis, computer science, or statistics.

Page 24: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

24

Chapter 3: Becoming a Data Scientist

• Key Features of Data Scientists: 2– Horizontal knowledge is important. D.J. Patil, previously a chief data scientist at Linkedin,

is now Data Scientist in Residence at Greylock Partners that advises In-Q-Tel (CIA) on investments.

• Types of Data Scientists: 4– Fake, Self-Made, Amateur, and Extreme (developing powerful, robust predictive solutions

without any statistical models)• Data Science Demographics: 1

– Data science websites attract highly educated, wealthy males, predominantly with Asian origin, living , mostly in the U.S.

• Training for Data Science: 3– University Programs (8), Corporate and Association Training Programs (7), and Free

Training Programs (Coursera.com and Data Science Central)• Data Scientist Career Paths: 2

– The Independent Consultant and The Entrepreneur (see 13 Startup Ideas for Data Scientists)

• Summary: See the above!

Page 25: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

25

Chapter 4: Data Science Craftsmanship - Part I

• New Types of Metrics: 2• Choosing Proper Analytic Tools: 4• Visualization: 2• Statistical Modeling Without Models: 3• Three Classes of Metrics: Centrality, Volatility, and Bumpiness: 4• Statistical Clustering for Big Data: 1• Correlation and R-Squared for Big Data: 2• Computational Complexity: 2• Structured Coefficient: 1• Identifying the Number of Clusters: 2• Internet Topology Mapping: 1• Securing Communications Data Encoding:1• Summary: This is the most technical chapter in the book based on articles first

published at Data Science Central to cover many different techniques, recipes, and topics so you can reproduce them when needed.

Page 26: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

26

Chapter 5: Data Science Craftsmanship – Part II• Data Dictionary: 2

– One of the most valuable tools when performing exploratory data analyses. My Note: I agree!• Hidden Decision Trees: 3• Model-Free Confidence Intervals: 4

– The first Analyticbridge Theorem, which provides a simple, model-free, nonparametric way to compute confidence intervals without statistical theory or knowledge,

• Random Numbers: 1• Four Ways to Solve a Problem: 4• Causation Versus Correlation: 1

– In all contexts, using predictors that are directly causal typically helps reduce the variance in the model and yields more robust solutions.

• How Do You Detect Causes?: 1• Life Cycle of Data Science Projects: 1• Predictive Modeling Projects: 1• Predictive Modeling Mistakes: 1• Logistic Related Regressions: 4• Experimental Design: 3• Analytics as a Service and APIs: 3• Miscellaneous Topics: 4• New Synthetic Variance for Hadoop and Big Data: 8• Summary: The topics discussed in this chapter are typically classified as data analyses rather than statistical or

computer analyses. Most of the material has not been published before. Traditional statisticians typically don’t learn or use these techniques, but data scientists do.

Page 27: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

27

Chapter 6: Data Science Applications - Case Studies

• Stock Market: 7• Encryption: 3• Fraud Detection: 11• Digital Analytics: 9• Miscellaneous: 6– Forecasting Meteorite Hits:

• Define the scope of the analysis: This is a small project to be completed in 10 hours of work or less, billed at $100/hour. Provide the risk of meteorite hit per year per meteorite size.

• Summary: 36 Case Studies, Real-Life Applications, and Success Stories

Page 28: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

28

Chapter 7: Launching Your New Data Science Career

• Job Interview Questions: http://bit.ly/1cGlFA5 • Testing Your Own Visual and Analytical Thinking• From Statistician to Data Scientist: http://bit.ly/197Jsfa 160

comments on Linkedin)• Taxonomy of a Data Scientist:

– Top Data Scientists on Linkedin: Kirk Borne-Analytics (0.00 with Vincent Granville=0.38), Big Data (0.15 with Miland Bhandarkar=0.54), Data Mining (0.45 with Dean Abbott=0.46), Machine Learning (0.39 with Monica Rogati=0.43), and Purity (0.70 with Dean Abbott=0.93)

• 400 Data Scientist Job Titles: http://bit.ly/11WhOcu (from 10,000 data scientists in Linkedin network)

• Salary Surveys: http://bit.ly/1dmCouo • Summary: See the above!

Page 29: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

29

Chapter 8: Data Science Resources• Professional Resources:

– Data Sets: http://bit.ly/W2HTJU – Books: 100+ (some free)– Conferences and Organizations: Vendors (e.g. SAS), Professional

Societies (e.g. INFORMS), and Conference Organizers (O’Reilly Strata)– Websites: http://bit.ly/lghDR7K (add your own)– Definitions: http://bit.ly/l8UcD7c

• Career Building Resources– Companies Employing Data Scientists: 21 leading and 6,000+ at http://

bit.ly/19vRlNV – Sample Data Science Job Ads: http://bit.ly/1hVAmr7 – Sample Resumes: http://bit.ly/1j4PNuP

• Summary: See the above!

Page 30: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

30

Addendum:2. Practical Illustration of Map-Reduce (Hadoop-Style), on Real Data

• Goal: Build a system to score Internet Clicks (50M) (“click data”):– Extract relevant fields (e.g. 6 of 60)– Build a summary table: the Map step (text file

like in Hadoop)• Split the big data in smaller data sets (say 20)

(called subsets), and perform this operation separately on each subset

– Build a summary table: the Reduce step• Simple merging will not work at this scale so sort

each by a key field and merge the sorted subsets to produce a big summary table which is much more manageable and compact, although still far too large to fit in Excel.

• Create a rule set by building less granular summary tables, on top of S, and testing.

• Improvements:– New technology to "split / sort subsets / merge

and aggregate“ faster and better• Conclusions:

– The granular table S (and the way it is built) is similar to the Hadoop architecture.My Note: I do not find this to be very interesting data science, but it is the way to make money!

Start: Extract/summarize data from say large log filesMap: Create an hierarchical data baseReduce: High-level summaries corresponding to rulesFinish: Find result (e.g. credit card fraud)

Page 31: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

31

Big Data – From Descriptive to Prescriptive Example:Forecasting Meteorite Hits, pages 248-252.

• What Happened (descriptive analytics)– Data Science Central: Registered meteorites that has

impacted on Earth visualized (original data set)• Why Did It Happen (correlation analytics),– See next slides

• What Will Happen Next (predictive analytics), and– See next slides

• What Should I Do About It (prescriptive analytics)– See next slides

Page 32: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

32

Forecasting Meteorite Hits• Statistical analyses in 8 steps:

– Define the scope of analysis:• 10 hours of work or less, billed at $100 hour to provide the risk of meteorite hit per year per meteorite size.

– Identify data and caveats (URL did not work)*:• http://osm2.cartodb.com/tables/2320/public#/map

– Data cleaning:• The data seem comprehensive, but are messy. Discard data prior to 1900.

– Exploratory analysis:• Strong patterns emerge, despite messy data, etc. like smaller meteorites are now detected because of the

growing surface of inhabited land and better instruments, etc.

– The actual analysis in an Excel spreadsheet with data and formulas (Vincent Granville):• http://bit.ly/1gaiIMm

– Model selection:• Two decades show relatively good pattern stability and recency: 2000-2010 and 1990-2000.

– Prepare forecasts:• Yearly_Occurrences (weight) = 1/(A + B* log (weight)). The “every 40 year” claim for the 2013 Russian bang is

plausible.

– Followup:• A more detailed analysis would involve predictions broken down by meteor type (iron and water), angle, and

velocity. Also the impact of population growth could be assessed in this risk analysis.* Source Web Site: Download entire data set (see Excel): it's a 7MBspreadsheet consisting of 34,513 meteorites, last updated in 2012.

Page 33: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

33

Where is the data stored? And What are the results?

http://bit.ly/1gaiIMm

Page 35: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

35

What Should I Do About It (prescriptive analytics)

LSST = Large Synoptic Survey Telescope: http://www.lsst.org/

Page 36: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

36

What Should I Do About It (prescriptive analytics)

• Professor Kirk Borne - My current research is focused on outlier detection, which I prefer to call Surprise Discovery – finding the unknown unknowns and the unexpected patterns in the data. These discoveries may reveal data quality problems (i.e., problems with the experiment or data processing pipeline), but they may also reveal totally new astrophysical phenomena: new types of galaxies or stars or whatever. That discovery potential is huge within the huge data collections that are being generated from the large astronomical sky surveys that are taking place now and will take place in the coming decades. I haven’t yet found that one special class of objects or new type of astrophysical process that will win me a Nobel Prize, but you never know what platinum-plated needles may be hiding in those data haystacks. – – See more at: http://

www.eeriedigest.com/wordpress/2013/01/taem-interview-with-dr-kirk-borne-of-george-mason-university/ Dr. Kirk Borne of George Mason University Slides