Upload
-
View
614
Download
0
Embed Size (px)
Citation preview
What is Data ScienceBig Data Dive, 20.09.2012
Data is everywhere
Apps with data
Google Page Rank
Amazon Recommendations
Meteorology
Healthcare
Big Data processing
Definition ofData Science
Data Science is…
• Data Engineering
• Scientific Method
• Math
• Statistics
• Advanced Computing
• Visualization
• Hacker mindset
• Domain Expertise
Data Science is…
• A/B testing
• Association rule learning
• Classification
• Cluster analysis
• Crowdsourcing
• Data fusion and integration
• Data mining
• Ensemble learning
• Genetic algorithms
• Machine learning
• Massive parallel-processing
• Natural language processing
• Neural networks
• Pattern recognition
• Predictive modelling
• Regression
• Sentiment analysis
• Signal processing
• Simulation
• Time series analysis
• Visualization
Data Science is…
• Explore data
• Build model
• Apply model
The most important goal of data science is
prediction
Process
Explore data
• Preprocessing
• Data cleaning
• Transformations
• Subsets selection
• Feature selection
• Discretization
• Binarization
• Normalization
• Generalization
• Investigation
• Plots
• Histograms
• Smoothing
• Plot matrices
• Distributions
• Multidimensional scaling
• Classification trees
• Correlation matrices
Example: Binarization
Example: Plot Matrices
Build model
• Artificial neural networks
• Association rules
• Bayesian networks
• Clustering
• Decision trees
• Generalized linear models
• Genetic programming
• Inductive logic programming
• Sparse dictionaries
• Support vector machines
• Reinforcement learning
• Representation learning
Example: Decision Trees
Apply model
Tools
R
• Open source programming language and software environment
• Designed for statistical computing and graphics
• CRAN (The Comprehensive R Archive Network) – 5300 packages and counting
• In 2010 has become the data mining tool used by more data miners (43%) than any other
Mathematical packages
They make presentation better
• Google Prediction API
• Microsoft Analysis Services
• Oracle Data Mining
Python
• Well recognized for scientific engineering
• General purpose scientific libraries:
Numpy, Scipy, Matplotlib, python-multiprocessing
• Statistical, data mining, machine learning packages:
Scikit-learn, Pandas, PyBrain
Thank you!
Andrei Paleyes
Skype: andrei.paleyes