View
118
Download
1
Category
Preview:
DESCRIPTION
Big Data consists of several issues: data collecting, storage, computing, analysis and visualization. Python is a popular scripting language with good code readability and thus is suitable for fast development. In this slides, the author shares how to solve Big Data issues using Python open source tools.
Citation preview
2012
When Big Data Meet Python
Jimmy Lai (賴弘哲)
jimmy.lai@oi-sys.com
2012/08/19
1
Slides: http://www.slideshare.net/jimmy_lai/when-big-data-meet-python
When big data meet python by Jimmy Lai is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
2012
自我介紹
• 賴弘哲 (Jimmy Lai)
• Interests: Data mining, Machine Learning, Natural Language Processing, Distributed Computing, Python
• LindedIn profile: http://goo.gl/XTEM5
• 現任職於引京聚點知識結構搜索公司,
從事大資料語意分析
2
2012
Outline
1. Big Data
a. Concept
b. Technical issues
2. Big Data + Python
a. Related open source tools
b. Example
3
2012
Benefits of Big Data
1. Creating transparency(透明度) 2. Enabling experimentation to discover needs,
expose variability, and improve performance(發現需求及潛在威脅、改善產能)
3. Segmenting populations to customize(客製化) actions
4. Replacing/supporting human decision making with automated algorithms(自動決策)
5. Innovating new business models, products and services(創新的服務、產業)
4
(May 2011). Big Data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute.
e.g. http://www.data.gov/
深度資料分析人才的短缺
2012
Initiative from the White House
• (Mar 2012) Big Data Research and Development Initiative, the White House.
• National Science Foundation encourages education on Big Data.
• Government invest on developing state-of-the-art technologies, harness those technologies, and expand the workforce for Big Data.
5
2012
Big Data Issues
6
Collecting
User Generated Content Machine Generated Data
Storage
Computing
Analysis
Visualization
2012
Big Data Techniques
7
Collecting
User Generated Content
Machine Generated Data
Storage
Computing
Analysis
Visualization
• Crawler
– Collect raw data
– E.g. Heritrix, Nutch
• Scraping
– Parse information from raw data
– E.g. Yahoo! Pipes, Scrapy
2012
Big Data Techniques
8
Collecting
User Generated Content
Machine Generated Data
Storage
Computing
Analysis
Visualization
• Big Table – Distributed key-value
storage – E.g.Hbase, Cassandra
• NoSQL – Not use SQL for
manipulation – Not use relational
database model – E.g. MongoDB, Redis,
CouchDB
2012
Big Data Techniques
9
Collecting
User Generated Content
Machine Generated Data
Storage
Computing
Analysis
Visualization
• Batch
– MapReduce
– E.g. Hadoop
• Real-time
– Stream processing
– E.g. S4, Storm
2012
Big Data Techniques
10
Collecting
User Generated Content
Machine Generated Data
Storage
Computing
Analysis
Visualization
• Data mining – Weka
• Machine learning – scikit-learn
• Natural language processing – NLTK, Stanford NLP
• Statistics – R
2012
Big Data Techniques
11
Collecting
User Generated Content
Machine Generated Data
Storage
Computing
Analysis
Visualization
• Abstract
• Interactive
• E.g. Processing, Gephi, D3.js
2012
Why Python?
• Good code readability for fast development.
• Scripting language: the less code, the more productivity.
• Fast growing among open source communities.
– Commits statistics from ohloh.net
12
2012
When Big Data meet Python
13
Collecting
User Generated Content
Machine Generated Data
Scrapy: scraping framework
PyMongo: Python client for Mongodb
Hadoop streaming: Linux pipe interface Disco: lightweight MapReduce in Python
Storage
Computing
Analysis
Visualization
Pandas: data analysis/manipulation Statsmodels: statistics NLTK: natural language processing Scikit-learn: machine learning
Matplotlib: plotting NetworkX: graph visualization
Infr
astr
uct
ure
2012
When Big Data meet Python
web scraping framework
• Simple and Extensible
• Components: • Scheduler
• Downloader
• Spider(Scraper)
• Item pipeline
14
Collecting
User Generated Content
Machine Generated Data
Storage
Computing
Analysis
Visualization
http://scrapy.org/
2012
When Big Data meet Python
NoSQL database
• PyMongo: client for python
• Document(JSON)-oriented
• No schema
• Scalable • Auto-sharding
• Replica-set
• File storage
• MapReduce aggregation
15
Collecting
User Generated Content
Machine Generated Data
Computing
Analysis
Visualization
http://www.mongodb.org/
Storage
2012
When Big Data meet Python
• Distributed computing: – MapReduce
– Disco distributed file system
• Write code in Python – Easy/fast to profiling
– Easy/fast to debugging
16
Collecting
User Generated Content
Machine Generated Data
Analysis
Visualization
Storage
Computing
http://discoproject.org/
2012
When Big Data meet Python
• Data analysis library
• Datastructure for fast data manipulation – Slicing
– Indexing
– subsetting
• Handling missing data
• Aggregation
• Time series
17
Collecting
User Generated Content
Machine Generated Data
Visualization
Storage
Computing
http://pandas.pydata.org/
Analysis
2012
When Big Data meet Python
Statsmodels
• Statistical analysis
• Statistical models
• Fit data with model
• Statistical tests
• Data exploration
• Time series analysis
18
Collecting
User Generated Content
Machine Generated Data
Visualization
Storage
Computing
http://statsmodels.sourceforge.net/
Analysis
2012
When Big Data meet Python
scikit-learn
• Machine learning algorithms
• Supervised learning
• Unsupervised learning
• Dataset
• Preprocessing
• feature extraction
• Model
• Selection
• Pipeline
19
Collecting
User Generated Content
Machine Generated Data
Visualization
Storage
Computing
http://scikit-learn.org/
Analysis
2012
When Big Data meet Python
NLTK: Natural Language Toolkit
• Natural language processing
• Annotated corpora and resources
20
Collecting
User Generated Content
Machine Generated Data
Visualization
Storage
Computing
http://scikit-learn.org/
Analysis
Sentence Segmentation
Tokenization POS tagging
Named Entity Recognition
Relation Recognition
Information Extraction Work Flow
2012
When Big Data meet Python
NL
• Plotting
– Histograms
– Power spectra
– Bar charts
– Error charts
– Scatter plots
• Full control to detail of plotting
21
Collecting
User Generated Content
Machine Generated Data
Storage
Computing
http://matplotlib.sourceforge.net/
Analysis
Visualization
2012
When Big Data meet Python
NetworkX • Graph algorithms and
visisualization
• Draw graph with layout: – Circular
– Random
– Spectural
– Spring
– Shell
– Graphviz
22
Collecting
User Generated Content
Machine Generated Data
Storage
Computing
http://networkx.lanl.gov/
Analysis
Visualization
2012
聚寶評 www.ezpao.com
美食搜尋引擎
23
搜尋各大部落格食記
2012
聚寶評 www.ezpao.com
語意分析搜尋引擎
24
2012
網友分享菜分析
正評/負評分析
評論主題分析
25
2012
Thank you for your attention. Q & A
We are hiring! • 核心引擎演算法研發工程師
• 系統研發工程師
• 網路應用研發工程師
Oxygen Intelligence Taiwan Limited
引京聚點 知識結構搜索股份有限公司
• 公司簡介: http://www.ezpao.com/about/
• 職缺簡介: http://www.ezpao.com/join/
• 請將履歷寄到 jimmy.lai@oi-sys.com
26
When big data meet python by Jimmy Lai is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Recommended