Pandas Dataframe on Notebook is Wonderful
From File
From Redshift
From Google Spreadsheet
concat, drop_duplicates, dropna, groupby, …
pandas.read_csv(DICT, header=None, sep=" ", names=[‘word’,'weight','type']) pandas.read_json(TOP_ARTICLE)
sql = “select keyword, sum(clicks) AS cc from search_console WHERE … GROUP BY …” df = read_sql(sql, con=con)
sheet = gc.open_by_url(link) spreadata = pandas.DataFrame(sheet.get_all_records())
ipywidgets• sliders, progress bars, checkboxes, buttons, …
qgrid• Uses SlickGrid to render pandas DataFrames within a Jupyter
notebook. IPython.Display• SVG, Math, Javascript, IFrame, HTML
nbviewer• A simple way to share Jupyter Notebooks
plotly • Make charts and dashboards online
word-libraryJieba
word2vec
data-utility
BigQueryApi
RedshiftApi
url2content
url2keyword RESTAPI
Scheduling
SlackBot Api
Dashboard
Build Model
...
Notebook control ML data pipeline
Core-Algorithm
• • 4 (about 4 billions record)
• 3
• Run 1 worker with 4 executor instances (per 2 cores, 4 GB RAM)
• Bottleneck • Query ordered data with doing mapPartitions
• Merge 20 millions cookies from 4 billions rows
• ReduceByKey will do lots of shuffle
• Feature selection (sklearn.feature_selection.chi2)
Doing Spark with PHP?
32 Core
Executor
spark.master local[*] spark.executor.instances 32 sc.textFile(“url.csv”).repartition(128)
Executor
Executor
Executor
Executor
Executor
PHP
PHP
PHP
PHP
PHP
PHP
Build word2vec Model•
• 120 •
• We choose cppjieba [github] • thread_number=16 • spark.executor.instances 32
• Jupyter Notebook (data pipeline)
• Jupyter Notebook reopen hard to track status • Slack Channel
• Jupyter Notebook
Use Notebook to define machine learning workflow
Jupyter Lab• The next generation of the Jupyter Notebook • Jupyter team + Bloomberg + Continuum Analytics
Google Datalab• Cloud Datalab is built on Jupyter, enables analysis of data on BigQuery,
GCE, and Cloud Storage using Python, SQL, and JavaScript. Domino
• A Platform to Accelerate Data Science, makes data scientists more productive and facilitates collaborative, reproducible, reusable analysis.
Zeppelin• Inspired by iPython notebook focusing on providing analytical environment
on top of Hadoop eco-system. Databricks Cloud Notebook
• Notebook Workflows as APIs that allow users to chain notebooks together using the standard control structures of the source programming language.