30
To Have Own Data Analytics Platform, Or NOT To 青山エンジニア勉強交流会 April 24, 2017 Satoshi Tagomori (@tagomoris)

To Have Own Data Analytics Platform, Or NOT To

Embed Size (px)

Citation preview

Page 1: To Have Own Data Analytics Platform, Or NOT To

To Have Own Data Analytics Platform, Or NOT To青山エンジニア勉強交流会 April 24, 2017

Satoshi Tagomori (@tagomoris)

Page 2: To Have Own Data Analytics Platform, Or NOT To

Satoshi "Moris" Tagomori (@tagomoris)

Fluentd, MessagePack-Ruby, Norikra, ...

Treasure Data, Inc.

Page 3: To Have Own Data Analytics Platform, Or NOT To
Page 4: To Have Own Data Analytics Platform, Or NOT To

http://tsuchinoko.dmmlabs.com/?p=1770

Page 5: To Have Own Data Analytics Platform, Or NOT To

At Feb 23, 2015• To Have Own Data Analytics Platform, Or NOT To,

In Startup Companies:

• "NOT To, in general"

• Data analytics services: • AWS EMR, Redshift • Google BigQuery • Treasure Data

Page 6: To Have Own Data Analytics Platform, Or NOT To

Options In 2017• On Premise

• Cloudera CDH, Hortonworks HDP, ...

• Services • AWS EMR, Redshift, Athena, Kinesis Analytics, ... • Google BigQuery, Cloud Dataflow, Cloud

Dataproc, ... • MS Azure SQL Data Warehouse, Stream Analytics,

Data Lake Analytics, ... • Treasure Data

Page 7: To Have Own Data Analytics Platform, Or NOT To

TO HAVE OR

NOT TO HAVE ?

Page 8: To Have Own Data Analytics Platform, Or NOT To

DO NOT

Page 9: To Have Own Data Analytics Platform, Or NOT To

😝

Page 10: To Have Own Data Analytics Platform, Or NOT To

Anyway,

Page 11: To Have Own Data Analytics Platform, Or NOT To

NO FINE CONCLUSION IN THIS PRESENTATION

Page 12: To Have Own Data Analytics Platform, Or NOT To

On Premise Platform In Past• 2011-2014: On-premise Hadoop&Presto cluster

• w/ Fluentd stream processing cluster • w/ Norikra stream processing • w/ Web UI (Shib)

https://www.slideshare.net/tagomoris/lambda-architecture-using-sql-hadoopcon-2014-taiwan

Page 13: To Have Own Data Analytics Platform, Or NOT To

To Be Considered• Distributed Processing Platform

• Data Management

• Process Management

• Platform Management

• Visualization and BI

• Connecting Data

Page 14: To Have Own Data Analytics Platform, Or NOT To

Distributed Processing Platform

• Hadoop, Presto, Spark, Flink, Storm, ... • + Servers

• EMR, Redshift, Dataproc, ... • Cost per instances

• BigQuery, Athena, Treasure Data, .... • Cost per data/queries/...

Page 15: To Have Own Data Analytics Platform, Or NOT To

Data Management

• How to collect data?

• How to ingest data?

• How to manage schema?

• How to move data from here to there?

Page 16: To Have Own Data Analytics Platform, Or NOT To

Process Management

• How to run queries on schedule?

• How to build workflow between queries?

• How to run queries after data ingestion?

• How to move data from the platform to elsewhere after queries?

Page 17: To Have Own Data Analytics Platform, Or NOT To

Platform Management• How to upgrade software?

• How to add nodes?

• How to manage failures / downtime?

• How to replace hardware?

• How to switch platforms?

• How to provide compatibility for queries?

Page 18: To Have Own Data Analytics Platform, Or NOT To

Visualization and BI

• How to show query results graphically?

• How to show relations between data graphically?

• How to query data interactively?

Page 19: To Have Own Data Analytics Platform, Or NOT To

Connecting Data• How to join logs and master data?

• How to join logs and user list?

• How to join logs and CRM data?

• How to push query results to marketing tools/services?

• How to send notifications using query results?

Page 20: To Have Own Data Analytics Platform, Or NOT To

Additional Topics

• Stream Processing Platform

• Machine Learning Platform

• AI(?) Services

Page 21: To Have Own Data Analytics Platform, Or NOT To

In My Past Case:• Distributed Processing Platform

• Hadoop & Presto (& Norikra)

• Data Management • Hive schema & Custom made UI (Shib) • Managed by engineers of each services

• Process Management • Custom made query scheduler (ShibUI)

• Platform Management • By tagomoris

• Visualization, BI: N/A

• Connecting Data: N/A

Page 22: To Have Own Data Analytics Platform, Or NOT To

About Treasure Data• Distributed Processing Platform: Hive, Presto

• Data Management: Fluentd & Schema-less DB

• Process Management: Digdag / Treasure Workflow

• Platform Management: Automatic

• Visualization and BI: Treasure BI

• Connecting Data: Embulk / Data Connector

😝

Page 23: To Have Own Data Analytics Platform, Or NOT To

Recent Improvements around Data Analytics

• Improvements of CDH/HDP to manage clusters • Online Upgrade • Support many processing frameworks

• Many new data processing software/frameworks • Apache Flink, Apache Arrow, Apache Beam, ...

• Many new services available • Stream processing, Machine learning, ...

Page 24: To Have Own Data Analytics Platform, Or NOT To

MONEY

• Saving money is important - it's true.

Page 25: To Have Own Data Analytics Platform, Or NOT To

MONEY

• Saving money introduces many issues - it's true!

Page 26: To Have Own Data Analytics Platform, Or NOT To

MONEY

• Money solves many problems - is it true?

Page 27: To Have Own Data Analytics Platform, Or NOT To

Complexity

• Connecting data / processing with applications

• Connecting data / processing with services

• Connecting data / processing with people

Page 28: To Have Own Data Analytics Platform, Or NOT To

Chasing the World• Many new software / services / platform /

paradigm, day by day

• Data sizes are growing day by day

• Complexity is growing day by day

• A data platform CANNOT live as-is 5 years!

Page 29: To Have Own Data Analytics Platform, Or NOT To

Finding Treasure From Data

• "Data Processing" is: • NOT the purpose • just a tool to get something great

• Use developers and their time to find treasures!

Page 30: To Have Own Data Analytics Platform, Or NOT To

TBD

Thank you! @tagomoris