Upload
sadayuki-furuhashi
View
2.143
Download
8
Embed Size (px)
Citation preview
Embulk - 進化するバルクデータローダ
Sadayuki Furuhashi Founder & Software Architect
Embulk Meetup Tokyo #2
A little about me…
Sadayuki Furuhashigithub: @frsyuki
Fluentd - Unifid log collection infrastracture
Embulk - Plugin-based parallel ETL Founder & Software Architect
What’s Embulk?
> An open-source parallel bulk data loader > loads records from “A” to “B”
> using plugins > for various kinds of “A” and “B”
> to make data integration easy. > which was very painful…
Storage, RDBMS, NoSQL, Cloud Service,
etc.
broken records,transactions (idempotency),
performance, …
The pains of bulk data loading
Example: load a 10GB CSV file to PostgreSQL > 1. First attempt → fails > 2. Write a script to make the records cleaned
• Convert ”2015-01-27T19:05:00Z” → “2015-01-27 19:05:00 UTC”
• Convert “\N" → “”
• many cleanings…
> 3. Second attempt → another error • Convert “Inf” → “Infinity”
> 4. Fix the script, retry, retry, retry… > 5. Oh, some data got loaded twice!?
The pains of bulk data loading
Example: load a 10GB CSV file to PostgreSQL > 6. Ok, the script worked. > 7. Register it to cron to sync data every day. > 8. One day… it fails with another error
• Convert invalid UTF-8 byte sequence to U+FFFD
The pains of bulk data loading
Example: load 10GB CSV × 720 files > Most of scripts are slow.
• People have little time to optimize bulk load scripts
> One file takes 1 hour → 720 files takes 1 month (!?)
A lot of integration efforts for each storages: > XML, JSON, Apache log format (+some custom), … > SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile… > MongoDB, Elasticsearch, Redshift, Salesforce, …
The problems:
> Data cleaning (normalization) > How to normalize broken records?
> Error handling > How to remove broken records?
> Idempotent retrying > How to retry without duplicated loading?
> Performance optimization > How to optimize the code or parallelize?
HDFS
MySQL
Amazon S3
Embulk
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behavior ✓ Resuming
Plugins Plugins
bulk load
Input Output
Embulk’s Plugin Architecture
Embulk Core
Executor Plugin
Filter Filter
Guess
Output
Embulk’s Plugin Architecture
Embulk Core
Executor Plugin
Filter Filter
GuessFileInput
Parser
Decoder
Guess
Embulk’s Plugin Architecture
Embulk Core
FileInput
Executor Plugin
Parser
Decoder
FileOutput
Formatter
Encoder
Filter Filter
Execution overview
Task
Transaction Task
Task
taskCount
{ taskIndex: 0, task: {…} }
{ taskIndex: 2, task: {…} }
runs on a single thread runs on multiple threads(or machines)
Parallel execution
Task
Task
Task
Task
Threads
Task queue
run tasks in parallel
(embulk-executor-local-thread)
Distributed execution
Task
Task
Task
Task
Map tasks
Task queue
run tasks on Hadoop
(embulk-executor-mapreduce)
Distributed execution (w/ partitioning)
Task
Task
Task
Task
Map - Shuffle - Reduce
Task queue
run tasks on Hadoop
(embulk-executor-mapreduce)
Transaction control
fileInput.transaction { parser.transaction { filters.transaction { formatter.transaction { fileOutput.transaction { executor.transaction { … } } } } } }
file input plugin
parser plugin
filter plugins
formatter plugin
file output plugin
executor plugin
Task Task
Task configurationfileInput.transaction { fileInputTask, taskCount → parser.transaction { parserTask, schema → filters.transaction { filterTasks, schema → formatter.transaction { formatterTask → fileOutput.transaction { fileOutputTask → executor.transaction { → task = { fileInputTask, parserTask, filterTasks, formatterTask, fileOutputTask, } taskCount.times.inParallel { taskIndex → run(taskIndex, task)
taskCount is decided by input
schema is decided by input, and may be
modified by filters
Task execution
parser.run(fileInput, pageOutput)
fileInput.open() formatter.open(fileOutput)
fileOutput.open()
parser plugin
file input plugin filter plugins
file output plugin
formatter plugin …Task Task …
Type conversionEmbulk type systemInput type system Output type system
boolean
long
double
string
timestamp
boolean integer bigint double precision text varchar date timestamp timestamp with zone …
(e.g. PostgreSQL)
boolean integer long float double string array geo point geo shape … (e.g. Elasticsearch)
Input plugin(parser plugin if input is file-based)
Output plugin(formatter plugin if output is file-based)
What’s added since the first release?
• v0.3 • Resuming • Filter plugin type
• v0.4 • Plugin template generator • Incremental execution (ConfigDiff) • Isolated ClassLoaders for Java plugins • Polyglot command launcher
What’s added since the first release?
• v0.6 • Executor plugin type • Liquid template engine
• v0.7 • EmbulkEmbed & Embulk::Runner • Plugin bundle (embulk-mkbundle) • JRuby 9000 • Gradle v2.6
Resuming
• Retries a failed transaction without retrying everything.
• Skips successful tasks by using information stored in a file by the previous transaction.
• embulk run config.yml -r resume-state.yml
Filter plugin type
• Filtering rows out, filtering columns out, or enrich the data. 18 plugins released.
Plugin template generator
• Generates template of a plugin. • Generated code is already ready to compile.
> You modify & compile it to do your work.
• embulk new <category> <new>
Incremental execution
• Store last file name or row in a file, and next execution starts from there.
• Usecase: sync new files on S3 to Elasticsearch every day.
• embulk run config.yml -o next-config.yml
Isolated ClassLoaders for Java plugins
• Embulk can load multiple versions of java plugins.
Plugin Version Conflicts
Embulk Core
Java Runtime
aws-sdk.jar v1.9
embulk-input-s3.jar
Version conflicts!
aws-sdk.jar v1.10
embulk-output-redshift.jar
Multiple Classloaders in JVM
Embulk Core
Java Runtime
aws-sdk.jar v1.9
embulk-input-s3.jar
Isolated environments
aws-sdk.jar v1.10
embulk-output-redshift.jar
Class Loader 1
Class Loader 2
Polyglot launcher script
• embulk .jar is a jar file. • embulk.jar is a shell script. • embulk.jar is a bat script. • It sets JVM options to improve performance.
• ./embulk run abc
Executor plugin type
• embulk-executor-mapreduce executes tasks on distributed environment.
Liquid template engine
• A config file can include variables.
EmbulkEmbed & Embulk::Runner
• Embed embulk in an application.
Plugin bundle
• Uses fixed version of plugins.
• embulk mkbundle my-project • embulk run -b my-project config.yml
Gradle v2.6
• Continous compiling. • “embulk migrate .” upgrades gradle versio of your
plugin project. • ./gradlew -t build
Future plan
• v0.8 • JSON type (issue #306) • Error plugin type (#27, #124) • More (or less) concurrency for output (#231)
• v0.9 • More Guess (#242, #235) • Multiple jobs using a single config file (#167)