Upload
bolke-de-bruin
View
1.024
Download
2
Embed Size (px)
Citation preview
2
https://wepayinc.app.box.com/s/hf1chwmthuet29ux2a83f5quc8o5q18k
Airflow @ING
3
ING
Multinational banking and financial services corporation headquartered in Amsterdam.
Its primary businesses are retail banking, direct banking, wholesale banking, investment banking, asset management, and insurance services.
4
• Cron Replacement• Fault tolerant• No XML (looking at you Oozie!)• Testable• Python code• Extendable• Now Apache (incubating)• Scale Out• Complex Dependency Rules• Pools• CLI & Web UI
Why Apache Airflow (incubating)?
5
Growing community
6
Airflow Operational Design
Airflow Webserver
Database
Airflow Scheduler
Airflow Executor(local/celery/mesos
worker)
Airflow Tasks
Talks to
Auth Backend
7
Choose an executor that fits your environmentSequentialExecut
orLocalExecutor CeleryExecutor
Use case Mainly testing Production (~50% of installed base)
Production (~50% of installed base)
Scaleability -na- Vertical Horizontal and Vertical
Complexity Low Medium Medium/High
DAG Local Local Needs sync / pickle
Configuration [core]Executor = SequentialExecutor
[core]Executor = LocalExecutorParallelism=32
[core]Executor = CeleryExecutor
[celery]Celeryd_concurrency = 32Broker_url = rabbitmqcelery_result_backendDefault_queue =
Remark Don’t use num_runs
8
UTC everywhere
Engineers here respond in UTC if you ask them
what time it is
Max
• Airflow assumes every server / worker runs in UTC
• Airflow does not manage time zones (correctly) (to be fixed)
• UTC does not know Daylight Savings Time
9
Tasks run at the end of the period not at the start
• First run will be at 2016-06-1 22:00 UTC
• Execution date will be 2016-06-1 21:00 UTC
10
How to stop/kill a task?
11
How to force running a task?
Celery only (for now)
12
“An idempotent operation is one that has no additional effect if it is called more than once with the same input parameters.”
Make your tasks and DAGs idempotent
• DAGS and Tasks receive an execution date
• on_retry_callback can be used to do a cleanup before a retry
13
Generate your tasks programmaticallyList file names on HDFS
Loop file names
Create task
Assign upstream downstream
14
• Otherwise scheduling can get deadlocked as the sensors take up all the slots in the scheduler
• Another way to circumvent this issue is to have a separate pool for sensors
When using ExternalTaskSensor make sure to manually raise the priority of the tasks it is waiting for
15
• Do you have longer running tasks? Increase the heartbeat of the scheduler to decrease load
• Smaller tasks make for easier debugging and retrying• Properly choose your start date: the scheduler will fill gaps.
• Changing the schedule requires change the dag_id• Backfills are used to add runs where the scheduler already went by
Some last bits
16
Use case
Transactions
Risk
Products
External
HDFS SPARK
TEZ
POSTGRES
FLUME
XFB
SQOOP
SQOOP
17
Wait for files to arrive (Sensor)
18
Copy & clean up
19
Model creation• Run Spark• Tez
Sharding
20
Sqooping to DB
21
• Apache Release• Allow auto aligned
start_date• Backfills to use
Dag Runs• Improve pooling• DAG Parsing
Isolation
Draft Roadmap
• Rest API• Further
Kerberos Integration
• Schedule Backfill Dag Runs
• Isolation• DAG syncing
across workers• No direct
imports for operators from __init__
• Event Driven Driven Scheduler
• Make tasks not need the database
• Roles / principalsIn
progress
In progress In
progressIn
progress
22
Aspiring committer? Contributor? User?
http://gitter.im/apache/incubator-airflow/
https://github.com/apache/incubator-airflow/
http://mail-archives.apache.org/mod_mbox/incubator-airflow-dev/
23