17
Snowplow: scalable open source web and event analytics platform, built on AWS Using EMR, Redshift, Cloudfront and Elastic Beanstalk to build a scalable, log- everything, query-everything data infrastructure

Snowplow presentation to hug uk

Embed Size (px)

DESCRIPTION

A quick overview of what Snowplow is, followed by a more indepth dive into how Snowplow is architected, and which AWS services are used where in the Snowplow data pipeline. Some tips are included at the end related to using EMR with Redshift. This presentation was given to the Hadoop Users Group in London on July 19th 2013, and was part of an event focused on AWS and Redshift in particular.

Citation preview

Page 1: Snowplow presentation to hug uk

Snowplow: scalable open source web and event analytics platform, built on AWSUsing EMR, Redshift, Cloudfront and Elastic Beanstalk to build a scalable, log-everything, query-everything data infrastructure

Page 2: Snowplow presentation to hug uk

What is Snowplow?

• Web analytics platform• Javascript tags -> event-level data delivered in your own Amazon Redshift or

PostgreSQL database, for analysis in R, Excel, Tableau

• Open source -> run on your own AWS account• Own your own data• Join with 3rd party data sets (PPC, Facebook, CRM)• Analyse with any tool you want

• Architected to scale• Ad networks track 100Ms of events (impressions) per day

• General purpose event analytics platform -> Universal Event Analytics• Log-everything infrastructure works for web data and other event data sets

Page 3: Snowplow presentation to hug uk

Why we built Snowplow

• Traditional web analytics tools are very limited• Siloed -> hard to integrate• Reports built for publishers and retailers in the 1990s

• Impressed by how easy AWS makes it to collect, manage and process massive data sets• More on this in a second…

• Impressed by new generation of agile BI tools• Tableau, Excel, R…

• Commoditise and standardise event data capture (esp. data structure) -> enable innovation in the use of that data• Lots of tech companies have built a similar stack to handle data internally• Makes sense for everyone to standardise around an open source product

Page 4: Snowplow presentation to hug uk

Snowplow’s (loosely coupled) technical architecture

1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsB C D

A D Standardised data protocols

Generate event data (e.g. Javascript tracker)

Receive data from trackers

and log it to S3

Clean and enrich raw data

(e.g. geoIP lookup,

sessionization, referrer parsing)

Store data in format suitable

to enable analysis

Page 5: Snowplow presentation to hug uk

The Snowplow technology stack: trackers

1. Trackers 2. Collectors 3. Enrich 4. Storage 5. Analytics

Javascript tracker

Pixel (No-JS) tracker

Arduino tracker

Lua tracker

Trackers on the roadmap:• Java• Python• Ruby• Android• iOS…

Page 6: Snowplow presentation to hug uk

The Snowplow technology stack: collectors

1. Trackers 2. Collectors 3. Enrich 4. Storage 5. Analytics

Cloudfront collector Clojure collector on Elastic Beanstalk

• Tracker: GET request to pixel hosted on Cloudfront

• Event data appended to the GET request as a query string

• Cloudfront logging -> data automatically logged to S3

• Scalable – Cloudfront CDN built to handle enormous volume and velocity of requests

• Enable tracking users across domains, by setting a 3rd party cookie server side

• Clojure collector runs on Tomcat: customize format of Tomcat logs to match Cloudfront log file format

• Elastic Beanstalk supports rotation of Tomcat logs into S3

• Scalable: Elastic Beanstalk makes it easy to handle spikes in request volumes

Page 7: Snowplow presentation to hug uk

The Snowplow technology stack: data enrichment

1. Trackers 2. Collectors 3. Enrich 4. Storage 5. Analytics

Scalding Enrichment on EMR

• Enrichment process run 1-4x per day• Consolidate log files from collector, clean up, enrich, and write back to storage (S3)

• Enrichments incl. referrer parsing, Geo-IP lookups, server-side sessionization• Process written in Scalding: a Scala API for Cascading• Cascading: a high level library for Hadoop esp. well suited for building robust data pipelines

(ETL) that e.g. push bad data into separate sinks to validated data• Powered by EMR: cluster fired up to perform the enrichment step, then shut down

Page 8: Snowplow presentation to hug uk

Hadoop and EMR are excellent for data enrichment

• For many, the volume of data processed with each run is not large enough to necessitate a big data solution…

• … but building the process on Hadoop / EMR means it is easy to rerun the entire historical Snowplow data set through Enrichment e.g.• When a new enrichment becomes available• When the company wants to apply a new definition of a key variable in their Snowplow data

set (e.g. new definition for sessionization, or new definition for user cohort) i.e. change in business logic

• Reprocessing entire data set isn’t just possible -> it’s easy (as easy as just processing new data) and fast (just fire up a larger cluster)

• This is game changing in web analytics, where reprocessing data has never been possible

Page 9: Snowplow presentation to hug uk

Scalding + Scalaz make it easy for us to build rich, validated ETL pipelines to run on EMR• Scalaz is a functional programming library for Scala – it has a Validation data type which

lets us accumulate errors as we process our raw Snowplow rows

• Scalding + Scalaz lets us write ETL in a very expressive way:

• In the above, ValidatedMaybeCanonicalOutput contains either a valid Snowplow event, or a list of validation failures (Strings) which were encountered trying to parse the raw Snowplow log row

Page 10: Snowplow presentation to hug uk

Scalding + Scalaz make it easy for us to build rich, validated ETL pipelines to run on EMR (continued)• Scalding + Scalaz lets us route our bad raw rows into a “bad bucket” in S3, along with all

of the validation errors which were encountered for that row:

• (This is pretty-printed – in fact the flatfile is one JSON object per line)

• In the future we could add an aggregation job to process these “bad bucket” files and report on the number of errors encountered and most common validation failures

Page 11: Snowplow presentation to hug uk

The Snowplow technology stack: storage and analytics

1. Trackers 2. Collectors 3. Enrich 4. Storage 5. Analytics

S3

Redshift

Postgres (coming soon)

Page 12: Snowplow presentation to hug uk

Loading Redshift from an EMR job is relatively straightforward, with some gotchas to be aware of• Load Redshift from S3, not DynamoDB – the costs for loading from DynamoDB only

make sense if you need the data in DynamoDB anyway

• Your EMR job can either write directly to S3 (slow), or write to local HDFS and then S3DistCp to S3 (faster)

• For Scalding, our Redshift table target is a POJO assembled using scala.reflect.BeanProperty – with fields declared in same order as in Redshift:

Page 13: Snowplow presentation to hug uk

Make sure to escape tabs, newlines etc in your strings

• Once we have Snowplow events in CanonicalOutput form, we simply unpack them into tuple fields for writing:

• Remember you are loading tab-separated, newline terminated values into Redshift, so make sure to escape all tabs, newlines, other special characters in your strings:

Page 14: Snowplow presentation to hug uk

You need to handle field length too

• You can either handle string length proactively in your code, or add TRUNCATECOLUMNS to your Redshift COPY command

• Currently we proactively truncate:

• BUT this code is not unicode-aware (Redshift varchar field lengths are in terms of bytes, not characters) and rather fragile – we will likely switch to using TRUNCATECOLUMNS

Page 15: Snowplow presentation to hug uk

Then use STL_LOAD_ERRORS, Excel and MAXERROR to help debug load errors• If you do get load errors, then check STL_LOAD_ERRORS in Redshift – it gives you all the

information you need to fix the load error

• If the error is non-obvious, pull your POJO, Redshift table definition and bad row (from STL_LOAD_ERRORS) into Excel to compare:

• COPY … MAXERROR X is your friend – lets you see more than just the first load error

Page 16: Snowplow presentation to hug uk

TSV text files are great for feeding Redshift, but be careful of using them as your “master data store”• Some limitations to using tab-separated flat files to store your data:

• Inefficient for storage/querying – versus e.g. binary files• Schemaless – no way of knowing the structure without visually eyeballing• Fragile – problems with field length, tabs, newlines, control characters etc• Inexpressive – no support for things like Union data types; rows can only be 65kb wide (you

can insert fatter rows into Redshift, but cannot query them)• Brittle – adding a new field to Redshift means the old files don’t load; need to re-run the

EMR job over all of your archived input data to re-generate

• All of this means we will be moving to a more robust Snowplow event storage format on disk (Avro), and simply generating TSV files from those Avro events as needed to feed Redshift (or Postgres or Amazon RDS or …)

• Recommendation: write a new Hadoop job step to take your existing outputs from EMR and convert into Redshift-friendly TSVs; don’t start hacking on your existing data flow

• COPY … MAXERROR X is your friend – lets you see more than just the first load error

Page 17: Snowplow presentation to hug uk

Any questions?

?Learn more• https://github.com/snowplow/snowplow• http://snowplowanalytics.com/• @snowplowdata