Analyzing twitter data with hadoop

  • View

  • Download

Embed Size (px)


  • 1. 1Analyzing Twi,er Data with Hadoop Data Science Maryland, May 2013 Joey Echeverria | Director Federal FTS | @fwio 2013 Cloudera, Inc.

2. About Joey Director Federal FTS Technical guy playing manager 2 years @ Cloudera 5 years of Hadoop Local 2 3. Analyzing Twi,er Data with Hadoop BUILDING A BIG DATA SOLUTION 3 2013 Cloudera, Inc. 4. Big Data Big Larger volume than youve handled before No litmus test High value, under uRlized Data Structured Unstructured Semi-structured Hadoop Distributed le system Distributed, batch computaRon 4 2013 Cloudera, Inc. 5. Data Management Systems 5 2013 Cloudera, Inc.Data Source Data Storage Data IngesRon Data Processing 6. RelaRonal Data Management Systems 6 2013 Cloudera, Inc.Data Source RDBMS ETL ReporRng 7. A Canonical Hadoop Architecture 7 2013 Cloudera, Inc.Data Source HDFS Flume Hive (Impala) 8. Analyzing Twi,er Data with Hadoop AN EXAMPLE USE CASE 8 2013 Cloudera, Inc. 9. Analyzing Twi,er Social media popular with markeRng teams Twi,er is an eecRve tool for promoRon Who is inuenRal? Tweets Followers Retweets Similar to e-mail forwarding Which twi,er user gets the most retweets? Who is inuenRal in our industry? 9 2013 Cloudera, Inc. 10. Analyzing Twi,er Data with Hadoop HOW DO WE ANSWER THESE QUESTIONS? 10 2013 Cloudera, Inc. 11. Techniques SQL Filtering AggregaRon SorRng Complex data Deeply nested Variable schema 11 12. Architecture 12 2013 Cloudera, Inc.Twi,er HDFS Flume Hive Custom Flume Source Sink to HDFS JSON SerDe Parses Data Oozie Add ParRRons Hourly Impala Queries Queries and ETL 13. Analyzing Twi,er Data with Hadoop TWITTER SOURCE 13 2013 Cloudera, Inc. 14. Flume Streaming data ow Sources Push or pull Sinks Event based 14 2013 Cloudera, Inc. 15. Pulling Data From Twi,er Custom source, using twi,er4j Sources process data as discrete events 16. Loading Data Into HDFS HDFS Sink comes stock with Flume Easily separate les by creaRon Rme hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/ 17. Flume Source 17 2013 Cloudera, Inc.public class TwitterSource extends AbstractSource!implements EventDrivenSource, Configurable {!...!// The initialization method for the Source. The context contains all !// the Flume configuration info!@Override!public void configure(Context context) {!...!}!...!// Start processing events. Uses the Twitter Streaming API to sample!// Twitter, and process tweets.!@Override!public void start() {!...!}!...!// Stops Sources event processing and shuts down the Twitter stream.!@Override!public void stop() {!...!}!}!! 18. Twi,er API Callback mechanism for catching new tweets 18 2013 Cloudera, Inc./** The actual Twitter stream. Its set up to collect raw JSON data */!private final TwitterStream twitterStream = new TwitterStreamFactory(!new ConfigurationBuilder().setJSONStoreEnabled(true).build())!.getInstance();!...!// The StatusListener is a twitter4j API that can be added to a stream,!// and will call a method every time a message is sent to the stream.!StatusListener listener = new StatusListener() {!// The onStatus method is executed every time a new tweet comes in.!public void onStatus(Status status) {!... !}!}!...!// Set up the streams listener (defined above), and set any necessary!// security information.!twitterStream.addListener(listener);!twitterStream.setOAuthConsumer(consumerKey, consumerSecret);!AccessToken token = new AccessToken(accessToken, accessTokenSecret);!twitterStream.setOAuthAccessToken(token);!! 19. JSON Data JSON data is processed as an event and wri,en to HDFS 19 2013 Cloudera, Inc.public void onStatus(Status status) {!// The EventBuilder is used to build an event using the headers and!// the raw JSON of a tweet!!headers.put("timestamp", String.valueOf(!status.getCreatedAt().getTime()));!Event event = EventBuilder.withBody(!DataObjectFactory.getRawJSON(status).getBytes(), headers);!!channel.processEvent(event);!}! 20. Analyzing Twi,er Data with Hadoop FLUME DEMO 20 2013 Cloudera, Inc. 21. Analyzing Twi,er Data with Hadoop HIVE 21 2013 Cloudera, Inc. 22. What is Hive? Created at Facebook HiveQL SQL like interface Hive interpreter converts HiveQL to MapReduce code Returns results to the client 22 2013 Cloudera, Inc. 23. Hive Details Schema on read Scalar types (int, oat, double, boolean, string) Complex types (struct, map, array) Metastore contains table deniRons Stored in a relaRonal database Similar to catalog tables in other DBs 23 24. Complex Data 24 2013 Cloudera, Inc.SELECT!t.retweet_screen_name,!sum(retweets) AS total_retweets,!count(*) AS tweet_count!FROM (SELECT! retweeted_status.user.screen_name AS retweet_screen_name,! retweeted_status.text,! max(retweeted_status.retweet_count) AS retweets!FROM tweets! GROUP BY!retweeted_status.user.screen_name,!retweeted_status.text) t!GROUP BY t.retweet_screen_name!ORDER BY total_retweets DESC!LIMIT 10;! 25. Analyzing Twi,er Data with Hadoop JSON INTERLUDE 25 2013 Cloudera, Inc. 26. What is JSON? Complex, semi-structured data Based on JavaScripts data syntax Rich, nested data types: number string Array object true, false null 26 2013 Cloudera, Inc. 27. What is JSON? 27 2013 Cloudera, Inc.{!"retweeted_status": {!"contributors": null,!"text": "#Crowdsourcing drivers already generate traffic data foryour smartphone to suggest alternative routes when a road is clogged.#bigdata",!"retweeted": false,!"entities": {!"hashtags": [!{!"text": "Crowdsourcing",!"indices": [0, 14]!},!{!"text": "bigdata",!"indices": [129,137]!}!],!"user_mentions": []!}!}!}! 28. Hive Serializers and Deserializers Instructs Hive on how to interpret data JSONSerDe 28 2013 Cloudera, Inc. 29. Analyzing Twi,er Data with Hadoop HIVE DEMO 29 2013 Cloudera, Inc. 30. Analyzing Twi,er Data with Hadoop ITS A TRAP! 30 2013 Cloudera, Inc.Photo from h,p:// Some rights reserved 31. Not a Database 31 2013 Cloudera, Inc.RDBMS HiveLanguageGenerally >= SQL-92Subset of SQL-92 plusHive specic extensionsUpdate CapabilitiesINSERT, UPDATE,DELETEINSERT OVERWRITEno UPDATE, DELETETransactions Yes NoLatency Sub-second MinutesIndexes Yes YesData size Terabytes Petabytes 32. ETL Hive works great for SQL-based ETL JSON -> SequenceFiles 32 2013 Cloudera, Inc. 33. Analyzing Twi,er Data with Hadoop IMPALA 33 2013 Cloudera, Inc. 34. Cloudera Impala 34Real-Time Query for Data Stored in Hadoop. Supports Hive SQL 4-30X faster than Hive over MapReduce Uses exisRng drivers, integrates with exisRng metastore, works with leading BI tools Flexible, cost-eecRve, no lock-in Deploy & operate with Cloudera Enterprise RTQ Supports mulRple storage engines & le formats 2013 Cloudera, Inc. 35. Benets of Cloudera Impala 35Real-Time Query for Data Stored in Hadoop Real-Rme queries run directly on source data No ETL delays No jumping between data silos No double storage with EDW/RDBMS Unlock analysis on more data No need to create and maintain complex ETL between systems No need to preplan schemas All data available for interacRve queries No loss of delity from xed data schemas Single metadata store from originaRon through analysis No need to hunt through mulRple data silos 2013 Cloudera, Inc. 36. Cloudera Impala Details 36 2013 Cloudera, Inc.HDFS DN Query Exec Engine Query Coordinator Query Planner HBase ODBC SQL App HDFS DN Query Exec Engine Query Coordinator Query Planner HBase HDFS DN Query Exec Engine Query Coordinator Query Planner HBase Fully MPP Distributed Local Direct Reads State Store HDFS NN Hive Metastore YARN Common Hive SQL and interface Unied metadata and scheduler Low-latency scheduler and cache (low-impact failures) 37. Analyzing Twi,er Data with Hadoop IMPALA DEMO 37 2013 Cloudera, Inc. 38. Analyzing Twi,er Data with Hadoop OOZIE AUTOMATION 38 2013 Cloudera, Inc. 39. Oozie: Everything in its Right Place 40. Oozie for ParRRon Management Once an hour, add a parRRon Takes advantage of advanced Hive funcRonality 41. Analyzing Twi,er Data with Hadoop OOZIE DEMO 41 2013 Cloudera, Inc. 42. Analyzing Twi,er Data with Hadoop PUTTING IT ALL TOGETHER 42 2013 Cloudera, Inc. 43. Complete Architecture 43 2013 Cloudera, Inc.Twi,er HDFS Flume Hive Custom Flume Source Sink to HDFS JSON SerDe Parses Data Oozie Add ParRRons Hourly Impala Queries Queries and ETL 44. Analyzing Twi,er Data with Hadoop MORE DEMOS 44 2013 Cloudera, Inc. 45. What next? Cloudera University Download Hadoop! CDH available at Cloudera provides pre-loaded VMs h,ps:// Clone the source repo h,ps://,er-example 46. My personal preference Cloudera Manager h,ps:// Free up to 50 unlimited nodes! 47. Shout Out Jon Natkins @na,yice Blog posts h,p://,er-data-with-hadoop/ h,p://,er-data-with-hadoop-part-2-gathering-data-with-ume/ h,p://,er-data-with-hadoop-part-3-querying-semi-structured-data-with-hive/ 48. QuesRons? Contact me! Joey Echeverria @fwio Were hiring! 49. 49 2013 Cloudera, Inc.