46
1 1 Headline Goes Here Speaker Name or Subhead Goes Here Building Hadoop Data Applica;ons with Kite Tom White @tom_e_white Hadoop Users Group UK, London 17 June 2014

Building Hadoop Data Applications with Kite

  • Upload
    huguk

  • View
    304

  • Download
    2

Embed Size (px)

DESCRIPTION

By Tom White, Software Engineer at Cloudera Video at: https://www.youtube.com/watch?v=ibgoMdca5mQ&list=PL5OOLwV_m9vaoNt0wM9BVjd_gWyseq0IR&index=1

Citation preview

Page 1: Building Hadoop Data Applications with Kite

1 1

Headline  Goes  Here  Speaker  Name  or  Subhead  Goes  Here  

Building  Hadoop  Data  Applica;ons  with  Kite  

Tom  White  @tom_e_white  Hadoop  Users  Group  UK,  London  17  June  2014  

Page 2: Building Hadoop Data Applications with Kite

About  me  

•  Engineer  at  Cloudera  working  on  Core  Hadoop  and  Kite  

• Apache  Hadoop  CommiMer,  PMC  Member,  Apache  Member  

• Author  of    “Hadoop:  The  Defini;ve  Guide”  

2

Page 3: Building Hadoop Data Applications with Kite

Hadoop  0.1  

% cat bigdata.txt | hadoop fs -put - in!% hadoop MyJob in out!% hadoop fs -get out!

3

Page 4: Building Hadoop Data Applications with Kite

Characteris;cs  

•  Batch  applica;ons  only  

•  Low-­‐level  coding  

•  File  format  •  Serializa;on  

•  Par;;oning  scheme  

4

Page 5: Building Hadoop Data Applications with Kite

A  Hadoop  stack  

5

Page 6: Building Hadoop Data Applications with Kite

Common  Data,  Many  Tools  

   #  tools  >>  #  file  formats  >>  #  file  systems  

6

Page 7: Building Hadoop Data Applications with Kite

Glossary  

• Apache  Avro  –  cross-­‐language  data  serializa;on  library  

• Apache  Parquet  (incuba;ng)  –  column-­‐oriented  storage  format  for  nested  data  

• Apache  Hive  –  data  warehouse  (SQL  and  metastore)  

• Apache  Flume  –  streaming  log  capture  and  delivery  system  

• Apache  Oozie  –  workflow  scheduler  system  

• Apache  Crunch  –  Java  API  for  wri;ng  data  pipelines  

•  Impala  –  interac;ve  SQL  on  Hadoop  

7

Page 8: Building Hadoop Data Applications with Kite

Outline  

• A  Typical  Applica;on  

•  Kite  SDK  

• An  Example  • Advanced  Kite  

8

Page 9: Building Hadoop Data Applications with Kite

A  typical  applica;on  (zoom  100:1)  

9

Page 10: Building Hadoop Data Applications with Kite

A  typical  applica;on  (zoom  10:1)  

10

Page 11: Building Hadoop Data Applications with Kite

A  typical  pipeline  (zoom  5:1)  

11

Page 12: Building Hadoop Data Applications with Kite

Kite  SDK  

12

Page 13: Building Hadoop Data Applications with Kite

Kite  Codifies  Best  Prac;ce  as  APIs,  Tools,  Docs  and  Examples  

13

Page 14: Building Hadoop Data Applications with Kite

Kite  

• A  client-­‐side  library  for  wri;ng  Hadoop  Data  Applica;ons  

•  First  release  was  in  April  2013  as  CDK  

•  0.14.1  last  month  • Open  source,  Apache  2  license,  kitesdk.org  

• Modular  

• Data  module  (HDFS,  Flume,  Crunch,  Hive,  HBase)  

• Morphlines  transforma;on  module  

• Maven  plugin  

14

Page 15: Building Hadoop Data Applications with Kite

An  Example  

15

Page 16: Building Hadoop Data Applications with Kite

Kite  Data  Module  

• Dataset  –  a  collec;on  of  en;;es  

• DatasetRepository  –  physical  storage  loca;on  for  datasets  

• DatasetDescriptor  –  holds  dataset  metadata  (schema,  format)  • DatasetWriter  –  write  en;;es  to  a  dataset  in  a  stream  

• DatasetReader  –  read  en;;es  from  a  dataset    

•  hMp://kitesdk.org/docs/current/apidocs/index.html  

16

Page 17: Building Hadoop Data Applications with Kite

1.  Define  the  Event  En;ty  

public class Event {!

private long id;!

private long timestamp;!

private String source;!

// getters and setters!

}!

17

Page 18: Building Hadoop Data Applications with Kite

2.  Create  the  Events  Dataset  

DatasetRepository repo = DatasetRepositories.open("repo:hive");!

DatasetDescriptor descriptor =!

new DatasetDescriptor.Builder()!

.schema(Event.class).build();!

repo.create("events", descriptor);!

18

Page 19: Building Hadoop Data Applications with Kite

(2.  or  with  the  Maven  plugin)  

$ mvn kite:create-dataset \!

-Dkite.repositoryUri='repo:hive' \!

-Dkite.datasetName=events \!

-Dkite.avroSchemaReflectClass=com.example.Event!

19

Page 20: Building Hadoop Data Applications with Kite

A  peek  at  the  Avro  schema  

$ hive -e "DESCRIBE EXTENDED events"!

...!

{!

"type" : "record",!

"name" : "Event",!

"namespace" : "com.example",!

"fields" : [!

{ "name" : "id", "type" : "long" },!

{ "name" : "timestamp", "type" : "long" },!

{ "name" : "source", "type" : "string" }!

]!

}!20

Page 21: Building Hadoop Data Applications with Kite

3.  Write  Events  

Logger logger = Logger.getLogger(...);!

Event event = new Event();!

event.setId(id);!

event.setTimestamp(System.currentTimeMillis());!

event.setSource(source);!

logger.info(event);!

21

Page 22: Building Hadoop Data Applications with Kite

Log4j  configura;on  

log4j.appender.flume = org.kitesdk.data.flume.Log4jAppender!

log4j.appender.flume.Hostname = localhost!

log4j.appender.flume.Port = 41415!

log4j.appender.flume.DatasetRepositoryUri = repo:hive!

log4j.appender.flume.DatasetName = events!

22

Page 23: Building Hadoop Data Applications with Kite

The  resul;ng  file  layout  

/user!

/hive!

/warehouse!

/events!

/FlumeData.1375659013795!

/FlumeData.1375659013796!

23

Avro  files  

Page 24: Building Hadoop Data Applications with Kite

4.  Generate  Summaries  with  Crunch  

PCollection<Event> events = read(asSource(repo.load("events"), Event.class));!

PCollection<Summary> summaries = events!

.by(new GetTimeBucket(), // minute of day, source!

Avros.pairs(Avros.longs(), Avros.strings()))!

.groupByKey()!

.parallelDo(new MakeSummary(),!

Avros.reflects(Summary.class));!

write(summaries, asTarget(repo.load("summaries"))!24

Page 25: Building Hadoop Data Applications with Kite

…  and  run  using  Maven  

$ mvn kite:create-dataset -Dkite.datasetName=summaries ...!

<plugin>!

<groupId>org.kitesdk</groupId>!

<artifactId>kite-maven-plugin</artifactId>!

<configuration>!

<toolClass>com.example.GenerateSummaries</toolClass>!

</configuration>!

</plugin>!

$ mvn kite:run-tool!25

Page 26: Building Hadoop Data Applications with Kite

5.  Query  with  Impala  

$ impala-shell -q ’DESCRIBE events'!

+-----------+--------+-------------------+!

| name | type | comment |!

+-----------+--------+-------------------+!

| id | bigint | from deserializer |!

| timestamp | bigint | from deserializer |!

| source | string | from deserializer |!

+-----------+--------+-------------------+!

26

Page 27: Building Hadoop Data Applications with Kite

…  Ad  Hoc  Queries  

$ impala-shell -q 'SELECT source, COUNT(1) AS cnt FROM events GROUP BY source'!

+--------------------------------------+-----+!

| source | cnt |!

+--------------------------------------+-----+!

| 018dc1b6-e6b0-489e-bce3-115917e00632 | 38 |!

| bc80040e-09d1-4ad2-8bd8-82afd1b8431a | 85 |!

+--------------------------------------+-----+!

Returned 2 row(s) in 0.56s!27

Page 28: Building Hadoop Data Applications with Kite

Advanced  Kite  

28

Page 29: Building Hadoop Data Applications with Kite

Unified  Storage  Interface  

• Dataset  –  streaming  access,  HDFS  storage  •  RandomAccessDataset  –  random  access,  HBase  storage  

•  Par;;onStrategy  defines  how  to  map  an  en;ty  to  par;;ons  in  HDFS  or  row  keys  in  HBase  

29

Page 30: Building Hadoop Data Applications with Kite

Filesystem  Par;;ons  

PartitionStrategy p = new PartitionStrategy.Builder()!

.year("timestamp")!

.month("timestamp")!

.day("timestamp").build();!

/user/hive/warehouse/events!

/year=2014/month=02/day=08!

/FlumeData.1375659013795!

/FlumeData.1375659013796!

30

Page 31: Building Hadoop Data Applications with Kite

HBase  Keys:  Defined  in  Avro  

{!

"name": "username",!

"type": "string",!

"mapping": { "type": "key", "value": "0" }!

},!

{!

"name": "favoriteColor",!

"type": "string",!

"mapping": { "type": "column", "value": "meta:fc" }!

}!31

Page 32: Building Hadoop Data Applications with Kite

Random  Access  Dataset:  Crea;on  

RandomAccessDatasetRepository repo = DatasetRepositories.openRandomAccess(!

"repo:hbase:localhost");!

RandomAccessDataset<User> users = repo.load("users");!

users.put(new User("bill", "green"));!

users.put(new User("alice", "blue"));!

32

Page 33: Building Hadoop Data Applications with Kite

Random  Access  Dataset:  Retrieval  

Key key = new Key.Builder(users)!

.add("username", "bill").build();!

User bill = users.get(key);!

33

Page 34: Building Hadoop Data Applications with Kite

Views  

View<User> view = users.from("username", "bill");!

DatasetReader<User> reader = view.newReader();!

reader.open();!

for (User user : reader) {!

System.out.println(user);!

}!

reader.close();!

34

Page 35: Building Hadoop Data Applications with Kite

Parallel  Processing  

• Goal  is  for  Hadoop  processing  frameworks  to  “just  work”  

•  Support  Formats,  Par;;ons,  Views  

• Na;ve  Kite  components,  e.g.  DatasetOutputFormat  for  MR  

35

HDFS  Dataset   HBase  Dataset  

Crunch   Yes   Yes  

MapReduce   Yes   Yes  

Hive   Yes   Planned  

Impala   Yes   Planned  

Page 36: Building Hadoop Data Applications with Kite

Schema  Evolu;on  

public class Event {!

private long id;!

private long timestamp;!

private String source;!

@Nullable private String ipAddress;!

}!

$ mvn kite:update-dataset \!

-Dkite.datasetName=events \!

-Dkite.avroSchemaReflectClass=com.example.Event!36

Page 37: Building Hadoop Data Applications with Kite

Searchable  Datasets  

• Use  Flume  Solr  Sink  (in  addi;on  to  HDFS  Sink)  

• Morphlines  library  to  define  fields  to  index  

•  SolrCloud  runs  on  cluster  from  indexes  in  HDFS  

•  Future  support  in  Kite  to  index  selected  fields  automa;cally  

37

Page 38: Building Hadoop Data Applications with Kite

Conclusion  

38

Page 39: Building Hadoop Data Applications with Kite

Kite  makes  it  easy  to  get  data  into  Hadoop  with  a  flexible  schema  model  that  is  storage  agnos;c  in  a  format  that  can  be  processed  

with  a  wide  range  of  Hadoop  tools  

39

Page 40: Building Hadoop Data Applications with Kite

Genng  Started  With  Kite  

•  Examples  at  github.com/kite-­‐sdk/kite-­‐examples  

• Working  with  streaming  and  random-­‐access  datasets  

•  Logging  events  to  datasets  from  a  webapp  •  Running  a  periodic  job  

• Migra;ng  data  from  CSV  to  a  Kite  dataset  

•  Conver;ng  an  Avro  dataset  to  a  Parquet  dataset  

• Wri;ng  and  configuring  Morphlines  

• Using  Morphlines  to  write  JSON  records  to  a  dataset  

40

Page 41: Building Hadoop Data Applications with Kite

Ques;ons?  

kitesdk.org  

@tom_e_white  

[email protected]  

41

Page 42: Building Hadoop Data Applications with Kite

42 42

Page 43: Building Hadoop Data Applications with Kite

Applica;ons  

•  [Batch]  Analyze  an  archive  of  songs1  

•  [Interac;ve  SQL]  Ad  hoc  queries  on  recommenda;ons  from  social  media  applica;ons2  

•  [Search]  Searching  email  traffic  in  near-­‐real;me3  

•  [ML]  Detec;ng  fraudulent  transac;ons  using  clustering4  

43

[1]  hMp://blog.cloudera.com/blog/2012/08/process-­‐a-­‐million-­‐songs-­‐with-­‐apache-­‐pig/    [2]  hMp://blog.cloudera.com/blog/2014/01/how-­‐wajam-­‐answers-­‐business-­‐ques;ons-­‐faster-­‐with-­‐hadoop/    [3]  hMp://blog.cloudera.com/blog/2013/09/email-­‐indexing-­‐using-­‐cloudera-­‐search/    [4]  hMp://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/    

Page 44: Building Hadoop Data Applications with Kite

…  or  use  JDBC  

Class.forName("org.apache.hive.jdbc.HiveDriver");!

Connection connection = DriverManager.getConnection(!

"jdbc:hive2://localhost:21050/;auth=noSasl");!

Statement statement = connection.createStatement();!

ResultSet resultSet = statement.executeQuery(!

"SELECT * FROM summaries");!

44

Page 45: Building Hadoop Data Applications with Kite

Apps  

• App  –  a  packaged  Java  program  that  runs  on  a  Hadoop  cluster  

•  cdk:package-­‐app  –  create  a  package  on  the  local  filesystem  

•  like  an  exploded  WAR  • Oozie  format  

•  cdk:deploy-­‐app  –  copy  packaged  app  to  HDFS  

•  cdk:run-­‐app  –  execute  the  app  

• Workflow  app  –  runs  once  

•  Coordinator  app  –  runs  other  apps  (like  cron)  

45

Page 46: Building Hadoop Data Applications with Kite

Morphlines  Example  

46

morphlines  :  [    {        id  :  morphline1        importCommands  :  ["com.cloudera.**",  "org.apache.solr.**"]        commands  :  [            {  readLine  {}  }                                                    {                  grok  {                      dic;onaryFiles  :  [/tmp/grok-­‐dic;onaries]                                                                                  expressions  :  {                          message  :  """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_;mestamp}  %{SYSLOGHOST:syslog_hostname}  %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?:  %{GREEDYDATA:syslog_message}"""                    }                }            }            {  loadSolr  {}  }                    ]    }  ]  

Example Input <164>Feb  4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22 Output Record syslog_pri:164 syslog_timestamp:Feb  4 10:46:14 syslog_hostname:syslog syslog_program:sshd syslog_pid:607 syslog_message:listening on 0.0.0.0 port 22.