A session focused on ramping you up on what Hadoop is, how its works and what it's capable of. We will also look at what Hadoop 2.x and YARN brings to the table and some future projects in the Hadoop space to keep an eye on.
Text of Hadoop - Past, Present and Future - v1.2
7/12/14 ! Prepared for: vOrange County Java Users Group ! Presented by: vBig Data Joe Rossi v@bigdatajoerossi Hadoop Past, Present and Future
Roadmap ~1 hour 1- What Makes Up Hadoop 1.x? 2- Whats New In Hadoop 2.x? 3- The Future Of Hadoop
MapReduce v1 LimitaTons Scalability Maximum cluster size is 4,000 nodes and maximum concurrent tasks is 40,000 Availability JobTracker failure kills all queued and running jobs Resources ParVVoned into Map and Reduce Hard parTToning of Map and Reduce slots led to low resource uVlizaVon No Support for Alternate Paradigms / Services Only MapReduce batch jobs, nothing else
HADOOP 1.0 Single Use System Batch Apps Apache Hadoop 1.0: Single Use System HDFS (redundant, reliable storage) MapReduce (cluster resource management and data processing) Pig Hive
Whats New In Hadoop 2.x?
YARN Replaces MapReduce Yet Another Resource NegoVator YARN YARN will be the de-facto distributed operaVng system for Big Data
Store DATA in one place YARN: Taking Hadoop Beyond Batch Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service ApplicaTons Run NaTvely IN Hadoop HDFS2 (redundant, reliable storage) YARN (cluster resource management) BATCH (MapReduce) INTERACTIVE (Tez) ONLINE (HBase) STREAMING (DataTorrent) GRAPH (Giraph)
Running all on the same Hadoop cluster to give applicaVons access to all the same source data! YARN: ApplicaTons MapReduce v2 Stream Processing Master-Worker Online In-Memory Apache Storm
2010 2011 2012 2013 2014 Today YARN: Moving Quickly Conceived at Yahoo! Alpha Releases 2.0 Beta Releases 2.1 GA Released 2.2 100,000+ nodes, 400,000+ jobs daily 10 million+ hours of compute daily Version 2.3 Version 2.4
YARN: Dr. Evil Approved
YARN: How It Works ResourceManager NodeManager ApplicaVonMaster NodeManager NodeManager NodeManager Scheduler Container Container Container Client
YARN: What Has Changed? YARN MRv1 RM ResourceManager AM ApplicaVonMaster JT JobTracker Scheduler Scheduler NM NodeManager TT TaskTracker Container Map Reduce ResourceManager Scheduler JobTracker Scheduler NodeManager ApplicaVonMaster TaskTracker Map Reduce NodeManager Container Container TaskTracker Map Reduce
! Scale ! New programming models and services ! Improved cluster uVlizaVon ! Agility ! Backwards compaVble with MapReduce v1 ! Mixed workloads on the same source of data 6 Benets of YARN
The Future of Hadoop Projects and Roadmap
Speed Deliver interacTve query performance. SQL on Hadoop SQL Support array of SQL semanTcs for analyTc applicaTons running against Hadoop. Scale SQL interface to Hadoop designed for queries that scale from Terabytes to Petabytes
Hive on Apache Tez Hortonworks Next Gen SQL on Hadoop Hive on Apache Spark Cloudera Cloudera Impala Cloudera Apache Drill MapR
Dynamic Scaling On-demand cluster size. Increase and decrease the size with load. HOYA: HBase (NoSQL) on YARN Easier Deployment APIs to create, start, stop and delete HBase clusters. Availability Recover from Region Server loss with a new container.
Machine Learning Framework well suited for building machine learning jobs. Microsog REEF Scalable / Fault Tolerant Makes it easy to implement scalable, fault- tolerant runTme environments for a range of computaTonal models. Maintain State Users can build jobs that uTlize data from where its needed and also maintain state ager jobs are done. Retainable Evaluator ExecuTon Framework
Heterogeneous Storages in HDFS NameNode Storage NameNode SATA SSD Fusion IO