62
INOM EXAMENSARBETE DATALOGI OCH DATATEKNIK, AVANCERAD NIVÅ, 30 HP , STOCKHOLM SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS- OCH KOMMUNIKATIONSTEKNIK

A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

INOM EXAMENSARBETE DATALOGI OCH DATATEKNIK,AVANCERAD NIVÅ, 30 HP

, STOCKHOLM SVERIGE 2016

A Global Ecosystem for Datasets on Hadoop

JOHAN SVEDLUND NORDSTRÖM

KTHSKOLAN FÖR INFORMATIONS- OCH KOMMUNIKATIONSTEKNIK

Page 2: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

A Global Ecosystem for Datasets on Hadoop

TRITA-ICT-EX-2016:131

Johan Peter Svedlund Nordstrom

Master of Science Thesis

Software Engineering of Distributed Systems

School of Information and Communication Technology

KTH Royal Institute of Technology

Stockholm, Sweden

11 September 2016

Examiner: Jim Dowling

Page 3: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

c© Johan Peter Svedlund Nordstrom, 11 September 2016

Page 4: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

Abstract

The immense growth of the web has led to the age of Big Data. Companies like

Google, Yahoo and Facebook generates massive amounts of data everyday. In

order to gain value from this data, it needs to be effectively stored and processed.

Hadoop, a Big Data framework, can store and process Big Data in a scalable

and performant fashion. Both Yahoo and Facebook, two major IT companies,

deploy Hadoop as their solution to the Big Data problem. Many application

areas for Big Data would benefit from the ability to share datasets across cluster

boundaries. However, Hadoop does not support searching for datasets either local

to a single Hadoop cluster or across many Hadoop clusters. Similarly, there is only

limited support for copying datasets between Hadoop clusters (using Distcp). This

project presents a solution to this weakness using the Hadoop distribution, Hops,

and its frontend Hopsworks. Clusters advertise their peer-to-peer and search

endpoints to a central server called Hops-Site. The advertised endpoints builds

a global hadoop ecosystem and gives clusters the ability to participate in public-

search or peer-to-peer sharing of datasets. HopsWorks users are given a choice to

write data into Kafka as it’s being downloaded. This opens up new possibilities

for data scientists who can interactively analyse remote datasets without having

to download everything in advance. By writing data into Kafka as its being

downloaded, it can be consumed by entities like Spark-streaming or Flink.

i

Page 5: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-
Page 6: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

Acknowledgements

I would like to acknowledge my examinor Jim Dowling and my supervisor Alex

Ormenisan. Both of them have contributed with advice and smart ideas which has

helped me throughout the project. I would also like to thank all the coworkers at

SICS, who where always glad to offer help.

iii

Page 7: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-
Page 8: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

Contents

1 Introduction 1

1.1 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Problem statement and Purpose . . . . . . . . . . . . . . . . . . . 3

1.3 Goals, Ethics and Sustainability . . . . . . . . . . . . . . . . . . 4

1.4 Structure of this thesis . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 7

2.1 Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Hops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 HopsWorks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Hops-site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 MySQL cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6 ElasticSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.7 Epipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.8 GVoD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.9 HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.10 Kafka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Method 19

3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

v

Page 9: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

vi CONTENTS

3.2 Experiments and evaluation . . . . . . . . . . . . . . . . . . . . . 20

4 Implementation 21

4.1 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Hops-Site interactions with a HopsWorks instance . . . . . . . . . 23

4.3 Public Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3.1 What is needed ? . . . . . . . . . . . . . . . . . . . . . . 25

4.3.2 Producing a public search and handling responses . . . . . 26

4.4 GVoD peer-to-peer upload and download . . . . . . . . . . . . . 28

4.4.1 What is needed ? . . . . . . . . . . . . . . . . . . . . . . 28

4.4.2 Upload . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.4.3 Download . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.5 Real-time processing . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Analysis 33

5.1 Evaluation of implementation and current technologies . . . . . . 33

5.2 Results of Experiments . . . . . . . . . . . . . . . . . . . . . . . 34

5.3 P2P test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.4 Real-time processing tests . . . . . . . . . . . . . . . . . . . . . 37

6 Conclusions 39

6.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Bibliography 45

Page 10: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

List of Figures

2.1 HopsWorks and Hops . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Hops HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1 Register to Hops-Site . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Ping Hops-Site . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.3 Public Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.1 Download with one uploader . . . . . . . . . . . . . . . . . . . . 35

5.2 Download with two uploaders . . . . . . . . . . . . . . . . . . . 35

5.3 Download with three uploaders . . . . . . . . . . . . . . . . . . . 36

5.4 Download with four uploaders . . . . . . . . . . . . . . . . . . . 36

vii

Page 11: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-
Page 12: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

List of Acronyms and Abbreviations

HDFS Hadoop Distributed File System

YARN Yet Another Resource Negotiator

Hops Hadoop Open Platform As-a-Service

JSON Javascript Object Notation

REST Representational State Transfer

NAT Network address translation

FTP File Transfer Protocol

NDB Network Database

API Application Programming Interface

ix

Page 13: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-
Page 14: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

Chapter 1

Introduction

Modern computing is generating massive volumes of data at ever growing speeds.

The data generated has different characteristics. Not only are the volumes large

but data is structured, semi structured and unstructured [1]. Data originates from

different kinds of sources like web pages, logs, social media, e-mail, documents,

sensor devices and many more [2]. The different characteristics, size, complexity

and origins of this data makes it difficult for traditional storage and processing

systems to handle. A commonly recognized term for this kind of data is ”Big

Data”.

Storing, processing and extracting value from Big Data is no trivial matter and

a problem that companies like Google and Yahoo spends lots of resources on.

The most notable framework for Big Data storage and processing is called

Hadoop [3]. The base of Hadoop was developed at Yahoo together with the

creator of the famous search-engine library Apache Lucene [4]. Hadoop is a

framework consisting of several different projects such as ”Hadoop Distributed

FileSystem”(HDFS) and ”Yet Another Resource Scheduler”(Yarn). Hadoop is

capable of storing and processing large amounts of data in a scalable, efficient and

1

Page 15: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

2 CHAPTER 1. INTRODUCTION

effective way. Although Hadoop excels with handling of Big Data, it is a fairly

young framework and lacks some important capabilities that would be beneficial

for its progress.

1.1 Problem description

The capabilities that Hadoop offer makes it possible for developers and scientists

to create new interesting applications as well as extracting interesting information

out of large datasets. However, as in any technical area, in order to make progress

there needs to be a way to share knowledge. Inside a Hadoop cluster, stored in

HDFS, there may be hundreds of petabytes of data which users of that cluster

can perform processing on. This data is bound to the cluster where it is stored.

There are limited options for users to obtain information about data that isn’t

local to their cluster. Neither is there support for scalable transfer of data between

clusters. Right now, in order to import data that that isn’t local to a cluster, a user

must first find it via some third party information source and after that do some

kind of copying, typically by using Hadoops Distcp [5] or worse something like

the File Transfer Protocol(FTP) [6]. These solutions aren’t always going to work

either since the ”open internet” might include NAT(Network address translation)

endpoints and other technology which would further complicate transfers. Also,

just because something is advertised to exist on the internet, doesn’t mean that it

actually is available. These limitations makes sharing datasets tedious and in many

cases very slow which could discourage users from sharing. Even if transfers were

fast and knowledge of remote datasets was easily accessible, datasets might be

very big and downloads will take time to complete. Some users aren’t concerned

with the whole blob of data, but might simply want to make some experiments on

some if it. If there is no way to process data as it is being downloaded then these

Page 16: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

1.2. PROBLEM STATEMENT AND PURPOSE 3

users will be forced to wait for everything to finish downloading. This will likely

make them less inclined to share datasets.

1.2 Problem statement and Purpose

Hadoop has no solutions for either searching or scalable sharing of datasets

across cluster boundaries. Also, sharing datasets between datacenters could be

problematic in the face of NAT endpoints. This means that data will likely be

bound to one cluster and sharing will not happen. The lack of shared datasets

is a hindrance for further progress within application areas that use the utilities

of Hadoop. Even if Hadoop had scalable capabilities to share data, downloading

large datasets might take too long and therefore be avoided. There needs to be a

way to do processing on data as it is being downloaded.

The contribution of this thesis is the implementation and evaluation of a global

ecosystem for hadoop datasets where datasets are searchable, shareable and

processable as they are downloading. The implementation uses SICS own Hadoop

distribution Hops and its frontend Hopsworks as the base system. For global

cluster registrations, a central server is deployed where different Hops clusters

can register their peer-to-peer and search endpoints. The peer-to-peer service

is an altered version of GVoD [7] which is a peer-to-peer video streaming

application with NAT-traversal capabilities, developed at SICS. The public search

is accomplished using the extremely popular ElasticSearch [8] search-engine

which is a distributed document store built on top of Apache Lucene [4]. The

real time processing of data is accomplished through the choice of writing GVoD

downloaded data into either HDFS or both HDFS and Kafka [9]. By writing data

into Kafka, users in Hopsworks can read from Kafka during downloads using

Page 17: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

4 CHAPTER 1. INTRODUCTION

entities like Spark streaming [10], Flink [11] or similar technologies.

1.3 Goals, Ethics and Sustainability

The goal of this thesis and work will be a working, scalable and efficient

implementation of a global ecosystem for Hops datasets. HopsWorks users of

different Hops clusters will be able to share and search for datasets in this global

ecosystem. Also in the case of downloading datasets, real-time processing will be

supported. The explicit goals are listed below.

• Implement search for public datasets. HopsWorks users should be able to

find data in their own Hops cluster and in remote Hops clusters.

• Implement peer-to-peer sharing of public datasets. HopsWorks users should

be able to upload and download data to and from other Hops clusters.

• Implement support for real-time processing of downloading data. HopsWorks

users should not be forced to wait for complete downloads in order to

investigate interesting data.

• Demonstrate that peer-to-peer sharing of data is a scalable solution and

better solution than Hadoops built in copying mechanism Distcp and similar

technologies.

If these goals are met then people creating, storing and processing interesting data

can share it with others from the related or unrelated application domains in order

to further progress their products and goals. It can directly benefit entities that

work in the Big Data society but could also indirectly benefit those who are only

affected by it, for example visitors of enterprise web applications.

Introducing the concept and peer-to-peer can be controversial in both an ethical

Page 18: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

1.4. STRUCTURE OF THIS THESIS 5

and a sustainable standpoint. Peer-to-peer downloads of large datasets could

potentially mean large usage of bandwidth which could strain network infrastructure.

Also, in an ethical perspective sharing data can be problematic if there is no

sufficient access control to the data being shared. Fortunately GVoD which is the

process that conducts the peer-to-peer downloads uses a special network protocol

known as Ledbat [12]. This protocol is different from protcols such as TCP and

UDP as it will adapt its usage to the current network characteristics. Concerning

ethical problems, access control is managed by the Hopsworks web application

where users can choose to make their own data publicly available.

1.4 Structure of this thesis

Chapter 1 describes the problem and its context. Chapter 2 provides the specific

knowledge that the reader will need to understand the rest of the thesis. Chapter

3 describes what method was used to implement and evaluate the solution. The

solution implementation is presented and explained in chapter 4. The solution is

analyzed and evaluated in Chapter 5. Finally, Chapter 6 offers some conclusions

and suggests future work.

Page 19: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-
Page 20: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

Chapter 2

Background

This chapter presents the background of the thesis. It introduces different entities

that the reader needs to understand in order to comprehend the remainder of the

thesis. First, Hadoop, Hops and HopsWorks are introduced as they outline the base

of which the solution is built upon. After that the different parts of the solution

architecture are introduced. Starting with the central-server(Hops-Site) and

then further on to relational-persistence(MySQL cluster), search(ElasticSearch

and Epipe) and peer-to-peer sharing (GVoD). Lastly, the different dataset-store

components (HDFS and Kafka) are presented and also what they offer for this

particular solution . This chapter doesn’t discuss how these different techniques

accomplish the overall solution, that is done in Chapter 4.

2.1 Hadoop

Apache Hadoop is a framework that provides distributed storage and processing of

large datasets [3]. Hadoop was designed to scale from single servers up to massive

7

Page 21: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

8 CHAPTER 2. BACKGROUND

clusters of nodes where each node offered both storage and processing power.

Today, companies like Yahoo, Facebook and Spotify deploy Hadoop stacks in

their datacenters in order to manage the large amount of data they generate

[13]. Rather than relying on central and expensive solutions, Hadoop utilizes the

power of parallel-computing and non expensive hardware. The main modules of

Hadoop are Hadoop Common, HDFS, YARN and MapReduce. Hadoop Common

consists of the utilities that the other modules need in order to function properly.

HDFS is a distributed filesystem, the default filesystem for Hadoop. Yarn is a

resource negotiator, with responsibilities similar to a traditional operating system.

MapReduce is a programming model that is widely supported inside Hadoop and

allows for things like distributed processing of data.

2.2 Hops

Hadoop open platform as-a-Service or ”Hops” is a Hadoop distribution developed

at SICS [14]. Hops has several improvements over Hadoop, some of these are

listed below.

• Hadoop-as-a-Service

• Project-Based Multi-Tenancy

• Secure sharing of DataSets across HopsWorks projects

• Extensible metadata that supports free-text search using Elasticsearch

• YARN quotas for projects

The key functionality that enables the above listed improvements is the storage of

HDFS and Yarn metadata inside the MySQL cluster. This functionality enables

things like search of datasets through ElasticSearch, described in sections 2.6

Page 22: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

2.3. HOPSWORKS 9

and 2.7. Also, with HDFS metadata stored inside the MySQL cluster instead

on the Heap of a NameNode, Hops becomes more scalable than other Hadoop

architectures [15]. The secure sharing, multi-tenancy and service improvements

are enabled by the HopsWorks web application which is described further

below.

2.3 HopsWorks

HopsWorks is the frontend for Hops. It introduces concepts like Users, Projects

and Datasets which helps organize the different services that Hops offer. For

example, a user of HopsWorks can create a Project, run a sparkjob and store

the results as a dataset inside the project. Technically speaking, HopsWorks is

an AngularJS [16] front-end and a Java Jersey [17] REST(Representational state

transfer) [18] back-end. The back-end talks with Hops services(like ElasticSearch

and GVoD) via REST calls. Most data exchanged over REST in the Hops and

HopsWorks architecture are serialized into JSON(Javascript Object Notation) [19]

format. Figure 2.1 exemplifies the architecture of Hops and HopsWorks.

The top layer represents the HopsWorks REST API(Application Programming

Interface), it offers many different API calls that the AngularJS frontend uses

to present Hops to its users. Almost all of these API calls are protected by

user authentication. However, the REST call to search for public datasets is

not protected by security, so anyone can call it. This is a necessary evil as it

allows clusters to make REST calls for public datasets without needing some kind

of web-application session information. We will come back to the problematic

aspects of this choice in the last chapter of the thesis. The next layer represents

the different services that Hops provides. The relevant services for this project are

Page 23: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

10 CHAPTER 2. BACKGROUND

Figure 2.1: HopsWorks and Hops

ElasticSearch and GVoD(unfortunately not in the picture), both of which will be

introduced below in sections 2.6 and 2.8. The bottom layer represents the main

modules of Hops, the filesystem and the resource negotiator. The filesystem is a

modified version of HDFS, which is further described below in section 2.9. The

resource negotiator is an altered version of Hadoops YARN. HDFS handles the

storage needs inside the Hops clusters while YARN manages important choices

like scheduling and allocation of resources. Together HDFS and YARN help

enable the above layered services of Hops.

Page 24: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

2.4. HOPS-SITE 11

2.4 Hops-site

Searching publicly for datasets means that those dataset needs to be globally

unique, i.e unique across different clusters and within clusters. A Hops cluster

also needs to know the search endpoints of other clusters in order to direct

public-search queries to them. Hops-Site serves as a solution for both of these

problems. Hops-Site is a Jersey [17] RESTFUL web-service deployed on a

Glassfish webserver [20] local to a certain Hops cluster. Hops-Site offers a REST-

API to Hops-clusters where they can advertise their search endpoints, obtain an

unique cluster-id and find information of other registered Hops-clusters. Hops-

Site also maintains other types of information about registered clusters, such as

how active(how often they ping) they are and what GVoD endpoint they have.

Hops-Site uses a MySQL cluster for persistence, same as HopsWorks.

2.5 MySQL cluster

Both HopsWorks, Hops and Hops-Site utilizes a relational database to persist

important information. HopsWorks and Hops(Yarn and HDFS) needs to store

information about users, projects, datasets and more, while Hops-Site persists

information about registered Hops clusters. Some of the reasons for choosing

MySQL-cluster as a persistent store for Hops-Site are listed below.

• Integration

• High Availability and Scale Ability

• No Single Point of Failure

The first reason for the choice of MySQL cluster is integration. Because Hops and

Page 25: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

12 CHAPTER 2. BACKGROUND

HopsWorks already utilizes MySQL cluster for persistence, it became a natural

choice for Hops-Site which also runs inside a Hops Cluster. The second and

perhaps most important reason is the performance that MySQL cluster offers.

MySQL cluster has both high availability and scale-ability [21] which both are

critical for up-time and being able to handle a high load of requests. Also, because

MySQL cluster is a distributed Relational database, there is no single point of

failure which further improves potential up-time.

2.6 ElasticSearch

Search inside Hops clusters is powered by ElasticSearch. ElasticSearch is a

distributed search-engine built on top of Apache Lucene [8, 4]. The distributed

nature of ElasticSearch gives it important properties like high availability, scale-

ability and responsiveness. ElasticSearch is a document store, which means that

instead of storing data in traditional rows and columns format, data is stored in

object format, in so called documents. Documents themselves are stored inside

indexes which are replicated throughout an ElasticSearch cluster using shards.

An index is similar to what a database is in a relational database system. By

default, each field inside a document is indexed with a Lucene inverted index

making it available for fast search and retrieval [22]. Other than the performance

reasons stated above, ElasticSearch is a good choice for a search-engine since it

speaks REST and JSON, this enables simple integration with the rest of Hops

and HopsWorks. The data that elastic makes available for search comes from

the metadata stored in the MySQL cluster and is delivered by Epipe which is

described below.

Page 26: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

2.7. EPIPE 13

2.7 Epipe

Inside a Hops cluster, an application called Epipe is responsible for writing data

from the MySQL cluster into ElasticSearch. This application employees the

NDB(Network Database) event API [23] and listens for evens that are generated

when something changes inside the MySQL cluster, for example an update to a

table. When an event is generated, Epipe looks at the event and writes the changes

into ElasticSearch. In this manner parts of the MySQL cluster is replicated in

ElasticSearch which means that users of HopsWorks can direct search-queries to

ElasticSearch and search for data that is stored in the MySQL cluster. An example

of this would be when a user makes a dataset public, this would then change

a column inside the dataset table in the MySQL cluster. Epipe would recieve

an event and write the column change into ElasticSearch. Now users should

be able to search for that public datasets by querying the local ElasticSearch

instance.

2.8 GVoD

GVoD is a video on demand streaming application with NAT-traversal capabilities,

developed at SICS [7]. GVoD has been altered to fit into the Hops ecosystem and

serve as the process which handles downloads and uploads of public datasets.

GVoD is a single process that runs on a single node inside a Hops cluster. GVoD

incorporates peer-to-peer technology in order to upload and download datasets.

GVoD downloads and uploads files in pieces, which are smaller units of data.

These pieces are assembled into blocks which are a bigger units of data that later

will be verified for correctness and then written into HDFS. The size of a block is

configurable but in the implementation for this project it was set to 10 megabytes.

Page 27: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

14 CHAPTER 2. BACKGROUND

Downloading datasets with GVoD means transferring alot of these pieces and

building blocks from them. In order for a GVoD instance to know that it has

obtained correct blocks, it needs to verify block hash-values. GVoD incorporates

”on demands hashing” of blocks in order to support this. GVoD downloads data

in an orderly fashion which differs to other common peer-to-peer applications that

usually downloads data out of order. The fact that GVoD downloads data in order

is necessary since HDFS is an append only filesystem [24]. GVoD has also been

altered to write to HDFS and Kafka, where HDFS represents the persistent store

and Kafka the temporary store that enables real-time processing. Writing data into

HDFS is is done by transferring pieces and building blocks, verifying the block

hash-values and then writing them to the DataNodes of HDFS. Writing data into

Kafka is a bit different and is described further in section 2.10.

2.9 HDFS

HDFS is the default filesystem for both Hadoop and Hops, it is a distributed

filesystem designed to run on non expensive hardware. HDFS was built according

to some particular design goals namely, fault-tolerance, streaming data access,

large datasets, simple read-write model and moving application to data [24].

Figure 2.2 exemplifies the architecture of HDFS.

HDFS is a master/slave architecture were the so called NameNode acts as the

master and DataNodes acts as slaves. The idea of HDFS is to let one node handle

client requests and metadata storage while having a massive amount of other nodes

that offers simple block storage. The NameNode has the most central role in the

system. It stores the HDFS metadata such as directory structure, permissions etc

[24]. The actual data is stored in blocks on the DataNodes and replicated onto

Page 28: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

2.9. HDFS 15

Figure 2.2: HDFS

other DataNodes at the command of the NameNode.

The Hops filesystem is different from HDFS, it migrates the filesystem metadata

from the Namenode heap to a MySQL cluster [15]. See figure 2.3 for an example

of the Hops-HDFS architecture.

The migration of filesystem metadata to the MySQL cluster makes the Hops

filesystem more scalable and also enables multiple-writer model for mutating

HDFS metadata [15].

Page 29: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

16 CHAPTER 2. BACKGROUND

Figure 2.3: Hops HDFS

Page 30: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

2.10. KAFKA 17

2.10 Kafka

Apache Kafka is a distributed publish/subscribe message-system with high

throughput [9]. Kafka acts as the temporary store for downloading public datasets

into a Hops cluster. Kafka manages a set of storage entities called Topics, these

entities are similar to queues and lets producers and consumers, read and write

messages to and from them. These messages must follow a certain structure,

this structure is usually expressed in CSV [25] or Avro [26] format. As with

datasets, topics are part of projects in HopsWorks and follow the same type of

access control. Messages written to Kafka Topics are kept there for a configurable

amount of time. When topics become full or time for keeping messages runs out,

the oldest messages are discarded. For each topic, the Kafka cluster maintains a

partitioned log, where each partition must reside on a node in the cluster but a topic

may have different partitions making topics storage scalable. Partitions are also

replicated throughout the cluster making Kafka fault-tolerant. Kafkas distributed,

high throughput and good integration capabilities with technologies such as Spark

Streaming [10] makes it a very capable temporary store for real-time processing

of data and compatible with the Hops ecosystem.

Page 31: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-
Page 32: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

Chapter 3

Method

This chapter presents the type of methodology used to produce the thesis,

work and result. It discusses the analysis and tests. The actual results and

implementation is presented in the upcoming chapters.

3.1 Methodology

For this thesis, a quantitative research method was chosen [27]. First a system

was created that sought to meet the proposed goals of the project. These where as

mentioned before, to implement search and scalable sharing of public datasets as

well as support for real-time processing of downloading datasets. Along with the

quantitative research method, a deductive research approach [27] was chosen and

experimental tests and evaluations were made to verify the goals.

19

Page 33: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

20 CHAPTER 3. METHOD

3.2 Experiments and evaluation

Due to the nature of the project as well as the limited time frame, only a couple of

experiments where conducted. The main experiment for testing the performance

of the implementation was a test that transferred datasets between clusters with an

increasing number of participating peers. By downloading datasets and increasing

the number of participating peers, the scale-ability and performance of the peer-

to-peer sharing could be verified.

No tests were conducted to establish the performance of public-search. The main

reason for this was that GVoD at the time of writing did not have the ability to

build its own overlay. Therefore in order to make the peer-to-peer sharing optimal,

public-search had to take a hit in performance, more on this later in chapter 4 and

section 4.3. Also, because no Hadoop implementation had the ability to do public-

search there was not really anything to do benchmarks against.

In order to test the ability of real-time processing, a simple test was conducted.

This test evaluated how much time it took before downloading data started to

appear in a Kafka Topic.

As a final evaluation, the peer-to-peer sharing of datasets was compared and

evaluated against existing technologies with similar abilities.

Page 34: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

Chapter 4

Implementation

This chapter presents the implementation of the search and peer-to-peer downloading

as well as the real-time processing support. First a couple of rules/assumptions

are presented, some of these are temporary limitations and others are the results

of logical conclusions. After that, the interactions between Hops-Site and

HopsWorks are discussed as the results from those interactions are essential for

public-search. Next, both public-search and peer-to-peer sharing of datasets are

presented in depth. Lastly the real-time processing support is explained.

4.1 Rules

The first rule says that public datasets are immutable. This means that once a

dataset is made public, it cannot be changed, i.e you cannot add or remove files

from a public dataset. It turns out that this rule is a logical choice as HDFS

originally was designed for immutable data [28] and even though it’s now possible

to append to files, it’s common that large datasets remain static.

21

Page 35: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

22 CHAPTER 4. IMPLEMENTATION

The second rule says that public datasets are identified by the cluster id, the project

name, the dataset name and an unix-timestamp. The cluster id is obtained through

registration with Hops-Site, which is described further in section 4.2. The project

name and dataset name are provided by Hopsworks and its structure of users

having projects and projects having datasets. The cluster id makes the public

dataset unqique across different clusters, the project name makes the dataset

unique within a cluster and dataset name is there for convenience. The unix-

timestamp is there because a HopsWorks user might want to remove the public

property of the dataset but then make it public again later.

The third rule is really a result of a temporary version of GVoD. When this thesis

was written, GVoD did not have the ability to build its own overlay for peer-to-

peer sharing of data. Instead it needed to know all the peers that it was going to

download data from. This limited the ability to optimize public-search and we

will discuss improvements to this later in chapter 6 section 6.2.

The fourth rule is another one of those temporary assumptions that can be

improved upon. At this moment public datasets are considered to be one-level

in the sense that they don’t have any directories, only files.

The last rule was already mentioned in chapter 2 section 2.3. It says that the REST

call for public-datasets is available for anyone to call. This is necessary since a

HopsWorks instances needs to be able to direct search queries for public-datasets

to other HopsWorks instances without requiring any session type information.

This is also a security issue which will further discussed in chapter 6 section

6.2.

Page 36: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

4.2. HOPS-SITE INTERACTIONS WITH A HOPSWORKS INSTANCE 23

4.2 Hops-Site interactions with a HopsWorks instance

Hops-Site is a centralized server that is crucial to the functionality of the global

Hops ecosystem. Below are two figures that presents the interactions between

HopsWorks and Hops-Site. The first, figure 4.1, shows the Register REST call to

Hops-Site and the second, figure 4.2, shows the Ping REST call.

Figure 4.1: Register to Hops-Site

HopsWorks needs a cluster id that uniqley identifies it in the Hops ecosystem so

that it can later create public datasets that are unique across clusters. This cluster

id is obtained when HopsWorks succesfully registers with Hops-Site. HopsWorks

registers with Hops-Site as its being deployed to a Glassfish server. At the time of

deployment, HopsWorks checks the MySQL cluster to see if it has a cluster id. If

it doesn’t, it makes a REST call to Hops-Site with information about itself(email,

certificate, public-search-endpoint, GVoD-endpoint). If Hops-Site accepts the

REST call, it will generate a unqiue cluster id and send it back as a response.

HopsWorks will then persist this cluster id in the MySQL cluster which means

that no further registrations will be needed and it now has the ability to create

Page 37: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

24 CHAPTER 4. IMPLEMENTATION

public-datasets.

Figure 4.2: Ping Hops-Site

In order to perform public-search HopsWorks also needs to know where to direct

REST calls, simply sending them to the local ElasticSearch instance will only

produce local results. HopsWorks needs to be able to direct queries to non

local ElasticSearch instances. This is done by querying the REST API for

public datasets of other HopsWorks instances which will then forward the queries

to their local ElasticSearch instances. HopsWorks obtains these endpoints by

continuously pinging Hops-Site. Hops-Site will investigate the Ping call and sends

back a list where each entry contains information about another registered cluster.

In this information are two very important entries, the endpoint for search and

a counter which indicates the activity of the cluster. If the counter is high, it

means that is inactive and sending search-queries or trying to share datasets with

it probably isn’t a good idea.

Page 38: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

4.3. PUBLIC SEARCH 25

4.3 Public Search

This section describes public search in detail. First, the information needed to

perform public search is summarized. After that the different steps of public

search is described and also what the results are.

4.3.1 What is needed ?

As mentioned in Chapter 2 section 2.6 and 2.7, search inside a Hops cluster

involves ElasticSearch, Epipe and the MySQL cluster. A change in a MySQL-

cluster table will generate an event which Epipe will look at and write the

corresponding changes into ElasticSearch making it searchable. For public search,

a litte more work must be done. First of all, a dataset that is searchable in more

than one cluster needs to have a unique id so that the cluster searching for it can

see what cluster has what data(it may be that different clusters have the same

dataset). This id can be created with the cluster-id, the project-name, the dataset-

name and an unix-timestamp as we discussed above. The other thing that public

search needs is the public-search endpoints of the different clusters that shall

receive the search query. This was as mentioned above acquired by HopsWorks

pinging Hops-Site and obtaining endpoint information of other clusters. That

is all the information that HopsWorks need in order to perform public search.

The production of the public search and the handling of the response is described

below.

Page 39: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

26 CHAPTER 4. IMPLEMENTATION

4.3.2 Producing a public search and handling responses

When HopsWorks receives a REST call for a public-search query it loops through

a list of clusters that it obtained from the Ping with Hops-Site. This list contains

information about each cluster and most importantly a public-search-endpoint.

At each iteration of the loop, a check is made to see if the particular cluster is

considered active. If a cluster hasn’t pinged for some time, a counter for that

cluster will have a high value and the corresponding cluster will not be considered

for search. However, if a cluster is considered active, the public-search endpoint

is extracted and a non-blocking REST call to that endpoint is made. When each

iteration has concluded the main thread(the one iterating through the loop) blocks

and starts to wait for all responses to come back. As each response comes back, a

handler(another assigned thread) checks the hits(each cluster might have several

datasets that match the query) of each response. If a hit is unique it is saved

together with its GVoD-endpoint in an overall result list. If a hit isn’t unique

then only the GVoD-endpoint is extracted and appended to the corresponding

hit in the overall result list. When a thread has handled all hits, it sets a flag

indicating that this cluster has responded, it then checks to see if all clusters

have responded. If all have responded the thread also wakes up the main thread

that blocked in the beginning. The overall result list produced by all responses

will contain unique public-dataset matches, all with a list of GVoD-endpoints

that can be used to download the dataset. This is important, as we mentioned

in section 4.1, GVoD needs to know all of it’s peers it should to the download

with, it cannot(at this moment) build an overlay on its own. Before sending

back the result list, the list is sorted according to the score of each entry. This

score is a value that ElasticSearch associates with every search-hit which basically

reflects its relevance [29]. Figure 4.3 exemplifies the steps of the public-search

Page 40: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

4.3. PUBLIC SEARCH 27

implementation described above.

Figure 4.3: Public Search

The steps are as followed. First, a user inputs a query into the frontend. The

frontend will then forward this query to the public-search REST endpoint of a

HopsWorks instance. The HopsWorks instance then loops through all registered

clusters (only 2 are shown in figure 4.3) and sends asynchronous search queries to

these clusters HopsWorks REST endpoints for public datasets. These HopsWorks

instances will forward the queries to their local ElasticSearch instances which

will respond with some set of matches. Lastly as all responses comes back, these

matches are combined and filtered into the overall result list which is then sorted

Page 41: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

28 CHAPTER 4. IMPLEMENTATION

and returned to the orignal frontend.

4.4 GVoD peer-to-peer upload and download

This section describes the peer-to-peer sharing in detail. First, the information

needed before a download can be made is presented. After that the actual, upload

and download are described.

4.4.1 What is needed ?

In Chapter 2 section 2.8 we mentioned that GVoD is the application that takes care

of the download and upload of datasets. In this Chapter, in section 4.1, we also

mentioned that GVoD has no ability to build an overlay on demand. This means

that in order to produce an optimal download, i.e a download with the maximum

amount of participating peers, GVoD needs to get the peers from somewhere. It

turns out that public search does just that. Public search returns a list of unique

public datasets corresponding to the query it received as input, each of these

datasets also comes with a list of GVoD endpoints, which happens to be all of

the peers that GVoD can utilize to download the dataset. This means that after a

public-search is performed a HopsWorks user has all the information needed to

perform a optimal peer-to-peer download of a dataset.

4.4.2 Upload

In order to share a dataset, someone must first make the dataset public so that it

can be searched for and after that downloaded. Making a dataset public inside

Page 42: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

4.4. GVOD PEER-TO-PEER UPLOAD AND DOWNLOAD 29

HopsWorks and Hops involves several steps. The first thing that happens is that a

user of HopsWorks right-clicks on a dataset icon and selects ”make public”. The

next step involves the creation of the so called Manifest. A Manifest is a JSON

file that contains information about the contents of a public dataset. It describes

the files and if they support writing into Kafka. The Manifest also contains other

metadata information such as creator, creator-date and so on. When the Manifest

is created, it is written to the dataset folder in HDFS. After that, HopsWorks makes

a REST call to GVoD, informing it about the path of to the HDFS folder and other

information such as the public-dataset-id and HDFS endpoint information. GVoD

then looks at the path provided and tries to read the Manifest and parse it into a

JSON file. If successful GVoD knows the structure of the dataset it should upload

and also the torrent-id it should use (public-dataset-id). GVoD then replies to

HopsWorks with a REST call indicating that everything went fine. HopsWorks

then persists the fact that this dataset is now public and also its public-dataset-id

into the MySQL cluster. Epipe will then receive an event and write the changes

into ElasticSearch, making it available for public-search.

4.4.3 Download

When an upload has been conducted the public dataset will be publicly searchable

and at least one GVoD instance will also have it ready for upload. When a

HopsWorks users searches for a public-dataset it will receive a list of matches,

where each match has a certain public-dataset-id and a list of GVoD endpoints that

are willing to upload this dataset. In order to download a public dataset certain

steps needs to happen. First, a user will want to understand what kind of dataset

it is downloading and also if it can be written into Kafka. This information is

present in the Manifest file of each public-dataset and the first step is to ask the

Page 43: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

30 CHAPTER 4. IMPLEMENTATION

local GVoD instance to download the Manifest and present it to the HopsWorks

user. In order to do this, there must first be a location where the local GVoD can

write the Manifest. The HopsWorks user must then create a destination dataset

folder so that the local GVoD instance can write the Manifest into it. After a

destination dataset is created, HopsWorks sends the path of this dataset to GVoD

in a REST call, together with the other important info such as GVoD endpoints

to download from and the torrent-id(public-dataset-id). GVoD then downloads

the Manifest from the peers it was presented with and writes the Manifest into

the path that HopsWorks gave it. Then it sends a REST call back indicating that

the Manifest is now present in the path that it was given. HopsWorks can now

read the manifest from the destination dataset and present the information to the

user. Depending on what kind of files and schemas are present in the dataset, the

user can choose to either write the rest of the dataset into only HDFS or into both

HDFS and Kafka. After that choice is made, HopsWorks sends a REST call to

GVoD informing it about what kind of download should be made. When GVoD

recieves this REST call it proceeds to download the rest of the data into the desired

storage components.

4.5 Real-time processing

If a HopsWorks user chooses to download data into both Kafka and HDFS, then

it is possible to process the data thats being written to Kafka while the download

is progressing. There are many different ways of doing this as there are plenty

of technologies that have the ability to read from Kafka. The simplest and most

boring way to go about it is to create a simple Kafka consumer that reads from

a certain offset inside the Kafka Topic in question. However there are more

interesting things you can do. For example, both Apache Spark [10] and Apache

Page 44: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

4.5. REAL-TIME PROCESSING 31

Flink [11] provide APIs that has the ability to read from Kafka Topics. With these

technologies advance processing can be done as the dataset is being downloaded

into the cluster.

Page 45: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-
Page 46: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

Chapter 5

Analysis

This chapter presents the results and analysis of the implementation and tests.

First, the evaluation of the implementation and existing technology is presented.

After that the different test results are shown.

5.1 Evaluation of implementation and current technologies

This thesis and project has introduced a peer-to-peer sharing service that enables

scalable and efficient sharing of public datasets. Copying data between file-

systems, servers and datacenters is no novel idea. There exists countless solutions

for this type of problem but almost none of them fit particularly good in an Hadoop

ecosystem. First of all, datasets in an Hadoop cluster like Hops are often very

large hence simple transfer protocols like FTP won’t scale well. Technologies like

DistCp does perform well when copying large datasets but not from one datacenter

to another. Also because neither of these solutions use peer-to-peer technology

they are unlikely to achieve maximum performance. Another major obstacle for

33

Page 47: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

34 CHAPTER 5. ANALYSIS

technologies such as DistCp is that they cannot traverse NAT-endpoints. This

is a major problem as most of todays internet use NATs to extend network

infrastructure.

The implementation developed throughout this thesis suffers from none of these

above mentioned problems. It is peer-to-peer by nature and has built in NAT-

traversal capabilities.

5.2 Results of Experiments

This section presents the results of the tests that were conducted to validate the

implementation. First, the scale-ability of the peer-to-peer sharing service is

presented. Then the performance of the real-time processing is presented and

discussed.

5.3 P2P test

To test the peer-to-peer performance and scale-ability a test of sharing a public

dataset with an increasing amount of participating peers was conducted. The

clusters in the test where emulated using Vagrant [30] Virtual Machines. The

machines had 2 CPUs and enough memory to deploy a Hops cluster, see

details http://www.hops.io/?q=content/hopsworks-vagrant. Five clusters where

deployed, the first one being an initial uploader, the others downloaders that

became uploaders after finishing a download. In order to clearly observe the

performance of the setup, the upload speed of each clusters GVoD instance was

limited to 300 000 bytes per second. The size of the dataset to share was set to 40

megabytes. The results of these tests are depicted in the figures below, were the

Page 48: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

5.3. P2P TEST 35

first figure is a download with one uploader and the next with two uploaders and

so on.

Figure 5.1: Download with one uploader

Figure 5.2: Download with two uploaders

We can observe that each added peer makes for faster download speeds. The

last figure(5.4) shows a quite stable speed around one million and two hundred

thousand bytes per second, which is almost four times the speed of a perfect one-

to-one download. These tests confirm the scale ability of the implementation.

Page 49: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

36 CHAPTER 5. ANALYSIS

Figure 5.3: Download with three uploaders

Figure 5.4: Download with four uploaders

Page 50: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

5.4. REAL-TIME PROCESSING TESTS 37

5.4 Real-time processing tests

To test the performance of real-time processing, a simple test was made that

checked how long it took before data appeared in a Kafka Topic after a download

was started. This test is highly dependent on the speed of the transfer which itself

is dependent on the amount of participating peers. In order to test a worse case

scenario, only one uploader participated in the transfer. The size of the dataset

was the same as in the tests above and the clusters where emulated using the same

type of Vagrant virtual machines. The average time for data to appear differed

between 3 seconds to 6 with and average of 4 seconds, only 10 tests where made

which wasn’t enough to produce any nice graphs. There are a couple of things you

can say about this test. First, it shows that even in the face of only one uploader, it

can take as little as 3 seconds before a Kafka Topic has data from the dataset and

real-time processing could begin. With more peers the speed of the transfers will

grow as shown in section 5.3 which means that a highly popular public dataset

would be very easy to process.

Page 51: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-
Page 52: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

Chapter 6

Conclusions

This chapter concludes the thesis by presenting the authors reflections of the

project. First an evaluation of the goals are made. Then some reflections about

the work and future work is presented. The thesis is summed up with a final

conclusion in the last section.

6.1 Goals

The explicit goals of this project can be found in chapter 1 section 1.3. Overall,

the goal was to create a scalable and effective solution to share datasets between

Hadoop clusters and also support the ability to do real-time processing on

downloading datasets. The tests and evaluation in chapter 5 confirms that the peer-

to-peer sharing is scalable and that the real-time processing is effective and very

useful. The public-search part wasn’t tested and I can’t therefore claim anything

about its performance. However, as is described in chapter 4 section 4.3 the

main thread that does the search does block to wait for all clusters to respond.

39

Page 53: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

40 CHAPTER 6. CONCLUSIONS

This is obviously a performance flaw and improvements to this will be discussed

below.

6.2 Future work

Even though the implementation fulfilled the goals of the project there are

plenty of improvements that would make the system more component and

performant.

The most obvious problem with the system is the sub-optimal implementation of

search. Right now, when HopsWorks produces a public-search it queries a list of

clusters and then blocks and waits for all of them to respond. This means that

search will be as slow as the cluster that takes the most time to respond. This

can be quite problematic as people won’t expect a simple search for datasets to

take a long time. The reason for this implementation is that GVoD doesn’t build

an overlay on demand. The search needs to wait for all queires to come back in

order to collect all possible GVoD endpoints for the different matching datasets.

In order to improve this GVoD must first incorporate the ability to build an overlay

on demand. When GVoD acquires this ability the implementation of search can be

changed into something more sophisticated. For example, instead of waiting for

every cluster to respond, a pre-determined wait time could be decided upon. When

all queries have been sent to the clusters HopsWorks wont block and wait for all

of them to return, it will only wait the amount of time that was decided upon.

When that time has elapsed all the responses that HopsWorks has gotten could be

returned to the frontend. The rest of the responses could either be discarded as too

old or handled in a way that a HopsWorks user could request for them later.

Other obvious things to improve upon are the rules/assumptions introduced in

Page 54: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

6.2. FUTURE WORK 41

chapter 4 section 4.1. The fact that public datasets are immutable could be

changed and incorporate some kind of version system. Instead of forcing public

dataset to be static, a public dataset could have different versions where added data

meant a new version of the dataset.The peer-to-peer system could the recognize

that two datasets had the same base version and perhaps use that to optimize a

download or upload. Another obvious limitation in those rules is the fact that

public datasets are one-level. This should be changeds so that public datasets can

have directories to further structure its data.

In both chapter 2 and 4 it was mentioned that the HopsWorks web application had

a REST call that was available for anyone to call. This is an obvious weak point

for DDOS attackers to exploit but it is not trivial to fix. The fix needs to allow

different HopsWorks web-applications to differentiate themselves from random

DDOS attackers. Another solution could be to incorporate some kind of DDOS

detection, where spam-behavior is detected and dealt with correctly.

Another problem that wasn’t really clear from the tests is the way that GVoD

writes data to Kafka. When this thesis is written this is done with synchronous

producers which basically means that GVoD writes data to a topic, awaits a

confirmation that it was written and then writes again. This is of course not

optimal, it would be better if data could be written in a asynchronous way, similar

to how search queries to other clusters are handled.

A final issue with the implementation is the lack of functionality that would

increase the incentive for someone to download or upload their data. On torrent-

sites, there is usually some kind of ranking of different torrents, where the most

popular torrents have many uploaders and downloaders. There is also often some

kind of indication of where the torrent is from, what its creator was and so on.

Right now, there is basically nothing of this in HopsWorks which is problematic

Page 55: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

42 CHAPTER 6. CONCLUSIONS

as presenting data in this way motivates people to share their data. An example

of a solution for this would be for Hops-Site to store information about popular-

datasets. When a user uploads or downloads a dataset, a REST call top Hops-

Site could be made that informs it about the action and the dataset. The ping

REST call that HopsWorks does continuously to get endpoint information could

be extended to also get information about popular-datasets. This could then be

displayed somewhere in the HopsWorks frontend.

Page 56: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

6.3. CONCLUSION 43

6.3 Conclusion

This report and project has presented a solution for sharing datasets between

Hadoop clusters in a scalable and efficient manner. The implementation also

introduced a solution for the problem of downloading very large amount of data.

The major limitations of the solution have also been presented and suggestive

work for removing those limitations are explained above.

Page 57: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-
Page 58: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

Bibliography

[1] S. Kaisler, F. Armour, J. A. Espinosa, and W. Money, “Big data: Issues and

challenges moving forward,” in System Sciences (HICSS), 2013 46th Hawaii

International Conference on, Jan 2013, pp. 995–1004.

[2] A. Katal, M. Wazid, and R. H. Goudar, “Big data: Issues, challenges,

tools and good practices,” in Contemporary Computing (IC3), 2013 Sixth

International Conference on, Aug 2013, pp. 404–409.

[3] “Hadoop homepage,” http://hadoop.apache.org/, accessed: 2016-09-09.

[4] “Apache lucene,” https://lucene.apache.org/core/, accessed: 2016-09-09.

[5] “Apache distcp homepage,” https://hadoop.apache.org/docs/r1.2.1/distcp2.

html, accessed: 2016-09-09.

[6] “Ftp rfc,” https://www.ietf.org/rfc/rfc959.txt, accessed: 2016-09-09.

[7] “Gvod homepage,” http://www.decentrify.io/?q=content/video, accessed:

2016-09-09.

[8] “Elasticsearch guide,” https://www.elastic.co/guide/en/elasticsearch/guide/

current/getting-started.html, accessed: 2016-09-09.

[9] “Kafka homepage,” http://kafka.apache.org/.

45

Page 59: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

46 BIBLIOGRAPHY

[10] “Spark streaming and kafka,” http://spark.apache.org/docs/latest/streaming-

kafka-integration.html.

[11] “Flink and kafka,” https://ci.apache.org/projects/flink/flink-docs-release-1.

0/apis/streaming/connectors/kafka.html, accessed: 2016-09-09.

[12] D. Rossi, C. Testa, S. Valenti, and L. Muscariello, “Ledbat: The new

bittorrent congestion control protocol,” in Computer Communications and

Networks (ICCCN), 2010 Proceedings of 19th International Conference on,

Aug 2010, pp. 1–6.

[13] “Hadoop usages,” http://wiki.apache.org/hadoop/PoweredBy, accessed:

2016-09-09.

[14] “Hadoop open platform-as-a-service,” http://www.hops.io/?q=content/docs.

[15] K. Hakimzadeh, H. Peiro Sajjad, and J. Dowling, Scaling HDFS

with a Strongly Consistent Relational Model for Metadata. Berlin,

Heidelberg: Springer Berlin Heidelberg, 2014, pp. 38–51. [Online].

Available: http://dx.doi.org/10.1007/978-3-662-43352-2 4

[16] “Angularjs doc,” https://angularjs.org/, accessed: 2016-09-09.

[17] “Jersey web services,” https://jersey.java.net/.

[18] L. Richardson and S. Ruby, RESTful Web Services. ” O’Reilly Media, Inc.”,

2008.

[19] “Json rfc,” https://tools.ietf.org/html/rfc7159, accessed: 2016-09-09.

[20] “Glassfish server,” https://glassfish.java.net/.

[21] “Mysql cluster,” http://dev.mysql.com/doc/refman/5.7/en/ha-overview.html.

Page 60: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

BIBLIOGRAPHY 47

[22] “Lucene inverted index,” https://lucene.apache.org/core/3 0 3/fileformats.

html, accessed: 2016-09-09.

[23] “Ndb cluster api,” https://dev.mysql.com/doc/ndbapi/en/mysql-cluster-api-

overview.html.

[24] “Hdfs architecture,” http://hadoop.apache.org/docs/r2.7.2/hadoop-project-

dist/hadoop-hdfs/HdfsDesign.html.

[25] “Csv rfc,” https://www.ietf.org/rfc/rfc4180.txt, accessed: 2016-09-09.

[26] “Avro docs,” http://avro.apache.org/docs/1.7.5/spec.html, accessed: 2016-

09-09.

[27] A. Hakansson, “Portal of research methods and methodologies for

research projects and degree projects,” in Proceedings of the International

Conference on Frontiers in Education : Computer Science and Computer

Engineering FECS’13. CSREA Press U.S.A, 2013, pp. 67–73, qC

20131210.

[28] “Older hdfs version,” https://hadoop.apache.org/docs/r1.2.1/hdfs design.

html, accessed: 2016-09-09.

[29] “Elasticsearch score,” , accessed: 2016-09-09.

[30] “Vagrant docs,” https://www.vagrantup.com/docs/, accessed: 2016-09-09.

Page 61: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-
Page 62: A Global Ecosystem for Datasets on Hadoop1088359/FULLTEXT01.pdf · STOCKHOLM, SVERIGE 2016 A Global Ecosystem for Datasets on Hadoop JOHAN SVEDLUND NORDSTRÖM KTH SKOLAN FÖR INFORMATIONS-

TRITA -ICT-EX-2016:131

www.kth.se