Big Data Tech Stack

Preview:

Citation preview

Big DataTech Stack

Big Data 2015by Abdullah Cetin CAVDAR

Me :)

Graduated from@HU

PhD Student@METU

Ex EntrepreneurI had 3 start-ups

Senior SoftwareEngineer@Udemy

Founder and Organizer of

meetup.com/ankara-big-data-meetup

What's Big DataBig data is data that exceeds the processing capacity

of conventional database systems.

What's Big DataBig data is when the data itself becomes part of the

problem.

4V's of Big Data

Multitude of DataTypes

StructuredSemi-structuredUnstructured

Data Data Data

What We Need?StoreJoinIndexAnalyticsAggregateVisualize

ChallengeThe challenge in big data analytics is to

dig deeplyquickly (real time?)and widely

"ilities" or NFR?AvailabilityScalabilitySecurityPerformance...

Solution?

Big Data TechStack

What're essentialcomponents?

Data Sources

Multiple internal& external

data sources

Creates adata lake

DifferentVolume, Variety,

Velocity

Aim is to createa funnel after

proper validationand cleaning

Ingestion Layer

Signal-to-Noiseratio10:90

separate thenoise from

relevant info

It has capability toValidateCleanseTransformReduceIntegrate

DistributedStorage Layer

Fault toleranceParallelization

HDFSmassively scalable distributed

file system

HDFS

HDFS Architecture

Non-relational,distributed data?

NoSQL

CAP theoremConsistency, Availability,

Partition Tolerance

Ingestion to DFSSqoop, Flume, MapReduce, ETL

Infrastructure &Platform Layer

Computing &Scalability

Hadoop?

Vertical Scaling

Vertical Scaling

Vertical Scaling

Horizontal Scaling

Horizontal Scaling

Horizontal Scaling

MapReduceis the main computation paradigm

MapReduce

Hadoop 2

What's new?

What's new?

H1 vs. H2

One cluster,distributed storage,

distributed scheduler,many types of applications.

BlueprintsNoSQL with HBaseStream Processing with Storm/SparkGraph Processing with GiraphSQL on Hadoop with ImpalaColumnar Data Formats

Security Layer

Data need to be protectedMeet compliance requirementsIndividual's privacy

Properauthorization and

authenticationneeded

What can we do?Authentication protocol like KerberosEnable file layer encryptionUse SSL, certificates and trusted keysProvision with Chef, Puppet or Ansible like toolsLog all the communication for detecting anomaliesMonitor whole system

Monitoring Layer

Get a completepicture

of our Big Data tech stack

Satisfy SLAs withmin downtime

DataDog

New Relic (Overview)

New Relic (Databases)

Analytics Engine

Co-Existencewith Traditional

BIData warehouse in the traditional wayDistributed MR processing on big data stores

Mediate data in either directioni.e use Hive/HBase with Sqoop

Real-time analysis can leveragelow-latency NoSQL stores

i.e Cassandra, Vertica, ...

R may be used for complexstatistical algorithms

Search Engines

Huge volume andvariety of data

“needle in ahaystack”

Need blazing fast searchmechanism

to index and search for big dataanalytics

Elastic Search,Solr, ...

Real-timeProcessing

In memory?

Apache Spark

Storm, Kinesis,Flink, ...

VisualizationLayer

Gain insight fasterLook at different aspects of

data visually

Tableau

ChartIO

LambdaArchitecture

Don't forget

There is no"One Size Fits All"

solution

We need

ContinuousDevelopment

Thank You :)

Recommended