Upload
abdullah-cetin-cavdar
View
6.690
Download
0
Embed Size (px)
Citation preview
Me :)
Graduated from@HU
PhD Student@METU
Ex EntrepreneurI had 3 start-ups
Senior SoftwareEngineer@Udemy
Founder and Organizer of
meetup.com/ankara-big-data-meetup
What's Big DataBig data is data that exceeds the processing capacity
of conventional database systems.
What's Big DataBig data is when the data itself becomes part of the
problem.
4V's of Big Data
Multitude of DataTypes
StructuredSemi-structuredUnstructured
Data Data Data
What We Need?StoreJoinIndexAnalyticsAggregateVisualize
ChallengeThe challenge in big data analytics is to
dig deeplyquickly (real time?)and widely
"ilities" or NFR?AvailabilityScalabilitySecurityPerformance...
Solution?
Big Data TechStack
What're essentialcomponents?
Data Sources
Multiple internal& external
data sources
Creates adata lake
DifferentVolume, Variety,
Velocity
Aim is to createa funnel after
proper validationand cleaning
Ingestion Layer
Signal-to-Noiseratio10:90
separate thenoise from
relevant info
It has capability toValidateCleanseTransformReduceIntegrate
DistributedStorage Layer
Fault toleranceParallelization
HDFSmassively scalable distributed
file system
HDFS
HDFS Architecture
Non-relational,distributed data?
NoSQL
CAP theoremConsistency, Availability,
Partition Tolerance
Ingestion to DFSSqoop, Flume, MapReduce, ETL
Infrastructure &Platform Layer
Computing &Scalability
Hadoop?
Vertical Scaling
Vertical Scaling
Vertical Scaling
Horizontal Scaling
Horizontal Scaling
Horizontal Scaling
MapReduceis the main computation paradigm
MapReduce
Hadoop 2
What's new?
What's new?
H1 vs. H2
One cluster,distributed storage,
distributed scheduler,many types of applications.
BlueprintsNoSQL with HBaseStream Processing with Storm/SparkGraph Processing with GiraphSQL on Hadoop with ImpalaColumnar Data Formats
Security Layer
Data need to be protectedMeet compliance requirementsIndividual's privacy
Properauthorization and
authenticationneeded
What can we do?Authentication protocol like KerberosEnable file layer encryptionUse SSL, certificates and trusted keysProvision with Chef, Puppet or Ansible like toolsLog all the communication for detecting anomaliesMonitor whole system
Monitoring Layer
Get a completepicture
of our Big Data tech stack
Satisfy SLAs withmin downtime
DataDog
New Relic (Overview)
New Relic (Databases)
Analytics Engine
Co-Existencewith Traditional
BIData warehouse in the traditional wayDistributed MR processing on big data stores
Mediate data in either directioni.e use Hive/HBase with Sqoop
Real-time analysis can leveragelow-latency NoSQL stores
i.e Cassandra, Vertica, ...
R may be used for complexstatistical algorithms
Search Engines
Huge volume andvariety of data
“needle in ahaystack”
Need blazing fast searchmechanism
to index and search for big dataanalytics
Elastic Search,Solr, ...
Real-timeProcessing
In memory?
Apache Spark
Storm, Kinesis,Flink, ...
VisualizationLayer
Gain insight fasterLook at different aspects of
data visually
Tableau
ChartIO
LambdaArchitecture
Lambda Architecture / MapR
Don't forget
There is no"One Size Fits All"
solution
We need
ContinuousDevelopment
Thank You :)