View
298
Download
0
Category
Preview:
Citation preview
AWS Big Data Solution Overview
Ivan Cheng (鄭志帆)
AWS Solutions Architect
What is Big Data?
When your data sets become so large and complex
you have to start innovating around how to
collect, store, process, analyze, and share them.
GBTB
PB
ZB
EB
Big Data: Unconstrained Growth
Unstructured data growth is explosive
95% of the 1.2 zettabytes of data in the digital universe is unstructured
Machine data and IoT will only steepen the curve
70% of this data is user-generated content
Source: IDC, The Internet of Things: Getting Ready to Embrace Its Impact on the Digital Economy, March 2016.
The Cloud Was Built for Big Data
Elastic and highly scalable
No upfront capital expense
Only pay for what you use+
+
Available on-demand+
= the Cloud removes constraints
Ingest/
Collect
Consume/
visualizeStore Process/
analyze
Data1 4
0 95 Answers &
insights
START HEREWITH A BUSINESS CASE
Time to answer (Latency)
Cost
Evolution of Analytics
Retrospective
analysis and
reporting
Here-and-now
real-time processing
and dashboards
Predictions
to enable smart
applications
AWS Big Data Benefits
Immediate Availability. Deploy instantly. No hardware to procure,
no infrastructure to maintain & scale.
Broad & Deep Capabilities. Over 50 services and 100s of features
to support virtually any big data application & workload.
Trusted & Secure. Designed to meet the strictest requirements.
Continuously audited, including certifications such as ISO 27001,
FedRAMP, DoD CSM, and PCI DSS.
Hundreds of Partners & Solutions. Get help from a consulting partner
or choose from hundreds of tools and applications across the entire data
management stack.
AWS Data PipelineAWS Database Migration Service
EMR
Analyze
Amazon
GlacierS3
StoreCollect
Amazon Kinesis
Direct Connect
Amazon
Machine
Learning
Amazon
Redshift
DynamoDB AWS IoT
AWS Snowball
QuickSight
Amazon Athena
EC2Amazon
Elasticsearch
Service
Lambda
AWS Glue
Key AWS Certifications and Assurance Programs
AWS Big Data Customer Success
AWS Big Data Partners
AWS Big Data Service Overview
AWS Database Migration Service
AWS Direct
ConnectAWS
Import/Export
& Snowball
AWS
Storage
Gateway
Data Movement
Storage and Databases
• Store unlimited number of objects
• Designed for 99.999999999% durability
• As Data Lake with integration with other AWS services
(Amazon Kinesis, Amazon Redshift, Amazon EMR, etc.)
• Low cost with tired-storage (Standard, IA, Amazon Glacier)
via life-cycle policy
• Secure – SSL, client/server-side encryption at rest
Amazon S3
• Fully Managed NoSQL Database
• Fast consistent performance (single-digit millisecond latency
at any scale)
• Highly scalable - automatic scaling of throughput capacity
• Highly available and durability
• Store unlimited number of data
Amazon
DynamoDB
• Fully Managed Relational Database Service
• MySQL and PostgreSQL compatible relational database with up to
5x better performance running on the same hardware
• Security, availability, and reliability of commercial databases at
1/10th the cost
• Designed to offer greater than 99.99% availability.
• Automatically grows storage as needed, from 10GB up to 64TB
• Achieve up to 500,000 reads and 100,000 writes per second
Amazon
Aurora
• Fully managed petabyte-scale relational, MPP, data warehousing
• Built-in end-to-end security, including SSL connections and cluster
encryption
• Fault-tolerant - automatically recovers from disk and node failures
• Data automatically backed up to Amazon S3
• $1,000/TB/Year; start at $0.25/hour. Provision in minutes; scale
from 160 GB to 2 PB of compressed data with just a few clicks
Amazon
Redshift
Analytic Frameworks
• Managed Hadoop framework
• Apache Hadoop, Hive, Spark, Zeppelin, Presto, HBase, Phoenix,
Tez, Flink, etc.
• Auto Scaling clusters with support for on-demand and spot pricing
• Support for end-to-end encryption, IAM/VPC, S3 client-side
encryption with customer managed keys and AWS KMS
• Integrates with Amazon S3, Amazon DynamoDB, Amazon Kinesis
and Amazon Redshift
Amazon
EMR
PIG
Amazon
EMR
Amazon
S3
EMRFS
Amazon EMR
• Fully managed, reliable, and scalable Elasticsearch service
• Support for ELK
• Integration options with other AWS services (CloudWatch
Logs, Amazon DynamoDB, Amazon S3, Amazon Kinesis)
• Use Case: log analytics, full text search, application
monitoring, and more.
Amazon
Elasticsearch
• Serverless query service for querying data in S3 using
standard SQL with no infrastructure to manage
• Support for multiple data formats include text, CSV, TSV,
JSON, Avro, ORC, Parquet
• Pay per query only when you’re running queries based on
data scanned. If you compress your data, you pay less and
your queries run faster
Amazon
Athena
Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning
• Fast and cloud-powered Business Analytics
• Easy to use, no infrastructure to manage
• Quick calculations with SPICE
• 1/10th the cost of legacy BI software
• Accessed from any browser or mobile device
Amazon
Quicksight
• Fully managed ETL (extract, transform, load) service
• Integrated data catalog, automatic schema discovery, ETL
code generation, flexible job scheduler
• Integrated across a wide range of AWS services (Amazon
RDS, Database running on Amazon EC2, Amazon Athena,
etc.)
AWS Glue
1. Build your data catalog
2. Generate and Edit Transformations
3. Schedule and Run Your Jobs
How AWS Glue Works
Real-time Analytics
• Fully managed streaming application
• Scalable – handle any amount of streaming data
• Ingest, buffer and process data in real-time
• React quickly – derive insight in seconds
Amazon
Kinesis
Amazon Kinesis
Amazon Kinesis
Streams
Build your own custom
applications that process or
analyze streaming data
Amazon Kinesis
Firehose
Easily load massive volumes
of streaming data into
Amazon S3, Amazon
Redshift, and Amazon
Elasticsearch
Amazon Kinesis
Analytics
Easily analyze data streams
using standard SQL queries
Amazon Kinesis Streams
• Reliably ingest and durably store streaming data at low
cost
• Build custom real-time applications to process
streaming data
Amazon Kinesis Firehose
Reliably ingest and deliver batched, compressed, and encrypted
data to S3, Amazon Redshift, and Amazon Elasticsearch Service
Amazon Kinesis Analytics
Interact with streaming data in real time using SQL
Hundreds of big data products are immediately available through the AWS marketplace
AWS Market Place for Big Data Solution
Advanced AnalyticsDatabase and Data Enablement Business Inteligence
Fully Integrated | 1-click deployment | Pay-as-you-go pricing
Modern Data Analytics Architecture on AWS
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architectureInsights to enhance business applications, new digital services
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architectureInsights to enhance business applications, new digital services
Transactions
Web logs /
cookies
ERP
Data analysts
Data scientists
Business users
Engagement platformsConnected
devices
Social media Automation / events
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architectureInsights to enhance business applications, new digital services
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces
Amazon
Kinesis
Connected
devices
Social media
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architectureInsights to enhance business applications, new digital services
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces
Amazon
Kinesis
Connected
devices
Social media
Speed (Real-time)
Scale (Batch)
Amazon S3
Staged Data
(Data Lake)Amazon S3
Raw DataAmazon EMR
ETL
AWS Glue
AWS
Cloud Trail
AWS
IAMAmazon
CloudWatch
AWS
KMS
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architectureInsights to enhance business applications, new digital services
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces
Amazon
Kinesis
Connected
devices
Social media
Speed (Real-time)
Scale (Batch)
Amazon S3
Staged Data
(Data Lake)Amazon S3
Raw DataAmazon EMR
ETL
Advanced
Analytics
MLlib
Deep LearningAmazon ML
Serving
AWS
Cloud Trail
AWS
IAMAmazon
CloudWatch
AWS
KMS
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architectureInsights to enhance business applications, new digital services
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces
Amazon
Kinesis
Connected
devices
Social media
Speed (Real-time)
Scale (Batch)
Amazon S3
Staged Data
(Data Lake)Amazon S3
Raw DataAmazon EMR
ETL
Advanced
Analytics
MLlib
Deep LearningAmazon ML
Serving
Data WarehouseAmazon Redshift
Legacy AppsAmazon RDS
SchemalessAmazon ElasticSearch
Direct QueryAmazon Athena
Near-Zero LatencyAmazon DynamoDB
Semi/UnstructuredAmazon EMR
AWS
Cloud Trail
AWS
IAMAmazon
CloudWatch
AWS
KMS
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architectureInsights to enhance business applications, new digital services
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces
Amazon
Kinesis
Connected
devices
Social media
Speed (Real-time)
Scale (Batch)
Amazon S3
Staged Data
(Data Lake)Amazon S3
Raw DataAmazon EMR
ETL
Advanced
Analytics
MLlib
Deep LearningAmazon ML
Serving
Data WarehouseAmazon Redshift
Legacy AppsAmazon RDS
SchemalessAmazon ElasticSearch
Direct QueryAmazon Athena
Near-Zero LatencyAmazon DynamoDB
Semi/UnstructuredAmazon EMR
Amazon
QuickSight
Amazon
API Gateway
AWS
Cloud Trail
AWS
IAMAmazon
CloudWatch
AWS
KMS
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces
Amazon
Kinesis
Connected
devices
Social media
Speed (Real-time)
Scale (Batch)
Amazon S3
Staged Data
(Data Lake)Amazon S3
Raw DataAmazon EMR
ETL
Advanced
Analytics
MLlib
Deep LearningAmazon ML
Serving
Data WarehouseAmazon Redshift
Legacy AppsAmazon RDS
SchemalessAmazon ElasticSearch
Direct QueryAmazon Athena
Near-Zero LatencyAmazon DynamoDB
Semi/UnstructuredAmazon EMR
Amazon
QuickSight
Amazon
API Gateway
Event CaptureAmazon Kinesis
Stream AnalysisAmazon EMR Event Scoring
Amazon AI
Event HandlerAWS Lambda Response Handler
AWS Lambda
Modern data architectureInsights to enhance business applications, new digital services
AWS
Cloud Trail
AWS
IAMAmazon
CloudWatch
AWS
KMS
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces
Amazon
Kinesis
Connected
devices
Social media
AWS
Cloud Trail
AWS
IAMAmazon
CloudWatch
AWS
KMS
Speed (Real-time)
Scale (Batch)
Amazon S3
Staged Data
(Data Lake)Amazon S3
Raw DataAmazon EMR
ETL
Advanced
Analytics
MLlib
Deep LearningAmazon ML
Serving
Data WarehouseAmazon Redshift
Legacy AppsAmazon RDS
SchemalessAmazon ElasticSearch
Direct QueryAmazon Athena
Near-Zero LatencyAmazon DynamoDB
Semi/UnstructuredAmazon EMR
Amazon
QuickSight
Amazon
API Gateway
Event CaptureAmazon Kinesis
Stream AnalysisAmazon EMR Event Scoring
Amazon AI
Event HandlerAWS Lambda Response Handler
AWS Lambda
Modern data architectureInsights to enhance business applications, new digital services
Reference Architecture
Sample Reference Architecture: Data Lake
AthenaGlue
Data Marts
(Amazon
Redshift)
Query Cluster
(EMR)
Query Cluster
(EMR)
Auto Scaling
EC2
Analytics
App
Normalization
ETL Clusters
(EMR)
Batch Analytic
Clusters
(EMR)
Ad Hoc Query
Cluster (EMR)
Auto Scaling
EC2
Analytics
App
Users Data
ProvidersAuto Scaling
EC2
Data
Ingestion
Services
Optimization
ETL Clusters
(EMR)
Shared Metastore
(RDS)
Query Optimized
(S3)
Auto Scaling EC2
Data
Catalog
& Lineage
Services
Reference Data
(RDS)
Shared Data Services
Auto Scaling
EC2
Cluster Mgt
& Workflow
Services
Source of
Truth (S3)
>5 PB, up to 75 billion events per day
Amazon
S3
Amazon
EMR
Amazon
S3
Amazon
Redshift
Amazon
QuickSightData
Sources
Enterprise Data Warehouse
Amazon
Athena
Amazon
Athena
Ingest/
Collect
Consume/
visualizeStore
Process/
analyze
Data
1 40 9
5
Outcomes
& insights
Personalized
recommendations within
seconds (from 15-20 min)
Scale the expertise of
stylists to all shoppers
Reduce costs by 2X order
of magnitude
…
Mobile Users
Desktop Users
Analytics
Tools
Online Stylist
Amazon
Redshift
Amazon
Kinesis
AWS
Lambda
Amazon
DynamoDBAWS
Lambda
Amazon S3
Data Storage
NORDSTROM
Big Data on AWS:
https://aws.amazon.com/big-data/
Thank you!
Recommended