Upload
codefest
View
60
Download
1
Embed Size (px)
Citation preview
Алгоритм обнаружения аномалий и Streaming SQL
в Amazon Kinesis Analytics
Денис Баталов, PhD @dbatalov
Sr. Solutions ArchitectСпец по ML и AI
Amazon Web ServicesLuxembourg
IOT – мощные потоки данных
Стандартное решение: пороговые значения
ГрафикЗаказов
Ложное
срабатываниеСлишком
много
метрик
Метрики внутри Amazon
Сегодня вы узнаете про
1. Алгоритм обнаружения аномалий Random Cut Forest
2. Спотовый рынок виртуальных машин Amazon EC2
3. Streaming SQL для обработки потоков
4. Обнаружение ценовых аномалий спотового рынка с использованием Amazon Kinesis Analytics
Random Cut Tree – Дерево Случайных Разбивок
повторяем: разбивка заканчивается когда все точки изолированыРазбивка длинной стороны
много данных
Неудачная разбивка
Random Cut Forest – Лес Случайных Разбивок
Каждое дерево построено на случайной выборке
…
Случайная выборка из потока
«резервуарная выборка» [Vitter]
Случайная выборка 5-ти значений из потока?
сохраняем с вероятностью
выбрасываем с вероятностью
Random Cut Forest
поток
…
поток
Операция Удаления
Теорема: результат удаления — дерево T’ построенное из Т ( )
Операция Вставки – Случай I
Начинаем с корневого узла
Если значение внутри контураспускайся ниже по
дереву по соответствующей ветви
Операция Вставки – Случай II
Теорема: результат вставки — дерево T’ ~ T( )
Что такое аномалия или выброс (outlier)?
Показатель Аномальности
Значение является аномальным если его вставка в дерево существенно увеличивает размер дерева, то есть сумму длин всех ветвей (или длину описания данных)
нормальное значение:
Показатель Аномальности
аномалия
Алгоритм с начала
…
поток Вставляем понарошку, получаем показатель аномальности
Эксперименты с реальными данными
Поездки такси в Нью Йорке
2014-12-01 00:00:002014-12-02 02:00:002014-12-03 04:00:002014-12-04 06:00:002014-12-05 08:00:002014-12-06 10:00:002014-12-07 12:00:000
5000
10000
15000
20000
25000
30000
numPassengers
Mon
8am6pm
4pm
Sat
11pm11am
Tue Wed Thu Fri
Данные агрегируются каждые 30 мин, размер шингла: 48
2014-09-17 00:30:00 2014-10-06 20:30:00 2014-10-26 16:30:00 2014-11-15 12:30:00 2014-12-05 08:30:00 2014-12-25 04:30:00 2015-01-14 00:30:000
5000
10000
15000
20000
25000
30000
35000
40000
45000
numPassengers
Поездки такси в Нью Йорке
2014-09-17 00:30:00 2014-10-07 10:30:00 2014-10-27 20:30:00 2014-11-17 06:30:00 2014-12-07 16:30:00 2014-12-28 02:30:00 2015-01-17 12:30:000
5000
10000
15000
20000
25000
30000
35000
40000
45000
0
10
20
30
40
50
60
70
80
90
100
numPassengers Anomaly Score
Поездки такси в Нью Йорке
ЭКГ
Копайте Глубже
“Robust Random Cut Forest Based Anomaly Detection on Streams”[Guha, Mishra, Roy, Schrijvers]
http://docs.aws.amazon.com/kinesisanalytics/latest/dev/app-anomaly-detection.html
Compute Purchasing ModelsOn-Demand
Pay for compute capacity by the hour with no long-term commitments
For spiky workloads, or to define needs
ReservedMake a low, one-time payment and receive a significant discount on the hourly charge
For committed utilization
SpotBid for unused capacity, charged at a Spot Price which fluctuates based on supply and demand
For time-insensitive or transient workloads
DedicatedLaunch instances within Amazon VPC that run on hardware dedicated to a single customer
For highly sensitive or compliance related workloads
Free TierGet Started on AWS with free usage & no commitment
For POCs and getting started
Reserved Instances (RI)
For example:
Reserve capacity for one or three yearsPay a low, one-time fee for the capacity reservationReceive a significant discount on the hourly charge for your instance
Reserved Instance Payment Options ExplainedNo Upfront option:
• Up to a 55% discount compared to On-Demand • Does not require upfront payment• Low hourly rate for the RI on an ongoing hourly basis
Partial Upfront option: • Balances the payments of an RI between upfront and hourly• Provides a higher discount (up to 76%) compared to the No
Upfront option• Pay a very low hourly rate upfront for every hour in the term
regardless of usage
With the All Upfront option: • Highest discount compared to On-Demand (up to 77% off).
Reserved Instance vs. On-Demand
30% 40% 50% 60% 70% 80% 90% 100% $-
$500 $1,000 $1,500 $2,000 $2,500 $3,000
m3.xlarge 1yr OD/RI Break Even Utiliza-tion
On Demand No Upfront Partial Upfront All Upfront
Utilization Over a Year
What are the “break-even” points of each of these options in relation to purchasing instances On-Demand?
Spot instancesWhat are Spot instances?
• Spare EC2 instances bid on in hourly increments• One hour at a time• Behave exactly like a regular instances
Cost Benefits• Up to 92% off regular on-demand prices per hour
What is the trade-off?• May be interrupted if that instance is needed for a
EC2 capacity • No charge for any partial hour due to termination
Reserved Instances
Spot Instances
Amazon Kinesis platform overview
Amazon Kinesis Streams
Easy administration: Create a stream, set capacity level with shards. Scale to match your data throughput rate & volume.
Build real-time applications: Process streaming data with Kinesis Client Library (KCL), Apache Spark/Storm, AWS Lambda, ....
Low cost: Cost-efficient for workloads of any scale.
Amazon Kinesis Firehose
Zero administration: Capture and deliver streaming data to Amazon S3, Amazon Redshift, or Amazon Elasticsearch Service without writing an app or managing infrastructure.
Direct-to-data-store integration: Batch, compress, and encrypt streaming data for delivery in as little as 60 seconds.
Seamless elasticity: Seamlessly scales to match data throughput without intervention.
Capture and submit streaming data to
Firehose
Analyze streaming data using your favorite BI tools
Firehose loads streaming data continuously into S3, Amazon Redshift, and Amazon ES
Amazon Kinesis Analytics
Apply SQL on streams: Easily connect to a Kinesis stream or Firehose delivery stream and apply SQL skills.
Build real-time applications: Perform continual processing on streaming big data with sub-second processing latencies.
Easy scalability: Elastically scales to match data throughput.
Connect to Kinesis streams,
Firehose delivery streams
Run standard SQL queries against data streams
Kinesis Analytics can send processed data to analytics tools so you can create alerts and respond in real
time
Amazon Kinesis: streaming data made easyServices make it easy to capture, deliver, and process streams on AWS
Kinesis Analytics For all developers, data scientists
Easily analyze data streams using standard SQL queries
Kinesis FirehoseFor all developers, data scientists
Easily load massive volumes of streaming data into S3, Amazon Redshift, or Amazon ES
Kinesis StreamsFor Technical Developers
Collect and stream data for ordered, replayable, real-time processing
Amazon Kinesis Analytics
Kinesis Analytics
Pay for only what you use
Automatic elasticity
Standard SQL for analytics
Real-time processing
Easy to use
Use SQL to build real-time applications
Easily write SQL code to process streaming data
Connect to streaming source
Continuously deliver SQL results
Connect to streaming source
• Streaming data sources include Firehose or Streams
• Input formats include JSON, .csv, variable column, unstructured text
• Each input has a schema; schema is inferred, but you can edit
• Reference data sources (S3) for data enrichment
Write SQL code
• Build streaming applications with one-to-many SQL statements
• Robust SQL support and advanced analytic functions
• Extensions to the SQL standard to work seamlessly with streaming data
• Support for at-least-once processing semantics
Continuously deliver SQL results
• Send processed data to multiple destinations• S3, Amazon Redshift, Amazon ES (through
Firehose)• Streams (with AWS Lambda integration for
custom destinations)• End-to-end processing speed as low as sub-
second• Separation of processing and data delivery
Generate time series analytics
• Compute key performance indicates over-time windows• Combine with historical data in S3 or Amazon Redshift
Analytics
Streams
Firehose
Amazon Redshift
S3
Streams
Firehose
Custom, real-time
destinations
Feed real-time dashboards
• Validate and transform raw data, and then process to calculate meaningful statistics
• Send processed data downstream for visualization in BI and visualization services
Amazon QuickSight
Analytics
Amazon ES
Amazon Redshift
AmazonRDS
Streams
Firehose
Create real-time alarms and notifications
• Build sequences of events from the stream, like user sessions in a clickstream or app behavior through logs
• Identify events (or a series of events) of interest, and react to the data through alarms and notifications
Analytics
Streams
Firehose
Streams
AmazonSNS
Amazon CloudWatch
Lambda
SQL on streaming data
• SQL is an API to your data
• Ask for what you want, system decides how to get it
• For all data, not just “flat” data in a database
• Opportunity for novel data organization and algorithms
• A standard (ANSI 2008, 2011) and the most commonly used data manipulation language
A simple streaming query
• Tweets about the AWS NYC Summit• Selecting from a STREAM of tweets, an in-application
stream• Each row has a corresponding ROWTIME
SELECT STREAM ROWTIME, author, textFROM TweetsWHERE text LIKE ‘%#AWSNYCSummit%'
A streaming table is a STREAM
• In relational databases, you work with SQL tables • With Analytics, you work with STREAMS• SELECT, INSERT, and CREATE can be used with STREAMs
CREATE STREAM Tweets(author VARCHAR(20), text VARCHAR(140));
INSERT INTO Tweets SELECT …
Writing queries on unbounded data sets
• Streams are unbounded data sets• Need continuous queries, row-by-row or across rows• WINDOWs define a start and end to the query
SELECT STREAM author, count(author) OVER ONE_MINUTE
FROM Tweets WINDOW ONE_MINUTE AS (PARTITION BY author RANGE INTERVAL '1' MINUTE PRECEDING);
Аномалии в спотовых ценахCREATE OR REPLACE PUMP "WEIGHTED_FAMILY_STREAM_PUMP" AS INSERT INTO "WEIGHTED_FAMILY_STREAM"SELECT STREAM "ts", "availabilityzone", "instancetype", "family", "size", "magnitude", "spotprice"/"magnitude" as "weightedprice", "spotprice"FROM (SELECT STREAM "ts", "availabilityzone", "instancetype", instance_family("instancetype") as "family", instance_size("instancetype") as "size", instance_magnitude("instancetype") as "magnitude", "spotprice" FROM "SOURCE_SQL_STREAM_001" WHERE "productdescription" = 'Linux/UNIX')WHERE "family" = 'C4';
Аномалии в спотовых ценахCREATE OR REPLACE PUMP "AZ_PRICE_STREAM_PUMP" ASINSERT INTO "AZ_PRICE_STREAM"SELECT STREAM "ts", "eu-west-1a-price", "eu-west-1b-price", "eu-west-1c-price", "ANOMALY_SCORE" as "anomaly_score"FROM TABLE(RANDOM_CUT_FOREST(CURSOR(SELECT STREAM"ts",avg(case when "availabilityzone" = 'eu-west-1a' then "weightedprice" else null end) over w1 as "eu-west-1a-price", avg(case when "availabilityzone" = 'eu-west-1b' then "weightedprice" else null end) over w1 as "eu-west-1b-price", avg(case when "availabilityzone" = 'eu-west-1c' then "weightedprice" else null end) over w1 as "eu-west-1c-price" FROM "WEIGHTED_FAMILY_STREAM"WINDOW W1 AS (RANGE INTERVAL '10' MINUTE PRECEDING)),100, 100, 10000, 10));
@dbatalov
Вопросы?
aws.amazon.com/ru
Денис Баталов, PhD Sr. Solutions ArchitectСпец по ML и AI
Amazon Web ServicesLuxembourg
@awsoblako aws.amazon.com/ru/blogs/rus/