49
Алгоритм обнаружения аномалий и Streaming SQL в Amazon Kinesis Analytics Денис Баталов, PhD @dbatalov Sr. Solutions Architect Спец по ML и AI Amazon Web Services Luxembourg

Денис Баталов

Embed Size (px)

Citation preview

Page 1: Денис Баталов

Алгоритм обнаружения аномалий и Streaming SQL

в Amazon Kinesis Analytics

Денис Баталов, PhD @dbatalov

Sr. Solutions ArchitectСпец по ML и AI

Amazon Web ServicesLuxembourg

Page 2: Денис Баталов

IOT – мощные потоки данных

Page 3: Денис Баталов

Стандартное решение: пороговые значения

ГрафикЗаказов

Ложное

срабатываниеСлишком

много

метрик

Метрики внутри Amazon

Page 4: Денис Баталов

Сегодня вы узнаете про

1. Алгоритм обнаружения аномалий Random Cut Forest

2. Спотовый рынок виртуальных машин Amazon EC2

3. Streaming SQL для обработки потоков

4. Обнаружение ценовых аномалий спотового рынка с использованием Amazon Kinesis Analytics

Page 5: Денис Баталов

Random Cut Tree – Дерево Случайных Разбивок

повторяем: разбивка заканчивается когда все точки изолированыРазбивка длинной стороны

много данных

Неудачная разбивка

Page 6: Денис Баталов

Random Cut Forest – Лес Случайных Разбивок

Каждое дерево построено на случайной выборке

Page 7: Денис Баталов

Случайная выборка из потока

«резервуарная выборка» [Vitter]

Случайная выборка 5-ти значений из потока?

сохраняем с вероятностью

выбрасываем с вероятностью

Page 8: Денис Баталов

Random Cut Forest

поток

поток

Page 9: Денис Баталов

Операция Удаления

Теорема: результат удаления — дерево T’ построенное из Т ( )

Page 10: Денис Баталов

Операция Вставки – Случай I

Начинаем с корневого узла

Если значение внутри контураспускайся ниже по

дереву по соответствующей ветви

Page 11: Денис Баталов

Операция Вставки – Случай II

Теорема: результат вставки — дерево T’ ~ T( )

Page 12: Денис Баталов

Что такое аномалия или выброс (outlier)?

Page 13: Денис Баталов

Показатель Аномальности

Значение является аномальным если его вставка в дерево существенно увеличивает размер дерева, то есть сумму длин всех ветвей (или длину описания данных)

нормальное значение:

Page 14: Денис Баталов

Показатель Аномальности

аномалия

Page 15: Денис Баталов

Алгоритм с начала

поток Вставляем понарошку, получаем показатель аномальности

Page 16: Денис Баталов

Эксперименты с реальными данными

Page 17: Денис Баталов

Поездки такси в Нью Йорке

2014-12-01 00:00:002014-12-02 02:00:002014-12-03 04:00:002014-12-04 06:00:002014-12-05 08:00:002014-12-06 10:00:002014-12-07 12:00:000

5000

10000

15000

20000

25000

30000

numPassengers

Mon

8am6pm

4pm

Sat

11pm11am

Tue Wed Thu Fri

Данные агрегируются каждые 30 мин, размер шингла: 48

Page 18: Денис Баталов

2014-09-17 00:30:00 2014-10-06 20:30:00 2014-10-26 16:30:00 2014-11-15 12:30:00 2014-12-05 08:30:00 2014-12-25 04:30:00 2015-01-14 00:30:000

5000

10000

15000

20000

25000

30000

35000

40000

45000

numPassengers

Поездки такси в Нью Йорке

Page 19: Денис Баталов

2014-09-17 00:30:00 2014-10-07 10:30:00 2014-10-27 20:30:00 2014-11-17 06:30:00 2014-12-07 16:30:00 2014-12-28 02:30:00 2015-01-17 12:30:000

5000

10000

15000

20000

25000

30000

35000

40000

45000

0

10

20

30

40

50

60

70

80

90

100

numPassengers Anomaly Score

Поездки такси в Нью Йорке

Page 20: Денис Баталов

ЭКГ

Page 22: Денис Баталов

Compute Purchasing ModelsOn-Demand

Pay for compute capacity by the hour with no long-term commitments

For spiky workloads, or to define needs

ReservedMake a low, one-time payment and receive a significant discount on the hourly charge

For committed utilization

SpotBid for unused capacity, charged at a Spot Price which fluctuates based on supply and demand

For time-insensitive or transient workloads

DedicatedLaunch instances within Amazon VPC that run on hardware dedicated to a single customer

For highly sensitive or compliance related workloads

Free TierGet Started on AWS with free usage & no commitment

For POCs and getting started

Page 23: Денис Баталов

Reserved Instances (RI)

 

For example:

Reserve capacity for one or three yearsPay a low, one-time fee for the capacity reservationReceive a significant discount on the hourly charge for your instance

Page 24: Денис Баталов

Reserved Instance Payment Options ExplainedNo Upfront option:

• Up to a 55% discount compared to On-Demand • Does not require upfront payment• Low hourly rate for the RI on an ongoing hourly basis

Partial Upfront option: • Balances the payments of an RI between upfront and hourly• Provides a higher discount (up to 76%) compared to the No

Upfront option• Pay a very low hourly rate upfront for every hour in the term

regardless of usage

With the All Upfront option: • Highest discount compared to On-Demand (up to 77% off).

Page 25: Денис Баталов

Reserved Instance vs. On-Demand

30% 40% 50% 60% 70% 80% 90% 100% $-

$500 $1,000 $1,500 $2,000 $2,500 $3,000

m3.xlarge 1yr OD/RI Break Even Utiliza-tion

On Demand No Upfront Partial Upfront All Upfront

Utilization Over a Year

What are the “break-even” points of each of these options in relation to purchasing instances On-Demand?

Page 26: Денис Баталов

Spot instancesWhat are Spot instances?

• Spare EC2 instances bid on in hourly increments• One hour at a time• Behave exactly like a regular instances

Cost Benefits• Up to 92% off regular on-demand prices per hour

What is the trade-off?• May be interrupted if that instance is needed for a

EC2 capacity • No charge for any partial hour due to termination

Page 27: Денис Баталов

Reserved Instances

Page 28: Денис Баталов

Spot Instances

Page 29: Денис Баталов

Amazon Kinesis platform overview

Page 30: Денис Баталов

Amazon Kinesis Streams

Easy administration: Create a stream, set capacity level with shards. Scale to match your data throughput rate & volume.

Build real-time applications: Process streaming data with Kinesis Client Library (KCL), Apache Spark/Storm, AWS Lambda, ....

Low cost: Cost-efficient for workloads of any scale.

Page 31: Денис Баталов

Amazon Kinesis Firehose

Zero administration: Capture and deliver streaming data to Amazon S3, Amazon Redshift, or Amazon Elasticsearch Service without writing an app or managing infrastructure.

Direct-to-data-store integration: Batch, compress, and encrypt streaming data for delivery in as little as 60 seconds.

Seamless elasticity: Seamlessly scales to match data throughput without intervention.

Capture and submit streaming data to

Firehose

Analyze streaming data using your favorite BI tools

Firehose loads streaming data continuously into S3, Amazon Redshift, and Amazon ES

Page 32: Денис Баталов

Amazon Kinesis Analytics

Apply SQL on streams: Easily connect to a Kinesis stream or Firehose delivery stream and apply SQL skills.

Build real-time applications: Perform continual processing on streaming big data with sub-second processing latencies.

Easy scalability: Elastically scales to match data throughput.

Connect to Kinesis streams,

Firehose delivery streams

Run standard SQL queries against data streams

Kinesis Analytics can send processed data to analytics tools so you can create alerts and respond in real

time

Page 33: Денис Баталов

Amazon Kinesis: streaming data made easyServices make it easy to capture, deliver, and process streams on AWS

Kinesis Analytics For all developers, data scientists

Easily analyze data streams using standard SQL queries

Kinesis FirehoseFor all developers, data scientists

Easily load massive volumes of streaming data into S3, Amazon Redshift, or Amazon ES

Kinesis StreamsFor Technical Developers

Collect and stream data for ordered, replayable, real-time processing

Page 34: Денис Баталов

Amazon Kinesis Analytics

Page 35: Денис Баталов

Kinesis Analytics

Pay for only what you use

Automatic elasticity

Standard SQL for analytics

Real-time processing

Easy to use

Page 36: Денис Баталов

Use SQL to build real-time applications

Easily write SQL code to process streaming data

Connect to streaming source

Continuously deliver SQL results

Page 37: Денис Баталов

Connect to streaming source

• Streaming data sources include Firehose or Streams

• Input formats include JSON, .csv, variable column, unstructured text

• Each input has a schema; schema is inferred, but you can edit

• Reference data sources (S3) for data enrichment

Page 38: Денис Баталов

Write SQL code

• Build streaming applications with one-to-many SQL statements

• Robust SQL support and advanced analytic functions

• Extensions to the SQL standard to work seamlessly with streaming data

• Support for at-least-once processing semantics

Page 39: Денис Баталов

Continuously deliver SQL results

• Send processed data to multiple destinations• S3, Amazon Redshift, Amazon ES (through

Firehose)• Streams (with AWS Lambda integration for

custom destinations)• End-to-end processing speed as low as sub-

second• Separation of processing and data delivery

Page 40: Денис Баталов

Generate time series analytics

• Compute key performance indicates over-time windows• Combine with historical data in S3 or Amazon Redshift

Analytics

Streams

Firehose

Amazon Redshift

S3

Streams

Firehose

Custom, real-time

destinations

Page 41: Денис Баталов

Feed real-time dashboards

• Validate and transform raw data, and then process to calculate meaningful statistics

• Send processed data downstream for visualization in BI and visualization services

Amazon QuickSight

Analytics

Amazon ES

Amazon Redshift

AmazonRDS

Streams

Firehose

Page 42: Денис Баталов

Create real-time alarms and notifications

• Build sequences of events from the stream, like user sessions in a clickstream or app behavior through logs

• Identify events (or a series of events) of interest, and react to the data through alarms and notifications

Analytics

Streams

Firehose

Streams

AmazonSNS

Amazon CloudWatch

Lambda

Page 43: Денис Баталов

SQL on streaming data

• SQL is an API to your data

• Ask for what you want, system decides how to get it

• For all data, not just “flat” data in a database

• Opportunity for novel data organization and algorithms

• A standard (ANSI 2008, 2011) and the most commonly used data manipulation language

Page 44: Денис Баталов

A simple streaming query

• Tweets about the AWS NYC Summit• Selecting from a STREAM of tweets, an in-application

stream• Each row has a corresponding ROWTIME

SELECT STREAM ROWTIME, author, textFROM TweetsWHERE text LIKE ‘%#AWSNYCSummit%'

Page 45: Денис Баталов

A streaming table is a STREAM

• In relational databases, you work with SQL tables • With Analytics, you work with STREAMS• SELECT, INSERT, and CREATE can be used with STREAMs

CREATE STREAM Tweets(author VARCHAR(20), text VARCHAR(140));

INSERT INTO Tweets SELECT …

Page 46: Денис Баталов

Writing queries on unbounded data sets

• Streams are unbounded data sets• Need continuous queries, row-by-row or across rows• WINDOWs define a start and end to the query

SELECT STREAM author, count(author) OVER ONE_MINUTE

FROM Tweets WINDOW ONE_MINUTE AS (PARTITION BY author RANGE INTERVAL '1' MINUTE PRECEDING);

Page 47: Денис Баталов

Аномалии в спотовых ценахCREATE OR REPLACE PUMP "WEIGHTED_FAMILY_STREAM_PUMP" AS INSERT INTO "WEIGHTED_FAMILY_STREAM"SELECT STREAM "ts", "availabilityzone", "instancetype", "family", "size", "magnitude", "spotprice"/"magnitude" as "weightedprice", "spotprice"FROM (SELECT STREAM "ts", "availabilityzone", "instancetype", instance_family("instancetype") as "family", instance_size("instancetype") as "size", instance_magnitude("instancetype") as "magnitude", "spotprice" FROM "SOURCE_SQL_STREAM_001" WHERE "productdescription" = 'Linux/UNIX')WHERE "family" = 'C4';

Page 48: Денис Баталов

Аномалии в спотовых ценахCREATE OR REPLACE PUMP "AZ_PRICE_STREAM_PUMP" ASINSERT INTO "AZ_PRICE_STREAM"SELECT STREAM "ts", "eu-west-1a-price", "eu-west-1b-price", "eu-west-1c-price", "ANOMALY_SCORE" as "anomaly_score"FROM TABLE(RANDOM_CUT_FOREST(CURSOR(SELECT STREAM"ts",avg(case when "availabilityzone" = 'eu-west-1a' then "weightedprice" else null end) over w1 as "eu-west-1a-price", avg(case when "availabilityzone" = 'eu-west-1b' then "weightedprice" else null end) over w1 as "eu-west-1b-price", avg(case when "availabilityzone" = 'eu-west-1c' then "weightedprice" else null end) over w1 as "eu-west-1c-price" FROM "WEIGHTED_FAMILY_STREAM"WINDOW W1 AS (RANGE INTERVAL '10' MINUTE PRECEDING)),100, 100, 10000, 10));

Page 49: Денис Баталов

@dbatalov

Вопросы?

aws.amazon.com/ru

Денис Баталов, PhD Sr. Solutions ArchitectСпец по ML и AI

Amazon Web ServicesLuxembourg

@awsoblako aws.amazon.com/ru/blogs/rus/