32
Create a serverless architecture for data collec1on with Python and AWS 9 Apr 2017 David Santucci

Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

Embed Size (px)

Citation preview

Page 1: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

Create a serverless architecture for data collec1on with Python and AWS

9 Apr 2017

David Santucci

Page 2: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

[email protected]

About me

David SantucciData scien;st @ CloudAcademy.com

@davidsantucci

linkedin.com/in/davidsantucci/

Page 3: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

Agenda

• Introduc;on

• Architecture

• Amazon Kinesis Stream

• Amazon Lambda

• Dead LeGer Queue (DLQ)

• Conclusions

• Q&A

Page 4: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

Introduc4on

Challenges: • Collect events from different sources

• Backend applica;ons • Frontend applica;ons • Mobile apps

• Store events to different des4na4ons • Data Warehouse • Third-party services

• e.g., Hubspot, Mixpanel, GTM, … • Avoid data loss

Page 5: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

A serverless architecture

AWS services:

• Kinesis Stream

• Lambda Func;ons

• SQS

• S3

• Amazon API Gateway

Page 6: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci
Page 7: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

Manage events from mul4ple sources

Page 8: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

Amazon Kinesis Stream

What is Amazon Kinesis Stream?

• Collect and process large streams of data records in real ;me.

• Typical scenarios for using Streams: • Manage mul;ple producers that push their data feed directly into a stream;

• Collect real-;me analy;cs and metrics;

• Process applica;on logs;

• Create pipeline with other AWS services (the consumers).

Page 9: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

from time import gmtime, strftimeimport boto3client = boto3.client( service_name="kinesis", region_name="us-east-1", ) for i in xrange(300): print "sending event {}".format(i+1) response = client.put_record( StreamName="data-collection-stream", Data='{"name":"event-%d","data":{"payload":%d}}' % (i, i), PartitionKey=strftime("PK-%Y%m%d-%H%M%S", gmtime()), ) print "response for event {}: {}".format(i+1, response)

Amazon Kinesis Stream

Page 10: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

Amazon Kinesis Stream - Tips

• Use API Gateway as entry point for front-end and mobile.

• Start with a single shard and increase only when needed.

• Output events one by one to avoid data loss.

• Generate Par44onKey using uuid (e.g., for test purpose).

Page 11: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

Amazon Lambda

What is AWS Lambda?

• It processes a single event at real-;me without managing servers.

• Highly scalable.

• Fallback strategy in case of errors.

Page 12: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci
Page 13: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

Amazon Lambda - Events rou4ng

Page 14: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

Amazon Lambda - Events rou4ng

It works as router and it is directly triggered by Kinesis Streams.

Page 15: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

[ { "destination_name": "mixpanel", "destination_arn": "arn:aws:lambda:region:account-id:function:function-name:prod", "enabled_events": [ "page_view", "search", "button_click", "page_scroll", ] }, { "destination_name": "hubspotcrm", "destination_arn": "arn:aws:lambda:region:account-id:function:function-name:prod", "enabled_events": [ "login", "logout", "registration", "page_view", "search", "email_sent", "email_open", ] }, { "destination_name": "datawarehouse", "destination_arn": "arn:aws:lambda:region:account-id:function:function-name:prod", "enabled_events": [ "login", "logout", "registration", "page_view", "search", "button_click", "page_scroll", "email_sent", "email_open", ] }]

Amazon Lambda - Events rou4ng

Page 16: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

… { "destination_name": "datawarehouse", "destination_arn": "arn:aws:lambda:region:id:function:name:prod", "enabled_events": [ "login", "logout", "registration", "page_view", "search", "button_click", "page_scroll", "email_sent", "email_open", ] }

Amazon Lambda - Events rou4ng

Page 17: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci
Page 18: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

Amazon Lambda - events rou4ng

Page 19: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

Amazon Lambda - events rou4ng

It provides the logic to connect to the des4na4on services (e.g., HubSpot, Mixpanel, etc … ). Custom retry strategy (with exponen;al delay).

Page 20: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

Amazon Lambda - Retry strategydef lambda_handler(event, context=None): try: hub_id = os.environ['HUBSPOT_HUB_ID'] except KeyError: raise DoNotRetryException('HUBSPOT_HUB_ID') event = format_event_data(event, hub_id) process_event(event['data']) return "ok"def format_event_data(event, hub_id): event_id = event["name"].split(".")[-1].replace("_", " ").title() event['data'].update({ '_a': hub_id, '_n': event_id, 'email': event['data']['_email'], }) return event@retrydef process_event(params): url = 'http://track.hubspot.com/v1/event?{}'.format(urllib.urlencode(params)) urllib2.urlopen(url)

Page 21: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

Amazon Lambda - Retry strategy

def retry(func, max_retries=3, backoff_rate=2, scale_factor=.1): def func_wrapper(*args, **kwargs): attempts = 0 while True: attempts += 1 if attempts >= max_retries: raise try: return func(*args, **kwargs) except DoNotRetryException: raise except: time.sleep(backoff_rate ** attempts * scale_factor) return func_wrapperclass DoNotRetryException(Exception): def __init__(self, *args, **kwargs): Exception.__init__(self, *args, **kwargs)

Page 22: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

Amazon Lambda - Our 4ps

• Enable Kinesis Stream as a trigger for other AWS services. • To preserve the priority Configure trigger with Batch size: 1 and Star;ng posi;on: Trim Horizon

• An S3 file can be used to define the rou;ng rules. • Invoke Lambda Func;ons that work as connector asynchronously • Always create aliases and versions for each Func;on. • Use environment variables for configura;ons. • Create a custom IAM role for each Func;on. • Detect delays in stream processing monitoring IteratorAge metric

in the Lambda console’s monitoring tab.

Page 23: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci
Page 24: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

Dead LeIer Queues (DLQ) - Avoid event loss

Page 25: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

DLQ - Simple Queue Service (SQS)

What is AWS SQS?

• Lambda automa4cally retries failed execu;ons for asynchronous invoca;ons.

• Configure Lambda (advanced secngs) to forward payloads that were not

processed to a dead-leIer queue (an SQS queue or an SNS topic).

• We used a SQS.

Page 26: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

def get_events_from_sqs( sqs_queue_name, region_name='us-west-2', purge_messages=False, backup_filename='backup.jsonl', visibility_timeout=60): """ Create a json backup file of all events in the SQS queue with the given 'sqs_queue_name'. :sqs_queue_name: the name of the AWS SQS queue to be read via boto3 :region_name: the region name of the AWS SQS queue to be read via boto3 :purge_messages: True if messages must be deleted after reading, False otherwise :backup_filename: the name of the file where to store all SQS messages :visibility_timeout: period of time in seconds (unique consumer window) :return: the number of processed batch of events """ forwarded = 0 counter = 0 sqs = boto3.resource('sqs', region_name=region_name) dlq = sqs.get_queue_by_name(QueueName=sqs_queue_name) # continues to next slide ..

Amazon Lambda - Events rou4ng

Page 27: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

Amazon Lambda - Events rou4ng # continues from previous slide .. with open(backup_filename, 'a') as filep: while True: batch_messages = dlq.receive_messages( MessageAttributeNames=['All'], MaxNumberOfMessages=10, WaitTimeSeconds=20, VisibilityTimeout=visibility_timeout, ) for msg in batch_messages: try: line = "{}\n".format(json.dumps({ 'attributes': msg.message_attributes, 'body': msg.body, })) print("Line: ", line) filep.write(line) if purge_messages: print('Deleting message from the queue.') msg.delete() forwarded += 1 except Exception as ex: print("Error in processing message %s: %r", msg, ex) counter += 1 print('Batch %d processed', counter)

Page 28: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

DLQ - Our 4ps

• Set a DLQ on each Lambda Func;on that can fail.

• Re-process events sent to DLQ with a custom script.

• Tune DLQ config directly from Lambda Func;on panel.

Page 29: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

Conclusions

Why a serverless architecture? • scalability - prevent data loss - full control on each step - costs

Open points: • Integrate a custom CloudWatch dashboard. • Configure Firehose for a Backup. • Write a script that manages events sent to DLQs. • Create a listener for anomaly detec;on with Kinesis Analy;cs. • Amazon StepFunc;ons.

Page 30: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

WE’RE HIRING!

clad.co/fullstack-dev

Page 31: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

Useful links

These slides: Create a serverless architecture for data collec4on with Python and AWS —> hGp://clda.co/pycon8-serverless-data-collec;on

Blog post with code snippets: Building a serverless architecture for data collec4on with AWS Lambda —> hGp://clda.co/pycon8-data-collec;on-blogpost

Serverless Learning Path: GeJng Started with Serverless Compu4ng —> hGp://clda.co/pycon8-serverless-LP

Page 32: Create a serverless architecture for data collecon with ... · PDF fileCreate a serverless architecture for data collecon with Python and AWS 9 Apr 2017 David Santucci

Thank you :)

cloudacademy.com