of 22/22
COLLECTING BIG DATA WITH S3/CLOUDFRONT LOGGING Moty Michaely, VP R&D Xplenty Data Integration-as-a- Service

Collecting Big Data with S3/CloudFront Logging

  • View

  • Download

Embed Size (px)


There are several ways of collecting big data, one the most promising is S3/CloudFront logging. It’s low cost and quick to implement. Let's dive in and see how to setup S3/CloudFront logging with your application.

Text of Collecting Big Data with S3/CloudFront Logging

  • COLLECTING BIG DATA WITH S3/CLOUDFRONT LOGGING Moty Michaely, VP R&D Xplenty Data Integration-as-a-Service
  • In our recent article, Scale Your Data Collection on the Cloud Like a Champ, we reviewed several ways of collecting big data, the most promising of which was S3/CloudFront logging. Its low cost and quick to implement. Now wed like to dig deeper and show how to setup S3/CloudFront logging with your application.
  • DEFINE APP DATA Sit back and think - which data would you like to collect? Which app events should be logged? These could be page visits, mouse clicks, logins, errors, etc. Some of them may include parameters such as the page visit URL. Write them all down. Be as thorough as possible so you dont lose any precious data.
  • CREATE AN AWS ACCOUNT If you dont already have an AWS (Amazon Web Services) account, you can sign up here. Registration is free with the basic support package.
  • CREATE AN S3 BUCKET Go to the S3 dashboard and create a bucket for saving the logs. Note that the bucket must have a unique name across Amazons service and adhere to DNS rules: 3-63 characters, only letters numbers and periods, shouldn't look like an IP address, and no underscores. Dont turn on logging - we will do so via CloudFront. (See the screenshot on the next slide for a visual explanation)
  • CREATE EVENT IMAGES Set up directories in the image bucket, for example /mouse, to organize events by categories, and create 1x1 pixel images (see previous post) for all the events that you defined in the first step, e.g. click.png, login.png, error.png. Dont worry about event parameters at the moment, we will deal with them shortly. All files uploaded to S3 are set as private, so make sure to change the file permissions to public. You may use tools such as CloudBerry Explorer or S3 Browser to do so and much more.
  • CREATE EVENT IMAGES CONT. Set HTTP headers for all the images so that they will be cached by CloudFront, thus saving GET requests from CloudFront edge locations to S3. Go to the relevant bucket, check the image files on the left, click Actions at the top, choose Properties, and open the Metadata section. Add the following metadata line and click save: Cache-Control: max-age=31536000
  • CREATE A CLOUDFRONT DISTRIBUTION Creating a CloudFront distribution costs extra, but its mandatory - it logs the query string, adds extra log info such as edge locations, and helps to deliver files via Amazons CDN to shorten load times. Access the CloudFront dashboard and create a web distribution for the image S3 bucket. Make sure that Use Origin Cache Headers is set under Object Caching (its the default setting).
  • CREATE A CLOUDFRONT DISTRIBUTION CONT. Note that the distribution gets a random domain name. It could take a while before it starts working because the DNS servers need to be updated to support it. You can also set a more friendly domain using the Alternate Domain Names (CNAMEs) option under Distribution Settings, though it requires configuring your DNS settings so that your domain points to CloudFronts domain name. See Amazons documentation for more info.
  • TURN LOGGING ON Still in the CloudFront dashboard, check the distribution on the left, click Distribution Settings at the top, click Edit under the General tab, enable logging, and insert the bucket where you want to store the logs.
  • CODE A FUNCTION TO CALL EVENTS Time to get your hands dirty and write a method that registers events, or call one of your apps developers to do it for you. The code could be on the client side, server side, or both depending on the architecture. The method should simply send an asynchronous HTTP GET request to the relevant image URL, e.g. to http://logs.xplenty.com/mouse/click.png (links in this format for demo purposes only, not operational). If you need to send additional event parameters, use the query string (dont forget URL encoding), e.g. http://logs.xplenty.com/mouse/click.png?id=login&url=http%3A%2F%2Fw ww.example.com%2Flogin
  • EXAMPLE CODE TO CALL EVENTS $.CloudFrontLog = function (attr) { var url = 'http://logs.xplenty.com/' + attr.category + '/' + attr.action + '.png', data = { id: attr.id, url: attr.url }; return $.get(url, data); };
  • CALL THE EVENTS Dig through your apps code and add event calls using the method that youve just written. This will collect the data that you defined in step 1. Heres a jQuery code sample for logging client-side button clicks: $('.btn').click(function(e) { var id = $(this).attr('id'); $.CloudFrontLog({ action: 'click', category: 'mouse', id: id, url: location.href }); });
  • TEST Use your staging environment to call events via the application and check that the logs are generated accordingly. Patience young padawan, it may take an hour or so until Amazon writes them.
  • GO LIVE! Everything should be ready for you to collect big data like a champ - update the production environment and let the logging begin. Don't know what to do with the data? See how to analyze AWS logs in 15 minutes.