Social Data and Log Analysis Using MongoDB

Preview:

Citation preview

Social Data and Log Analysis Using MongoDB2011/03/01(Tue) #mongotokyo

doryokujin

Self-Introduction

• doryokujin (Takahiro Inoue), Age: 25

• Education: University of Keio• Master of Mathematics March 2011 ( Maybe... )

• Major: Randomized Algorithms and Probabilistic Analysis

• Company: Geisha Tokyo Entertainment (GTE)• Data Mining Engineer (only me, part-time)

• Organized Community: • MongoDB JP, Tokyo Web Mining

My Job

• I’m a Fledgling Data Scientist

• Development of analytical systems for social data

• Development of recommendation systems for social data

• My Interest: Big Data Analysis

• How to generate logs scattered many servers

• How to storage and access to data

• How to analyze and visualization of billions of data

Agenda

• My Company’s Analytic Architecture

• How to Handle Access Logs

• How to Handle User Trace Logs

• How to Collaborate with Front Analytic Tools

• My Future Analytic Architecture

Agenda

• My Company’s Analytic Architecture

• How to Handle Access Logs

• How to Handle User Trace Logs

• How to Collaborate with Front Analytic Tools

• My Future Analytic Architecture

Of Course Everything With

Hadoop, Mongo Map Reduce

Hadoop, Schema Free

REST Interface, JSON

Capped Collection,Modifier Operation

My Company’s Analytic Architecture

Social Game (Mobile): Omiseyasan

• Enjoy arranging their own shop (and avatar)

• Communicate with other users by shopping, part-time, ...

• Buy seeds of items to display their own shop

Data Flow

Flash ComposeServer

User Game Save Data

Access Logs

User Registration / Charge

User Trace Logs

Access

Back-end Architecture

User Registration / Charge

User Trace LogsAccess Logs User Game

Save Data

Pretreatment: Trimming,Validation, Filtering,...

As a Central Data Server

Back Up To S3

PyMongo

Dumbo (Hadoop Streaming)

Front-end Architecture

Social Data Analysis Data Analysis

Web UI

sleepy.mongoose(REST Interface)

PyMongo

Environment• MongoDB: 1.6.4

• PyMongo: 1.9

• Hadoop: CDH2 ( soon update to CDH3 )

• Dumbo: Simple Python Module for Hadoop Streaming

• Cassandra: 0.6.11

• R, Neo4j, jQuery, Munin, ...

• [Data Size (a rough estimate)]

• Access Log 15GB / day ( gzip ) - 2,000M PV

• User Trace Log 5GB / day ( gzip )

How to Handle Access Logs

How to Handle Access Logs

User Registration / Charge

User Trace LogsAccess Logs User Game

Save Data

Pretreatment: Trimming,Validation, Filtering, ...

As a Data Server

Back Up To S3

Access Data Flow

user_access

user_pageview

daily_pageviewagent_pageview

hourly_pageview

Access Logs

Pretreatment

1st Map Reduce

2nd Map Reduce

Group by

Caution: need MongoDB >= 1.7.4

Hadoop

• Using Hadoop: Pretreatment Raw Records

• [Map / Reduce]

• Read all records

• Split each record by ‘¥s’

• Filter unnecessary records (such as *.swf)

• Check records whether correct or not

• Insert (save) records to MongoDB

※ write operations won’t yet fully utilize all cores

110.44.178.25 - - [19/Nov/2010:04:40:40 +0900] "GET /playshop.4ce13800/battle/

BattleSelectAssetPage.html;jsessionid=9587B0309581914AB7438A34B1E51125-n15.at3?collec\

tion=12&opensocial_app_id=00000&opensocial_owner_id=00000 HTTP/1.0" 200 6773 "-"

"DoCoMo/2.0 ***"

110.44.178.26 - - [19/Nov/2010:04:40:40 +0900] "GET /playshop.4ce13800/shopping/battle/

ShoppingBattleTopPage.html;jsessionid=D901918E3CAE46E6B928A316D1938C3A-n11.a\

p1?opensocial_app_id=00000&opensocial_owner_id=11111 HTTP/1.0" 200 15254 "-"

"DoCoMo/2.0 ***"

110.44.178.27 - - [19/Nov/2010:04:40:40 +0900] "GET /playshop.4ce13800/battle/

BattleSelectAssetDetailPage;jsessionid=202571F97B444370ECB495C2BCC6A1D5-n14.at11?asse\

t=53&collection=9&opensocial_app_id=00000&opensocial_owner_id=22222 HTTP/1.0" 200

11616 "-" "SoftBank/***"

...(many records)

Access Logs

> db.user_trace.find({user: "7777", date: "2011-02-12"}).limit(0)

.forEach(printjson)

{

"_id" : "2011-02-12+05:39:31+7777+18343+Access",

"lastUpdate" : "2011-02-19",

"ipaddr" : "202.32.107.166",

"requestTimeStr" : "12/Feb/2011:05:39:31 +0900",

"date" : "2011-02-12",

"time" : "05:39:31",

"responseBodySize" : 18343,

"userAgent" : "DoCoMo/2.0 SH07A3(c500;TB;W24H14)",

"statusCode" : "200",

"splittedPath" : "/avatar2-gree/MyPage,

"userId" : "7777",

"resource" : "/avatar2-gree/MyPage;jsessionid=...?

battlecardfreegacha=1&feed=...&opensocial_app_id=...&opensocial_viewer_id=...&

opensocial_owner_id=..."

}

Collection: user_trace

1st Map Reduce

• [Aggregation]

• Group by url, date, userId

• Group by url, date, userAgent

• Group by url, date, time

• Group by url, date, statusCode

• Map Reduce operations runs in parallel on all shards

map = Code("""

function(){

emit({

path:this.splittedPath,

userId:this.userId,

date:this.date

},1)}

""")

reduce = Code("""

function(key, values){

var count = 0;

values.forEach(function(v) {

count += 1;

});

return {"count": count, "lastUpdate": today};

}

""")

• this.userId

• this.userAgent

• this. timeRange

• this.statusCode

1st Map Reduce with PyMongo

# ( mongodb >= 1.7.4 )

result = db.user_access.map_reduce(map,

reduce,

marge_out="user_pageview",

full_response=True,

query={"date": date})

• About output collection, there are 4 options: (MongoDB >= 1.7.4)• out : overwrite collection if already exists• marge_output : merge new data into the old output collection• reduce_output : reduce operation will be performed on the two values

(the same key on new result and old collection) and the result will be written to the output collection.

• full_responce (=false) : If True, return on stats on the operation. If False, No collection will be created, and the whole map-reduce operation will happen in RAM. The Result set fits within the 8MB/doc limit (16MB/doc in 1.8?).

Map Reduce (>=1.7.4):out option in JavaScript

• "collectionName" : If you pass a string indicating the name of a collection, then the output will replace any existing output collection with the same name.

• { merge : "collectionName" } : This option will merge new data into the old output collection. In other words, if the same key exists in both the result set and the old collection, the new key will overwrite the old one.

• { reduce : "collectionName" } : If documents exists for a given key in the result set and in the old collection, then a reduce operation (using the specified reduce function) will be performed on the two values and the result will be written to the output collection. If a finalize function was provided, this will be run after the reduce as well.

• { inline : 1} : With this option, no collection will be created, and the whole map-reduce operation will happen in RAM. Also, the results of the map-reduce will be returned within the result object. Note that this option is possible only when the result set fits within the 8MB limit.

http://www.mongodb.org/display/DOCS/MapReduce

> db.user_pageview.find({

"_id.userId": "7777",

"_id.path": "/.*MyPage$/",

"_id.date": {$lte: "2011-02-12"}

).limit(1).forEach(printjson)

#####

{

"_id" : {

"date" : "2011-02-12",

"path" : "/avatar2-gree/MyPage",

"userId" : "7777",

},

"value" : {

"count" : 10,

"lastUpdate" : "2011-02-19"

}

}

• Regular Expression

• <, >, <=, >=

Collection: user_pageview

map = Code("""

function(){

emit({

"path" : this._id.path,

"date": this._id.date,

},{

"pv": this.value.count,

"uu": 1

});

}

""")

reduce = Code("""

function(key, values){

var pv = 0;

var uu = 0;

values.forEach(function(v){

pv += v.pv;

uu += v.uu;

});

return {"pv": pv, "uu": uu};

}

""")

2nd Map Reduce with PyMongo

map = Code("""

function(){

emit({

"path" : this._id.path,

"date": this._id.date,

},{

"pv": this.value.count,

"uu": 1

});

}

""")

reduce = Code("""

function(key, values){

var pv = 0;

var uu = 0;

values.forEach(function(v){

pv += v.pv;

uu += v.uu;

});

return {"pv": pv, "uu": uu};

}

""")

2nd Map Reduce with PyMongo

Must be the same key ({“pv”: NaN} if not)

# ( mongodb >= 1.7.4 )

result = db.user_pageview.map_reduce(map,

reduce,

marge_out="daily_pageview",

full_response=True,

query={"date": date})

> db.daily_pageview.find({

"_id.date": "2011-02-12",

"_id.path": /.*MyPage$/

}).limit(1).forEach(printjson)

{

"_id" : {

"date" : "2011-02-12",

"path" : "/avatar2-gree/MyPage",

},

"value" : {

"uu" : 53536,

"pv" : 539467

}

}

Collection: daily_pageview

Current Map Reduce is Imperfect• [Single Threads per node]

• Doesn't scale map-reduce across multiple threads

• [Overwrite the Output Collection]• Overwrite the old collection ( no other options like “marge” or

“reduce” )

# mapreduce code to merge output (MongoDB < 1.7.4)

result = db.user_access.map_reduce(map,

reduce,

full_response=True,

out="temp_collection",

query={"date": date})

[db.user_pageview.save(doc) for doc in temp_collection.find()]

How to HandleUser Trace Logs

How to Handle User TRACE Logs

User Registration / Charge

User Trace LogsAccess Logs Game Save

Data

Pretreatment: Trimming,Validation, Filtering, ...

As a Data Server

Back Up To S3

User Trace / Charge Data Flow

user_trace

user_charge

daily_charge

daily_trace

User Trace Logs

Pretreatment

User Registration / Charge

User Trace Log

Hadoop• Using Hadoop: Pretreatment Raw Records

• [Map / Reduce]• Split each record by ‘¥s’

• Filter Unnecessary Records

• Check records whether user behaves dishonestly

• Unify format to be able to sum up ( Because raw records are written by free format )

• Sum up records group by “userId” and “actionType”

• Insert (save) records to MongoDB

※ write operations won’t yet fully utilize all cores

An Example of User Trace Log

UserId ActionType ActionDetail

An Example of User Trace Log-----Change------ActionLogger a{ChangeP} (Point,1371,1383) ActionLogger a{ChangeP} (Point,2373,2423)

------Get------ActionLogger a{GetMaterial} (syouhinnomoto,0,-1) ActionLogger a{GetMaterial} usesyouhinnomoto ActionLogger a{GetMaterial} (omotyanomotoPRO,1,6)

-----Trade-----ActionLogger a{Trade} buy 3 itigoke-kis from gree.jp:00000 #逆からみれば売った事に

-----Make-----ActionLogger a{Make} make item kuronekono_nActionLogger a{MakeSelect} make item syouhinnomoto ActionLogger a{MakeSelect} (syouhinnomoto,0,1)

-----PutOn/Off-----ActionLogger a{PutOff} put off 1 ksuterasActionLogger a{PutOn} put 1 burokkus @2500

-----Clear/Clean-----ActionLogger a{ClearLuckyStar} Clear LuckyItem_1 4 times

-----Gatcha-----ActionLogger a{Gacha} Play gacha with first free play:わくわくおみせ服ガチャActionLogger a{Gacha} Play gacha:わくわくおみせ服ガチャ

The value of “actionDerail” must be unified format

> db.user_trace.find({date:"2011-02-12”,

actionType:"a{Make}",

userId:”7777"}).forEach(printjson)

{

"_id" : "2011-02-12+7777+a{Make}",

"date" : "2011-02-12"

"lastUpdate" : "2011-02-19",

"userId" : ”7777",

"actionType" : "a{Make}",

"actionDetail" : {

"make item ksutera" : 3,

"make item makaron" : 1,

"make item huwahuwamimiate" : 1,

     …

   }

}

Collection: user_trace

Sum up values group by “userId” and “actionType”

> db.daily_trace.find({

date:{$gte:"2011-02-12”,$lte:”2011-02-19”},

actionType:"a{Make}"}).forEach(printjson)

{

"_id" : "2011-02-12+group+a{Make}",

"date" : "2011-02-12",

"lastUpdate" : "2011-02-19",

"actionType" : "a{Make}",

"actionDetail" : {

"make item kinnokarakuridokei" : 615,

"make item banjo-" : 377,

"make item itigoke-ki" : 135904,

...

},

...

}...

Collection: daily_trace

User Charge Log

// TOP10 Users at 2011-02-12 abount Accounting

> db.user_charge.find({date:"2011-02-12"})

.sort({totalCharge:-1}).limit(10).forEach(printjson)

{

"_id" : "2011-02-12+7777+Charge",

"date" : "2011-02-12",

"lastUpdate" : "2011-02-19",

"totalCharge" : 10000,

"userId" : ”7777",

"actionType" : "Charge",

"boughtItem" : {

"アクセサリーの素EX" : 13,

"コネルギー+6000" : 3,

"アクセサリーの素PRO" : 20

}

}

{…

Collection: user_charge

Sum up values group by “userId” and “actionType”

> db.daily_charge.find({date:"2011-02-12",T:"all"})

.limit(10).forEach(printjson)

{

"_id" : "2011-02-12+group+Charge+all+all",

"date" : "2011-02-12",

"total" : 100000,

"UU" : 2000,

"group" : {

  "わくわくポイント" : 1000000,

  "アクセサリー" : 1000000, ...

},

"boughtItemNum" : {

"料理の素EX" : 8,

"アクセサリーの素" : 730, ...

},

"boughtItem" : {

"料理の素EX" : 10000,

"アクセサリーの素" : 100000, ...

}

}

Collection: daily_charge

Categorize Users

user_registration

user_category

• [Categorize Users]

• by play term

• by total amount of charge

• by registration date

• [ Take an Snapshot of Each Category’s Stats per Week]

Attribution

Attribution

Attribution

Attribution

Categorize Usersuser_trace

user_charge

user_savedata

user_pageview

> db.user_registration.find({userId:”7777"}).forEach(printjson)

{

"_id" : "2010-06-29+7777+Registration",

"userId" : ”7777"

"actionType" : "Registration",

"category" : {

“R1” : “True”, # categorize whether resign or not

“T” : “ll” # categorize play term

     …

  },

  “firstCharge” : “2010-07-07”, # date when first charge

  “lastLogin” : “2010-09-30”, # date when last access

  “playTerm” : 94,

  “totalCumlativeCharge” : 50000, # total amount of accounting

  “totalMonthCharge” : 10000, # total amount of accounting recent a month

  …

}

Collection: user_registration

Tagging User

> var cross = new Cross() # User Definition Function

> MCResign = cross.calc(“2011-02-12”,“MC”,1)

# each value is the number of the user

# Charge(yen)/Term(day)

0(z) ~¥1k(s) ~¥10k(m) ¥100k~(l) total

~1day(z) 50000 10 5 0 50015

~1week(s) 50000 100 50 3 50153

~1month(m) 100000 200 100 1 100301

~3month(l) 100000 300 50 6 100356

month~(ll) 0 0 0 0 0

Collection: user_category

How to Collaborate WithFront Analytic Tools

Front-end Architecture

Social Data Analysis Data Analysis

Web UI

sleepy.mongoose(REST Interface)

PyMongo

Web UI and Mongo

Data Table: jQuery.DataTables[ Data Table ]

• Want to Share Daily Summary

• Want to See Data from Many Viewpoint

• Want to Implement Easily

• jQuery.DataTables

1 Variable length pagination

2 On-the-fly filtering

3 Multi-column sorting with data

type detection

4 Smart handling of column widths

5 Scrolling options for table

viewport

6 ...

Graph: jQuery.HighCharts[ Graph ]

• Want to Visualize Data

• Handle Time Series Data Mainly

• Want to Implement Easily

• jQuery.HighCharts

1. Numerous Chart Types

2. Simple Configuration Syntax

3. Multiple Axes

4. Tooltip Labels

5. Zooming

6. ...

sleepy.mongoose

• [REST Interface + Mongo]

• Get Data by HTTP GET/POST Request

• sleepy.mongoose

‣ request as “/db_name/collection_name/_command”

‣made by a 10gen engineer: @kchodorow

‣ Sleepy.Mongoose: A MongoDB REST Interface

//start server

> python httpd.py

…listening for connections on http://localhost:27080

//connect to MongoDB

> curl --data server=localhost:27017 'http://localhost:27080/_connect’

//request example

> http://localhost:27080/playshop/daily_charge/_find?criteria={}&limit=10&batch_size=10

{"ok": 1, "results": [{“_id": “…”, ”date":… },{“_id”:…}], "id": 0}}

sleepy.mongoose

JSON: Mongo <---> Ajax

JSONGet

• jQuery library and MongoDB are compatible

• It is not necessary to describe HTML tag(such as <table>)

sleepy.mongoose(REST Interface)

Example: Web UI

R and Mongo

> db.user_registration.find({userId:”7777"}).forEach(printjson)

{

"_id" : "2010-06-29+7777+Registration",

"userId" : ”7777"

"actionType" : "Registration",

"category" : {

“R1” : “True”, # categorize whether resign or not

“T” : “ll” # categorize play term

     …

  },

  “firstCharge” : “2010-07-07”, # date when first charge

  “lastLogin” : “2010-09-30”, # date when last access

  “playTerm” : 94,

  “totalCumlativeCharge” : 50000, # total amount of accounting

  “totalMonthCharge” : 10000, # total amount of accounting recent a month

  …

}

Collection: user_registration

Want to know the relationbetween user attributions

##### LOAD LIBRARY #####

library(RCurl)

library(rjson)

##### CONF #####

today.str <- format(Sys.time(), "%Y-%m-%d")

url.base <- "http://localhost:27080"

mongo.db <- "playshop"

mongo.col <- "user_registration"

mongo.base <- paste(url.base, mongo.db, mongo.col, sep="/")

mongo.sort <- ""

mongo.limit <- "limit=100000"

mongo.batch <- "batch_size=100000"

R Code: Access MongoDBUsing sleepy.mongoose

##### FUNCTION #####

find <- function(query){

mongo <- fromJSON(getURL(url))

docs <- mongo$result

makeTable(docs) # My Function

}

# Example

# Using sleepy.mongoose https://github.com/kchodorow/sleepy.mongoose

mongo.criteria <- "_find?criteria={ ¥

\"totalCumlativeCharge\":{\"$gt\":0,\"$lte\":1000}}"

mongo.query <- paste(mongo.criteria, mongo.sort, ¥

mongo.limit, mongo.batch, sep="&")

url <- paste(mongo.base, mongo.query, sep="/")

user.charge.low <- find(url)

R Code: Access MongoDBUsing sleepy.mongoose

# Result: 10th Document

[[10]][[10]]$playTerm[1] 31

[[10]]$lastUpdate[1] "2011-02-24"

[[10]]$userId[1] "7777"

[[10]]$totalCumlativeCharge[1] 10000

[[10]]$lastLogin[1] "2011-02-21"

[[10]]$date[1] "2011-01-22"

[[10]]$`_id`[1] "2011-02-12+18790376+Registration"

...

The Result

# Result: Translate Document to Table

playTerm totalWinRate totalCumlativeCharge totalCommitNum totalWinNum [1,] 56 42 1000 533 224 [2,] 57 33 1000 127 42 [3,] 57 35 1000 654 229 [4,] 18 31 1000 49 15 [5,] 77 35 1000 982 345 [6,] 77 45 1000 339 153 [7,] 31 44 1000 70 31 [8,] 76 39 1000 229 89 [9,] 40 21 1000 430 92[10,] 26 40 1000 25 10...

Make a Data Table from The Result

Scatter Plot / Matrix

Each Category

(User Attribution)

# Run as a batch command$ R --vanilla --quiet < mongo2R.R

Munin and MongoDB

My FutureAnalytic Architecture

user_access

user_trace

User Trace Logs

Access Logs

capped collection(per hour) Trimming

FilteringSum Up

RealTime(hourly)

Flume

daily/hourly_access

daily/hourly_trace

capped collection(per hour)

MapReduceModifierSum Up

RealTime(hourly)

Realtime Analysiswith MongoDB

Flume

Server A

Server B

Server C

Server D

Server E

Server F

Collector MongoDB

Access LogUser Trace Log

Hourly / Realtime

Flume Plugin

> db.flume_capped_21.find().limit(1).forEach(printjson)

{

"_id" : ObjectId("4d658187de9bd9f24323e1b6"),

"timestamp" : "Wed Feb 23 2011 21:52:06 GMT+0000 (UTC)",

"nanoseconds" : NumberLong("562387389278959"),

"hostname" : "ip-10-131-27-115.ap-southeast-1.compute.internal",

"priority" : "INFO",

"message" : "202.32.107.42 - - [14/Feb/2011:04:30:32 +0900] "GET /avatar2-gree.4d537100/res/swf/avatar/18051727/5/useravatar1582476746.swf?opensocial_app_id=472&opensocial_viewer_id=36858644&o

pensocial_owner_id=36858644 HTTP/1.1" 200 33640 "-" "DoCoMo/2.0 SH01C(c500;TB;W24H16)"",

"metadata" : {}

}

An Output FromMongo-Flume Plugin

Mongo Flume Plugin: https://github.com/mongodb/mongo-hadoop/tree/master/flume_plugin

Summary

Summary

• Almighty as a Analytic Data Server

• schema-free: social game data are changeable

• rich queries: important for analyze many point of view

• powerful aggregation: map reduce

• mongo shell: analyze from mongo shell are speedy and handy

• More...

• Scalability: using Replication, Sharding are very easy

• Node.js: It enable us server side scripting with Mongo

My Presentation・「MongoDBを用いたソーシャルアプリのログ解析」 ~解析基盤構築からフロントUIまで、MongoDBを最大限に活用する~ :

http://www.slideshare.net/doryokujin/mongodb-uimongodb

・「MongoDBとAjaxで作る解析フロントエンド&GraphDBを用いたソーシャルデータ解析」:

http://www.slideshare.net/doryokujin/mongodbajaxgraphdb-5774546

・「HadoopとMongoDBを活用したソーシャルアプリのログ解析」:

http://www.slideshare.net/doryokujin/hadoopmongodb

・「GraphDB徹底入門」~構造や仕組み理解から使いどころ・種々のGraphDBの比較まで幅広く~ :

http://www.slideshare.net/doryokujin/graphdbgraphdb

I ♥ MongoDB JP

• continue to be a organizer of MongoDB JP

• continue to propose many use cases of MongoDB

• ex: Social Data, Log Data, Medical Data, ...

• support MongoDB users

• by document translation, user-group, IRC, blog, book, twitter,...

• boosting services and products using MongoDB

[Contact me]twitter: doryokujinskype: doryokujin mail: mr.stoicman@gmail.comblog: http://d.hatena.ne.jp/doryokujin/MongoDB JP: https://groups.google.com/group/mongodb-jp?hl=ja

Thank you for coming to Mongo Tokyo!!