24
從從從從從從從從從從從從從從從從 Using Browsing Behavior Log to Predict User’s Gender Rick , Kent , Koi

Using browsing behavior history to predict user’s gender presenation

  • Upload
    -

  • View
    181

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Using browsing behavior history to predict user’s gender   presenation

從瀏覽文章行為來預測使用者的性別Using Browsing Behavior Log to Predict User’s Gender

Rick , Kent , Koi

Page 2: Using browsing behavior history to predict user’s gender   presenation

Overview● Huge Data Burn Money (燒錢啊 )

o 28 Million PV / Day o 7.7 Million UV / Dayo Have Total 4.4 Billion Articleso Have Total 4.7 Million Registered User

● Only 2% Login , Who is 98% ?

Page 3: Using browsing behavior history to predict user’s gender   presenation

Problem Definition• Use Only 2% History Data to Prediction 98% users

Train Model

User ModelTo Predict

Training Data Model

Unknown Cookie’ Gender Result

Page 4: Using browsing behavior history to predict user’s gender   presenation

Training Flow Training Data Selection

RawLog

TargetData

Preprocessing

TransformedData

Transformation

Data Mining

Pattern

取得最近三個月內的有登入者瀏覽紀錄,並且看過兩篇不同的文上以上的使用者

使用 Naïve Bayes 演算去來產生預測模型

• Feature Extraction• Feature Selection

Page 5: Using browsing behavior history to predict user’s gender   presenation

Prediction Flow

SelectionRawLog

PredictData

Preprocessing

TransformedData

Preprocessing

Transformation

Naive BayesPattern

取得最近三個月內的未登入者瀏覽紀錄,數量約佔全站資料的 98% 使用 Naïve Bayes 演算去來預測性別

Page 6: Using browsing behavior history to predict user’s gender   presenation

Naive Bayes Formula

大至說穿了就是看看哪一個出現比較多次!!

Page 7: Using browsing behavior history to predict user’s gender   presenation

Naive Bayes in Python Scikit-learn

http://scikit-learn.org/stable/modules/naive_bayes.html#bernoulli-naive-bayes

Page 8: Using browsing behavior history to predict user’s gender   presenation

Raw Data (Matrix? )

Page 9: Using browsing behavior history to predict user’s gender   presenation

Training Data Set OverviewItem Description Comment

Date 20150223 ~ 20150424

Total Click Counts 10908692

Login User Male : 149403 Female: 229448

Feature Before: 2543240After : 508648

use chi-squre as feature selection

Page 10: Using browsing behavior history to predict user’s gender   presenation

Feature Extraction• Category Feature -> Binary Feature• Example

Feature Name Feature Value

Article Type A, B , C , D, E

Feature Name Feature Value

Article Type - A 0 ,1

Article Type - B 0 ,1

Article Type - C 0 ,1

Article Type - D 0 ,1

Article Type - E 0 ,1

Page 11: Using browsing behavior history to predict user’s gender   presenation

Features ListFeature Name Description Example

gender the gender of login user 1 or 2

cat The article’s category 旅遊url is a blog url http://kittyfish.pixnet.net/blog/post/

345566174

ariticle_author the blog’s author kittyfish

article_id the blog’s unique id 345566174

hours the time of click event 6

refers http://www.google.com/

country the country that predicted by ip address tw

Page 12: Using browsing behavior history to predict user’s gender   presenation

But …… Too Many Features(又是燒錢 )

• T = 2,450,000 x 2,543,240• Many Irrelevant Feature for

Prediction

2,543,240 Feature

Page 13: Using browsing behavior history to predict user’s gender   presenation

Feature Selection – Chi Square

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.htmlhttp://www.slideshare.net/parth241989/chi-square-test-16093013

Chi Square Value Dependence with Result

Large High

Small Low

• 2543240 Features -> 508648 Features

• Precision 74% -> 81%

Page 14: Using browsing behavior history to predict user’s gender   presenation

Important Feature is ?feature_name male_prob female_prob male_count female_count total prob_distance

cat_財經企管 0.137798 0.045564 20587 10454 31041 0.184468

cat_美容彩妝 0.062211 0.137009 9294 31436 40730 0.149596

cat_時尚流行 0.079325 0.151936 11851 34861 46712 0.145221

cat_親子育兒 0.079640 0.133178 11898 30557 42455 0.107076

cat_心情日記 0.180942 0.231797 27033 53185 80218 0.101709

cat_國外旅遊 0.152288 0.194490 22752 44625 67377 0.084403

author_XXXXX 0.049975 0.009037 7466 2073 9539 0.081877

cat_食譜分享 0.054607 0.093596 8158 21475 29633 0.077978

cat_圖文創作 0.085483 0.122831 12771 28183 40954 0.074696

Page 15: Using browsing behavior history to predict user’s gender   presenation

Important Feature is ?• 以分類就可以初步判定性別傾向• 部份特定作者及文章,可以特別用來識別是否為男性• 男性點擊分佈特定傾向大於女性 ,這在後續使用 GA 作線上實驗,男性的預測精準度是大於女性,不謀而合

Page 16: Using browsing behavior history to predict user’s gender   presenation

Feature Distribution

少數的 feature 很具有引響力,但是其它的 feature的長尾效應還是有的,對於提升最後幾個百分點是有效力的

Page 17: Using browsing behavior history to predict user’s gender   presenation

Prediction Set Data AnalysisIntersection/Training Intersection/Prediction

hour 100.00% 91.67%

author 94.37% 7.79%

country 100.00 2.46%

category 100.00 ???

article 84.53 2.64%

referer 94.50% 8.76%

Page 18: Using browsing behavior history to predict user’s gender   presenation

Real War Record

Live Experiment on PIXNET Falcon(Advertisement) System

Page 19: Using browsing behavior history to predict user’s gender   presenation

Validation by Google Analytics● Is God ?● How to Use ?

UGD sayMale

UGD sayFemale

GA Set 1

GA Set 2

GA Say Male

GA Say Female

GA Say Male

GA Say Female

An non-registration user

Classification Model

Prediction

Page 20: Using browsing behavior history to predict user’s gender   presenation

Prediction Set Data Analysis• 於由 Prediction Data 遠高於 Training Data ,故以 Training Set 為分母來看的話,交集的比率頗高• 但是以 Prediction Data 為分母的話, Article 、 Author 、 Country 、

Referer ,交集的比率均小於 10%,如下圖所示• Article 及 Author 是因為 Pixnet 使用者的閱讀習慣集中在特定的文章,其它的文章點擊次數非常的少,甚至沒有被其它人閱覽過

Prediction set Training Set

Article 、 Author 、 Referrer

Hour & Category

Prediction Set

Training Set

Page 21: Using browsing behavior history to predict user’s gender   presenation

Implementation - System Architecture

Page 22: Using browsing behavior history to predict user’s gender   presenation

Implementation - Technology-Inventor List

Technology Tool Purpose

Scikit-learn Machine learning library

Redis Cookie profile database

Python Programing language

Celery Scheduling framework

Redshift Large raw data datawarehouse

Django & Rest framework Build api service for internal sytem

Page 23: Using browsing behavior history to predict user’s gender   presenation

Implement - Performance Tuning● CPU

● Batch Prediction● 1000 x Speed Up

● Parallel Process● Full usage mulit-core – 8 x Speed Up● Python

● Memory● Garbage Collection● Python - del

Page 24: Using browsing behavior history to predict user’s gender   presenation

Reference● http://www.cs.ucr.edu/~eamonn/CE/Bayesian%20Classification%20withInsect_ex

amples.pdf● https://www.iperceptions.com/~/media/files/knowledge/whitepapers/iperceptio

nsintentrecognitionenginewhitepaperfeb2014v13.ashx● A Two-Stage Ensemble of Diverse Models for Advertisement ...● http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html● Whyo use naive bayes : http://

www.aaai.org/Papers/FLAIRS/2004/Flairs04-097.pdf● Unbias : http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf