Upload
-
View
70
Download
1
Embed Size (px)
Citation preview
by yi00da
Something AboutSearch
搜索探索性分析搜索评价指标BM25
Click Model
Learning To Rank
Search Analysis
过滤搜索关键字长度大于 40 的关键字
表 1-1
表 1-2
表 1-3
由表 1-1/1-2 可知,三搜索关键字组成, PC 段和移动端差不多
平台 搜索关键字平均长度android 7.44ios 7.62pc 8.47
平台每个 query 的不同 term数
一个 term 的占比
2 个 term 的占比
3 个 term 的占比
android 4.25 7% 13% 22%ios 4.07 8% 14% 23%pc 4.12 11% 12% 22%
平台 关键字搜索次数unique 关键词 占比
android 60925778 3612131 5.9%ios 19063979 1499656 7.9%pc 24689234 2356621 9.5%
表 2-1
表 2-2
由上表可知,一个用户的 session, pc 和 android 在 2 个小时左右 , ios 在一个小时左右;每个 session 含有大概 2-3 个 query
业界一般使用半个小时作为一次 search session, 一般用来做Query transformation/document ranking/user satisfaction prediction
平台session 平均时长 ( 分钟 )
android 108ios 63pc 129
平台 session 的 query 数mean
session 的 query 数 mean ,限制 query 数不超过 100
android 3.015 3.005ios 2.334 2.325pc 3.425 3.362
表 3-1
表 3-2
由表 3-1 可知,搜索结果的平均点击位置为 7 ,第一次点击的平均位置为 4 ,说明搜索的排序仍然有待提升 ( 点击位置越前越好 )
由表 3-2 可知,搜索后有播放的 query 仅仅占 55% 左右,说明搜索结果展示需要改进 ( 未有第三方竞品数据作对比 )
平台 点击的平均位置每个用户每个 sid 的不同搜索关键字首次点击平均位置
pc 7.08 4.08
平台搜索后有播放的 unique query数
总 unique query数 播放占比
有下载的 unique query数 下载占比
android 1969604 3612131 55% 727141 20%ios 756288 1499656 50%
pc 1280618 2356621 54% 31762513%( 不
准 )
中国新歌声
微微一笑很倾城TFB
OYS
逆流成河 张杰 鹿晗儿童歌曲
旋风少女 2tfb
oys 儿歌
蒙面唱将猜猜猜 郑源 冷漠 刘德华
大王叫我来巡山0
200000400000600000800000
10000001200000
android 搜索关键字 top30
薛之谦 周杰伦逆流成河 小幸运 歌在飞
微微一笑很倾城
没有你陪伴真的好孤单 陈奕迅 张学友 张杰 丑八怪tfb
oys 张信哲告白气球 汪峰
060000
120000180000
PC 搜索关键字 top30
由上图可以看出,移动端和 PC 端的用户搜索行为有区别
、
图 4-1 表 4-2
由图 4-1 可知,用户搜索主要集中在晚上 8-9 点由表 4-2 可知, top100000w 的关键字搜索量占到 85% ,长尾分布严重
平台 类型 占比pc top10 3.9%pc top100 14.0%pc top1000 36.5%pc top10000 66.2%pc top100000 84.2%android top10 6.1%android top100 17.7%android top1000 41.5%android top10000 70.4%android top100000 86.2%ios top10 6.5%ios top100 18.9%ios top1000 43.3%ios top10000 71.8%ios top100000 87.5%
0点 1点 2点 3点 4点 5点 6点 7点 8点 9点 10点 11点 12点 13点 14点 15点 16点 17点 18点 19点 20点 21点 22点 23点0.0%1.0%2.0%3.0%4.0%5.0%6.0%7.0%8.0%9.0%
20 点 ; 8.4%
不同时段的搜索次数占比
Do users scan document from top to bottom?
1. The click-through rate (CTR) of the first document is about 0.45 while the CTR of the tenth document is well below 0.052. The document below a click is viewed roughly 50% of the times
Search habit
Appendix
搜索探索性分析搜索评价指标BM25
Click Model
Learning To Rank
Search Analysis
MAPMean Average Precision
Example:假设有两个主题,主题 1 有 4 个相关网页,主题 2 有 5 个相关网页。某系统对于主题 1 检索出 4 个相关网页,其rank 分别为 1, 2, 4, 7 ;对于主题 2 检索出 3 个相关网页,其 rank 分别为 1,3,5 。对于主题 1 ,平均准确率为(1/1+2/2+3/4+4/7)/4=0.83 。对于主题 2 ,平均准确率为 (1/1+2/3+3/5+0+0)/5=0.45 。则 MAP= (0.83+0.45)/2=0.64
NDCGNormalize Discounted cumulative gain
搜索探索性分析搜索评价指标BM25
Click Model
Learning To Rank
Search Analysis
BM25BM25 算法,通常用来作搜索相关性平分。一句话概况其主要思想:对 Query 进行语素解析,生成语素qi ;然后,对于每个搜索结果 D ,计算每个语素 qi 与 D 的相关性得分,最后,将 qi 相对于 D 的相关性得分进行加权求和,从而得到 Query 与 D 的相关性得分
一般而言,没有相关信息,即 r 和 R 都是 0 ,而在 query 中,一般不会有某个 term 出现的次数大于 1 ,qfi=1,Score 的定义如下:
其中参数 b 的作用是调整文档长度对相关性影响的大小。 b 越大,文档长度的对相关性得分的影响越大,反之越小。
BM25 with title
可以看见, BM25 对歌曲的 Title效果不好
Appendix
搜索探索性分析搜索评价指标BM25
Click Model
Learning To Rank
Search Analysis
Random Click Model (RCM)Click-through Rate Models (CTR) Rank-based CTR Model (RCTR)
Document-based CTR Model (DCTR)
User Browsing Model (UBM) Position-based Model (PBM)
Dependent Click Model (DCM)Click Chain Model (CCM)Dynamic Bayesian Network Model (DBN) Simplified DBN Model (SDBN) Cascade Model (CM)
Click Model
Baseline model1.Random Click Model (RCM) Any document can be click with the same (fixed) probability
2. Click-Through Rate Models (RCTR)
the click probability depends on the rank of the document
3. Document-Based CTR Model (DCTR)
the click-through rates for each query-document pair.subject to overfitting for the reason that some documents and/or queries were not previouslyencountered in our click log
Position-Based Model
position-based model (PBD)
Means that a document is clicked when user Examine and attractive with it
Examination hypothesis. The probability of a user examining a document depends heav-ily on its rank or position. PBM introduces a set of examination parameters Y, one for each rank. PBM does not depend on the events at previous ranks.
Cascade ModelCascade model (CM)
Step:1.Start from the first document2.Examine documents one by one3.If click, then stop4.Otherwise, continue
Cascade model (CM)
In particular:1.CM does not allow sessions with more than one click2.CM can not explain non-linear examination patterns
So far,
1.CTR models + count clicks (simple and fast) - do not distinguish examination and attractiveness
2. Position-based model (PBM) User browsing model + examination and attractiveness - examination of a document at rank r does not depend on examinations and clicks above r
3. Cascade model (CM) Dynamic Bayesian network + cascade dependency of examination at r on examinations and clicks above r - only one click is allowed
User Browsing ModelUser Browsing model (UBM)
the examination probability depends notonly on the rank of a document r, but alsoon the rank of the previously clicked document r’
r’ is the rank of the previously clicked document or 0 if none of them was clickedwhere c0 is set to 1 for convenience
Dynamic Bayesian ModelDynamic Bayesian model (DBN)
Step:1.Start from the first document2.Examine documents one by one3.If click, read actual document and can be satisfied4.If satisfied, stop5.Otherwise,continue with fixed probability
Dynamic Bayesian model (DBN)
In particular:1.Gamma is the continuation probability for a user that either did not click on a document or clicked but was not satisfied by it2.DBN set gamma to 1,is Simplified DBN Model (SDBN) – MLE & good performance3.SDBN set to 1,then model become Cascade Model (CM)
Random Click Model (RCM) Click-through Rate Models (CTR) Rank-based CTR Model (RCTR)
Document-based CTR Model (DCTR)
User Browsing Model (UBM) Position-based Model (PBM)
Dependent Click Model (DCM) Click Chain Model (CCM) Dynamic Bayesian Network Model (DBN) Simplified DBN Model (SDBN) Cascade Model (CM)
1. Maximum likelihood estimation (RCM,RCTP,DCTP,DCM,SDBN,CM)2. Expectation maximization (UBM,PBM,CCM,DBN)
Parameter Estimation
Simplified DBN Model (SDBN) -- MLE
In particular:1. SDBN assumes that a user examines all documents until the last-clicked one and then aban-dons the search. In this case, both the attractiveness A and satisfaction S of SDBN are ob-served.2.吸引度 A 即是给定 query ,其 ducument 的点击次数和展示次数 ( 最后一个点击或之前 ) 之比3.满意度 S 即是给定 query ,在其 ducument 的点击集合中该 ducument 最后一次点击的占比
Simplified DBN Model (SDBN) -- MLE
Dynamic Bayesian model (DBN) -- EM
In particular:1. E-step. Given three parameters,compute the posterior probabilities A,E,S, This involves theforward-backward algorithm 2. M-step. Given the posterior probabilities, update three parameters
1.The DBN outperform others2. X-axis = 100 means those urls whose train set >= 100;more session means priors not as important. Cascade & DBN improve.3. Navigational queries have quality of context bias, and lots of sessions. Position models suffer
Result
Limit:1.Click model cannot model out of order clicks2. Completely blind to query reformulations3. Assumes homogeneous user population
Future research:1. Why not learning the structure of a click model from datainstead of defining it manually2. Interactions beyond clicks
Limitations and future research
[1] Anne Schuth, Floor Sietsma, Shimon Whiteson, and Maarten de Rijke. “Optimizing Base Rankers Using Clicks A Case Study using BM25”[2] Thorsten Joachims, Laura Granka Bing Pan, Helene Hembrooke,and Geri Gay.” Accuratelyinterpreting click-through data as implicit feedback”[3] Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. “An experimental comparison of click position-bias models”[4] Fan Guo, Chao Liu, Anitha Kannan, Tom Minka, Michael Taylor, Yi-Min Wang, and Christos Faloutsos. ”Click chain model in web search”[5] Olivier Chapelle and Ya Zhang.” A dynamic bayesian network click model for web search ranking”[6] Kevin Patrick Murphy. Machine Learning: “A Probabilistic Perspective.”[7] Suzan Verberne, Hans van Halteren, Daphne Theijssen,” Learning to Rank QA Data”[8] Thorsten Joachims,” Optimizing Search Engines using Clickthrough Data”[9] Daxin Jiang, Jian Pei, Hang Li,” Mining Search and Browse Logs for Web Search: A Survey”
Reference
SDBN compute
只取搜索结果 top60条记录
1. 定义曝光为最后一次点击之前的结果 2. 点击满意定义如备注所示1. 点击定义为播放、添加或下载 3. att_alpha=0.1,att_beta=250,sat_alpha=0.1,sat_beta=1002.丢弃 z 序列缺失超过 10% 的 session3. 过滤同一个 mid,sid 的记录数超过 10000 的 session4.只留下超过 10 个 session 的 query5. 用户播放顺序从上往下,抛弃乱序播放的 session
Appendix
pc 行为流水
爬取搜索接口数据 Click model 相关性 score
RCM VS RCTR
RCM RCTR全局热度和关键字下热度,都会出现 position bais,关键字热度要好一点,考虑到不同 query的影响
Appendix
SDBN VS CM
对某些关键字来看, SDBN效果要好一些。并未有人工编辑的标签,未做 NDCG
Appendix
NDCG compute
Appendix