Something about search

by yi00da

Something AboutSearch

搜索探索性分析搜索评价指标BM25

Click Model

Learning To Rank

Search Analysis

过滤搜索关键字长度大于 40 的关键字

表 1-1

表 1-2

表 1-3

由表 1-1/1-2 可知，三搜索关键字组成， PC 段和移动端差不多

平台搜索关键字平均长度android 7.44ios 7.62pc 8.47

平台每个 query 的不同 term数

一个 term 的占比

2 个 term 的占比

3 个 term 的占比

android 4.25 7% 13% 22%ios 4.07 8% 14% 23%pc 4.12 11% 12% 22%

平台关键字搜索次数unique 关键词占比

android 60925778 3612131 5.9%ios 19063979 1499656 7.9%pc 24689234 2356621 9.5%

表 2-1

表 2-2

由上表可知，一个用户的 session, pc 和 android 在 2 个小时左右 , ios 在一个小时左右；每个 session 含有大概 2-3 个 query

业界一般使用半个小时作为一次 search session, 一般用来做Query transformation/document ranking/user satisfaction prediction

平台session 平均时长 ( 分钟 )

android 108ios 63pc 129

平台 session 的 query 数mean

session 的 query 数 mean ，限制 query 数不超过 100

android 3.015 3.005ios 2.334 2.325pc 3.425 3.362

表 3-1

表 3-2

由表 3-1 可知，搜索结果的平均点击位置为 7 ，第一次点击的平均位置为 4 ，说明搜索的排序仍然有待提升 ( 点击位置越前越好 )

由表 3-2 可知，搜索后有播放的 query 仅仅占 55% 左右，说明搜索结果展示需要改进 ( 未有第三方竞品数据作对比 )

平台点击的平均位置每个用户每个 sid 的不同搜索关键字首次点击平均位置

pc 7.08 4.08

平台搜索后有播放的 unique query数

总 unique query数播放占比

有下载的 unique query数下载占比

android 1969604 3612131 55% 727141 20%ios 756288 1499656 50%

pc 1280618 2356621 54% 31762513%( 不

准 )

中国新歌声

微微一笑很倾城TFB

OYS

逆流成河张杰鹿晗儿童歌曲

旋风少女 2tfb

oys 儿歌

蒙面唱将猜猜猜郑源冷漠刘德华

大王叫我来巡山0

200000400000600000800000

10000001200000

android 搜索关键字 top30

薛之谦周杰伦逆流成河小幸运歌在飞

微微一笑很倾城

没有你陪伴真的好孤单陈奕迅张学友张杰丑八怪tfb

oys 张信哲告白气球汪峰

060000

120000180000

PC 搜索关键字 top30

由上图可以看出，移动端和 PC 端的用户搜索行为有区别

、

图 4-1 表 4-2

由图 4-1 可知，用户搜索主要集中在晚上 8-9 点由表 4-2 可知， top100000w 的关键字搜索量占到 85% ，长尾分布严重

平台类型占比pc top10 3.9%pc top100 14.0%pc top1000 36.5%pc top10000 66.2%pc top100000 84.2%android top10 6.1%android top100 17.7%android top1000 41.5%android top10000 70.4%android top100000 86.2%ios top10 6.5%ios top100 18.9%ios top1000 43.3%ios top10000 71.8%ios top100000 87.5%

0点 1点 2点 3点 4点 5点 6点 7点 8点 9点 10点 11点 12点 13点 14点 15点 16点 17点 18点 19点 20点 21点 22点 23点0.0%1.0%2.0%3.0%4.0%5.0%6.0%7.0%8.0%9.0%

20 点 ; 8.4%

不同时段的搜索次数占比

Do users scan document from top to bottom?

1. The click-through rate (CTR) of the first document is about 0.45 while the CTR of the tenth document is well below 0.052. The document below a click is viewed roughly 50% of the times

Search habit

Appendix


Click Model

Learning To Rank

Search Analysis

MAPMean Average Precision

Example:假设有两个主题，主题 1 有 4 个相关网页，主题 2 有 5 个相关网页。某系统对于主题 1 检索出 4 个相关网页，其rank 分别为 1, 2, 4, 7 ；对于主题 2 检索出 3 个相关网页，其 rank 分别为 1,3,5 。对于主题 1 ，平均准确率为(1/1+2/2+3/4+4/7)/4=0.83 。对于主题 2 ，平均准确率为 (1/1+2/3+3/5+0+0)/5=0.45 。则 MAP= (0.83+0.45)/2=0.64

NDCGNormalize Discounted cumulative gain


Click Model

Learning To Rank

Search Analysis

BM25BM25 算法，通常用来作搜索相关性平分。一句话概况其主要思想：对 Query 进行语素解析，生成语素qi ；然后，对于每个搜索结果 D ，计算每个语素 qi 与 D 的相关性得分，最后，将 qi 相对于 D 的相关性得分进行加权求和，从而得到 Query 与 D 的相关性得分

一般而言，没有相关信息，即 r 和 R 都是 0 ，而在 query 中，一般不会有某个 term 出现的次数大于 1 ，qfi=1,Score 的定义如下：

其中参数 b 的作用是调整文档长度对相关性影响的大小。 b 越大，文档长度的对相关性得分的影响越大，反之越小。

BM25 with title

可以看见， BM25 对歌曲的 Title效果不好

Appendix


Click Model

Learning To Rank

Search Analysis

Random Click Model (RCM)Click-through Rate Models (CTR) Rank-based CTR Model (RCTR)

Document-based CTR Model (DCTR)

User Browsing Model (UBM) Position-based Model (PBM)

Dependent Click Model (DCM)Click Chain Model (CCM)Dynamic Bayesian Network Model (DBN) Simplified DBN Model (SDBN) Cascade Model (CM)

Click Model

Baseline model1.Random Click Model (RCM) Any document can be click with the same (fixed) probability

2. Click-Through Rate Models (RCTR)

the click probability depends on the rank of the document

3. Document-Based CTR Model (DCTR)

the click-through rates for each query-document pair.subject to overfitting for the reason that some documents and/or queries were not previouslyencountered in our click log

Position-Based Model

position-based model (PBD)

Means that a document is clicked when user Examine and attractive with it

Examination hypothesis. The probability of a user examining a document depends heav-ily on its rank or position. PBM introduces a set of examination parameters Y, one for each rank. PBM does not depend on the events at previous ranks.

Cascade ModelCascade model (CM)

Step:1.Start from the first document2.Examine documents one by one3.If click, then stop4.Otherwise, continue

Cascade model (CM)

In particular:1.CM does not allow sessions with more than one click2.CM can not explain non-linear examination patterns

So far,

1.CTR models + count clicks (simple and fast) - do not distinguish examination and attractiveness

2. Position-based model (PBM) User browsing model + examination and attractiveness - examination of a document at rank r does not depend on examinations and clicks above r

3. Cascade model (CM) Dynamic Bayesian network + cascade dependency of examination at r on examinations and clicks above r - only one click is allowed

User Browsing ModelUser Browsing model (UBM)

the examination probability depends notonly on the rank of a document r, but alsoon the rank of the previously clicked document r’

r’ is the rank of the previously clicked document or 0 if none of them was clickedwhere c0 is set to 1 for convenience

Dynamic Bayesian ModelDynamic Bayesian model (DBN)

Step:1.Start from the first document2.Examine documents one by one3.If click, read actual document and can be satisfied4.If satisfied, stop5.Otherwise,continue with fixed probability

Dynamic Bayesian model (DBN)

In particular:1.Gamma is the continuation probability for a user that either did not click on a document or clicked but was not satisfied by it2.DBN set gamma to 1,is Simplified DBN Model (SDBN) – MLE & good performance3.SDBN set to 1,then model become Cascade Model (CM)

Random Click Model (RCM) Click-through Rate Models (CTR) Rank-based CTR Model (RCTR)

Document-based CTR Model (DCTR)

User Browsing Model (UBM) Position-based Model (PBM)

Dependent Click Model (DCM) Click Chain Model (CCM) Dynamic Bayesian Network Model (DBN) Simplified DBN Model (SDBN) Cascade Model (CM)

1. Maximum likelihood estimation (RCM,RCTP,DCTP,DCM,SDBN,CM)2. Expectation maximization (UBM,PBM,CCM,DBN)

Parameter Estimation

Simplified DBN Model (SDBN) -- MLE

In particular:1. SDBN assumes that a user examines all documents until the last-clicked one and then aban-dons the search. In this case, both the attractiveness A and satisfaction S of SDBN are ob-served.2.吸引度 A 即是给定 query ，其 ducument 的点击次数和展示次数 ( 最后一个点击或之前 ) 之比3.满意度 S 即是给定 query ，在其 ducument 的点击集合中该 ducument 最后一次点击的占比

Simplified DBN Model (SDBN) -- MLE

Dynamic Bayesian model (DBN) -- EM

In particular:1. E-step. Given three parameters,compute the posterior probabilities A,E,S, This involves theforward-backward algorithm 2. M-step. Given the posterior probabilities, update three parameters

1.The DBN outperform others2. X-axis = 100 means those urls whose train set >= 100;more session means priors not as important. Cascade & DBN improve.3. Navigational queries have quality of context bias, and lots of sessions. Position models suffer

Result

Limit:1.Click model cannot model out of order clicks2. Completely blind to query reformulations3. Assumes homogeneous user population

Future research:1. Why not learning the structure of a click model from datainstead of defining it manually2. Interactions beyond clicks

Limitations and future research

[1] Anne Schuth, Floor Sietsma, Shimon Whiteson, and Maarten de Rijke. “Optimizing Base Rankers Using Clicks A Case Study using BM25”[2] Thorsten Joachims, Laura Granka Bing Pan, Helene Hembrooke,and Geri Gay.” Accuratelyinterpreting click-through data as implicit feedback”[3] Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. “An experimental comparison of click position-bias models”[4] Fan Guo, Chao Liu, Anitha Kannan, Tom Minka, Michael Taylor, Yi-Min Wang, and Christos Faloutsos. ”Click chain model in web search”[5] Olivier Chapelle and Ya Zhang.” A dynamic bayesian network click model for web search ranking”[6] Kevin Patrick Murphy. Machine Learning: “A Probabilistic Perspective.”[7] Suzan Verberne, Hans van Halteren, Daphne Theijssen,” Learning to Rank QA Data”[8] Thorsten Joachims,” Optimizing Search Engines using Clickthrough Data”[9] Daxin Jiang, Jian Pei, Hang Li,” Mining Search and Browse Logs for Web Search: A Survey”

Reference

SDBN compute

只取搜索结果 top60条记录

1. 定义曝光为最后一次点击之前的结果 2. 点击满意定义如备注所示1. 点击定义为播放、添加或下载 3. att_alpha=0.1,att_beta=250,sat_alpha=0.1,sat_beta=1002.丢弃 z 序列缺失超过 10% 的 session3. 过滤同一个 mid,sid 的记录数超过 10000 的 session4.只留下超过 10 个 session 的 query5. 用户播放顺序从上往下，抛弃乱序播放的 session

Appendix

pc 行为流水

爬取搜索接口数据 Click model 相关性 score

RCM VS RCTR

RCM RCTR全局热度和关键字下热度，都会出现 position bais，关键字热度要好一点，考虑到不同 query的影响

Appendix

SDBN VS CM

对某些关键字来看， SDBN效果要好一些。并未有人工编辑的标签，未做 NDCG

Appendix

NDCG compute

Appendix

Engineering

Something about search