29
第第第第第第第第第第第第第第第 第第第第第第第 ThuFit 第 第第第 第第第 第第 第第第 第第第 第第 第第第 第第 第第第第 第第第第第第第第第

第一届中国大数据技术创新与创业大赛 关键 词行业分类

  • Upload
    dash

  • View
    227

  • Download
    0

Embed Size (px)

DESCRIPTION

第一届中国大数据技术创新与创业大赛 关键 词行业分类. ThuFit 队: 周 昕宇,吴育昕 ,任杰 ,王 禺淇 ,罗鸿胤 指 导:方展 鹏 , 唐 杰 清华大学 未来互联网兴趣团队. Task. Given: Partially labeled keywords First 10 search results for each keywords Keyword-buyer relationship Goal: Predict unlabeled keywords. Data summary. keyword_class.txt - PowerPoint PPT Presentation

Citation preview

Page 1: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

第一届中国大数据技术创新与创业大赛关键词行业分类

ThuFit 队: 周昕宇,吴育昕 ,任杰 ,王禺淇 ,罗鸿胤指导:方展鹏 ,唐杰

清华大学 未来互联网兴趣团队

Page 2: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

Given: Partially labeled keywords First 10 search results for each keywords Keyword-buyer relationship

Goal: Predict unlabeled keywords

Task

Page 3: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

keyword_class.txt 10,787,584 keywords 1,143,928 labeled, 10.6% 9,963,062 unique keywords 33 classes

keyword_users.txt 23,942,643 entries Each entry is a keyword-buyer pair

keyword_titles.txt 21,575,166 entries, but only 10,787,583 entries are non-empty. Each entry comprised of keyword and its first 10 search

result using Baidu

Data summary

11%

89%

keyword distribution

labeled unlabeled

Page 4: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

Preprocessing: Keyword segmentation

Feature Extraction: Keyword segment Keyword-buyer relation Keyword-segment relation Search result utilization

Model: liblinear

Approach

Page 5: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

Keyword Segement

A sub-string of a keyword Semantic unit

Segmentation Break a keyword to a set of segment

Two ways: Exact segmentation

清华大学 => 清华 / 大学 Full segmentaion

清华大学 => 清华 / 大学 / 华大 / 清华大学 结巴中文分词 :https://github.com/fxsjy/jieba

Keyword segmentation

Page 6: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

Sparse representation of segments Smoothened TFIDF-based feature N-gram “End-gram”

Feature Extraction - segment

Page 7: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

Just in this page: segment = term

Definition of will be given later

Feature Extraction - TFIDF

Page 8: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

N-gram To capture some structure information Recall

There are two ways of segmenting a keyword , a set , an ordered list <- adopt this one

2-gram

Limitation Large character set produce large keyword set Noise

Reduced 2-gram

Feature Extraction - N-gram

Page 9: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

End-gram is more likely to carry discriminative information Emphasis on the last segment: append a character that did not appear in , e.g “ 漢” Example

rnu209e.tvp2 轴承

“hj 系列双锥混合机市场调查报告” Similarly we can define

Feature Extraction - End-gram

Page 10: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

Where is ?Experiments showed that, when adding , performance slightly degrades.

Feature Extraction

Page 11: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

Keyword-buyer/segment relation

B0

B1

B2

B3

K0

K1

K2

K3

S0

S1

S2

S3

K0

K1

K2

K3

C0

C1

C2

C3

Page 12: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

Keyword-buyer/segment relation

B0

B1

B2

B3

K0

K1

K2

K3

S0

S1

S2

S3

K0

K1

K2

K3

C0

C1

C2

C3

S0: C2 S1: C3 S2: S3: C2 C3K0: C2 K1: K2: K3: C3B0: C2 C3 B1: B2: B3:

Page 13: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

Keyword-buyer/segment relation

B0

B1

B2

B3

K0

K1

K2

K3

S0

S1

S2

S3

K0

K1

K2

K3

C0

C1

C2

C3

S0: C2 S1: C3 C3 S2: C0 S3: C2 C3 C0 C3K0: C2 C0 K1: K2: C3 K3: C3B0: C2 C3 B1: C0 C3 B2: B3:

Page 14: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

Assumption: A user tends to by similar class of keywords Obtain the distribution of classes of keywords a buyer buys on labeled data. Each buyer has a 33-dimensioned feature vector For each keyword , its feature vectors is an average over feature vector of a buyers that buys this keyword. Using only this feature we get an accuracy of 0.82

Keyword-buyer relation

Page 15: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

Keyword-buyer relationB0

B1

B2

B3

K0

K1

K2

K3

S0

S1

S2

S3

K0

K1

K2

K3

C0

C1

C2

C3

Page 16: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

We have made effort trying modeling buyers by the segments of keywords they bought, and model keywords-keywords relationship by exploiting their common connection with segments. Buyer -> Keyword ->Segment =>Buyer -> Segment We further introduced higher order relation influence between buyers and keywords, but improvements are subtle.

Keyword-buyer relation

Page 17: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

Reverse the link between segment and keywords Keyword ->Segment => Segment -> Keyword

Keyword-segment relation

Page 18: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

Keyword-segment relationB0

B1

B2

B3

K0

K1

K2

K3

S0

S1

S2

S3

K0

K1

K2

K3

C0

C1

C2

C3

Page 19: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

Some weird keywords appears /^[0-9a-zA-Z\-_]{1,}$/ 1-1828169-5: 1 1828169 5 1-1838143-0: 1 1838143 0

Their search results 1-1838143-0 1-1838143-0 全国供货商【 IC37 旗下站】 1-1838143-0 价格 |PDF ... IC 芯片 1-1838143-0 品牌、价格、 PDF 参数 - 电子产品资料 - 买卖 IC 网 PIC16C57-XT/SP145的 IC 、二极管、三极管查询 , 采购 PIC16C57-XT/SP... 原装进口连接器 TYCO 1-1838143-0 2000pcs 1005+ 现货 泰科Tyco431829-1 集成电路、连接器、接插件 AMP 欧式背板连接器崧晔达 _ 达价格 _ 优质崧晔达批发 / 采购 - 阿里巴巴 供应聚氯乙烯 _连接器 _ 供应聚 崧晔达价格 _ 优质崧晔达批发 / 采购 - 阿里巴巴 供应聚氯乙烯 _ 连接器 _ 供应聚氯乙烯批发 _ 供应聚氯乙烯供应 _ 阿里巴巴 上海金庆电子技术有限公司 限位开关 12 福州福铭仪器

Search Result Utilization

Page 20: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

For normal keywords, the keyword itself has semantic meaning. For those keywords with less semantic information, they are usually a product serial number or some domain specific terminology , e.g chemical element names. These supplementary information yields more accuracy results on “weird” keywords. But these keywords did not seem to be included in online test.

Search Result Utilization

Page 21: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

Recall: If we add one more term:

where is the search result of Performance decreased by noise introduced Example

“hj 系列双锥混合机市场调查报告” “ 混合设备 HJ 系列双锥混合机 - 常州市华欧干燥制粒设备有限公司 - ... 混合机 -供应 HJ 系列双锥混合机 - 混合机尽在阿里巴巴 - 常州欧朋干燥 ...  HJ 系列双锥混合机厂家 _ 价格 - 食品机械行业网 HJ 系列双锥混合机供应信息 , 常州市步群干燥设备有限公司 HJ 系列双锥混合机 _ 百度百科 HJ 系列双锥混合机 - 常州普耐尔干燥设备有限公司 HJ 系列双锥混合机价格 ( 江苏 常州 )- 盖德化工网 ...”

Search Result Utilization

Page 22: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

Dimensionality: 200,000Lower dimensionality introduce better generalization ability.Feature Statistics

Page 23: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

Life is short, you need PythonImplementation

Page 24: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

Liblinear: http://www.csie.ntu.edu.tw/~cjlin/liblinear/

A Library for Large Linear Classification L2-loss logistic regression 33 one-vs-all classifiers for each class.

Model

Page 25: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

We split labeled data into training and validation set All following results are local results. Online test result are higher due to utilizing more training data. Due to the complexity of migrating our code to hadoop platform (mainly because we used third party non-java libraries), not all of the features above are employed in our final submission.

Experiments and Results

Page 26: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

Experiments and ResultsFeature vector constituents Accuracy

Keyword-buyer relation 0.8194Keyword-segment

relation 0.9019

Keyword-buyer + ( + TFIDF) 0.9537

+ TFIDF 0.9656

+ TFIDF 0.9635

+ TFIDF 0.9725

+ TFIDF 0.9713

Page 27: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

We split labeled data into training and validation set All following results are local results. Online test result are higher due to utilizing more training data. Due to the complexity of migrating our code to hadoop platform (mainly because we used third party non-java libraries), not all of the features above are employed in our final submission.

Analysis

Page 28: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

Two types of feature Relation feature:

Utilized prior knowledge of class label information Low dimension May biased to training data

TFIDF feature: No class label information utilized High dimension Robust, good generalization ability

But a simple combination of two does not work well Ensemble methods may workaround this problem.

Limitations

Page 29: 第一届中国大数据技术创新与创业大赛 关键 词行业分类

Thanks!