第一届中国大数据技术创新与创业大赛关键词行业分类

第一届中国大数据技术创新与创业大赛关键词行业分类

ThuFit 队：周昕宇，吴育昕，任杰，王禺淇，罗鸿胤指导：方展鹏，唐杰

清华大学未来互联网兴趣团队

Given: Partially labeled keywords First 10 search results for each keywords Keyword-buyer relationship

Goal: Predict unlabeled keywords

Task

keyword_class.txt 10,787,584 keywords 1,143,928 labeled, 10.6% 9,963,062 unique keywords 33 classes

keyword_users.txt 23,942,643 entries Each entry is a keyword-buyer pair

keyword_titles.txt 21,575,166 entries, but only 10,787,583 entries are non-empty. Each entry comprised of keyword and its first 10 search

result using Baidu

Data summary

11%

89%

keyword distribution

labeled unlabeled

Preprocessing: Keyword segmentation

Feature Extraction: Keyword segment Keyword-buyer relation Keyword-segment relation Search result utilization

Model: liblinear

Approach

Keyword Segement

A sub-string of a keyword Semantic unit

Segmentation Break a keyword to a set of segment

Two ways: Exact segmentation

清华大学 => 清华 / 大学 Full segmentaion

清华大学 => 清华 / 大学 / 华大 / 清华大学结巴中文分词 :https://github.com/fxsjy/jieba

Keyword segmentation

https://github.com/fxsjy/jieba

https://github.com/fxsjy/jieba

Sparse representation of segments Smoothened TFIDF-based feature N-gram “End-gram”

Feature Extraction - segment

Just in this page: segment = term

Definition of will be given later

Feature Extraction - TFIDF

N-gram To capture some structure information Recall

There are two ways of segmenting a keyword , a set , an ordered list <- adopt this one

2-gram

Limitation Large character set produce large keyword set Noise

Reduced 2-gram

Feature Extraction - N-gram

End-gram is more likely to carry discriminative information Emphasis on the last segment: append a character that did not appear in , e.g “ 漢” Example

rnu209e.tvp2 轴承

“hj 系列双锥混合机市场调查报告” Similarly we can define

Feature Extraction - End-gram

Where is ?Experiments showed that, when adding , performance slightly degrades.

Feature Extraction

Keyword-buyer/segment relation

B0

B1

B2

B3

K0

K1

K2

K3

S0

S1

S2

S3

K0

K1

K2

K3

C0

C1

C2

C3


B0

B1

B2

B3

K0

K1

K2

K3

S0

S1

S2

S3

K0

K1

K2

K3

C0

C1

C2

C3

S0: C2 S1: C3 S2: S3: C2 C3K0: C2 K1: K2: K3: C3B0: C2 C3 B1: B2: B3:


B0

B1

B2

B3

K0

K1

K2

K3

S0

S1

S2

S3

K0

K1

K2

K3

C0

C1

C2

C3

S0: C2 S1: C3 C3 S2: C0 S3: C2 C3 C0 C3K0: C2 C0 K1: K2: C3 K3: C3B0: C2 C3 B1: C0 C3 B2: B3:

Assumption: A user tends to by similar class of keywords Obtain the distribution of classes of keywords a buyer buys on labeled data. Each buyer has a 33-dimensioned feature vector For each keyword , its feature vectors is an average over feature vector of a buyers that buys this keyword. Using only this feature we get an accuracy of 0.82

Keyword-buyer relation

Keyword-buyer relationB0

B1

B2

B3

K0

K1

K2

K3

S0

S1

S2

S3

K0

K1

K2

K3

C0

C1

C2

C3

We have made effort trying modeling buyers by the segments of keywords they bought, and model keywords-keywords relationship by exploiting their common connection with segments. Buyer -> Keyword ->Segment =>Buyer -> Segment We further introduced higher order relation influence between buyers and keywords, but improvements are subtle.

Keyword-buyer relation

Reverse the link between segment and keywords Keyword ->Segment => Segment -> Keyword

Keyword-segment relation

Keyword-segment relationB0

B1

B2

B3

K0

K1

K2

K3

S0

S1

S2

S3

K0

K1

K2

K3

C0

C1

C2

C3

Some weird keywords appears /^[0-9a-zA-Z\-_]{1,}$/ 1-1828169-5: 1 1828169 5 1-1838143-0: 1 1838143 0

Their search results 1-1838143-0 1-1838143-0 全国供货商【 IC37 旗下站】 1-1838143-0 价格 |PDF ... IC 芯片 1-1838143-0 品牌、价格、 PDF 参数 - 电子产品资料 - 买卖 IC 网 PIC16C57-XT/SP145的 IC 、二极管、三极管查询 , 采购 PIC16C57-XT/SP... 原装进口连接器 TYCO 1-1838143-0 2000pcs 1005+ 现货泰科Tyco431829-1 集成电路、连接器、接插件 AMP 欧式背板连接器崧晔达 _ 达价格 _ 优质崧晔达批发 / 采购 - 阿里巴巴供应聚氯乙烯 _连接器 _ 供应聚崧晔达价格 _ 优质崧晔达批发 / 采购 - 阿里巴巴供应聚氯乙烯 _ 连接器 _ 供应聚氯乙烯批发 _ 供应聚氯乙烯供应 _ 阿里巴巴上海金庆电子技术有限公司限位开关 12 福州福铭仪器

Search Result Utilization

For normal keywords, the keyword itself has semantic meaning. For those keywords with less semantic information, they are usually a product serial number or some domain specific terminology , e.g chemical element names. These supplementary information yields more accuracy results on “weird” keywords. But these keywords did not seem to be included in online test.


Recall: If we add one more term:

where is the search result of Performance decreased by noise introduced Example

“hj 系列双锥混合机市场调查报告” “ 混合设备 HJ 系列双锥混合机 - 常州市华欧干燥制粒设备有限公司 - ... 混合机 -供应 HJ 系列双锥混合机 - 混合机尽在阿里巴巴 - 常州欧朋干燥 ... HJ 系列双锥混合机厂家 _ 价格 - 食品机械行业网 HJ 系列双锥混合机供应信息 , 常州市步群干燥设备有限公司 HJ 系列双锥混合机 _ 百度百科 HJ 系列双锥混合机 - 常州普耐尔干燥设备有限公司 HJ 系列双锥混合机价格 ( 江苏常州 )- 盖德化工网 ...”


Dimensionality: 200,000Lower dimensionality introduce better generalization ability.Feature Statistics

Life is short, you need PythonImplementation

Liblinear: http://www.csie.ntu.edu.tw/~cjlin/liblinear/

A Library for Large Linear Classification L2-loss logistic regression 33 one-vs-all classifiers for each class.

Model

http://www.csie.ntu.edu.tw/~cjlin/liblinear/

http://www.csie.ntu.edu.tw/~cjlin/liblinear/

We split labeled data into training and validation set All following results are local results. Online test result are higher due to utilizing more training data. Due to the complexity of migrating our code to hadoop platform (mainly because we used third party non-java libraries), not all of the features above are employed in our final submission.

Experiments and Results

Experiments and ResultsFeature vector constituents Accuracy

Keyword-buyer relation 0.8194Keyword-segment

relation 0.9019

Keyword-buyer + ( + TFIDF) 0.9537

+ TFIDF 0.9656

+ TFIDF 0.9635

+ TFIDF 0.9725

+ TFIDF 0.9713

We split labeled data into training and validation set All following results are local results. Online test result are higher due to utilizing more training data. Due to the complexity of migrating our code to hadoop platform (mainly because we used third party non-java libraries), not all of the features above are employed in our final submission.

Analysis

Two types of feature Relation feature:

Utilized prior knowledge of class label information Low dimension May biased to training data

TFIDF feature: No class label information utilized High dimension Robust, good generalization ability

But a simple combination of two does not work well Ensemble methods may workaround this problem.

Limitations

Thanks!

Documents

第一届中国大数据技术创新与创业大赛 关键 词行业分类

第一届中国大数据技术创新与创业大赛关键词行业分类