Upload
della-cole
View
233
Download
0
Embed Size (px)
Citation preview
1
Personal information• Hung-Hsuan Chen 陳弘軒• PhD, Computer Science and Engineering, the
Pennsylvania State University (2008 – 2013)• MS, BS, Computer Science, National Tsing Hua
University (2000 – 2004, 2004 – 2006)• Recent honors
Best paper award, College of Engineering, PSU (2013) Highest F1 score, the Competition of Plagiarism detection,
PAN (2013) Invited to Amazon PhD research symposium, present
research work at Amazon, single digit acceptance rate (2013)
Travel award, SIGMOD 2013, ICHI 2013, SBP 2012
2
Data Science?
From data scientist Drew Conwayhttp://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
3
Research interest in general: data analysis and mining
Link Only
Link + Content
Content Only
Ranking Function
Similarity Search
Link Prediction
Info Prop P2P Trans.
User Analysis
Data Type
Task
DBSocial’13DMH’13 KDD’12
SAC’13
SBP’12K-CAP’11
ASONAM’13
JCDL’13D-Lib’12
JCDL’11JCDL’11
JCDL’10WWW’10
MIR’10
ICPADS’05
CLEF’13
MobiDE’09SIGCOMM’08
TKDD ‘14
TMIS ‘14
Text Analysis
ECIR’14
AAAI’14JCDL’14
JCDL’14
WebSci’14
IAAI’14
4
Research experience• RA, IST, Pennsylvania State University (2008 – now)
Large scale text mining + social network analysis• RA, CSIE, National Taiwan University (08/2011 –
01/2012) Information propagation analysis
• Software engineer Intern, Google (05/2010 – 08/2010) Recommender system + user log analysis
• RA, IIS, Academia Sinica (11/2007 – 07/2008) User traffic analysis
• RA, CS, National Tsing Hua University (2004 – 2006) Distributed data stream analysis
5
Selected recent research
6
CSSeer• An open source expert recommender system based
on a given digital library Live site (based on CiteSeerX): http://csseer.ist.psu.edu
• The framework is shipped to Dow Chemical Expert discovery based on internal technical reports
• Author disambiguation (by random forest) Wen-Yi:雯怡?文溢? “C. Giles” = “C. Lee Giles” = “Lee Giles” = “C. L. Giles”? Google Scholar suffers from a similar problem
• Keyphrase extraction (by naïve Bayesian)• Expert ranking (by naïve Bayesian)
7
How to rank experts?• Rank authors, not documents
Text indexing and PageRank-like methods cannot be directly applied
• Ranking efficiency Aggregating author scores on-
the-fly is time consuming Offline computing unigrams
• Ranking quality What is the probability that a is
an expert given a query term q? P(a|q) = ΣP(d)P(q|d)P(a|q,d) = ΣP(d)P(q|d)P(a|d)
Document Ranking Function
Query term
d1
d2
d3
d4
d5
d3
d2
d5
d1
d4
Current searchengine
8
Comparison with other expert recommenders
Simulates Google Scholar
Obtain ground truth from: http://arnetminer.org/lab-datasets/expertfinding/
9
ASCOS• Discovering similar objects in a network• Symmetric vs. asymmetric similarity
Similarity can be asymmetric• Coauthoring behavior: a young researcher might be more
interested in collaborating with a strong researcher than vise versa
Asymmetric property may reveal the hierarchical relationship between objects• Word association network: “fruit” should be the super-class
of “banana” and “apple”, but sub-class of “food”
• Link prediction ASCOS better predicts future collaborations than
SimRank and several other state-of-the-art link prediction algorithms
10
Intuition of ASCOS• Similarity from i to j is dependent on the
similarity score from i’s neighbors to j
N(i): the set of neighbors of node i • Utilize all paths between nodes• Asymmetric
otherwise 1
if |)(|: )(
j isiN
c
s iNk kjij
11
Hierarchical structure inference
The score difference between the neighbor words of “instrument” to the word “instrument” (node i)
The score difference between the neighbor words of “fruit” to the word “fruit” (node i)
12
Future research direction (and several ongoing research)
13
Data science• Hacking skills in handling big data
CiteSeerX, CSSeer, CollabSeer• 3 million+ documents• 1 million+ authors, 300K+ disambiguated authors• 3 billion+ log entries to analyze
Google ad logs• Several TB per day
• Knowledge in math and stats Various data mining techniques Social network analysis Natural language processing
• Inter-discipline collaboration Collaborated with Dow Chemical, Alcatel-Lucent, etc.
14
Data is changing the world• WhosCall
Telephone number crowd-sourcing Reverse lookup and number identification
• Waze Map crowd-sourcing + Google Map info Automatic road update + real time traffic update
• 零時政府 台灣懸浮微粒汙染圖–已被用於新聞台氣象報導 萌典–教育部林主任:應用層面已經不是廠商做不做得
出來的問題,是我們想都想不到能有這些應用。• And many others…
Netflix, Amazon, Walmart, State Farm, Spotify, Yelp, medical data used in hospital, etc.
15
Potential research projects: IR and DM on MOOC
• MOOC: Massive Open Online Course• Mining keyphrases from slides + audio lectures
Slide texts may not be a complete sentence POS taggers may not work
Slides provides unique style clues for keyphrase extraction Speaker’s voice, tones, and other features may provide other
clues for keyphrase extraction Combining above heterogeneous perspectives to improve
performance
• Automatic course topic clustering or classification• Rely on students’ interactions with MOOC to predict
their learning performance Find talented students and slow learners as early as possible 因材施教
16
Potential research projects: IR and DM on digital libraries
• Math equation retrieval How to correctly parse equation (from PDF)? How to index equation? Query interface?
• Music score retrieval How to parse music notes? How to index music? Query interface?
• Inferring “meaning” of figures Retrieving x, y labels and the points in a figure could make a search
engine more powerful Sample query: what’s the performance of method Y when x = x1?
• “Artificial” paper detection IEEE and Springer withdraw 120 papers Fake paper influences user experience and Scientometrics
17
Teaching
18
Teaching experience• Guest lectures in classes
Information Retrieval and Search Engines (Spring 2011, Spring 2013, at PSU)
• TA Operating Systems (Fall 2005, at NTHU)
• Guest speaker of various seminars (in addition to conference presentations) Dept of CS, RIT – 2014 Amazon – 2013 Graduate Exhibitions, PSU – 2012, 2013 College of Engineering, PSU – 2013 Network Science Seminar, PSU – 2013
19
Teaching philosophy• Incorporate research in teaching
Students understand the usefulness of what they’ve learned
Students understand what’s happening in science Students may bring useful feedback or fresh ideas
• Learning by doing Students may better understand these topics Students usually feel more confident when they
implement a concept by themselves• Competition
Online competitions stimulate students’ motivations to think and work hard
20
Advanced courses I can offer• 資料探勘 (data mining)
Overview Evaluation methods Supervised
• Classification• Regression
Unsupervised• Clustering• Density estimation
Applications of DM Practical issues Advanced techniques
• 資料擷取與蒐尋引擎 (information retrieval and search engine) Overview Retrieval evaluation Concept of documents Text processing Query models and
indexing Web crawling and
robots.txt Link analysis
21
Advanced courses I can offer• 社群網路分析 (Social
network analysis) Overview Graph theory Properties of real social
networks Community detection Link prediction Information propagation Heterogeneous social
network
• 大規模文字資料分析 (Large-scale Text Document analysis) Word and document
representation Text mining pipeline Fundamentals of NLP Association rules MapReduce framework Keyphrase extraction Duplicate detection Recommender systems
22
Basic courses I can offer• 線性代數 (Linear
algebra) Systems of linear
equations Vector and matrix Eigenvalues and
eigenvectors Determinants When LA meets DM
• Matrix decomposition vs recommender systems
• PCA vs dimension reduction
• Eigenvector vs PageRank
• 機率與統計 (Probability and Statistics) Intro to probability Random variables and
Bayes’ theorem Discrete random
variables and PMF Continuous random
variables and PDF Joint probability
distribution Confidence interval Hypothesis testing
23
Other courses I am interested to offer
• 計算機程式設計• 資料結構• 資料庫系統概論• Web 程式設計• 開源軟體開發實務