Personal information Hung-Hsuan Chen 陳弘軒 PhD, Computer Science and Engineering, the Pennsylvania State University (2008 – 2013) MS, BS, Computer Science,

1

Personal information• Hung-Hsuan Chen 陳弘軒• PhD, Computer Science and Engineering, the

Pennsylvania State University (2008 – 2013)• MS, BS, Computer Science, National Tsing Hua

University (2000 – 2004, 2004 – 2006)• Recent honors

Best paper award, College of Engineering, PSU (2013) Highest F1 score, the Competition of Plagiarism detection,

PAN (2013) Invited to Amazon PhD research symposium, present

research work at Amazon, single digit acceptance rate (2013)

Travel award, SIGMOD 2013, ICHI 2013, SBP 2012

2

Data Science?

From data scientist Drew Conwayhttp://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

3

Research interest in general: data analysis and mining

Link Only

Link + Content

Content Only

Ranking Function

Similarity Search

Link Prediction

Info Prop P2P Trans.

User Analysis

Data Type

Task

DBSocial’13DMH’13 KDD’12

SAC’13

SBP’12K-CAP’11

ASONAM’13

JCDL’13D-Lib’12

JCDL’11JCDL’11

JCDL’10WWW’10

MIR’10

ICPADS’05

CLEF’13

MobiDE’09SIGCOMM’08

TKDD ‘14

TMIS ‘14

Text Analysis

ECIR’14

AAAI’14JCDL’14

JCDL’14

WebSci’14

IAAI’14

4

Research experience• RA, IST, Pennsylvania State University (2008 – now)

Large scale text mining + social network analysis• RA, CSIE, National Taiwan University (08/2011 –

01/2012) Information propagation analysis

• Software engineer Intern, Google (05/2010 – 08/2010) Recommender system + user log analysis

• RA, IIS, Academia Sinica (11/2007 – 07/2008) User traffic analysis

• RA, CS, National Tsing Hua University (2004 – 2006) Distributed data stream analysis

5

Selected recent research

6

CSSeer• An open source expert recommender system based

on a given digital library Live site (based on CiteSeerX): http://csseer.ist.psu.edu

• The framework is shipped to Dow Chemical Expert discovery based on internal technical reports

• Author disambiguation (by random forest) Wen-Yi：雯怡？文溢？ “C. Giles” = “C. Lee Giles” = “Lee Giles” = “C. L. Giles”? Google Scholar suffers from a similar problem

• Keyphrase extraction (by naïve Bayesian)• Expert ranking (by naïve Bayesian)

http://csseer.ist.psu.edu/

7

How to rank experts?• Rank authors, not documents

Text indexing and PageRank-like methods cannot be directly applied

• Ranking efficiency Aggregating author scores on-

the-fly is time consuming Offline computing unigrams

• Ranking quality What is the probability that a is

an expert given a query term q? P(a|q) = ΣP(d)P(q|d)P(a|q,d) = ΣP(d)P(q|d)P(a|d)

Document Ranking Function

Query term

d1

d2

d3

d4

d5

d3

d2

d5

d1

d4

Current searchengine

8

Comparison with other expert recommenders

Simulates Google Scholar

Obtain ground truth from: http://arnetminer.org/lab-datasets/expertfinding/

http://arnetminer.org/lab-datasets/expertfinding/

9

ASCOS• Discovering similar objects in a network• Symmetric vs. asymmetric similarity

Similarity can be asymmetric• Coauthoring behavior: a young researcher might be more

interested in collaborating with a strong researcher than vise versa

Asymmetric property may reveal the hierarchical relationship between objects• Word association network: “fruit” should be the super-class

of “banana” and “apple”, but sub-class of “food”

• Link prediction ASCOS better predicts future collaborations than

SimRank and several other state-of-the-art link prediction algorithms

10

Intuition of ASCOS• Similarity from i to j is dependent on the

similarity score from i’s neighbors to j

N(i): the set of neighbors of node i • Utilize all paths between nodes• Asymmetric

otherwise 1

if |)(|: )(

j isiN

c

s iNk kjij

11

Hierarchical structure inference

The score difference between the neighbor words of “instrument” to the word “instrument” (node i)

The score difference between the neighbor words of “fruit” to the word “fruit” (node i)

12

Future research direction (and several ongoing research)

13

Data science• Hacking skills in handling big data

CiteSeerX, CSSeer, CollabSeer• 3 million+ documents• 1 million+ authors, 300K+ disambiguated authors• 3 billion+ log entries to analyze

Google ad logs• Several TB per day

• Knowledge in math and stats Various data mining techniques Social network analysis Natural language processing

• Inter-discipline collaboration Collaborated with Dow Chemical, Alcatel-Lucent, etc.

14

Data is changing the world• WhosCall

Telephone number crowd-sourcing Reverse lookup and number identification

• Waze Map crowd-sourcing + Google Map info Automatic road update + real time traffic update

• 零時政府台灣懸浮微粒汙染圖–已被用於新聞台氣象報導萌典–教育部林主任：應用層面已經不是廠商做不做得

出來的問題，是我們想都想不到能有這些應用。• And many others…

Netflix, Amazon, Walmart, State Farm, Spotify, Yelp, medical data used in hospital, etc.

15

Potential research projects: IR and DM on MOOC

• MOOC: Massive Open Online Course• Mining keyphrases from slides + audio lectures

Slide texts may not be a complete sentence POS taggers may not work

Slides provides unique style clues for keyphrase extraction Speaker’s voice, tones, and other features may provide other

clues for keyphrase extraction Combining above heterogeneous perspectives to improve

performance

• Automatic course topic clustering or classification• Rely on students’ interactions with MOOC to predict

their learning performance Find talented students and slow learners as early as possible 因材施教

16

Potential research projects: IR and DM on digital libraries

• Math equation retrieval How to correctly parse equation (from PDF)? How to index equation? Query interface?

• Music score retrieval How to parse music notes? How to index music? Query interface?

• Inferring “meaning” of figures Retrieving x, y labels and the points in a figure could make a search

engine more powerful Sample query: what’s the performance of method Y when x = x1?

• “Artificial” paper detection IEEE and Springer withdraw 120 papers Fake paper influences user experience and Scientometrics

17

Teaching

18

Teaching experience• Guest lectures in classes

Information Retrieval and Search Engines (Spring 2011, Spring 2013, at PSU)

• TA Operating Systems (Fall 2005, at NTHU)

• Guest speaker of various seminars (in addition to conference presentations) Dept of CS, RIT – 2014 Amazon – 2013 Graduate Exhibitions, PSU – 2012, 2013 College of Engineering, PSU – 2013 Network Science Seminar, PSU – 2013

19

Teaching philosophy• Incorporate research in teaching

Students understand the usefulness of what they’ve learned

Students understand what’s happening in science Students may bring useful feedback or fresh ideas

• Learning by doing Students may better understand these topics Students usually feel more confident when they

implement a concept by themselves• Competition

Online competitions stimulate students’ motivations to think and work hard

20

Advanced courses I can offer• 資料探勘 (data mining)

Overview Evaluation methods Supervised

• Classification• Regression

Unsupervised• Clustering• Density estimation

Applications of DM Practical issues Advanced techniques

• 資料擷取與蒐尋引擎 (information retrieval and search engine) Overview Retrieval evaluation Concept of documents Text processing Query models and

indexing Web crawling and

robots.txt Link analysis

21

Advanced courses I can offer• 社群網路分析 (Social

network analysis) Overview Graph theory Properties of real social

networks Community detection Link prediction Information propagation Heterogeneous social

network

• 大規模文字資料分析 (Large-scale Text Document analysis) Word and document

representation Text mining pipeline Fundamentals of NLP Association rules MapReduce framework Keyphrase extraction Duplicate detection Recommender systems

22

Basic courses I can offer• 線性代數 (Linear

algebra) Systems of linear

equations Vector and matrix Eigenvalues and

eigenvectors Determinants When LA meets DM

• Matrix decomposition vs recommender systems

• PCA vs dimension reduction

• Eigenvector vs PageRank

• 機率與統計 (Probability and Statistics) Intro to probability Random variables and

Bayes’ theorem Discrete random

variables and PMF Continuous random

variables and PDF Joint probability

distribution Confidence interval Hypothesis testing

23

Other courses I am interested to offer

• 計算機程式設計• 資料結構• 資料庫系統概論• Web 程式設計• 開源軟體開發實務

Documents

Personal information Hung-Hsuan Chen 陳弘軒 PhD, Computer Science and Engineering, the Pennsylvania State University (2008 – 2013) MS, BS, Computer Science,