47
HUMANE INFORMATION SEEKING: GOING BEYOND THE IR WAY JIN YOUNG KIM @ SNU DCC 1

랭킹 최적화를 넘어 인간적인 검색으로 - 서울대 융합기술원 발표

Embed Size (px)

Citation preview

Page 1: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

HUMANE INFORMATION SEEKING:

GOING BEYOND THE IR WAY

JIN YOUNG KIM @ SNU DCC

1

Page 2: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

Jin Young Kim

• Graduate of SNU EE / Business

• 5th Year Ph.D Student in UMass Computer Science

• Starting as a Applied Researcher at Microsoft Bing

2

Page 3: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

Today’s Agenda

• A brief introduction of IR as a research area

• An example of how we design a retrieval model

• Other research projects and recent trends in IR

3

Page 4: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

BACKGROUND

An Information Retrieval Primer

4

Page 5: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

Information Retrieval?

• The study of how an automated system can enable

its users to access, interact with, and make sense of

information.

5

User

Query

DocumentVisit

IssueSurface

Page 6: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

IR Research in Context

• Situated between human interface and

system / analytics research

• Aims at satisfying user’s information needs

• Based on large-scale system infrastructure & analytics

• Need for convergence research!

6

Information Retrieval

Large-scale

System Infra.

Large-scale

(Text)Analytic

s

End-user Interface

(UX / HCI / InfoViz)

Page 7: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

Major Problems in IR

• Matching

• (Keyword) Search : query – document

• Personalized Search : (user+query) – document

• Contextual Advertising : (user+context) – advertisement

• Quality

• Authority/ Spam / Freshness

• Various ways to capture them

• Relevance Scoring

• Combination of matching and quality features

• Evaluation is critical for optimal performance

7

User

Query

DocumentVisit

IssueSurface

Page 8: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

HUMANE INFORMATION

RETRIEVAL

Going Beyond the IR Way

8

Page 9: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

9

You need the freedom of expression.You need someone who understands.

Information seeking requires a communication.

Page 10: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

10

Information Seeking circa 2012

Search engine accepts keywords only.Search engine doesn’t understand you.

Page 11: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

11

Toward Humane Information Seeking

Rich User Interactions

Rich User Modeling

Profile

Context

Behavior

Search

Browsing

Filtering

Page 12: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

from Query to SessionRich User ModelingHCIR Way:

12

Action Response

Action Response

Action Response

USER SYSTEM

Interaction

History

Filtering / Browsing

Relevance Feedback

Filtering Conditions

Related Items

User

Model

Rich User InteractionIR Way:The

ProfileContextBehavior

HCIR = HCI + IR

Page 13: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

13

The Rest of Talk…

Web Search

Personal SearchImproving search and browsing for known-item findingEvaluating interactions combining search and browsing

User modeling based on reading level and topicProviding non-intrusive recommendations for browsing

Book SearchAnalyzing interactions combining search and filtering

Page 14: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

PERSONAL SEARCH

Retrieval And Evaluation Techniques

for Personal Information [Thesis]

14

Page 15: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

Example: Desktop Search

15

Example: Search over Social Media

Ranking using Multiple

Document Types for

Desktop Search [SIGIR10]

Evaluating Search in

Personal Social Media

Collections [WSDM12]

Page 16: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

Structured Document Retrieval: Background

• Field Operator / Advanced Search Interface

• User’s search terms are found in multiple fields

16

Understanding Re-finding Behavior in Naturalistic Email

Interaction Logs. Elsweiler, D, Harvey, M, Hacker., M [SIGIR'11]

Page 17: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

Structured Document Retrieval: Models

• Document-based Retrieval Model

• Score each document as a whole

• Field-based Retrieval Model

• Combine evidences from each field

q1 q2 ... qm

Document-based Scoring Field-based Scoring

f1

f2

fn

...

q1 q2 ... qm

f1

f2

fn

...

f1

f2

fn

...

w1

w2

wn

w1

w2

wn

17

Page 18: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

18

1

1

2

21

2

• Field Relevance

• Different field is important for different query-term

‘james’ is relevant

when it occurs in <to>

‘registration’ is relevant

when it occurs in <subject>

Improved Matching for Email SearchStructured Documents[CIKM09, ECIR09,12]

Page 19: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

Estimating the Field Relevance

• If User Provides Feedback

• Relevant document provides sufficient information

• If No Feedback is Available

• Combine field-level term statistics from multiple sources

19

content

title

from/to

Relevant Docs

content

title

from/to

Collection

content

title

from/to

Top-k Docs

+ ≅

Page 20: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

20

Retrieval Using the Field Relevance

• Comparison with Previous Work

• Ranking in the Field Relevance Model

q1 q2 ... qm

f1

f2

fn

...

f1

f2

fn

...

w1

w2

wn

w1

w2

wn

q1 q2 ... qm

f1

f2

fn

...

f1

f2

fn

...

P(F1|q1)

P(F2|q1)

P(Fn|q1)

P(F1|qm)

P(F2|qm)

P(Fn|qm)

Per-term Field Weight

Per-term Field Score

sum

multiply

Page 21: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

• Retrieval Effectiveness (Metric: Mean Reciprocal Rank)

DQL BM25F MFLM FRM-C FRM-T FRM-R

TREC 54.2% 59.7% 60.1% 62.4% 66.8% 79.4%

IMDB 40.8% 52.4% 61.2% 63.7% 65.7% 70.4%

Monster 42.9% 27.9% 46.0% 54.2% 55.8% 71.6%

Evaluating the Field Relevance Model

21

40.0%

45.0%

50.0%

55.0%

60.0%

65.0%

70.0%

75.0%

80.0%

DQL BM25F MFLM FRM-C FRM-T FRM-R

TREC

IMDB

Monster

Fixed Field Weights Per-term Field Weights

Page 22: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

Evaluation Challenges for Personal Search• Evaluation of Personal Search

• Each based on its own user study

• No comparative evaluation was performed yet

• Solution: Simulated Collections

• Crawl CS department webpages, docs and calendars

• Recruit department people for user study

• Collecting User Logs

• DocTrack: a human-computation search game

• Probabilistic User Model: a method for user simulation

22

[CIKM09,SIGIR10,CIKM11]

Page 23: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

DocTrack Game

23

Find It!

Target Item

Page 24: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

Summary so far…

• Query Modeling for Structured Documents

• Using the estimated field relevance improves the retrieval

• User’s feedback can help personalize the field relevance

• Evaluation Challenges in Personal Search

• Simulation of the search task using game-like structures

• Related work : ‘Find It If You Can’ [SIGIR11]

24

Page 25: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

WEB SEARCH

Characterizing Web Content, User Interests, and

Search Behavior by Reading Level and Topic

25

[WSDM12]

Page 26: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

Reading level distribution varies across major topical categories

Page 27: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

User Modeling by Reading Level and Topic• Reading Level and Topic

• Reading Level: proficiency (comprehensibility)

• Topic: topical areas of interests

• Profile Construction

• Profile Applications

• Improving personalized search ranking

• Enabling expert content recommendation

P(R|d1) P(T|d1)P(R|d1) P(T|d1)

P(R|d1) P(T|d1) P(R,T|u)

Page 28: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

Profile matching can predict user’s preference over search results• Metric

• % of user’s preferences predicted by profile matching

• Profile matching measured in KL-Divergence of RT profiles

• Results

• By the degree of focus in user profile

• By the distance metric between user and website

User Group #Clicks KLR(u,s) KLT(u,s) KLRT(u,s)

↑Focused 5,960 59.23% 60.79% 65.27%

147,195 52.25% 54.20% 54.41%

↓Diverse 197,733 52.75% 53.36% 53.63%

Page 29: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

Comparing Expert vs. Non-expert URLs• Expert vs. Non-expert URLs taken from [White’09]

Higher Reading Level

Low

er To

pic

Div

ersity

Page 30: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

Enabling Browsing for Web Search

• SurfCanyon®

• Recommend results

based on clicks

30

Initial results indicate that

recommendations are useful

for shopping domain.

[Work-in-progress]

Page 31: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

BOOK SEARCH

Understanding Book Search Behavior on the Web

31

[Submitted to SIGIR12]

Page 32: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

Understanding Book Search on the Web

• OpenLibrary

• User-contributed online digital library

• DataSet: 8M records from web server log

32

Page 33: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

Comparison of Navigational Behavior

• Users entering directly show different behaviors from

users entering via web search engines

33

Users entering the site directly Users entering via Google

Page 34: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

Comparison of Search Behavior

34

Rich interaction reduces the query lengthsFiltering induces more interactions than search

Page 35: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

LOOKING ONWARD

35

Page 36: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

Where’s the Future? – Social Search

• The New Bing Sidebar makes search a social activity.

36

Page 37: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

Where’s the Future? – Semantic Search

• The New Google serves ‘knowledge’ as well as docs.

37

Page 38: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

Where’s the Future? – Siri-like Agent

• The New Google serves ‘knowledge’ as well as docs.

38

Page 39: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

Exciting Future is Awaiting US!

• Recommended Readings in IR:

• http://www.cs.rmit.edu.au/swirl12

39

Any

Questions

?

Page 40: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

Selected Publications

• Structured Document Retrieval• A Probabilistic Retrieval Model for Semi-structured Data [ECIR09]

• A Field Relevance Model for Structured Document Retrieval [ECIR11]

• Personal Search• Retrieval Experiments using Pseudo-Desktop Collections [CIKM09]

• Ranking using Multiple Document Types in Desktop Search [SIGIR10]

• Building a Semantic Representation for Personal Information [CIKM10]

• Evaluating an Associative Browsing Model for Personal Info. [CIKM11]

• Evaluating Search in Personal Social Media Collections [WSDM12]

• Web / Book Search• Characterizing Web Content, User Interests, and Search Behavior by

Reading Level and Topic [WSDM12]

• Understanding Book Search Behavior on the Web [In submission to SIGIR12]

40

More at @lifidea, or

cs.umass.edu/~jykim

Page 41: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

My Self-tracking Efforts

• Life-optimization Project (2002~2006)

• LiFiDeA Project (2011-2012)

41

Page 42: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

OPTIONAL SLIDES

42

Page 43: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

The Great Divide: IR vs. HCI

IR

• Query / Document

• Relevant Results

• Ranking / Suggestions

• Feature Engineering

• Batch Evaluation (TREC)

• SIGIR / CIKM / WSDM

HCI

• User / System

• User Value / Satisfaction

• Interface / Visualization

• Human-centered Design

• User Study

• CHI / UIST / CSCW

43

Can we learn from each other?

Page 44: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

The Great Divide: IR vs. RecSys

IR

• Query / Document

• Reactive (given query)

• SIGIR / CIKM / WSDM

RecSys

• User / Item

• Proactive (push item)

• RecSys / KDD / UMAP

44

Page 45: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

The Great Divide: IR in CS vs. LIS

IR in CS

• Focus on ranking &

relevance optimization

• Batch & quantitative

evaluation

• SIGIR / CIKM / WSDM

• UMass / CMU /

Glasgow

IR in LIS

• Focus on behavioral

study & understanding

• User study & qualitative

evaluation

• ASIS&T / JCDL

• UNC / Rutgers / UW

45

Page 46: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

• What

• How

Problems & Techniques in IR

Data Format (documents, records and linked data) /

Size / Dynamics (static, dynamic, streaming)

User &

Domain

End User (web and library)

Business User (legal, medical and patent)

System Component (e.g., IBM Watson)

Needs Known-item vs. Exploratory Search

Recommendation

46

System Indexing and Retrieval

(Platforms for Big Data Handling)

Analytics Feature Extraction

Retrieval Model Tuning & Evaluation

Presentation User Interface

Information Visualization

Page 47: 랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표

More about the Matching Problem

• Finding Representations

• Term vector vs. Term distribution

• Topical category, Reading level, …

• Estimating Representations

• By counting terms

• Using automatic classifiers

• Calculating Matching Scores

• Cosine similarity vs. KL-divergence

• Combining multiple reps.

47

User

Query

DocumentVisit

IssueSurface