Intelligent Photo Retrieval Using Commonsense Semantics Hugo Liu Software Agents Group MIT Media Laboratory [email protected] [email protected] PARC

Intelligent Photo Retrieval Using Commonsense Semantics

Hugo LiuSoftware Agents GroupMIT Media [email protected]

PARC talk08 Apr 2002Palo Alto, CA

Overview

Introduction to ARIAStorytelling with photos (i.e. email, web page)

Automated Learning of Photo AnnotationsBy observation of user’s behaviorParsing, IE heuristics, identifying semantic roles

Proactive Photo RetrievalFuzzy match of user’s current textual context to annotated

photosUse of commonsense reasoning to make retrieval more

robust

Organizing Digital Photos:A Common Approach

TasksRename the JPG fileOrganize it into a directory

DrawbacksManual annotations are time-consumingMay be hard to find relevant photos later

Organizing Digital Photos:Using Software Tools

Kodak’s Picture Easy SoftwareUsers create albums (hierarchical)

DrawbacksUpfront cost imposed on the userPhotos must be put under only one category

category types: places, social grouping, eventsHard to choose categoriesHard to find photos, recall categories, etc.

Why not use annotations?

An annotation is:A keyword associated with an image

Each image can have multiple annotations

Compared with directory structure:A directory represents one classification of a

photoi.e. event, geographical location, etc

Annotations allow for multiple “classifications”

ARIA: Annotation and Retrieval Integration Agent

Semi-automated annotation and retrieval

Allows user to do image-related taski.e.: emailing photos and posting web pagesi.e.: storytelling

AnnotationUses description from text to annotate photos

RetrievalProactively suggests possibly relevant photos

The User Interface

Email Headers

Text Pane

Photo Pane (ordered by relevance)

Double-click to edit annotations

A Scenario of Annotation

John has just returned from a friend’s wedding.

He wants to share his experience with his friends through pictures that he took.

He opens ARIA and the photos from his trip are automatically entered in his ARIA shoebox

He begins to write an email to his friend Jane:

To: [email protected]: My wonderful weekend

Hi Jane!

Last weekend, I went to Ken and Mary’s wedding in San Francisco. Here’s a picture of the two of them at the gazebo.

Regards, John

A Baseline Approach

Extract all keywords from text surrounding photo

Jane Last Weekend Went Ken Mary’s Wedding San Francisco Picture Gazebo Doesn’t make sense!

Rather brittle!

Our approach

Use concepts, not keywordsIdentify semantically important

conceptsWho: Ken, Mary

What: wedding

When: June 12, 2001, 1pm

Where: gazebo, San Francisco, CA

How did ARIA learn this annotation?

“Hi Jane!

Last weekend, I went to Ken and Mary’s wedding in

San Francisco.

Here’s a picture of the two of them at the gazebo.

The date that this event occurred

The two people pictured

The event

The place

The place

Extracting Semantic Roles from Text

What Understanding Agents are Needed?

“Hi Jane!

Last weekend, I went to Ken and Mary’s wedding in

San Francisco.

Here’s a picture of the two of them at the gazebo.

Temporal UA needs to recognize dateand time expressions

People UA needs KB of names and roles (e.g. brother, tour guide)

Events UA needs events KB

Place UA needs places KB

Anaphora resolutionContext attacher (links previous paragraph)

ARIA’s internal representation of text

Last weekend, I went to Ken and Mary’s wedding in San Francisco. Here’s a picture of the two of them at the gazebo.

ARIA_DATE_INTERVAL{2001-6-10,2001-6-12} , [PERSON $user] [ACTION go] [TRAJECTORY to [PERSON Ken] and [PERSON Mary] ’s [EVENT wedding] [LOCATION in [PLACE San Francisco]] [SBR .] [PLACE Here] [STATE is] [THING a picture [PROPERTY of [PERSON the two of them] [LOCATION at [PLACE the gazebo]]]] SBR

Mechanism for information extraction

Tokenize and NormalizeAnaphora Resolution / Context AttacherTemporal UA runsPOS Tagging

TBL Tagger (Brill, 1994) on Penn Treebank tagset

Syntactic ParsingLink Grammar (Sleator & Temperley, 1993)outputs a constituent parse

Semantic ProcessingCombines POS info, runs semantic UAssyntactic parse tree semantic parse treeontology of tags for event structure (Jackendoff, 1983, 1990)Not as deeper as deep semantic understanding i.e. UNITRAN (Dorr, 1992), but not as brittle

Frame Extraction

ARIA_DATE_INTERVAL{2001-6-10,2001-6-12} , [PERSON $user]

[ACTION go] [TRAJECTORY to

[PERSON Ken] and [PERSON Mary] ’s [EVENT wedding]

[LOCATION in [PLACE San Francisco]] [SBR .] [PLACE Here] [STATE is] [THING a picture

[PROPERTY of [PERSON the two of them] [LOCATION at

[PLACE the gazebo]]]] SBR

Hold On a Second…

The pragmatist says:Why go through all this effort to parse text? Why not just match

against your dictionary of PLACES, EVENTS, NAMES?

Response That would make a great baseline Getting the constituent structure is helpful in:

Learning bindings useful, e.g. “Here is a picture of George, who was our priest and also a great family friend.”

Exploit structure to induce semantic importance “Here is my wife posing with foobar in front of the Eiffel

Tower

Retrieval

As user types, the photos in the shoebox dynamically reorder (based on relevance)

A baseline approach:“exact matching”Problem is.. not robust

morphological variations (e.g. sisters =/= sister)semantic connections missed(e.g. wedding =/= bride)

Our Approach

Morphological variations easy to solveLemmatize words… e.g. married marry, brides bride

Semantic ConnectionsA good candidate for commonsense reasoning

1. In consumer photography, events and situations tend to be somewhat stereotyped and predictable:

brides are usually found at weddingsa birthday cake may be found at a birthday partypicnics may be located in parks

(note the granularity)2. So, given the annotation “bride”, also add “wedding”

Expand Annotations using Commonsense

Each annotation undergoes common sense expansion, which produces other semantically related annotations. For example:

>>> expand(“bride”)('wedding', '0.3662') ('woman', '0.2023')('ball', '0.1517') ('tree', '0.1517')('snow covered mountain', '0.1517')('flower', '0.1517') ('lake', '0.1517')('cake decoration', '0.1517') ('grass', '0.1517')('groom', '0.1517') ('tender moment', '0.1517')('veil', '0.1517') ('tuxedo', '0.1517')('wedding dress', '0.1517') ('sky', '0.1517')('hair', '0.1517') ('wedding bouquet', '0.1517')

Something that’s not perfect or complete can still be useful!

Retrieval Steps

As user types, use most recent n words to rerank photos by relevance

Fuzzy match n words to annotated photos (expanded with commonsense)

Quality of match determined by simple scoring function

Run temporal UA on n words for temporal retrieval against photo timestamp

More on Commonsense

Source of Common sense Open Mind Common Sense

(OMCS), (Singh, 2002)

OMCS is:Publicly acquired common

knowledge about the worldHas about 400,000

commonsense factsRepresented as semi-

structured English sentences

Examples of OMCS entries

Well-constrained (NP/AP slots)Something you may find in a restaurant is a

waiter.Something that might come after a wedding is a

wedding reception.

Poorly ConstrainedSomething that might happen in a play is you

forget your lines.

More on Open Mind

Comparision:CYC (Lenat, 1998)

1 million hand crafted assertionsrepresented more formally, i.e. as logic

OMCS advantagesPublicly and freely availableLess granular (i.e. more knowledge about social-

level interactions, versus, “to open a door, grasp the door knob and turn it”)

Easy to add to (e.g. personal commonsense)

Open Mind Caveats

More ambiguity than CYCWord senses not disambiguated

Coverage is uneven, and spotty at times

Culturally specific (CYC is too)i.e.: common knowledge for middle-class

americansWedding scenario is a perfect example

Open Mind Redeeming Qualities

Good granularity for our problem domaini.e.: events, social composition, etc.

Many simple, binary relationsOur approach is tolerant to noise

Each commonsense annotation doesn’t count muchIt combines a lot of “partial evidence”

Adaptive photo suggestion is “soft” (AI)Unlike “hard” uses like question answering, and general

problem solving, photo suggestion isn’t mission-critical

Expanding annotations: Under the hood

Use OMCS to build a semantic resourceChoose spreading activation network

Nodes are concepts, edges are semantic relation

e.g.: bride and wedding are connected through foundAt edge

SAN applications to IR: Salton & Buckley (1988)

Enables shallow commonsense reasoning (only applying transitivity inference pattern)

On-the-fly photo suggestion demands efficiency

Building the graph

Sample mapping rule from OMCS to pred-arg structure:

somewhere THING1 can be is PLACE1somewherecanbeTHING1, PLACE10.5, 0.1

Differentiate Forward/Backward weightSmallbig (e.g. bride wedding) preferred

20 mapping rules applied to get 50,000 pred/arg structures (mostly binary)

Mapping Rules

Mapping Rules target this subset of OMCS:

Classification: A dog is a pet

Spatial: San Francisco is part of California

Scene: Things often found together are: restaurant, food, waiters, tables, seats

Purpose: A vacation is for relaxation; Pets are for companionship

Causality: After the wedding ceremony comes the wedding reception.

Emotion: A pet makes you feel happy; Rollercoasters make you feel excited and scared.

Cleaning Args in Pred/Args

Apply a sieve grammar to normalize concepts NOUN PHRASE: (PREP) (DET|POSS-PRON) NOUN (PREP) (DET|POSS-PRON) NOUN NOUN (PREP) NOUN POSS-MARKER (ADJ) NOUN (PREP) (DET|POSS-PRON) NOUN NOUN NOUN (PREP) (DET|POSS-PRON) (ADJ) NOUN PREP NOUN ACTIVITY PHRASE: (PREP) (ADV) VERB (ADV) (PREP) (ADV) VERB (ADV) (DET|POSS-PRON) (ADJ) NOUN (PREP) (ADV) VERB (ADV) (DET|POSS-PRON) (ADJ) NOUN NOUN

Examples: “to cook some dinner” “cook dinner”, “in the playground” “playground”

Building concept node graph

30,000 concept nodes80,000 pairs of directed edgesAverage branching factor is 5

0.2

bride

weddinggroom

0.5

0.1

0.2

0.5

0.1FoundTogether PartOf

PartOf

Spreading activation algorithm

Origin node is annotation, activates with energy=1

Next, nodes 1 hop away are activated, then 2 hops, etc

Nodes will activate if energy meets threshold (e.g. 0.2)Because edges are a measure of supposed semantic

connectedness, lower energy less semantic relevance

Given edge (A,B), activation score of B:)),((*)()( BAedgeweightAASBAS

Reweighting edges

Two opportunities to reweight edges to improve semantic relevanceReinforcementInverse Popularity

First, ReinforcementSimple IdeaNodes connected thru multiple paths are more semantically relatedApplies to k-clusters

(mutually reinforced)A CP PB

Q

Inverse Popularity

Punish a node for having too many children Often symptomatic of word sense collision

(since OMCS is not sense disambiguated)

Sense collisions lead to noise Discount by equation (above-right)

bridesmaid

bride groom

to wedding

to wedding

to beauty

concepts

conceptsconcepts

(bf = 2)

(bf = 129)

)*log(

1

*

actorbranchingFdiscount

discountoldWeightnewWeight

More Example Expansions

>>> expand('london')('england', '0.9618') ('ontario', '0.6108')('europe', '0.4799') ('california', '0.3622')('united kingdom', '0.2644') ('forest', '0.2644')('earth', '0.1244') >>> expand(“symphony”)('concert', '0.5') ('piece of music', '0.4')('theatre', '0.2469')('screw', '0.2244') ('concert hall', '0.2244')('xylophone', '0.1') ('harp', '0.1')('viola', '0.1') ('cello', '0.1')('wind instrument', '0.1') ('bassoon', '0.1')('bass fiddle', '0.1')

Finally, the Demo

Evaluation

Full Evaluation completed on ARIA 1.0 at KodakUsers liked it better than Kodak PictureEasy

Only Informal Evaluation of ARIA 2.0 (with NL and CS)Interesting findingsWhen CS expansion helped, user had positive commentsMutual adaptationWhen CS expansion suggests irrelevant photo, users still thought

computer was being helpful, attributed intentionality to ARIAValidates this as “fail-soft” problem

Related Work

State-of-the-art: FotoFile (Kuchinsky et al, 1999)DOES some image analysis to propose annotationsDOESN’T do observational learning of annotations from text

Watson (Budzik and Hammond, 2000)DOES observe user actions, and automates retrievalDOESN’T learn annotationsNEITHER does adaptive retrieval of semantically connected

photos

OMCS vs. WordNet (Fellbaum, 1998), thesauriOMCS operates on a concept level (e.g. activity) vs. WordNet on

a lexical levelWordNet gives small set of nymic relations, OMCS offers more

variety of relations, describing the real world.

Future Work

Learning personal commonsense to build a user model e.g.: “…Mary, my best friend, …” “Mary is user’s best

friend”

Bulk annotationsAs photo shoebox grows, need to cluster annotations by

time, event (can be done heuristically)Propagate annotations to other photos in cluster

e.g. propagate “european vacation” to all photos from vacation

Pointers, etc.

Access to papers: google for “hugo liu”

Related Projects:GOOSE: A Goal-Oriented Search Engine with Commonsense (AH-2002)MAKEBELIEVE: Using commonsense to generate stories (AAAI-2002)Natural Language Agent for Automatic Meeting Scheduling through Emails (unpublished)

Documents

Intelligent Photo Retrieval Using Commonsense Semantics Hugo Liu Software Agents Group MIT Media Laboratory [email protected] [email protected] PARC