SISTEMI PER LA RICERCASISTEMI PER LA RICERCADI DATI MULTIMEDIALI
Cl di L h S l t O l dClaudio Lucchese e Salvatore OrlandoTerzo Workshop di Dipartimento
Dipartimento di InformaticaUniversità degli studi di Venezia
Ringrazio le seguenti persone, coinvolte nel progetto SAPIR
Raffaele PeregoFausto Rabitti
Paolo BolettieriFabrizio Falchi
Tommaso PiccioliMatteo Mordacchini
SISTEMI P
WHAT AND, MORE IMPORTANTLY, WHY ??!
M lti M di Obj t
PER LA R
ICE
Multi-Media Objects:text, web pages (Google, Yahoo!, …)i (Fli k b ht b Y h ! )
ERC
A DI D
AT
images (Flickr bought by Yahoo!, …)video (YouTube bought by Google, …)audio
TI MU
LTIMEDaudio
etc. ...Why we are interested in MM objects ?
DIA
LI
Why we are interested in MM objects ? 1 image every 10 documents.Increasing multimedia content on the webIncreasing multimedia content on the web.Yet, search in audio-visual content is limited to associated text and metadata annotations. 33
SISTEMI P
OK, BUT GOOGLE ALREADY DOES THE JOB ...
PER LA R
ICE
We mean to search by content !
G l d t t b d ( t ti ) h
ERC
A DI D
AT
Google does text-based (annotations) searchDo we have enough text ?How is the quality of such text ?
TI MU
LTIMEDHow is the quality of such text ?
Is it correlated with the actual content ?
DIA
LI
Look at this:http://www.nmis.isti.cnr.it/khi/pISTI-CNR (Pisa) and Max Planck KunsthistorischeInstitute project on Florentine Coat of Arms
44
SISTEMI P
CONTENT-BASED SEARCH
PER LA R
ICE
Shape Selection
ERC
A DI D
ATTI MU
LTIMED
Results
DIA
LI
55
SISTEMI P
CONTENT-BASED SEARCH
PER LA R
ICE
Shape Selection
ERC
A DI D
ATTI MU
LTIMED
Results
DIA
LI
66
SISTEMI P
WHAT IS THE ADDED VALUE ?
PER LA R
ICE
Content-based search for disambiguation:What about searching for “sapphire” on Flickr ?
ERC
A DI D
ATTI MU
LTIMEDD
IALI
7 of 21
SISTEMI P
CONTENT-BASED, I.E. SIMILARITY SEARCH
PER LA R
ICE
Objects are “unknown”distances between objects is “known”
ERC
A DI D
AT
Metric Space assumption:symmetryid tit
TI MU
LTIMED
identitytriangle inequality
Distance functions inducing a metric space:
DIA
LI
Distance functions inducing a metric space:Minkowski distances, edit distance, jaccard distance...
Typical queries: Range or kNNTypical queries: Range or kNNApplications:
88
Photo SearchSISTEM
I P
SIMILARITY SEARCH
PER LA R
ICE
Objects are “unknown”distances between objects is “known”
query:
ERC
A DI D
AT
Metric Space assumption:symmetryid tit
TI MU
LTIMED
identitytriangle inequality
Distance functions inducing a metric space:
DIA
LI
Distance functions inducing a metric space:Minkowski distances, edit distance, jaccard distance...
Typical queries: Range or kNNTypical queries: Range or kNNApplications:
99
Photo SearchSISTEM
I P
SIMILARITY SEARCHMedical Data Search
PER LA R
ICE
Objects are “unknown”distances between objects is “known”query:
ERC
A DI D
AT
Metric Space assumption:symmetryid tit
TI MU
LTIMED
identitytriangle inequality
Distance functions inducing a metric space:
DIA
LI
Distance functions inducing a metric space:Minkowski distances, edit distance, jaccard distance...
Typical queries: Range or kNNTypical queries: Range or kNNApplications:
1010
Photo SearchSISTEM
I P
SIMILARITY SEARCHMedical Data Search
PER LA R
ICE
Objects are “unknown”distances between objects is “known”
3D Shape Search
ERC
A DI D
AT
Metric Space assumption:symmetryid tit
query:
TI MU
LTIMED
identitytriangle inequality
Distance functions inducing a metric space:
q y(haemoglobin molecule)
DIA
LI
Distance functions inducing a metric space:Minkowski distances, edit distance, jaccard distance...
Typical queries: Range or kNNTypical queries: Range or kNNApplications:
1111
SISTEMI P
SIMILARITY SEARCH
PER LA R
ICE
Objects are “unknown”distances between objects is “known”
M t i S ti
ERC
A DI D
AT
Metric Space assumption:symmetryidentity
TI MU
LTIMEDidentity
triangle inequalityDistance functions inducing a metric space:
DIA
LI
Minkowski distances, edit distance, jaccard distance...Typical queries: Range or kNNApplications:
photos, 3D shapes, medical images but alsotext dna graphs etc etc 12text, dna, graphs, etc. etc. 12
SISTEMI P
FEATURE-BASED APPROACH
PER LA R
ICE
From the object space to the feature space
ERC
A DI D
ATTI MU
LTIMEDD
IALI
1313
SISTEMI P
DATA STRUCTURES
Vantage Point Tree Generalized Hyper-plane
PER LA R
ICEVantage Point Tree yp p
Tree
ERC
A DI D
ATTI MU
LTIMEDD
IALI
1414
SISTEMI P
DATA STRUCTURES
Vantage Point Tree Generalized Hyper-plane
PER LA R
ICEVantage Point Tree yp p
Tree
ERC
A DI D
AT
Multiple sub-trees may hold the solution
TI MU
LTIMEDp y D
IALI
They perform well with high dimensionality datadimensionality data
1515
I HAVE NEVER SEEN ANYTHINGSISTEM
I P
LIKE THAT IN THE WEB !!!
PER LA R
ICE
Why giants like Google and Yahoo! are not using content-based search ?
Recent studies confirm that
ERC
A DI D
ATRecent studies confirm that centralized solutions are not scalable !A single standard PC would need about 12 years
TI MU
LTIMEDg y
to process a collection of 100 million images.Why ?
DIA
LI
Feature extraction is expensive !feature extraction vs. words in a web-page
S hi i i !Searching is expensive !similarity search vs. boolean search
1616
I HAVE NEVER SEEN ANYTHINGSISTEM
I P
LIKE THAT IN THE WEB !!!
PER LA R
ICE
Why giants like Google and Yahoo! are not using content-based search ?
Recent studies confirm that
ERC
A DI D
ATRecent studies confirm that centralized solutions are not scalable !A single standard PC would need about 12 years
TI MU
LTIMEDg y
to process a collection of 100 million images.Why ?
DIA
LI
Feature extraction is expensive !feature extraction vs. words in a web-page
S hi i i !Searching is expensive !similarity search vs. boolean search
1717
SISTEMI P
PARALLEL FEATURE EXTRACTION
W l d Fli k t th I Id
PER LA R
ICE
We crawled Flickr to gather Image IdsA set of clients were deployed on the EDEE Grid:
Retrieve 1000 imaged Ids
ERC
A DI D
ATRetrieve 1000 imaged IdsDownload ImageExtract MPEG-7 features
TI MU
LTIMED
Parse Flicker Photo-page to obtain additional metadataSend metadata to our centralized repository
W h b t 50 illi i ith f t
DIA
LI
We have about 50 millions images with featuresit seems that Flickr as at least 1 billion imagesYahoo! images has at least 2 billion imagesg gand … they grow exponentially !
Our metadata collection is called CoPhIRIt is largely the largest publicly available collection 18It is largely the largest publicly available collection 18
SISTEMI P
PARALLEL SEARCH
PER LA R
ICE
The search space is mapped into a linear intervalThis is mapped onto a distributed network thanks to
DHT b d l ith
ERC
A DI D
AT
some DHT-based algorithmEach node of the network holds an M-Tree
TI MU
LTIMEDD
IALI
1919
SISTEMI P
OUR PROPOSAL
PER LA R
ICE
Metric cache:Cache is widely used in traditional search engines
T i i l
ERC
A DI D
AT
Trivial:Store results previous queries Return results when submitted query is stored
TI MU
LTIMEDReturn results when submitted query is stored
Less trivial ...Store results of previous queries
DIA
LI
Store results of previous queriesBest-effort query answering
(with guarantees)( g )Use past queries to optimize database queries.
2020
SISTEMI P
A CACHE WITH ONE ENTRY ...
PER LA R
ICE
qi is the query in cachewith its k-NN.r is the “radius of the
d(qi,qj)
ERC
A DI D
ATri is the radius of the query”.qj is a new query.
ri
TI MU
LTIMEDqj q y
If d(qi,qj)<rithen the cached objects t di t
qi
Rji
DIA
LI
at distance
Rji = ri - d( qi qj ) from qj
qj
ji i ( qi,qj ) qj
are the top-k’ results of the new query 21the new query. 21
SISTEMI P
A CACHE WITH MANY ENTRIES
Gi th
PER LA R
ICE
Given the new query qxfind the largest Rxicorresponding to the ri
ERC
A DI D
AT
cached query qi.
The cached objects within qi
TI MU
LTIMEDThe cached objects within
distance Rxi from qx are the most similar to the
qi
rj
DIA
LI
query.
Additionally one couldqj
rl
Additionally, one could use other objects in the cache to provide an approximate answer
ql qx
22approximate answer. 22