Upload
leliem
View
224
Download
2
Embed Size (px)
Citation preview
1
Video Data Mining: Technology Dimensions and Challenges
Chin-Hui Lee ()School of ECE, Georgia Institute of Technology
Atlanta, GA 30332-0250, [email protected]
Talk at NTNU, Jan. 5 2007
2
Evolution of Language and Media
Paper Radio Historic Flow of Knowledge & Civilization
Print(1450AD)
WrittenLanguage (3000BC)
SpokenLanguage
Telegraph &Telephone
TV
RecordingMedia
Computer & Digital
Processing
Internet & WWW
ElectronicMedia
(1900AD)
Hyper & Virtual
Media ? (21st Cen)
2
3
Outline Web is a rich repository of all information sources
Multimedia web content: heterogeneous with little constraints Multilingual web materials, many cases on the same page Multimodal user interface: multiple human sensory input/output Multidisciplinary
From media processing to data mining Document processing/understanding Speech and speaker information Audio fingerprint and music identification Image annotation, indexing and retrieval Video data organization, information subscription and delivery
From mining to knowledge discovery and decision making The next big thing
Summary
4
Information Explosion on the Web(Part of The World is Flat by Friedman)
Over two billion of English web documents in 2005 Still growing exponentially in most languages (>1B for Chinese) From text only to multimedia and multilingual documents
Global data storage explosion (HD+network storage) In 2002, global information increased by 5 billion GBs, about 800MB
per person, enough to fill up 500,000 Library of Congress Memory explosion on PC and portable devices
From 512 KB for PC (Gates) to portable Library of Congress Cell phone sales close to 500M units in 2003
Mobile information appliances are commodities and fashions Todays students are more wired than ever in US
90% of aged 5-17 use computers, 99% public schools are wired From Command/Control to Connect/Collaborate
World leaders and executives access Google and Blackberry
3
5
Web to Accelerate Technology Adoption RateWeb to Accelerate Technology Adoption Rate
Time To Reach 10 Million CustomersTime To Reach 10 Million Customers
Sources: Apple, AirTouch Cellular, Info Tech and USA Today
0 10 20 30 40
Fax Machine
New Browser
WWW
PC
CD-ROM
Cellular Phone
VCR
Pager 41 Years
1
4
6
7
9
9
22
100 Years!
Napster?Napster?
6
Content Web: The New DB & PlaygroundFamily Sports
Fun
Knowledge
Personal
Finance
Office
4
7
Text Categorization Topic Identification
Unknown document dj
Classifier Ti
Classifier T1
Classifier Tm
..
T1(dj)
Ti(dj)
Tm(dj)Decisions by m classifiers
Labels of djfor Ci
L1(dj)
Li(dj)
Lm(dj)
.. ..
Evaluation
8
Event Representation & Topic Classification
Speech
Query-VectorExtraction
Text MorphologicalFiltering
ASR
Image AIA
Text Categorization
ASR: Automatic Speech RecognitionAIA: Automatic Image Annotation
Video: speech, audio, image, text, and others
5
9
Common Technology Thread: DSP, Feature Extraction & Classifier Learning
Speech/Image/Audio
LSA-Based FE/SVD
TextMedia Tokenization
Results
A/V Alphabet Model
A/V Word List
TC ClassifierLearning
TC Classifier
Text DocumentTraining Set
AudiovisualClassification
TC Classifier
Feature
First Step: Define alphabets and training alphabet models
10
First Google, Now Google News
6
11
Information Present in Speech0012-02
In real speech,this information is highly correlated.
Speech recognitionSpoken language understandingSpeaker verification
Channel characterization
Language identification
Word transcriptionMeaningIdentity of a speakerEmotional state of the speakerSex, age, and health of the speakerTransmission characteristics
(type of microphone, room noise,transmission noise, filtering,distortion, reverberation)
Language that is spoken
TechnologyInformation present
12
Speech and Speaker Data Mining
Event Detection
ASR
SID
Speech transcripts
Speaker identity:anchor
Event:explosion
..Devastating accident
7
13
SpeechFind: Speech & Speaker AnnotationFully searchable online database of spoken word collections spanning the 20th century
http://svoice.colorado.edu (Bowen Zhou)
14
Blink-X: A Video Search Portal
8
15
Conversational User Interface R2D2
SpeechRecognizer
Language Generator &
TTS SynthesizerLanguageAnalyzer
DialogueManager
SemanticRules
Database InteractionHistory Management
Command Execution
Status Report
Text Analysis & Pronunciation
Rules
Acoustic& Language
Models
VoiceInput
VoiceOutput
(Speech) (Text) (Meaning) (Text Reply) (Speech)
Output Action
Applicable to any language, including Mandarin, Minnan, Hakka, English spoken in Taiwan
16
Multilingual Web of the Future
Internet Users by Language (end of 2004, 800 million)
Clear Trend Non-English users continuously increasing Japanese and Chinese are currently (and may continuously be) the
two largest non-English groups
English, 35.90%
Chinese, 13.20%Japanese, 8.30%
German, 6.80%Spanish, 6.70%
French, 4.40%Korean, 3.80%
Italian, 3.60%Portuguese, 2.90%
Dutch, 1.70%All Others, 12.70%
9
17
Multilingual Web Pages: An Example
18
Universal Speech Translation C3PO
SpeechRecognizer
Language Generator &
TTS SynthesizerLanguageAnalyzer
Machine Translation
SemanticRules
Bilingual DatabasesTranslation Models
Text Analysis & Pronunciation
Rules
Acoustic& Language
Models
VoiceInput
VoiceOutput
(Speech in Language A)
(Text Understanding in Language A/B)
(Text Reply in Language B)
(Speech in Language B)
(Text in Language A)
Applicable to any pairs of languages, e.g. (English, Mandarin), (Minnan, Mandarin),
10
19
Talking HeadsTalking Heads3D Talking Heads Sample-based Talking
Heads
flexible; easy to show in any pose; faces look cartoon-like.
look like a real person; require recording of real people; limited in pose that can be shown.
20
Music and Speech Connection Krishna and Sreenivas (2004) drew parallels
between music and speech Speech recognition music transcription Instrument recognition speaker recognition Cocktail separation instrument separation Genre classification language classification
Perceptual results do exist that give support to the link between music and language, but the debate is still continuing
11
21
Subjective Music Genre Classification
Complex & syncopated rhythmsDifficult even for musically trained
Jazz swingM
Syncopated guitar & vocals, very little percussion.Difficult even for human
Bossa novaR
Non-prominent drum, much lower correlation between beat and events. Difficult
Country songO
More freedom to anticipate, more syncopation. greater tempo fluctuations, medium difficulty
Motown/SoulS&Y
DescriptionStyleID
22
Experimental Results (ISMIR2006)
Overall accuracy was 72.86%. 128 segment models 4 iterations of segment modeling algorithm
77.885.758.563.392.9Precision72.4211241Ambient57.1112521Jazz/Blues80.0102450Rock63.3209190Electronic86.7211026ClassicalRecallAmbientJazz/BluesRockElectronicClassicalGenre
12
23
Self-Generating Web Community Yahoo
When you organize, users will come Free e-mail : later inspiring G-mail
Napstor Peer-to-peer networking and information exchange
Google Web is the largest database, library and playground
Wikipedia: Self-established encyclopedia in multiple languages
YouTube Pioneering and outlasting Google video
Many more, but whats next ?
24
Wikipedia
13
25
Concept & Content Based Photo RetrievalConcept & Content Based Photo Retrieval
Indexing and retrieval of photos Content based example search does not give good performance Concept based keyword search
GUI Speech UI Multimedia UI
26
Multilingual Image Annotation (IIS)
(Rainbow) (Weather) (Flower) (Nature)
(Sunflower) (Flower) (Plant) (Desert)
(Seal) (Mammal) (Coast) (Animal)
(Solar System) (Comet)(Tropical Fish) (Universe)
(Waterfall) (Landform) (Nature) (Cockroach)
(Dog) (Mammal) (Pangolin) (Sheep)
Top 4 keywords Top 4 keywordsImages Images
14
27
Automatic Image Annotation
Partition
Feature Extraction
TokenizationVisual / Verbal
Connection Model
Ground Truth: Bear, Polar, Snow, TundraOur Method: Bear, Polar, Snow, Tundra, Ice
15
29
Education Media: Connexions in Progress
http://cns.rice.edu Hits in Q4 2002 250,000 hits/day
from 157 countries
>2100 modules>45 courses (November 2004)
engineering, computer science, nanotech physics, statistics, math, music, IPbio-diversity, botany, bio-infoBRIT, UNESCO, UN, Sigma Xi, from authors worldwide
30
H & M Project: Knowledge Gathering
A mythology scenario: named after Hugin and Munin (thought and memory), the twin ravens of Norse God Odin, that circled the earth gathering information and knowledge each day
Another mythology scenario: a suite of access tools (H), that uses stored context and indices (M), to provide useful access to a federation of digital repositories (named after Mimir, the giant guarding the Well of the Highest Wisdomunder the root of the World Tree)
16
31
Video Information Processing Audio-visual & text features in a learning framework Domain structure (similar to PoS tagging in NLP) Education and entertainment applications
words
Part-of-speech taggingFace, audio,
etc.
Stories
Shot Classification (tagging)
Shot segmentation & Feature Extraction
Story Segmentation(HMM & Rule Induction)
Tag_IDs
Tagged word
Sentence BD
Identify Sentence BD
32
Shot Shot classification classification ((1st level1st level))
Story Story segmentationsegmentation
((22ndnd levellevel))HMM based
story segmentation
Story unitsStory units
Rule Induction based story segmentation
Shot classificationShot classification
Scene changeScene change CueCue--phrasephraseTag_IDsTag_IDs
Video Story Segmentation (NUS)
TRECVID is a community-supported annual open evaluation of technologies: for topic detection and tracking of multiple thread of similar stories spanning over a period of time, and from multiple channels, and covering multilingual sources
17
33
Video Story Segmentation
Story 1
Story 3
Key FrameStory 2
34
Video Clip Browsing over IP on 3G
18
35
Possible Query Examples Search for all video clips containing images of
Mother Teresa Using indexing information on text, video, speech, and image
Search for all video clips mentioning IBM or containing the IBM logo Using indexing information on text, video, speech, image, and logo
Search for all lectures giving by Stephen Hawking Using indexing information on text, video, speech, and speaker
Search for all recent lectures on the subject of global warming Using indexing information on text, video, speech, and image
36
Web Information Access & Presentation
Sampras volunteers for Davis Cup doublesduty
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Sampras .----------------------------------------------------------
LinksNews Page (HTML)
News Content(Text)
Summary
Web data mining and information extraction: audio, video, image, speech, text, graphics, icons, cartoon, objects, links, anchor text, web logs, etc.
19
37
A Grand View: Multimodal Access of Multilingual Multimedia Information
User Model
User Input
Keyboard
Speech
MM-pad
Speech Recognizer
Text Processing
Multimedia Presentation
User Intent Understanding
Audio/video Recognizer
Audio/Video Rendering
Indexed A/V Database
A/V Browser
InformationAppliance
Info Fusion
Raw A/VDatabase
Multimedia Processing
User Feedback
Multimedia Indexing
Info Fusion & Retrieval
Network
VideoAudioText
Q&A Dialogue
38
Information Technologies and 4M Multimedia documents
Audio, video, speech, image, text, chart, map, etc. Indexing, retrieval, presentation, rendering, etc.
Multimodal human machine interface (HCI) Speech, gesture, point n click, pen, MM sketch pad, etc. Multiple sensory inputs and feedbacks
Multilingual information sources Multilingual speech and language understanding Multilingual presentation, cross-language referencing
Multidisciplinary collaborative research Engineers, scientists, artists, psychologists, etc. Human factors, behavior science, wide range of topics
20
39
Other Video Data Mining Applications Video recommendation Video summarization Browsing and visualization of multimedia dataset Video content filtering User generated content management (YouTube, Google Video) Management of meeting/presentation recordings Educational multimedia material browsing and retrieval Video broadcast monitoring User friendly entertainment video browsing Video skimming, Hierarchical skimming Video content description for visually impaired Music thumb-nailing Personal photo album thumb-nailing Extraction of moods from pictures and sounds Personalization of multimodal and multilingual user interfaces
40
Summary Web is a rich DB for information processing R&D
Multimedia web content: heterogeneous with little constraints Multilingual web materials, many cases on the same page Multimodal user interface: multiple human sensory input/output Multidisciplinary collaboration involving all experts
From media processing to data mining Document processing/understanding Speech and speaker information Audio fingerprint and music identification Image annotation, indexing and retrieval Video data organization, information subscription and delivery
From mining to knowledge discovery and decision making The next business and research frontier Plenty of technical challenges that will lead to huge societal impacts
/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile () /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False
/Description > /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ > /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ]>> setdistillerparams> setpagedevice