Video Data Mining: Technology Dimensions and …berlin.csie.ntnu.edu.tw/Courses/2006F-SpeechRecognition/...1 Video Data Mining: Technology Dimensions and Challenges Chin-Hui Lee (李錦輝)

1

Video Data Mining: Technology Dimensions and Challenges

Chin-Hui Lee ()School of ECE, Georgia Institute of Technology

Atlanta, GA 30332-0250, [email protected]

Talk at NTNU, Jan. 5 2007

2

Evolution of Language and Media

Paper Radio Historic Flow of Knowledge & Civilization

Print(1450AD)

WrittenLanguage (3000BC)

SpokenLanguage

Telegraph &Telephone

TV

RecordingMedia

Computer & Digital

Processing

Internet & WWW

ElectronicMedia

(1900AD)

Hyper & Virtual

Media ? (21st Cen)

2

3

Outline Web is a rich repository of all information sources

Multimedia web content: heterogeneous with little constraints Multilingual web materials, many cases on the same page Multimodal user interface: multiple human sensory input/output Multidisciplinary

From media processing to data mining Document processing/understanding Speech and speaker information Audio fingerprint and music identification Image annotation, indexing and retrieval Video data organization, information subscription and delivery

From mining to knowledge discovery and decision making The next big thing

Summary

4

Information Explosion on the Web(Part of The World is Flat by Friedman)

Over two billion of English web documents in 2005 Still growing exponentially in most languages (>1B for Chinese) From text only to multimedia and multilingual documents

Global data storage explosion (HD+network storage) In 2002, global information increased by 5 billion GBs, about 800MB

per person, enough to fill up 500,000 Library of Congress Memory explosion on PC and portable devices

From 512 KB for PC (Gates) to portable Library of Congress Cell phone sales close to 500M units in 2003

Mobile information appliances are commodities and fashions Todays students are more wired than ever in US

90% of aged 5-17 use computers, 99% public schools are wired From Command/Control to Connect/Collaborate

World leaders and executives access Google and Blackberry

3

5

Web to Accelerate Technology Adoption RateWeb to Accelerate Technology Adoption Rate

Time To Reach 10 Million CustomersTime To Reach 10 Million Customers

Sources: Apple, AirTouch Cellular, Info Tech and USA Today

0 10 20 30 40

Fax Machine

New Browser

WWW

PC

CD-ROM

Cellular Phone

VCR

Pager 41 Years

1

4

6

7

9

9

22

100 Years!

Napster?Napster?

6

Content Web: The New DB & PlaygroundFamily Sports

Fun

Knowledge

Personal

Finance

Office

4

7

Text Categorization Topic Identification

Unknown document dj

Classifier Ti

Classifier T1

Classifier Tm

..

T1(dj)

Ti(dj)

Tm(dj)Decisions by m classifiers

Labels of djfor Ci

L1(dj)

Li(dj)

Lm(dj)

.. ..

Evaluation

8

Event Representation & Topic Classification

Speech

Query-VectorExtraction

Text MorphologicalFiltering

ASR

Image AIA

Text Categorization

ASR: Automatic Speech RecognitionAIA: Automatic Image Annotation

Video: speech, audio, image, text, and others

5

9

Common Technology Thread: DSP, Feature Extraction & Classifier Learning

Speech/Image/Audio

LSA-Based FE/SVD

TextMedia Tokenization

Results

A/V Alphabet Model

A/V Word List

TC ClassifierLearning

TC Classifier

Text DocumentTraining Set

AudiovisualClassification

TC Classifier

Feature

First Step: Define alphabets and training alphabet models

10

First Google, Now Google News

6

11

Information Present in Speech0012-02

In real speech,this information is highly correlated.

Speech recognitionSpoken language understandingSpeaker verification

Channel characterization

Language identification

Word transcriptionMeaningIdentity of a speakerEmotional state of the speakerSex, age, and health of the speakerTransmission characteristics

(type of microphone, room noise,transmission noise, filtering,distortion, reverberation)

Language that is spoken

TechnologyInformation present

12

Speech and Speaker Data Mining

Event Detection

ASR

SID

Speech transcripts

Speaker identity:anchor

Event:explosion

..Devastating accident

7

13

SpeechFind: Speech & Speaker AnnotationFully searchable online database of spoken word collections spanning the 20th century

http://svoice.colorado.edu (Bowen Zhou)

14

Blink-X: A Video Search Portal

8

15

Conversational User Interface R2D2

SpeechRecognizer

Language Generator &

TTS SynthesizerLanguageAnalyzer

DialogueManager

SemanticRules

Database InteractionHistory Management

Command Execution

Status Report

Text Analysis & Pronunciation

Rules

Acoustic& Language

Models

VoiceInput

VoiceOutput

(Speech) (Text) (Meaning) (Text Reply) (Speech)

Output Action

Applicable to any language, including Mandarin, Minnan, Hakka, English spoken in Taiwan

16

Multilingual Web of the Future

Internet Users by Language (end of 2004, 800 million)

Clear Trend Non-English users continuously increasing Japanese and Chinese are currently (and may continuously be) the

two largest non-English groups

English, 35.90%

Chinese, 13.20%Japanese, 8.30%

German, 6.80%Spanish, 6.70%

French, 4.40%Korean, 3.80%

Italian, 3.60%Portuguese, 2.90%

Dutch, 1.70%All Others, 12.70%

9

17

Multilingual Web Pages: An Example

18

Universal Speech Translation C3PO

SpeechRecognizer

Language Generator &

TTS SynthesizerLanguageAnalyzer

Machine Translation

SemanticRules

Bilingual DatabasesTranslation Models

Text Analysis & Pronunciation

Rules

Acoustic& Language

Models

VoiceInput

VoiceOutput

(Speech in Language A)

(Text Understanding in Language A/B)

(Text Reply in Language B)

(Speech in Language B)

(Text in Language A)

Applicable to any pairs of languages, e.g. (English, Mandarin), (Minnan, Mandarin),

10

19

Talking HeadsTalking Heads3D Talking Heads Sample-based Talking

Heads

flexible; easy to show in any pose; faces look cartoon-like.

look like a real person; require recording of real people; limited in pose that can be shown.

20

Music and Speech Connection Krishna and Sreenivas (2004) drew parallels

between music and speech Speech recognition music transcription Instrument recognition speaker recognition Cocktail separation instrument separation Genre classification language classification

Perceptual results do exist that give support to the link between music and language, but the debate is still continuing

11

21

Subjective Music Genre Classification

Complex & syncopated rhythmsDifficult even for musically trained

Jazz swingM

Syncopated guitar & vocals, very little percussion.Difficult even for human

Bossa novaR

Non-prominent drum, much lower correlation between beat and events. Difficult

Country songO

More freedom to anticipate, more syncopation. greater tempo fluctuations, medium difficulty

Motown/SoulS&Y

DescriptionStyleID

22

Experimental Results (ISMIR2006)

Overall accuracy was 72.86%. 128 segment models 4 iterations of segment modeling algorithm

77.885.758.563.392.9Precision72.4211241Ambient57.1112521Jazz/Blues80.0102450Rock63.3209190Electronic86.7211026ClassicalRecallAmbientJazz/BluesRockElectronicClassicalGenre

12

23

Self-Generating Web Community Yahoo

When you organize, users will come Free e-mail : later inspiring G-mail

Napstor Peer-to-peer networking and information exchange

Google Web is the largest database, library and playground

Wikipedia: Self-established encyclopedia in multiple languages

YouTube Pioneering and outlasting Google video

Many more, but whats next ?

24

Wikipedia

13

25

Concept & Content Based Photo RetrievalConcept & Content Based Photo Retrieval

Indexing and retrieval of photos Content based example search does not give good performance Concept based keyword search

GUI Speech UI Multimedia UI

26

Multilingual Image Annotation (IIS)

(Rainbow) (Weather) (Flower) (Nature)

(Sunflower) (Flower) (Plant) (Desert)

(Seal) (Mammal) (Coast) (Animal)

(Solar System) (Comet)(Tropical Fish) (Universe)

(Waterfall) (Landform) (Nature) (Cockroach)

(Dog) (Mammal) (Pangolin) (Sheep)

Top 4 keywords Top 4 keywordsImages Images

14

27

Automatic Image Annotation

Partition

Feature Extraction

TokenizationVisual / Verbal

Connection Model

Ground Truth: Bear, Polar, Snow, TundraOur Method: Bear, Polar, Snow, Tundra, Ice

15

29

Education Media: Connexions in Progress

http://cns.rice.edu Hits in Q4 2002 250,000 hits/day

from 157 countries

>2100 modules>45 courses (November 2004)

engineering, computer science, nanotech physics, statistics, math, music, IPbio-diversity, botany, bio-infoBRIT, UNESCO, UN, Sigma Xi, from authors worldwide

30

H & M Project: Knowledge Gathering

A mythology scenario: named after Hugin and Munin (thought and memory), the twin ravens of Norse God Odin, that circled the earth gathering information and knowledge each day

Another mythology scenario: a suite of access tools (H), that uses stored context and indices (M), to provide useful access to a federation of digital repositories (named after Mimir, the giant guarding the Well of the Highest Wisdomunder the root of the World Tree)

16

31

Video Information Processing Audio-visual & text features in a learning framework Domain structure (similar to PoS tagging in NLP) Education and entertainment applications

words

Part-of-speech taggingFace, audio,

etc.

Stories

Shot Classification (tagging)

Shot segmentation & Feature Extraction

Story Segmentation(HMM & Rule Induction)

Tag_IDs

Tagged word

Sentence BD

Identify Sentence BD

32

Shot Shot classification classification ((1st level1st level))

Story Story segmentationsegmentation

((22ndnd levellevel))HMM based

story segmentation

Story unitsStory units

Rule Induction based story segmentation

Shot classificationShot classification

Scene changeScene change CueCue--phrasephraseTag_IDsTag_IDs

Video Story Segmentation (NUS)

TRECVID is a community-supported annual open evaluation of technologies: for topic detection and tracking of multiple thread of similar stories spanning over a period of time, and from multiple channels, and covering multilingual sources

17

33

Video Story Segmentation

Story 1

Story 3

Key FrameStory 2

34

Video Clip Browsing over IP on 3G

18

35

Possible Query Examples Search for all video clips containing images of

Mother Teresa Using indexing information on text, video, speech, and image

Search for all video clips mentioning IBM or containing the IBM logo Using indexing information on text, video, speech, image, and logo

Search for all lectures giving by Stephen Hawking Using indexing information on text, video, speech, and speaker

Search for all recent lectures on the subject of global warming Using indexing information on text, video, speech, and image

36

Web Information Access & Presentation

Sampras volunteers for Davis Cup doublesduty

-------------------------------------------------------------------------------------------------------------------------------------------------------------

Sampras .----------------------------------------------------------

LinksNews Page (HTML)

News Content(Text)

Summary

Web data mining and information extraction: audio, video, image, speech, text, graphics, icons, cartoon, objects, links, anchor text, web logs, etc.

19

37

A Grand View: Multimodal Access of Multilingual Multimedia Information

User Model

User Input

Keyboard

Speech

MM-pad

Speech Recognizer

Text Processing

Multimedia Presentation

User Intent Understanding

Audio/video Recognizer

Audio/Video Rendering

Indexed A/V Database

A/V Browser

InformationAppliance

Info Fusion

Raw A/VDatabase

Multimedia Processing

User Feedback

Multimedia Indexing

Info Fusion & Retrieval

Network

VideoAudioText

Q&A Dialogue

38

Information Technologies and 4M Multimedia documents

Audio, video, speech, image, text, chart, map, etc. Indexing, retrieval, presentation, rendering, etc.

Multimodal human machine interface (HCI) Speech, gesture, point n click, pen, MM sketch pad, etc. Multiple sensory inputs and feedbacks

Multilingual information sources Multilingual speech and language understanding Multilingual presentation, cross-language referencing

Multidisciplinary collaborative research Engineers, scientists, artists, psychologists, etc. Human factors, behavior science, wide range of topics

20

39

Other Video Data Mining Applications Video recommendation Video summarization Browsing and visualization of multimedia dataset Video content filtering User generated content management (YouTube, Google Video) Management of meeting/presentation recordings Educational multimedia material browsing and retrieval Video broadcast monitoring User friendly entertainment video browsing Video skimming, Hierarchical skimming Video content description for visually impaired Music thumb-nailing Personal photo album thumb-nailing Extraction of moods from pictures and sounds Personalization of multimodal and multilingual user interfaces

40

Summary Web is a rich DB for information processing R&D

Multimedia web content: heterogeneous with little constraints Multilingual web materials, many cases on the same page Multimodal user interface: multiple human sensory input/output Multidisciplinary collaboration involving all experts

From media processing to data mining Document processing/understanding Speech and speaker information Audio fingerprint and music identification Image annotation, indexing and retrieval Video data organization, information subscription and delivery

From mining to knowledge discovery and decision making The next business and research frontier Plenty of technical challenges that will lead to huge societal impacts

/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile () /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

/Description > /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ > /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ]>> setdistillerparams> setpagedevice

Documents

Video Data Mining: Technology Dimensions and …berlin.csie.ntnu.edu.tw/Courses/2006F-SpeechRecognition/...1 Video Data Mining: Technology Dimensions and Challenges Chin-Hui Lee (李錦輝)