20
1 Video Data Mining: Technology Dimensions and Challenges Chin-Hui Lee (李錦輝) School of ECE, Georgia Institute of Technology Atlanta, GA 30332-0250, USA [email protected] Talk at NTNU, Jan. 5 2007 2 Evolution of Language and Media Paper Radio Historic Flow of Knowledge & Civilization Print (1450AD) Written Language (3000BC) Spoken Language Telegraph & Telephone TV Recording Media Computer & Digital Processing Internet & WWW Electronic Media (1900AD) Hyper & Virtual Media ? (21 st Cen)

Video Data Mining: Technology Dimensions and …berlin.csie.ntnu.edu.tw/Courses/2006F-SpeechRecognition/...1 Video Data Mining: Technology Dimensions and Challenges Chin-Hui Lee (李錦輝)

  • Upload
    leliem

  • View
    224

  • Download
    2

Embed Size (px)

Citation preview

  • 1

    Video Data Mining: Technology Dimensions and Challenges

    Chin-Hui Lee ()School of ECE, Georgia Institute of Technology

    Atlanta, GA 30332-0250, [email protected]

    Talk at NTNU, Jan. 5 2007

    2

    Evolution of Language and Media

    Paper Radio Historic Flow of Knowledge & Civilization

    Print(1450AD)

    WrittenLanguage (3000BC)

    SpokenLanguage

    Telegraph &Telephone

    TV

    RecordingMedia

    Computer & Digital

    Processing

    Internet & WWW

    ElectronicMedia

    (1900AD)

    Hyper & Virtual

    Media ? (21st Cen)

  • 2

    3

    Outline Web is a rich repository of all information sources

    Multimedia web content: heterogeneous with little constraints Multilingual web materials, many cases on the same page Multimodal user interface: multiple human sensory input/output Multidisciplinary

    From media processing to data mining Document processing/understanding Speech and speaker information Audio fingerprint and music identification Image annotation, indexing and retrieval Video data organization, information subscription and delivery

    From mining to knowledge discovery and decision making The next big thing

    Summary

    4

    Information Explosion on the Web(Part of The World is Flat by Friedman)

    Over two billion of English web documents in 2005 Still growing exponentially in most languages (>1B for Chinese) From text only to multimedia and multilingual documents

    Global data storage explosion (HD+network storage) In 2002, global information increased by 5 billion GBs, about 800MB

    per person, enough to fill up 500,000 Library of Congress Memory explosion on PC and portable devices

    From 512 KB for PC (Gates) to portable Library of Congress Cell phone sales close to 500M units in 2003

    Mobile information appliances are commodities and fashions Todays students are more wired than ever in US

    90% of aged 5-17 use computers, 99% public schools are wired From Command/Control to Connect/Collaborate

    World leaders and executives access Google and Blackberry

  • 3

    5

    Web to Accelerate Technology Adoption RateWeb to Accelerate Technology Adoption Rate

    Time To Reach 10 Million CustomersTime To Reach 10 Million Customers

    Sources: Apple, AirTouch Cellular, Info Tech and USA Today

    0 10 20 30 40

    Fax Machine

    New Browser

    WWW

    PC

    CD-ROM

    Cellular Phone

    VCR

    Pager 41 Years

    1

    4

    6

    7

    9

    9

    22

    100 Years!

    Napster?Napster?

    6

    Content Web: The New DB & PlaygroundFamily Sports

    Fun

    Knowledge

    Personal

    Finance

    Office

  • 4

    7

    Text Categorization Topic Identification

    Unknown document dj

    Classifier Ti

    Classifier T1

    Classifier Tm

    ..

    T1(dj)

    Ti(dj)

    Tm(dj)Decisions by m classifiers

    Labels of djfor Ci

    L1(dj)

    Li(dj)

    Lm(dj)

    .. ..

    Evaluation

    8

    Event Representation & Topic Classification

    Speech

    Query-VectorExtraction

    Text MorphologicalFiltering

    ASR

    Image AIA

    Text Categorization

    ASR: Automatic Speech RecognitionAIA: Automatic Image Annotation

    Video: speech, audio, image, text, and others

  • 5

    9

    Common Technology Thread: DSP, Feature Extraction & Classifier Learning

    Speech/Image/Audio

    LSA-Based FE/SVD

    TextMedia Tokenization

    Results

    A/V Alphabet Model

    A/V Word List

    TC ClassifierLearning

    TC Classifier

    Text DocumentTraining Set

    AudiovisualClassification

    TC Classifier

    Feature

    First Step: Define alphabets and training alphabet models

    10

    First Google, Now Google News

  • 6

    11

    Information Present in Speech0012-02

    In real speech,this information is highly correlated.

    Speech recognitionSpoken language understandingSpeaker verification

    Channel characterization

    Language identification

    Word transcriptionMeaningIdentity of a speakerEmotional state of the speakerSex, age, and health of the speakerTransmission characteristics

    (type of microphone, room noise,transmission noise, filtering,distortion, reverberation)

    Language that is spoken

    TechnologyInformation present

    12

    Speech and Speaker Data Mining

    Event Detection

    ASR

    SID

    Speech transcripts

    Speaker identity:anchor

    Event:explosion

    ..Devastating accident

  • 7

    13

    SpeechFind: Speech & Speaker AnnotationFully searchable online database of spoken word collections spanning the 20th century

    http://svoice.colorado.edu (Bowen Zhou)

    14

    Blink-X: A Video Search Portal

  • 8

    15

    Conversational User Interface R2D2

    SpeechRecognizer

    Language Generator &

    TTS SynthesizerLanguageAnalyzer

    DialogueManager

    SemanticRules

    Database InteractionHistory Management

    Command Execution

    Status Report

    Text Analysis & Pronunciation

    Rules

    Acoustic& Language

    Models

    VoiceInput

    VoiceOutput

    (Speech) (Text) (Meaning) (Text Reply) (Speech)

    Output Action

    Applicable to any language, including Mandarin, Minnan, Hakka, English spoken in Taiwan

    16

    Multilingual Web of the Future

    Internet Users by Language (end of 2004, 800 million)

    Clear Trend Non-English users continuously increasing Japanese and Chinese are currently (and may continuously be) the

    two largest non-English groups

    English, 35.90%

    Chinese, 13.20%Japanese, 8.30%

    German, 6.80%Spanish, 6.70%

    French, 4.40%Korean, 3.80%

    Italian, 3.60%Portuguese, 2.90%

    Dutch, 1.70%All Others, 12.70%

  • 9

    17

    Multilingual Web Pages: An Example

    18

    Universal Speech Translation C3PO

    SpeechRecognizer

    Language Generator &

    TTS SynthesizerLanguageAnalyzer

    Machine Translation

    SemanticRules

    Bilingual DatabasesTranslation Models

    Text Analysis & Pronunciation

    Rules

    Acoustic& Language

    Models

    VoiceInput

    VoiceOutput

    (Speech in Language A)

    (Text Understanding in Language A/B)

    (Text Reply in Language B)

    (Speech in Language B)

    (Text in Language A)

    Applicable to any pairs of languages, e.g. (English, Mandarin), (Minnan, Mandarin),

  • 10

    19

    Talking HeadsTalking Heads3D Talking Heads Sample-based Talking

    Heads

    flexible; easy to show in any pose; faces look cartoon-like.

    look like a real person; require recording of real people; limited in pose that can be shown.

    20

    Music and Speech Connection Krishna and Sreenivas (2004) drew parallels

    between music and speech Speech recognition music transcription Instrument recognition speaker recognition Cocktail separation instrument separation Genre classification language classification

    Perceptual results do exist that give support to the link between music and language, but the debate is still continuing

  • 11

    21

    Subjective Music Genre Classification

    Complex & syncopated rhythmsDifficult even for musically trained

    Jazz swingM

    Syncopated guitar & vocals, very little percussion.Difficult even for human

    Bossa novaR

    Non-prominent drum, much lower correlation between beat and events. Difficult

    Country songO

    More freedom to anticipate, more syncopation. greater tempo fluctuations, medium difficulty

    Motown/SoulS&Y

    DescriptionStyleID

    22

    Experimental Results (ISMIR2006)

    Overall accuracy was 72.86%. 128 segment models 4 iterations of segment modeling algorithm

    77.885.758.563.392.9Precision72.4211241Ambient57.1112521Jazz/Blues80.0102450Rock63.3209190Electronic86.7211026ClassicalRecallAmbientJazz/BluesRockElectronicClassicalGenre

  • 12

    23

    Self-Generating Web Community Yahoo

    When you organize, users will come Free e-mail : later inspiring G-mail

    Napstor Peer-to-peer networking and information exchange

    Google Web is the largest database, library and playground

    Wikipedia: Self-established encyclopedia in multiple languages

    YouTube Pioneering and outlasting Google video

    Many more, but whats next ?

    24

    Wikipedia

  • 13

    25

    Concept & Content Based Photo RetrievalConcept & Content Based Photo Retrieval

    Indexing and retrieval of photos Content based example search does not give good performance Concept based keyword search

    GUI Speech UI Multimedia UI

    26

    Multilingual Image Annotation (IIS)

    (Rainbow) (Weather) (Flower) (Nature)

    (Sunflower) (Flower) (Plant) (Desert)

    (Seal) (Mammal) (Coast) (Animal)

    (Solar System) (Comet)(Tropical Fish) (Universe)

    (Waterfall) (Landform) (Nature) (Cockroach)

    (Dog) (Mammal) (Pangolin) (Sheep)

    Top 4 keywords Top 4 keywordsImages Images

  • 14

    27

    Automatic Image Annotation

    Partition

    Feature Extraction

    TokenizationVisual / Verbal

    Connection Model

    Ground Truth: Bear, Polar, Snow, TundraOur Method: Bear, Polar, Snow, Tundra, Ice

  • 15

    29

    Education Media: Connexions in Progress

    http://cns.rice.edu Hits in Q4 2002 250,000 hits/day

    from 157 countries

    >2100 modules>45 courses (November 2004)

    engineering, computer science, nanotech physics, statistics, math, music, IPbio-diversity, botany, bio-infoBRIT, UNESCO, UN, Sigma Xi, from authors worldwide

    30

    H & M Project: Knowledge Gathering

    A mythology scenario: named after Hugin and Munin (thought and memory), the twin ravens of Norse God Odin, that circled the earth gathering information and knowledge each day

    Another mythology scenario: a suite of access tools (H), that uses stored context and indices (M), to provide useful access to a federation of digital repositories (named after Mimir, the giant guarding the Well of the Highest Wisdomunder the root of the World Tree)

  • 16

    31

    Video Information Processing Audio-visual & text features in a learning framework Domain structure (similar to PoS tagging in NLP) Education and entertainment applications

    words

    Part-of-speech taggingFace, audio,

    etc.

    Stories

    Shot Classification (tagging)

    Shot segmentation & Feature Extraction

    Story Segmentation(HMM & Rule Induction)

    Tag_IDs

    Tagged word

    Sentence BD

    Identify Sentence BD

    32

    Shot Shot classification classification ((1st level1st level))

    Story Story segmentationsegmentation

    ((22ndnd levellevel))HMM based

    story segmentation

    Story unitsStory units

    Rule Induction based story segmentation

    Shot classificationShot classification

    Scene changeScene change CueCue--phrasephraseTag_IDsTag_IDs

    Video Story Segmentation (NUS)

    TRECVID is a community-supported annual open evaluation of technologies: for topic detection and tracking of multiple thread of similar stories spanning over a period of time, and from multiple channels, and covering multilingual sources

  • 17

    33

    Video Story Segmentation

    Story 1

    Story 3

    Key FrameStory 2

    34

    Video Clip Browsing over IP on 3G

  • 18

    35

    Possible Query Examples Search for all video clips containing images of

    Mother Teresa Using indexing information on text, video, speech, and image

    Search for all video clips mentioning IBM or containing the IBM logo Using indexing information on text, video, speech, image, and logo

    Search for all lectures giving by Stephen Hawking Using indexing information on text, video, speech, and speaker

    Search for all recent lectures on the subject of global warming Using indexing information on text, video, speech, and image

    36

    Web Information Access & Presentation

    Sampras volunteers for Davis Cup doublesduty

    -------------------------------------------------------------------------------------------------------------------------------------------------------------

    Sampras .----------------------------------------------------------

    LinksNews Page (HTML)

    News Content(Text)

    Summary

    Web data mining and information extraction: audio, video, image, speech, text, graphics, icons, cartoon, objects, links, anchor text, web logs, etc.

  • 19

    37

    A Grand View: Multimodal Access of Multilingual Multimedia Information

    User Model

    User Input

    Keyboard

    Speech

    MM-pad

    Speech Recognizer

    Text Processing

    Multimedia Presentation

    User Intent Understanding

    Audio/video Recognizer

    Audio/Video Rendering

    Indexed A/V Database

    A/V Browser

    InformationAppliance

    Info Fusion

    Raw A/VDatabase

    Multimedia Processing

    User Feedback

    Multimedia Indexing

    Info Fusion & Retrieval

    Network

    VideoAudioText

    Q&A Dialogue

    38

    Information Technologies and 4M Multimedia documents

    Audio, video, speech, image, text, chart, map, etc. Indexing, retrieval, presentation, rendering, etc.

    Multimodal human machine interface (HCI) Speech, gesture, point n click, pen, MM sketch pad, etc. Multiple sensory inputs and feedbacks

    Multilingual information sources Multilingual speech and language understanding Multilingual presentation, cross-language referencing

    Multidisciplinary collaborative research Engineers, scientists, artists, psychologists, etc. Human factors, behavior science, wide range of topics

  • 20

    39

    Other Video Data Mining Applications Video recommendation Video summarization Browsing and visualization of multimedia dataset Video content filtering User generated content management (YouTube, Google Video) Management of meeting/presentation recordings Educational multimedia material browsing and retrieval Video broadcast monitoring User friendly entertainment video browsing Video skimming, Hierarchical skimming Video content description for visually impaired Music thumb-nailing Personal photo album thumb-nailing Extraction of moods from pictures and sounds Personalization of multimodal and multilingual user interfaces

    40

    Summary Web is a rich DB for information processing R&D

    Multimedia web content: heterogeneous with little constraints Multilingual web materials, many cases on the same page Multimodal user interface: multiple human sensory input/output Multidisciplinary collaboration involving all experts

    From media processing to data mining Document processing/understanding Speech and speaker information Audio fingerprint and music identification Image annotation, indexing and retrieval Video data organization, information subscription and delivery

    From mining to knowledge discovery and decision making The next business and research frontier Plenty of technical challenges that will lead to huge societal impacts

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile () /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

    /Description > /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ > /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ]>> setdistillerparams> setpagedevice