Television news search and analysis with lucene solr

Television News Search and Analysis with Lucene/Solr

Kai Chan <kai@ssc.ucla.edu> Social Sciences CompuAng

University of California, Los Angeles

Lucene RevoluAon, May 10, 2012

CommunicaAon Studies Archive Background (1)

•  ConAnuaAon of analog recording of TV news – Thousands of tapes since Watergate/1970s – Hard to look for a parAcular news program or topic

•  Digital recording since 2005 •  Capture news programs on computers

– Video: can be streamed over the Web – Closed capAoning (“subAtle text”): indexed and searchable

–  Image snapshots – Search engine and analysis tools

•  Also download transcripts and web-‐streamed news programs

•  100 news programs and 600,000 words added each day

•  January 2005 to present – 28 networks – 1,600 shows – 130,000 hours – 160,000 news programs – 50,000,000 images – 880,000,000 words

Why This is Important (1)

•  Researchers – Large and unique collecAon of communicaAon – Many modaliAes

•  Speech, facial expression, body gesture, etc. – Different condiAons/secngs – Different networks and communiAes – Allows study of TV news + communicaAon in general in ways impossible before

Why This is Important (2)

•  Non-‐researchers – TV news about presentaAon and persuasion

• Which happen in daily life also

– TV main source of news for many/most – Greatly affects the public’s decisions – Learn about what we watch

ApplicaAon in Research

•  CommunicaAon Studies – Amount of coverage for events over Ame

•  LinguisAc – Speech and language pagerns

•  Computer Science – Object idenAficaAon –  IdenAfy news anchors, public figures – Story segmentaAon

ApplicaAon in Teaching (1)

•  Chicano Studies: RepresentaAons of LaAnos on the Television News – May 1, 2007 immigraAon march – MacArthur Park, Los Angeles, CA – 2 days (May 1 & 2, 2007) – Framing, stereotyping, metaphor, silencing –  reports with screenshots and links to news stories

ApplicaAon in Teaching (2)

•  CommunicaAon Studies: PresidenAal CommunicaAon – 2008 presidenAal primary – 6 weeks (Dec 2007 to Feb 2008) – Coverage of sound bites

•  Amount of Ame given to candidate/party •  Types of response (posiAve, neutral, negaAve)

– Students created their own poliAcal ad.

Work flow (1) Capture/conversion machines

•  2 groups, 2 machines per group –  Keep the best recording –  6 TV tuners per machine

•  Capture video and CC to separate files in real-‐Ame – MPEG-‐TS (~7 GB/hr) –  Timestamp every 2-‐3 seconds

•  Generate image snapshots •  Convert videos

– MP4/H.264 (VGA, ~240 MB/hr)

Capture/conversion machines

Storage/control server

Backup storage server

Video streaming

server

Search server

Image server

Work flow (2) Storage/staAc file servers

•  Control server – Download TV schedules – Download web-‐streamed news programs

– Collect and check recordings – Pushes files to places

•  Video streaming server •  Backup storage server •  Image server

Video streaming

server

Search server

Image server

Work flow (3) Search server

•  Lucene index updated daily – Main text field tokenized – Separate fields for date, network, show, etc.

– Binary fields for segment and Ame data

•  Hosts search engine

Video streaming

server

Search server

Image server

The search process

Search server

Video serverVideo files

Image serverWeb server (Apache)

Thumbnail & montages

Video streaming server (Wowza)

Web server Custom code (PHP)

PHP-Java Bridge or Solr

Custom code (Java) LuceneLucene indexMySQL database

front end

back end

bridge

Perform searches

Watch videosRetrieve thumbnails

and montages

Custom query type Segment-‐enclosed query (1)

•  Problem 1: search for “X near Z” •  Lucene: search for “X within Y words of Z”

– How to pick Y? – Hard to pick a fixed number

•  Problem 2: all matched search words might not be talking about same story – E.g. “Obama AND visit AND Afghanistan” – Might match a news program about Obama’s visit to Canada + violence in Afghanistan

•  A news program can contain several stories – E.g. Local, naAonal, world, weather, sports

local story 1

local story 2commercials

national story 1

commercials

national story 2

healthentertainment

commercials

world story 2world story 1

weather 1

weather 2

sports24

•  One soluAon: search for “X and Z within same story segment” – Possible with Lucene + story segment info

•  Bonus: enables searching/filtering for a parAcular story type – E.g. PoliAcs

•  How to mark segments – Automated

•  Computer Science researchers working on them • Word frequency •  Scene change •  Black frame and silence

– Manual segmentaAon • Watch the video •  Decide where a story starts and ends •  Mark posiAons in semi-‐automated system

seg. 1begin

seg. 1end

seg. 2begin

seg. 2end

seg. 3begin

seg. 3end

span 1

span 2

span 3

span 4

span 5

•  Idea – Get spans from SpanNearQuery –  Filter and keep those fully within segments

•  In producAon: segment info in stored fields – As a list of <start posiAon, end posiAon> –  Simple to implement –  Reasonably fast searching

•  AlternaAve: store segment info as terms –  Possible to find segments by themselves – Appears to run much faster

Custom query type Time-‐enclosed query

55 s 60 s30 s20 s 25 s 35 s 40 s 45 s 50 s

span 1

span 2

span 3

span 4

span 5

<= 20 s

<= 15 s

<= 10 s

<= 35 s

<= 25 s

Custom query type MulA-‐term regular expression (1)

•  “here is _ _ _ with the (news|story|details|report)”

•  Apply RegEx to a phrase or sentence – Not just individual words

•  Lucene core has regular expression query support – Good starAng point – Not a complete soluAon for us

•  Problems –  Some analyzers do not work with RegEx –  Lucene’s RegEx query classes only apply RegEx to individual terms

•  Want to match a pagern against a phrase/sentence •  Want placeholders for whole words (not just characters)

–  Term(fieldName, “.*”) matches all terms, and all documents, and all posiAons in the index

•  very slow •  takes lots of memory

•  What we did –  Parse and translate mulA-‐term RegEx into Lucene built-‐in queries (SpanNearQuery, RegexQuery)

•  E.g. “here is _ _ _ with the” = “here is” followed by “with the” (with exactly 3 terms in between)

–  Leading and trailing placeholders •  E.g. “_ _ is the _ _ _” •  Preserve for correctness •  Store word count for each document •  Expand each span on both sides •  Bounds checking

•  Regular expression libraries differ in – Syntax (e.g. Perl 5-‐compaAble) – CapabiliAes (e.g. back-‐references) – Speed

•  Memory usage – ProporAonal to number of terms matched –  Increasing available memory might help

Custom result format Occurrence count

crisis meltdown tsunamicrash

9/14/08

date \ word

9/16/08

9/15/08

X docs, Y occurrences

go through every span generated by

(SpanTermQuery(meltdown) filtered by date 9/15/08)

Future work Job queue (1)

•  Research front moving towards analysis of whole database – Want full search result set – Queries are intensive and take a long Ame

•  SoluAon will be beyond increasing Ameout – Users might close their browsers – We might restart the search back-‐end

Future work Job queue (2)

•  Features – Query runs in background – NoAficaAon when finished/failed – Restart queries with recoverable errors – Check and cancel jobs – Downloadable result – Schedule recurring queries – Manage job priority and quota

Future work MulAple sources and languages (1)

•  MulAlingual news programs – E.g. some have English + Spanish CC

•  MulAple text and Amestamp sources – E.g. CNN transcript available from website – Applying speech-‐to-‐text to videos – Manual correcAon of text and Amestamps

•  MulAple markets – E.g. Capture TV programs in Denmark and Norway

Future work MulAple sources and languages (2)

•  Need language detecAon – Libraries exist

•  Search for specific channel – Search by language more useful – But no fixed channel -‐> language mapping

•  What will proximity search and occurrence counAng mean when there are mulAple channels/languages?

Future work Metadata

•  Types of metadata – Segment boundary, type and topic – Headline and descripAon (from transcripts) – Website links – SyntacAc tags (e.g. part of speech) – Generated annotaAon (e.g. object idenAficaAon) – User annotaAon (e.g. scene descripAon) – Screen text

•  Eventually: want them to be searchable

Thank you for coming!

•  Any quesAons? •  My e-‐mail: kai@ssc.ucla.edu •  Slides available: hgp://ucla.in/IDJq2u

Television news search and analysis with lucene solr

Technology

erstellt werden - Intentive · Lucene™-basierte Volltextsuche ... Wie und wofür lässt sich Apache Solr in der Praxis nutzen? Bemerkung: Alternatives Beispiel aus einem OTWSM-basiertem

Besoin de rien Envie de Search - Presentation Lucene Solr ElasticSearch

Gestion de la relation client. - calleocrm.com · • Moteur de recherche de référence Lucene Solr assurant la recherche sur les racines des mots, la ... • Export des données

Using lucene solr to build advertising systems

Relevantes schneller finden – mit Lucene und Solr · Warum Solr verwenden und nicht nur Lucene? •Lucene ist lediglich eine Library •Solr ist eine skalierbare Suchplattform und

Volltextsuche mit Lucene und Solr

Facettensuche mit Lucene und Solr

Neue Discovery-Services im GBV TouchPoint, Lucene/SOLR... Neue Discovery-Services im GBV TouchPoint, Lucene/SOLR Neue Entwicklungen im Bereich Katalogisierung Göttingen, 29. März

Apache Lucene

Relevantes schneller finden – mit-Lucene und Solr

第16回Lucene/Solr勉強会 – ランキングチューニングと定量評価 #SolrJP

Marseille JUG Novembre 2013 Lucene Solr France Labs

Using Lucene/Solr to Surface the Big Data of Social Media

Базы данных. Lucene

Slides Lucene

Geneva jug Lucene Solr

Apache lucene

Suchen und Finden mit Lucene und Solr

Meetup solr

Lucene Document