Television News Search and Analysis with Lucene/Solr
Kai Chan <[email protected]> Social Sciences CompuAng
University of California, Los Angeles
Lucene RevoluAon, May 10, 2012
CommunicaAon Studies Archive Background (1)
• ConAnuaAon of analog recording of TV news – Thousands of tapes since Watergate/1970s – Hard to look for a parAcular news program or topic
1
CommunicaAon Studies Archive Background (2)
• Digital recording since 2005 • Capture news programs on computers
– Video: can be streamed over the Web – Closed capAoning (“subAtle text”): indexed and searchable
– Image snapshots – Search engine and analysis tools
2
CommunicaAon Studies Archive Background (3)
• Also download transcripts and web-‐streamed news programs
• 100 news programs and 600,000 words added each day
3
CommunicaAon Studies Archive Background (4)
• January 2005 to present – 28 networks – 1,600 shows – 130,000 hours – 160,000 news programs – 50,000,000 images – 880,000,000 words
4
Why This is Important (1)
• Researchers – Large and unique collecAon of communicaAon – Many modaliAes
• Speech, facial expression, body gesture, etc. – Different condiAons/secngs – Different networks and communiAes – Allows study of TV news + communicaAon in general in ways impossible before
5
Why This is Important (2)
• Non-‐researchers – TV news about presentaAon and persuasion
• Which happen in daily life also
– TV main source of news for many/most – Greatly affects the public’s decisions – Learn about what we watch
6
7
8
9
10
11
13
ApplicaAon in Research
• CommunicaAon Studies – Amount of coverage for events over Ame
• LinguisAc – Speech and language pagerns
• Computer Science – Object idenAficaAon – IdenAfy news anchors, public figures – Story segmentaAon
14
ApplicaAon in Teaching (1)
• Chicano Studies: RepresentaAons of LaAnos on the Television News – May 1, 2007 immigraAon march – MacArthur Park, Los Angeles, CA – 2 days (May 1 & 2, 2007) – Framing, stereotyping, metaphor, silencing – reports with screenshots and links to news stories
15
ApplicaAon in Teaching (2)
• CommunicaAon Studies: PresidenAal CommunicaAon – 2008 presidenAal primary – 6 weeks (Dec 2007 to Feb 2008) – Coverage of sound bites
• Amount of Ame given to candidate/party • Types of response (posiAve, neutral, negaAve)
– Students created their own poliAcal ad.
16
Work flow (1) Capture/conversion machines
• 2 groups, 2 machines per group – Keep the best recording – 6 TV tuners per machine
• Capture video and CC to separate files in real-‐Ame – MPEG-‐TS (~7 GB/hr) – Timestamp every 2-‐3 seconds
• Generate image snapshots • Convert videos
– MP4/H.264 (VGA, ~240 MB/hr)
Capture/conversion machines
Storage/control server
Backup storage server
Video streaming
server
Search server
Image server
17
Work flow (2) Storage/staAc file servers
• Control server – Download TV schedules – Download web-‐streamed news programs
– Collect and check recordings – Pushes files to places
• Video streaming server • Backup storage server • Image server
Capture/conversion machines
Storage/control server
Backup storage server
Video streaming
server
Search server
Image server
18
Work flow (3) Search server
• Lucene index updated daily – Main text field tokenized – Separate fields for date, network, show, etc.
– Binary fields for segment and Ame data
• Hosts search engine
Capture/conversion machines
Storage/control server
Backup storage server
Video streaming
server
Search server
Image server
19
The search process
Search server
User
Video serverVideo files
Image serverWeb server (Apache)
Thumbnail & montages
Video streaming server (Wowza)
Web server Custom code (PHP)
PHP-Java Bridge or Solr
Custom code (Java) LuceneLucene indexMySQL database
front end
back end
bridge
Perform searches
Watch videosRetrieve thumbnails
and montages
20
Custom query type Segment-‐enclosed query (1)
• Problem 1: search for “X near Z” • Lucene: search for “X within Y words of Z”
– How to pick Y? – Hard to pick a fixed number
21
Custom query type Segment-‐enclosed query (2)
• Problem 2: all matched search words might not be talking about same story – E.g. “Obama AND visit AND Afghanistan” – Might match a news program about Obama’s visit to Canada + violence in Afghanistan
22
Custom query type Segment-‐enclosed query (3)
• A news program can contain several stories – E.g. Local, naAonal, world, weather, sports
23
Custom query type Segment-‐enclosed query (4)
local story 1
local story 2commercials
national story 1
commercials
national story 2
healthentertainment
commercials
world story 2world story 1
weather 1
weather 2
sports24
Custom query type Segment-‐enclosed query (5)
• One soluAon: search for “X and Z within same story segment” – Possible with Lucene + story segment info
• Bonus: enables searching/filtering for a parAcular story type – E.g. PoliAcs
25
Custom query type Segment-‐enclosed query (6)
• How to mark segments – Automated
• Computer Science researchers working on them • Word frequency • Scene change • Black frame and silence
– Manual segmentaAon • Watch the video • Decide where a story starts and ends • Mark posiAons in semi-‐automated system
26
Custom query type Segment-‐enclosed query (7)
seg. 1begin
seg. 1end
seg. 2begin
seg. 2end
seg. 3begin
seg. 3end
span 1
span 2
span 3
span 4
span 5
27
Custom query type Segment-‐enclosed query (8)
• Idea – Get spans from SpanNearQuery – Filter and keep those fully within segments
• In producAon: segment info in stored fields – As a list of <start posiAon, end posiAon> – Simple to implement – Reasonably fast searching
• AlternaAve: store segment info as terms – Possible to find segments by themselves – Appears to run much faster
28
Custom query type Time-‐enclosed query
55 s 60 s30 s20 s 25 s 35 s 40 s 45 s 50 s
span 1
span 2
span 3
span 4
span 5
<= 20 s
<= 15 s
<= 10 s
<= 35 s
<= 25 s
29
Custom query type MulA-‐term regular expression (1)
• “here is _ _ _ with the (news|story|details|report)”
• Apply RegEx to a phrase or sentence – Not just individual words
• Lucene core has regular expression query support – Good starAng point – Not a complete soluAon for us
30
Custom query type MulA-‐term regular expression (2)
• Problems – Some analyzers do not work with RegEx – Lucene’s RegEx query classes only apply RegEx to individual terms
• Want to match a pagern against a phrase/sentence • Want placeholders for whole words (not just characters)
– Term(fieldName, “.*”) matches all terms, and all documents, and all posiAons in the index
• very slow • takes lots of memory
31
Custom query type MulA-‐term regular expression (3)
• What we did – Parse and translate mulA-‐term RegEx into Lucene built-‐in queries (SpanNearQuery, RegexQuery)
• E.g. “here is _ _ _ with the” = “here is” followed by “with the” (with exactly 3 terms in between)
– Leading and trailing placeholders • E.g. “_ _ is the _ _ _” • Preserve for correctness • Store word count for each document • Expand each span on both sides • Bounds checking
32
Custom query type MulA-‐term regular expression (4)
• Regular expression libraries differ in – Syntax (e.g. Perl 5-‐compaAble) – CapabiliAes (e.g. back-‐references) – Speed
• Memory usage – ProporAonal to number of terms matched – Increasing available memory might help
33
Custom result format Occurrence count
crisis meltdown tsunamicrash
9/14/08
date \ word
9/16/08
9/15/08
...
...
X docs, Y occurrences
go through every span generated by
(SpanTermQuery(meltdown) filtered by date 9/15/08)
34
Future work Job queue (1)
• Research front moving towards analysis of whole database – Want full search result set – Queries are intensive and take a long Ame
• SoluAon will be beyond increasing Ameout – Users might close their browsers – We might restart the search back-‐end
35
Future work Job queue (2)
• Features – Query runs in background – NoAficaAon when finished/failed – Restart queries with recoverable errors – Check and cancel jobs – Downloadable result – Schedule recurring queries – Manage job priority and quota
36
Future work MulAple sources and languages (1)
• MulAlingual news programs – E.g. some have English + Spanish CC
• MulAple text and Amestamp sources – E.g. CNN transcript available from website – Applying speech-‐to-‐text to videos – Manual correcAon of text and Amestamps
• MulAple markets – E.g. Capture TV programs in Denmark and Norway
37
Future work MulAple sources and languages (2)
• Need language detecAon – Libraries exist
• Search for specific channel – Search by language more useful – But no fixed channel -‐> language mapping
• What will proximity search and occurrence counAng mean when there are mulAple channels/languages?
38
Future work Metadata
• Types of metadata – Segment boundary, type and topic – Headline and descripAon (from transcripts) – Website links – SyntacAc tags (e.g. part of speech) – Generated annotaAon (e.g. object idenAficaAon) – User annotaAon (e.g. scene descripAon) – Screen text
• Eventually: want them to be searchable
39
Thank you for coming!
• Any quesAons? • My e-‐mail: [email protected] • Slides available: hgp://ucla.in/IDJq2u
40