View
473
Download
5
Category
Preview:
DESCRIPTION
Presented by Kai Chan | UCLA - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 UCLA Communication Studies Archive hosts a collection of over 100,000 hours of digital television news, updated daily. Its search engine provides closed captioning search and online streaming of videos. The search engine allows researchers and students in various fields to study television news, images and language usage, in ways that were not possible before. In this presentation, we will show the setup of our Lucene/Solr-powered search engine, as well as how it is being used. We will discuss our work on custom result formats, such as linking search result text to the video at particular timestamps, counting occurrences of words, phrases or patterns, grouping the result by fields such as month or show, and creating interactive charts. We will also discuss our work on extending Lucene’s proximity searches, and creating custom query types, such as segment-enclosed (two or more words, phrases or patterns occurring within a story-based text segment), time-enclosed (two or more words, phrases or patterns occurring within a certain time), and multi-word regular expression queries. Future goals will also be discussed, such as supporting multiple languages, multiple sources (speech-to-text along side closed-captioning text), searching user-contributed and generated metadata (programs that identify story segments, objects in video, etc.), and syntactic tags (such as parts of speech).
Citation preview
Television News Search and Analysis with Lucene/Solr
Kai Chan <kai@ssc.ucla.edu> Social Sciences CompuAng
University of California, Los Angeles
Lucene RevoluAon, May 10, 2012
CommunicaAon Studies Archive Background (1)
• ConAnuaAon of analog recording of TV news – Thousands of tapes since Watergate/1970s – Hard to look for a parAcular news program or topic
1
CommunicaAon Studies Archive Background (2)
• Digital recording since 2005 • Capture news programs on computers
– Video: can be streamed over the Web – Closed capAoning (“subAtle text”): indexed and searchable
– Image snapshots – Search engine and analysis tools
2
CommunicaAon Studies Archive Background (3)
• Also download transcripts and web-‐streamed news programs
• 100 news programs and 600,000 words added each day
3
CommunicaAon Studies Archive Background (4)
• January 2005 to present – 28 networks – 1,600 shows – 130,000 hours – 160,000 news programs – 50,000,000 images – 880,000,000 words
4
Why This is Important (1)
• Researchers – Large and unique collecAon of communicaAon – Many modaliAes
• Speech, facial expression, body gesture, etc. – Different condiAons/secngs – Different networks and communiAes – Allows study of TV news + communicaAon in general in ways impossible before
5
Why This is Important (2)
• Non-‐researchers – TV news about presentaAon and persuasion
• Which happen in daily life also
– TV main source of news for many/most – Greatly affects the public’s decisions – Learn about what we watch
6
7
8
9
10
11
13
ApplicaAon in Research
• CommunicaAon Studies – Amount of coverage for events over Ame
• LinguisAc – Speech and language pagerns
• Computer Science – Object idenAficaAon – IdenAfy news anchors, public figures – Story segmentaAon
14
ApplicaAon in Teaching (1)
• Chicano Studies: RepresentaAons of LaAnos on the Television News – May 1, 2007 immigraAon march – MacArthur Park, Los Angeles, CA – 2 days (May 1 & 2, 2007) – Framing, stereotyping, metaphor, silencing – reports with screenshots and links to news stories
15
ApplicaAon in Teaching (2)
• CommunicaAon Studies: PresidenAal CommunicaAon – 2008 presidenAal primary – 6 weeks (Dec 2007 to Feb 2008) – Coverage of sound bites
• Amount of Ame given to candidate/party • Types of response (posiAve, neutral, negaAve)
– Students created their own poliAcal ad.
16
Work flow (1) Capture/conversion machines
• 2 groups, 2 machines per group – Keep the best recording – 6 TV tuners per machine
• Capture video and CC to separate files in real-‐Ame – MPEG-‐TS (~7 GB/hr) – Timestamp every 2-‐3 seconds
• Generate image snapshots • Convert videos
– MP4/H.264 (VGA, ~240 MB/hr)
Capture/conversion machines
Storage/control server
Backup storage server
Video streaming
server
Search server
Image server
17
Work flow (2) Storage/staAc file servers
• Control server – Download TV schedules – Download web-‐streamed news programs
– Collect and check recordings – Pushes files to places
• Video streaming server • Backup storage server • Image server
Capture/conversion machines
Storage/control server
Backup storage server
Video streaming
server
Search server
Image server
18
Work flow (3) Search server
• Lucene index updated daily – Main text field tokenized – Separate fields for date, network, show, etc.
– Binary fields for segment and Ame data
• Hosts search engine
Capture/conversion machines
Storage/control server
Backup storage server
Video streaming
server
Search server
Image server
19
The search process
Search server
User
Video serverVideo files
Image serverWeb server (Apache)
Thumbnail & montages
Video streaming server (Wowza)
Web server Custom code (PHP)
PHP-Java Bridge or Solr
Custom code (Java) LuceneLucene indexMySQL database
front end
back end
bridge
Perform searches
Watch videosRetrieve thumbnails
and montages
20
Custom query type Segment-‐enclosed query (1)
• Problem 1: search for “X near Z” • Lucene: search for “X within Y words of Z”
– How to pick Y? – Hard to pick a fixed number
21
Custom query type Segment-‐enclosed query (2)
• Problem 2: all matched search words might not be talking about same story – E.g. “Obama AND visit AND Afghanistan” – Might match a news program about Obama’s visit to Canada + violence in Afghanistan
22
Custom query type Segment-‐enclosed query (3)
• A news program can contain several stories – E.g. Local, naAonal, world, weather, sports
23
Custom query type Segment-‐enclosed query (4)
local story 1
local story 2commercials
national story 1
commercials
national story 2
healthentertainment
commercials
world story 2world story 1
weather 1
weather 2
sports24
Custom query type Segment-‐enclosed query (5)
• One soluAon: search for “X and Z within same story segment” – Possible with Lucene + story segment info
• Bonus: enables searching/filtering for a parAcular story type – E.g. PoliAcs
25
Custom query type Segment-‐enclosed query (6)
• How to mark segments – Automated
• Computer Science researchers working on them • Word frequency • Scene change • Black frame and silence
– Manual segmentaAon • Watch the video • Decide where a story starts and ends • Mark posiAons in semi-‐automated system
26
Custom query type Segment-‐enclosed query (7)
seg. 1begin
seg. 1end
seg. 2begin
seg. 2end
seg. 3begin
seg. 3end
span 1
span 2
span 3
span 4
span 5
27
Custom query type Segment-‐enclosed query (8)
• Idea – Get spans from SpanNearQuery – Filter and keep those fully within segments
• In producAon: segment info in stored fields – As a list of <start posiAon, end posiAon> – Simple to implement – Reasonably fast searching
• AlternaAve: store segment info as terms – Possible to find segments by themselves – Appears to run much faster
28
Custom query type Time-‐enclosed query
55 s 60 s30 s20 s 25 s 35 s 40 s 45 s 50 s
span 1
span 2
span 3
span 4
span 5
<= 20 s
<= 15 s
<= 10 s
<= 35 s
<= 25 s
29
Custom query type MulA-‐term regular expression (1)
• “here is _ _ _ with the (news|story|details|report)”
• Apply RegEx to a phrase or sentence – Not just individual words
• Lucene core has regular expression query support – Good starAng point – Not a complete soluAon for us
30
Custom query type MulA-‐term regular expression (2)
• Problems – Some analyzers do not work with RegEx – Lucene’s RegEx query classes only apply RegEx to individual terms
• Want to match a pagern against a phrase/sentence • Want placeholders for whole words (not just characters)
– Term(fieldName, “.*”) matches all terms, and all documents, and all posiAons in the index
• very slow • takes lots of memory
31
Custom query type MulA-‐term regular expression (3)
• What we did – Parse and translate mulA-‐term RegEx into Lucene built-‐in queries (SpanNearQuery, RegexQuery)
• E.g. “here is _ _ _ with the” = “here is” followed by “with the” (with exactly 3 terms in between)
– Leading and trailing placeholders • E.g. “_ _ is the _ _ _” • Preserve for correctness • Store word count for each document • Expand each span on both sides • Bounds checking
32
Custom query type MulA-‐term regular expression (4)
• Regular expression libraries differ in – Syntax (e.g. Perl 5-‐compaAble) – CapabiliAes (e.g. back-‐references) – Speed
• Memory usage – ProporAonal to number of terms matched – Increasing available memory might help
33
Custom result format Occurrence count
crisis meltdown tsunamicrash
9/14/08
date \ word
9/16/08
9/15/08
...
...
X docs, Y occurrences
go through every span generated by
(SpanTermQuery(meltdown) filtered by date 9/15/08)
34
Future work Job queue (1)
• Research front moving towards analysis of whole database – Want full search result set – Queries are intensive and take a long Ame
• SoluAon will be beyond increasing Ameout – Users might close their browsers – We might restart the search back-‐end
35
Future work Job queue (2)
• Features – Query runs in background – NoAficaAon when finished/failed – Restart queries with recoverable errors – Check and cancel jobs – Downloadable result – Schedule recurring queries – Manage job priority and quota
36
Future work MulAple sources and languages (1)
• MulAlingual news programs – E.g. some have English + Spanish CC
• MulAple text and Amestamp sources – E.g. CNN transcript available from website – Applying speech-‐to-‐text to videos – Manual correcAon of text and Amestamps
• MulAple markets – E.g. Capture TV programs in Denmark and Norway
37
Future work MulAple sources and languages (2)
• Need language detecAon – Libraries exist
• Search for specific channel – Search by language more useful – But no fixed channel -‐> language mapping
• What will proximity search and occurrence counAng mean when there are mulAple channels/languages?
38
Future work Metadata
• Types of metadata – Segment boundary, type and topic – Headline and descripAon (from transcripts) – Website links – SyntacAc tags (e.g. part of speech) – Generated annotaAon (e.g. object idenAficaAon) – User annotaAon (e.g. scene descripAon) – Screen text
• Eventually: want them to be searchable
39
Thank you for coming!
• Any quesAons? • My e-‐mail: kai@ssc.ucla.edu • Slides available: hgp://ucla.in/IDJq2u
40
Recommended