Download pdf - Television news search and analysis with lucene solr

Television News Search and Analysis with Lucene/Solr

Kai Chan <[email protected]> Social Sciences CompuAng

University of California, Los Angeles

Lucene RevoluAon, May 10, 2012

CommunicaAon Studies Archive Background (1)

•  ConAnuaAon of analog recording of TV news – Thousands of tapes since Watergate/1970s – Hard to look for a parAcular news program or topic

1


•  Digital recording since 2005 •  Capture news programs on computers

– Video: can be streamed over the Web – Closed capAoning (“subAtle text”): indexed and searchable

–  Image snapshots – Search engine and analysis tools

2


•  Also download transcripts and web-‐streamed news programs

•  100 news programs and 600,000 words added each day

3


•  January 2005 to present – 28 networks – 1,600 shows – 130,000 hours – 160,000 news programs – 50,000,000 images – 880,000,000 words

4

Why This is Important (1)

•  Researchers – Large and unique collecAon of communicaAon – Many modaliAes

•  Speech, facial expression, body gesture, etc. – Different condiAons/secngs – Different networks and communiAes – Allows study of TV news + communicaAon in general in ways impossible before

5

Why This is Important (2)

•  Non-‐researchers – TV news about presentaAon and persuasion

• Which happen in daily life also

– TV main source of news for many/most – Greatly affects the public’s decisions – Learn about what we watch

6

7

8

9

10

11

13

ApplicaAon in Research

•  CommunicaAon Studies – Amount of coverage for events over Ame

•  LinguisAc – Speech and language pagerns

•  Computer Science – Object idenAficaAon –  IdenAfy news anchors, public figures – Story segmentaAon

14

ApplicaAon in Teaching (1)

•  Chicano Studies: RepresentaAons of LaAnos on the Television News – May 1, 2007 immigraAon march – MacArthur Park, Los Angeles, CA – 2 days (May 1 & 2, 2007) – Framing, stereotyping, metaphor, silencing –  reports with screenshots and links to news stories

15

ApplicaAon in Teaching (2)

•  CommunicaAon Studies: PresidenAal CommunicaAon – 2008 presidenAal primary – 6 weeks (Dec 2007 to Feb 2008) – Coverage of sound bites

•  Amount of Ame given to candidate/party •  Types of response (posiAve, neutral, negaAve)

– Students created their own poliAcal ad.

16

Work flow (1) Capture/conversion machines

•  2 groups, 2 machines per group –  Keep the best recording –  6 TV tuners per machine

•  Capture video and CC to separate files in real-‐Ame – MPEG-‐TS (~7 GB/hr) –  Timestamp every 2-‐3 seconds

•  Generate image snapshots •  Convert videos

– MP4/H.264 (VGA, ~240 MB/hr)

Capture/conversion machines

Storage/control server

Backup storage server

Video streaming

server

Search server

Image server

17

Work flow (2) Storage/staAc file servers

•  Control server – Download TV schedules – Download web-‐streamed news programs

– Collect and check recordings – Pushes files to places

•  Video streaming server •  Backup storage server •  Image server




Video streaming

server

Search server

Image server

18

Work flow (3) Search server

•  Lucene index updated daily – Main text field tokenized – Separate fields for date, network, show, etc.

– Binary fields for segment and Ame data

•  Hosts search engine




Video streaming

server

Search server

Image server

19

The search process

Search server

User

Video serverVideo files

Image serverWeb server (Apache)

Thumbnail & montages

Video streaming server (Wowza)

Web server Custom code (PHP)

PHP-Java Bridge or Solr

Custom code (Java) LuceneLucene indexMySQL database

front end

back end

bridge

Perform searches

Watch videosRetrieve thumbnails

and montages

20

Custom query type Segment-‐enclosed query (1)

•  Problem 1: search for “X near Z” •  Lucene: search for “X within Y words of Z”

– How to pick Y? – Hard to pick a fixed number

21


•  Problem 2: all matched search words might not be talking about same story – E.g. “Obama AND visit AND Afghanistan” – Might match a news program about Obama’s visit to Canada + violence in Afghanistan

22


•  A news program can contain several stories – E.g. Local, naAonal, world, weather, sports

23


local story 1

local story 2commercials

national story 1

commercials

national story 2

healthentertainment

commercials

world story 2world story 1

weather 1

weather 2

sports24


•  One soluAon: search for “X and Z within same story segment” – Possible with Lucene + story segment info

•  Bonus: enables searching/filtering for a parAcular story type – E.g. PoliAcs

25


•  How to mark segments – Automated

•  Computer Science researchers working on them • Word frequency •  Scene change •  Black frame and silence

– Manual segmentaAon • Watch the video •  Decide where a story starts and ends •  Mark posiAons in semi-‐automated system

26


seg. 1begin

seg. 1end

seg. 2begin

seg. 2end

seg. 3begin

seg. 3end

span 1

span 2

span 3

span 4

span 5

27


•  Idea – Get spans from SpanNearQuery –  Filter and keep those fully within segments

•  In producAon: segment info in stored fields – As a list of <start posiAon, end posiAon> –  Simple to implement –  Reasonably fast searching

•  AlternaAve: store segment info as terms –  Possible to find segments by themselves – Appears to run much faster

28

Custom query type Time-‐enclosed query

55 s 60 s30 s20 s 25 s 35 s 40 s 45 s 50 s

span 1

span 2

span 3

span 4

span 5

<= 20 s

<= 15 s

<= 10 s

<= 35 s

<= 25 s

29

Custom query type MulA-‐term regular expression (1)

•  “here is _ _ _ with the (news|story|details|report)”

•  Apply RegEx to a phrase or sentence – Not just individual words

•  Lucene core has regular expression query support – Good starAng point – Not a complete soluAon for us

30


•  Problems –  Some analyzers do not work with RegEx –  Lucene’s RegEx query classes only apply RegEx to individual terms

•  Want to match a pagern against a phrase/sentence •  Want placeholders for whole words (not just characters)

–  Term(fieldName, “.*”) matches all terms, and all documents, and all posiAons in the index

•  very slow •  takes lots of memory

31


•  What we did –  Parse and translate mulA-‐term RegEx into Lucene built-‐in queries (SpanNearQuery, RegexQuery)

•  E.g. “here is _ _ _ with the” = “here is” followed by “with the” (with exactly 3 terms in between)

–  Leading and trailing placeholders •  E.g. “_ _ is the _ _ _” •  Preserve for correctness •  Store word count for each document •  Expand each span on both sides •  Bounds checking

32


•  Regular expression libraries differ in – Syntax (e.g. Perl 5-‐compaAble) – CapabiliAes (e.g. back-‐references) – Speed

•  Memory usage – ProporAonal to number of terms matched –  Increasing available memory might help

33

Custom result format Occurrence count

crisis meltdown tsunamicrash

9/14/08

date \ word

9/16/08

9/15/08

...

...

X docs, Y occurrences

go through every span generated by

(SpanTermQuery(meltdown) filtered by date 9/15/08)

34

Future work Job queue (1)

•  Research front moving towards analysis of whole database – Want full search result set – Queries are intensive and take a long Ame

•  SoluAon will be beyond increasing Ameout – Users might close their browsers – We might restart the search back-‐end

35

Future work Job queue (2)

•  Features – Query runs in background – NoAficaAon when finished/failed – Restart queries with recoverable errors – Check and cancel jobs – Downloadable result – Schedule recurring queries – Manage job priority and quota

36

Future work MulAple sources and languages (1)

•  MulAlingual news programs – E.g. some have English + Spanish CC

•  MulAple text and Amestamp sources – E.g. CNN transcript available from website – Applying speech-‐to-‐text to videos – Manual correcAon of text and Amestamps

•  MulAple markets – E.g. Capture TV programs in Denmark and Norway

37

Future work MulAple sources and languages (2)

•  Need language detecAon – Libraries exist

•  Search for specific channel – Search by language more useful – But no fixed channel -‐> language mapping

•  What will proximity search and occurrence counAng mean when there are mulAple channels/languages?

38

Future work Metadata

•  Types of metadata – Segment boundary, type and topic – Headline and descripAon (from transcripts) – Website links – SyntacAc tags (e.g. part of speech) – Generated annotaAon (e.g. object idenAficaAon) – User annotaAon (e.g. scene descripAon) – Screen text

•  Eventually: want them to be searchable

39

Thank you for coming!

•  Any quesAons? •  My e-‐mail: [email protected] •  Slides available: hgp://ucla.in/IDJq2u

40