41
Television News Search and Analysis with Lucene/Solr Kai Chan <[email protected]> Social Sciences CompuAng University of California, Los Angeles Lucene RevoluAon, May 10, 2012

Television news search and analysis with lucene solr

Embed Size (px)

DESCRIPTION

Presented by Kai Chan | UCLA - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 UCLA Communication Studies Archive hosts a collection of over 100,000 hours of digital television news, updated daily. Its search engine provides closed captioning search and online streaming of videos. The search engine allows researchers and students in various fields to study television news, images and language usage, in ways that were not possible before. In this presentation, we will show the setup of our Lucene/Solr-powered search engine, as well as how it is being used. We will discuss our work on custom result formats, such as linking search result text to the video at particular timestamps, counting occurrences of words, phrases or patterns, grouping the result by fields such as month or show, and creating interactive charts. We will also discuss our work on extending Lucene’s proximity searches, and creating custom query types, such as segment-enclosed (two or more words, phrases or patterns occurring within a story-based text segment), time-enclosed (two or more words, phrases or patterns occurring within a certain time), and multi-word regular expression queries. Future goals will also be discussed, such as supporting multiple languages, multiple sources (speech-to-text along side closed-captioning text), searching user-contributed and generated metadata (programs that identify story segments, objects in video, etc.), and syntactic tags (such as parts of speech).

Citation preview

Page 1: Television news search and analysis with lucene solr

Television  News  Search  and  Analysis  with  Lucene/Solr  

Kai  Chan  <[email protected]>  Social  Sciences  CompuAng  

University  of  California,  Los  Angeles    

Lucene  RevoluAon,  May  10,  2012  

Page 2: Television news search and analysis with lucene solr

CommunicaAon  Studies  Archive  Background  (1)  

•  ConAnuaAon  of  analog  recording  of  TV  news  – Thousands  of  tapes  since  Watergate/1970s  – Hard  to  look  for  a  parAcular  news  program  or  topic  

1

Page 3: Television news search and analysis with lucene solr

CommunicaAon  Studies  Archive  Background  (2)  

•  Digital  recording  since  2005  •  Capture  news  programs  on  computers  

– Video:  can  be  streamed  over  the  Web  – Closed  capAoning  (“subAtle  text”):  indexed  and  searchable  

–  Image  snapshots  – Search  engine  and  analysis  tools  

2

Page 4: Television news search and analysis with lucene solr

CommunicaAon  Studies  Archive  Background  (3)  

•  Also  download  transcripts  and  web-­‐streamed  news  programs  

•  100  news  programs  and  600,000  words  added  each  day  

3

Page 5: Television news search and analysis with lucene solr

CommunicaAon  Studies  Archive  Background  (4)  

•  January  2005  to  present  – 28  networks  – 1,600  shows  – 130,000  hours  – 160,000  news  programs  – 50,000,000  images  – 880,000,000  words  

4

Page 6: Television news search and analysis with lucene solr

Why  This  is  Important  (1)    

•  Researchers  – Large  and  unique  collecAon  of  communicaAon  – Many  modaliAes  

•  Speech,  facial  expression,  body  gesture,  etc.  – Different  condiAons/secngs  – Different  networks  and  communiAes  – Allows  study  of  TV  news  +  communicaAon  in  general  in  ways  impossible  before  

5

Page 7: Television news search and analysis with lucene solr

Why  This  is  Important  (2)    

•  Non-­‐researchers  – TV  news  about  presentaAon  and  persuasion  

• Which  happen  in  daily  life  also  

– TV  main  source  of  news  for  many/most  – Greatly  affects  the  public’s  decisions  – Learn  about  what  we  watch  

6

Page 8: Television news search and analysis with lucene solr

7

Page 9: Television news search and analysis with lucene solr

8

Page 10: Television news search and analysis with lucene solr

9

Page 11: Television news search and analysis with lucene solr

10

Page 12: Television news search and analysis with lucene solr

11

Page 13: Television news search and analysis with lucene solr
Page 14: Television news search and analysis with lucene solr

13

Page 15: Television news search and analysis with lucene solr

ApplicaAon  in  Research    

•  CommunicaAon  Studies  – Amount  of  coverage  for  events  over  Ame  

•  LinguisAc  – Speech  and  language  pagerns  

•  Computer  Science  – Object  idenAficaAon  –  IdenAfy  news  anchors,  public  figures  – Story  segmentaAon  

14

Page 16: Television news search and analysis with lucene solr

ApplicaAon  in  Teaching  (1)    

•  Chicano  Studies:  RepresentaAons  of  LaAnos  on  the  Television  News  – May  1,  2007  immigraAon  march  – MacArthur  Park,  Los  Angeles,  CA  – 2  days  (May  1  &  2,  2007)  – Framing,  stereotyping,  metaphor,  silencing  –  reports  with  screenshots  and  links  to  news  stories  

15

Page 17: Television news search and analysis with lucene solr

ApplicaAon  in  Teaching  (2)    

•  CommunicaAon  Studies:  PresidenAal  CommunicaAon  – 2008  presidenAal  primary  – 6  weeks  (Dec  2007  to  Feb  2008)  – Coverage  of  sound  bites  

•  Amount  of  Ame  given  to  candidate/party  •  Types  of  response  (posiAve,  neutral,  negaAve)  

– Students  created  their  own  poliAcal  ad.  

16

Page 18: Television news search and analysis with lucene solr

Work  flow  (1)  Capture/conversion  machines  

•  2  groups,  2  machines  per  group  –  Keep  the  best  recording  –  6  TV  tuners  per  machine  

•  Capture  video  and  CC  to  separate  files  in  real-­‐Ame  – MPEG-­‐TS  (~7  GB/hr)  –  Timestamp  every  2-­‐3  seconds  

•  Generate  image  snapshots  •  Convert  videos  

– MP4/H.264  (VGA,  ~240  MB/hr)  

Capture/conversion machines

Storage/control server

Backup storage server

Video streaming

server

Search server

Image server

17

Page 19: Television news search and analysis with lucene solr

Work  flow  (2)  Storage/staAc  file  servers  

•  Control  server    – Download  TV  schedules  – Download  web-­‐streamed  news  programs  

– Collect  and  check  recordings  – Pushes  files  to  places  

•  Video  streaming  server  •  Backup  storage  server  •  Image  server  

Capture/conversion machines

Storage/control server

Backup storage server

Video streaming

server

Search server

Image server

18

Page 20: Television news search and analysis with lucene solr

Work  flow  (3)  Search  server  

•  Lucene  index  updated  daily  – Main  text  field  tokenized  – Separate  fields  for  date,  network,  show,  etc.  

– Binary  fields  for  segment  and  Ame  data  

•  Hosts  search  engine  

Capture/conversion machines

Storage/control server

Backup storage server

Video streaming

server

Search server

Image server

19

Page 21: Television news search and analysis with lucene solr

The  search  process    

Search server

User

Video serverVideo files

Image serverWeb server (Apache)

Thumbnail & montages

Video streaming server (Wowza)

Web server Custom code (PHP)

PHP-Java Bridge or Solr

Custom code (Java) LuceneLucene indexMySQL database

front end

back end

bridge

Perform searches

Watch videosRetrieve thumbnails

and montages

20

Page 22: Television news search and analysis with lucene solr

Custom  query  type  Segment-­‐enclosed  query  (1)  

•  Problem  1:  search  for  “X  near  Z”  •  Lucene:  search  for  “X  within  Y  words  of  Z”  

– How  to  pick  Y?  – Hard  to  pick  a  fixed  number  

21

Page 23: Television news search and analysis with lucene solr

Custom  query  type  Segment-­‐enclosed  query  (2)  

•  Problem  2:  all  matched  search  words  might  not  be  talking  about  same  story  – E.g.  “Obama  AND  visit  AND  Afghanistan”  – Might  match  a  news  program  about  Obama’s  visit  to  Canada  +  violence  in  Afghanistan  

22

Page 24: Television news search and analysis with lucene solr

Custom  query  type  Segment-­‐enclosed  query  (3)  

•  A  news  program  can  contain  several  stories  – E.g.  Local,  naAonal,  world,  weather,  sports  

23

Page 25: Television news search and analysis with lucene solr

Custom  query  type  Segment-­‐enclosed  query  (4)  

local story 1

local story 2commercials

national story 1

commercials

national story 2

healthentertainment

commercials

world story 2world story 1

weather 1

weather 2

sports24

Page 26: Television news search and analysis with lucene solr

Custom  query  type  Segment-­‐enclosed  query  (5)  

•  One  soluAon:  search  for  “X  and  Z  within  same  story  segment”  – Possible  with  Lucene  +  story  segment  info  

•  Bonus:  enables  searching/filtering  for  a  parAcular  story  type  – E.g.  PoliAcs  

25

Page 27: Television news search and analysis with lucene solr

Custom  query  type  Segment-­‐enclosed  query  (6)  

•  How  to  mark  segments  – Automated  

•  Computer  Science  researchers  working  on  them  • Word  frequency  •  Scene  change  •  Black  frame  and  silence  

– Manual  segmentaAon  • Watch  the  video  •  Decide  where  a  story  starts  and  ends  •  Mark  posiAons  in  semi-­‐automated  system  

26

Page 28: Television news search and analysis with lucene solr

Custom  query  type  Segment-­‐enclosed  query  (7)  

seg. 1begin

seg. 1end

seg. 2begin

seg. 2end

seg. 3begin

seg. 3end

span 1

span 2

span 3

span 4

span 5

27

Page 29: Television news search and analysis with lucene solr

Custom  query  type  Segment-­‐enclosed  query  (8)  

•  Idea  – Get  spans  from  SpanNearQuery  –  Filter  and  keep  those  fully  within  segments  

•  In  producAon:  segment  info  in  stored  fields  – As  a  list  of  <start  posiAon,  end  posiAon>  –  Simple  to  implement  –  Reasonably  fast  searching  

•  AlternaAve:  store  segment  info  as  terms  –  Possible  to  find  segments  by  themselves  – Appears  to  run  much  faster  

28

Page 30: Television news search and analysis with lucene solr

Custom  query  type  Time-­‐enclosed  query  

55 s 60 s30 s20 s 25 s 35 s 40 s 45 s 50 s

span 1

span 2

span 3

span 4

span 5

<= 20 s

<= 15 s

<= 10 s

<= 35 s

<= 25 s

29

Page 31: Television news search and analysis with lucene solr

Custom  query  type  MulA-­‐term  regular  expression  (1)  

•  “here  is  _  _  _  with  the  (news|story|details|report)”  

•  Apply  RegEx  to  a  phrase  or  sentence  – Not  just  individual  words  

•  Lucene  core  has  regular  expression  query  support  – Good  starAng  point  – Not  a  complete  soluAon  for  us  

30

Page 32: Television news search and analysis with lucene solr

Custom  query  type  MulA-­‐term  regular  expression  (2)  

•  Problems  –  Some  analyzers  do  not  work  with  RegEx  –  Lucene’s  RegEx  query  classes  only  apply  RegEx  to  individual  terms  

•  Want  to  match  a  pagern  against  a  phrase/sentence  •  Want  placeholders  for  whole  words  (not  just  characters)  

–  Term(fieldName,  “.*”)  matches  all  terms,  and  all  documents,  and  all  posiAons  in  the  index  

•  very  slow  •  takes  lots  of  memory  

31

Page 33: Television news search and analysis with lucene solr

Custom  query  type  MulA-­‐term  regular  expression  (3)  

•  What  we  did  –  Parse  and  translate  mulA-­‐term  RegEx  into  Lucene  built-­‐in  queries  (SpanNearQuery,  RegexQuery)  

•  E.g.  “here  is  _  _  _  with  the”  =  “here  is”  followed  by  “with  the”  (with  exactly  3  terms  in  between)  

–  Leading  and  trailing  placeholders  •  E.g.  “_  _  is  the  _  _  _”  •  Preserve  for  correctness  •  Store  word  count  for  each  document  •  Expand  each  span  on  both  sides  •  Bounds  checking  

32

Page 34: Television news search and analysis with lucene solr

Custom  query  type  MulA-­‐term  regular  expression  (4)  

•  Regular  expression  libraries  differ  in  – Syntax  (e.g.  Perl  5-­‐compaAble)  – CapabiliAes  (e.g.  back-­‐references)  – Speed  

•  Memory  usage  – ProporAonal  to  number  of  terms  matched  –  Increasing  available  memory  might  help  

33

Page 35: Television news search and analysis with lucene solr

Custom  result  format  Occurrence  count  

crisis meltdown tsunamicrash

9/14/08

date \ word

9/16/08

9/15/08

...

...

X docs, Y occurrences

go through every span generated by

(SpanTermQuery(meltdown) filtered by date 9/15/08)

34

Page 36: Television news search and analysis with lucene solr

Future  work  Job  queue  (1)  

•  Research  front  moving  towards  analysis  of  whole  database  – Want  full  search  result  set  – Queries  are  intensive  and  take  a  long  Ame  

•  SoluAon  will  be  beyond  increasing  Ameout  – Users  might  close  their  browsers  – We  might  restart  the  search  back-­‐end  

35

Page 37: Television news search and analysis with lucene solr

Future  work  Job  queue  (2)  

•  Features  – Query  runs  in  background  – NoAficaAon  when  finished/failed  – Restart  queries  with  recoverable  errors  – Check  and  cancel  jobs  – Downloadable  result  – Schedule  recurring  queries  – Manage  job  priority  and  quota  

36

Page 38: Television news search and analysis with lucene solr

Future  work  MulAple  sources  and  languages  (1)  

•  MulAlingual  news  programs  – E.g.  some  have  English  +  Spanish  CC  

•  MulAple  text  and  Amestamp  sources  – E.g.  CNN  transcript  available  from  website  – Applying  speech-­‐to-­‐text  to  videos  – Manual  correcAon  of  text  and  Amestamps  

•  MulAple  markets  – E.g.  Capture  TV  programs  in  Denmark  and  Norway  

37

Page 39: Television news search and analysis with lucene solr

Future  work  MulAple  sources  and  languages  (2)  

•  Need  language  detecAon  – Libraries  exist  

•  Search  for  specific  channel  – Search  by  language  more  useful  – But  no  fixed  channel  -­‐>  language  mapping  

•  What  will  proximity  search  and  occurrence  counAng  mean  when  there  are  mulAple  channels/languages?  

38

Page 40: Television news search and analysis with lucene solr

Future  work  Metadata  

•  Types  of  metadata  – Segment  boundary,  type  and  topic  – Headline  and  descripAon  (from  transcripts)  – Website  links  – SyntacAc  tags  (e.g.  part  of  speech)  – Generated  annotaAon  (e.g.  object  idenAficaAon)  – User  annotaAon  (e.g.  scene  descripAon)  – Screen  text  

•  Eventually:  want  them  to  be  searchable  

39

Page 41: Television news search and analysis with lucene solr

Thank  you  for  coming!    

•  Any  quesAons?  •  My  e-­‐mail:  [email protected]  •  Slides  available:  hgp://ucla.in/IDJq2u  

40