Transcript
Page 1: Television news search and analysis with lucene solr

Television  News  Search  and  Analysis  with  Lucene/Solr  

Kai  Chan  <[email protected]>  Social  Sciences  CompuAng  

University  of  California,  Los  Angeles    

Lucene  RevoluAon,  May  10,  2012  

Page 2: Television news search and analysis with lucene solr

CommunicaAon  Studies  Archive  Background  (1)  

•  ConAnuaAon  of  analog  recording  of  TV  news  – Thousands  of  tapes  since  Watergate/1970s  – Hard  to  look  for  a  parAcular  news  program  or  topic  

1

Page 3: Television news search and analysis with lucene solr

CommunicaAon  Studies  Archive  Background  (2)  

•  Digital  recording  since  2005  •  Capture  news  programs  on  computers  

– Video:  can  be  streamed  over  the  Web  – Closed  capAoning  (“subAtle  text”):  indexed  and  searchable  

–  Image  snapshots  – Search  engine  and  analysis  tools  

2

Page 4: Television news search and analysis with lucene solr

CommunicaAon  Studies  Archive  Background  (3)  

•  Also  download  transcripts  and  web-­‐streamed  news  programs  

•  100  news  programs  and  600,000  words  added  each  day  

3

Page 5: Television news search and analysis with lucene solr

CommunicaAon  Studies  Archive  Background  (4)  

•  January  2005  to  present  – 28  networks  – 1,600  shows  – 130,000  hours  – 160,000  news  programs  – 50,000,000  images  – 880,000,000  words  

4

Page 6: Television news search and analysis with lucene solr

Why  This  is  Important  (1)    

•  Researchers  – Large  and  unique  collecAon  of  communicaAon  – Many  modaliAes  

•  Speech,  facial  expression,  body  gesture,  etc.  – Different  condiAons/secngs  – Different  networks  and  communiAes  – Allows  study  of  TV  news  +  communicaAon  in  general  in  ways  impossible  before  

5

Page 7: Television news search and analysis with lucene solr

Why  This  is  Important  (2)    

•  Non-­‐researchers  – TV  news  about  presentaAon  and  persuasion  

• Which  happen  in  daily  life  also  

– TV  main  source  of  news  for  many/most  – Greatly  affects  the  public’s  decisions  – Learn  about  what  we  watch  

6

Page 8: Television news search and analysis with lucene solr

7

Page 9: Television news search and analysis with lucene solr

8

Page 10: Television news search and analysis with lucene solr

9

Page 11: Television news search and analysis with lucene solr

10

Page 12: Television news search and analysis with lucene solr

11

Page 13: Television news search and analysis with lucene solr
Page 14: Television news search and analysis with lucene solr

13

Page 15: Television news search and analysis with lucene solr

ApplicaAon  in  Research    

•  CommunicaAon  Studies  – Amount  of  coverage  for  events  over  Ame  

•  LinguisAc  – Speech  and  language  pagerns  

•  Computer  Science  – Object  idenAficaAon  –  IdenAfy  news  anchors,  public  figures  – Story  segmentaAon  

14

Page 16: Television news search and analysis with lucene solr

ApplicaAon  in  Teaching  (1)    

•  Chicano  Studies:  RepresentaAons  of  LaAnos  on  the  Television  News  – May  1,  2007  immigraAon  march  – MacArthur  Park,  Los  Angeles,  CA  – 2  days  (May  1  &  2,  2007)  – Framing,  stereotyping,  metaphor,  silencing  –  reports  with  screenshots  and  links  to  news  stories  

15

Page 17: Television news search and analysis with lucene solr

ApplicaAon  in  Teaching  (2)    

•  CommunicaAon  Studies:  PresidenAal  CommunicaAon  – 2008  presidenAal  primary  – 6  weeks  (Dec  2007  to  Feb  2008)  – Coverage  of  sound  bites  

•  Amount  of  Ame  given  to  candidate/party  •  Types  of  response  (posiAve,  neutral,  negaAve)  

– Students  created  their  own  poliAcal  ad.  

16

Page 18: Television news search and analysis with lucene solr

Work  flow  (1)  Capture/conversion  machines  

•  2  groups,  2  machines  per  group  –  Keep  the  best  recording  –  6  TV  tuners  per  machine  

•  Capture  video  and  CC  to  separate  files  in  real-­‐Ame  – MPEG-­‐TS  (~7  GB/hr)  –  Timestamp  every  2-­‐3  seconds  

•  Generate  image  snapshots  •  Convert  videos  

– MP4/H.264  (VGA,  ~240  MB/hr)  

Capture/conversion machines

Storage/control server

Backup storage server

Video streaming

server

Search server

Image server

17

Page 19: Television news search and analysis with lucene solr

Work  flow  (2)  Storage/staAc  file  servers  

•  Control  server    – Download  TV  schedules  – Download  web-­‐streamed  news  programs  

– Collect  and  check  recordings  – Pushes  files  to  places  

•  Video  streaming  server  •  Backup  storage  server  •  Image  server  

Capture/conversion machines

Storage/control server

Backup storage server

Video streaming

server

Search server

Image server

18

Page 20: Television news search and analysis with lucene solr

Work  flow  (3)  Search  server  

•  Lucene  index  updated  daily  – Main  text  field  tokenized  – Separate  fields  for  date,  network,  show,  etc.  

– Binary  fields  for  segment  and  Ame  data  

•  Hosts  search  engine  

Capture/conversion machines

Storage/control server

Backup storage server

Video streaming

server

Search server

Image server

19

Page 21: Television news search and analysis with lucene solr

The  search  process    

Search server

User

Video serverVideo files

Image serverWeb server (Apache)

Thumbnail & montages

Video streaming server (Wowza)

Web server Custom code (PHP)

PHP-Java Bridge or Solr

Custom code (Java) LuceneLucene indexMySQL database

front end

back end

bridge

Perform searches

Watch videosRetrieve thumbnails

and montages

20

Page 22: Television news search and analysis with lucene solr

Custom  query  type  Segment-­‐enclosed  query  (1)  

•  Problem  1:  search  for  “X  near  Z”  •  Lucene:  search  for  “X  within  Y  words  of  Z”  

– How  to  pick  Y?  – Hard  to  pick  a  fixed  number  

21

Page 23: Television news search and analysis with lucene solr

Custom  query  type  Segment-­‐enclosed  query  (2)  

•  Problem  2:  all  matched  search  words  might  not  be  talking  about  same  story  – E.g.  “Obama  AND  visit  AND  Afghanistan”  – Might  match  a  news  program  about  Obama’s  visit  to  Canada  +  violence  in  Afghanistan  

22

Page 24: Television news search and analysis with lucene solr

Custom  query  type  Segment-­‐enclosed  query  (3)  

•  A  news  program  can  contain  several  stories  – E.g.  Local,  naAonal,  world,  weather,  sports  

23

Page 25: Television news search and analysis with lucene solr

Custom  query  type  Segment-­‐enclosed  query  (4)  

local story 1

local story 2commercials

national story 1

commercials

national story 2

healthentertainment

commercials

world story 2world story 1

weather 1

weather 2

sports24

Page 26: Television news search and analysis with lucene solr

Custom  query  type  Segment-­‐enclosed  query  (5)  

•  One  soluAon:  search  for  “X  and  Z  within  same  story  segment”  – Possible  with  Lucene  +  story  segment  info  

•  Bonus:  enables  searching/filtering  for  a  parAcular  story  type  – E.g.  PoliAcs  

25

Page 27: Television news search and analysis with lucene solr

Custom  query  type  Segment-­‐enclosed  query  (6)  

•  How  to  mark  segments  – Automated  

•  Computer  Science  researchers  working  on  them  • Word  frequency  •  Scene  change  •  Black  frame  and  silence  

– Manual  segmentaAon  • Watch  the  video  •  Decide  where  a  story  starts  and  ends  •  Mark  posiAons  in  semi-­‐automated  system  

26

Page 28: Television news search and analysis with lucene solr

Custom  query  type  Segment-­‐enclosed  query  (7)  

seg. 1begin

seg. 1end

seg. 2begin

seg. 2end

seg. 3begin

seg. 3end

span 1

span 2

span 3

span 4

span 5

27

Page 29: Television news search and analysis with lucene solr

Custom  query  type  Segment-­‐enclosed  query  (8)  

•  Idea  – Get  spans  from  SpanNearQuery  –  Filter  and  keep  those  fully  within  segments  

•  In  producAon:  segment  info  in  stored  fields  – As  a  list  of  <start  posiAon,  end  posiAon>  –  Simple  to  implement  –  Reasonably  fast  searching  

•  AlternaAve:  store  segment  info  as  terms  –  Possible  to  find  segments  by  themselves  – Appears  to  run  much  faster  

28

Page 30: Television news search and analysis with lucene solr

Custom  query  type  Time-­‐enclosed  query  

55 s 60 s30 s20 s 25 s 35 s 40 s 45 s 50 s

span 1

span 2

span 3

span 4

span 5

<= 20 s

<= 15 s

<= 10 s

<= 35 s

<= 25 s

29

Page 31: Television news search and analysis with lucene solr

Custom  query  type  MulA-­‐term  regular  expression  (1)  

•  “here  is  _  _  _  with  the  (news|story|details|report)”  

•  Apply  RegEx  to  a  phrase  or  sentence  – Not  just  individual  words  

•  Lucene  core  has  regular  expression  query  support  – Good  starAng  point  – Not  a  complete  soluAon  for  us  

30

Page 32: Television news search and analysis with lucene solr

Custom  query  type  MulA-­‐term  regular  expression  (2)  

•  Problems  –  Some  analyzers  do  not  work  with  RegEx  –  Lucene’s  RegEx  query  classes  only  apply  RegEx  to  individual  terms  

•  Want  to  match  a  pagern  against  a  phrase/sentence  •  Want  placeholders  for  whole  words  (not  just  characters)  

–  Term(fieldName,  “.*”)  matches  all  terms,  and  all  documents,  and  all  posiAons  in  the  index  

•  very  slow  •  takes  lots  of  memory  

31

Page 33: Television news search and analysis with lucene solr

Custom  query  type  MulA-­‐term  regular  expression  (3)  

•  What  we  did  –  Parse  and  translate  mulA-­‐term  RegEx  into  Lucene  built-­‐in  queries  (SpanNearQuery,  RegexQuery)  

•  E.g.  “here  is  _  _  _  with  the”  =  “here  is”  followed  by  “with  the”  (with  exactly  3  terms  in  between)  

–  Leading  and  trailing  placeholders  •  E.g.  “_  _  is  the  _  _  _”  •  Preserve  for  correctness  •  Store  word  count  for  each  document  •  Expand  each  span  on  both  sides  •  Bounds  checking  

32

Page 34: Television news search and analysis with lucene solr

Custom  query  type  MulA-­‐term  regular  expression  (4)  

•  Regular  expression  libraries  differ  in  – Syntax  (e.g.  Perl  5-­‐compaAble)  – CapabiliAes  (e.g.  back-­‐references)  – Speed  

•  Memory  usage  – ProporAonal  to  number  of  terms  matched  –  Increasing  available  memory  might  help  

33

Page 35: Television news search and analysis with lucene solr

Custom  result  format  Occurrence  count  

crisis meltdown tsunamicrash

9/14/08

date \ word

9/16/08

9/15/08

...

...

X docs, Y occurrences

go through every span generated by

(SpanTermQuery(meltdown) filtered by date 9/15/08)

34

Page 36: Television news search and analysis with lucene solr

Future  work  Job  queue  (1)  

•  Research  front  moving  towards  analysis  of  whole  database  – Want  full  search  result  set  – Queries  are  intensive  and  take  a  long  Ame  

•  SoluAon  will  be  beyond  increasing  Ameout  – Users  might  close  their  browsers  – We  might  restart  the  search  back-­‐end  

35

Page 37: Television news search and analysis with lucene solr

Future  work  Job  queue  (2)  

•  Features  – Query  runs  in  background  – NoAficaAon  when  finished/failed  – Restart  queries  with  recoverable  errors  – Check  and  cancel  jobs  – Downloadable  result  – Schedule  recurring  queries  – Manage  job  priority  and  quota  

36

Page 38: Television news search and analysis with lucene solr

Future  work  MulAple  sources  and  languages  (1)  

•  MulAlingual  news  programs  – E.g.  some  have  English  +  Spanish  CC  

•  MulAple  text  and  Amestamp  sources  – E.g.  CNN  transcript  available  from  website  – Applying  speech-­‐to-­‐text  to  videos  – Manual  correcAon  of  text  and  Amestamps  

•  MulAple  markets  – E.g.  Capture  TV  programs  in  Denmark  and  Norway  

37

Page 39: Television news search and analysis with lucene solr

Future  work  MulAple  sources  and  languages  (2)  

•  Need  language  detecAon  – Libraries  exist  

•  Search  for  specific  channel  – Search  by  language  more  useful  – But  no  fixed  channel  -­‐>  language  mapping  

•  What  will  proximity  search  and  occurrence  counAng  mean  when  there  are  mulAple  channels/languages?  

38

Page 40: Television news search and analysis with lucene solr

Future  work  Metadata  

•  Types  of  metadata  – Segment  boundary,  type  and  topic  – Headline  and  descripAon  (from  transcripts)  – Website  links  – SyntacAc  tags  (e.g.  part  of  speech)  – Generated  annotaAon  (e.g.  object  idenAficaAon)  – User  annotaAon  (e.g.  scene  descripAon)  – Screen  text  

•  Eventually:  want  them  to  be  searchable  

39

Page 41: Television news search and analysis with lucene solr

Thank  you  for  coming!    

•  Any  quesAons?  •  My  e-­‐mail:  [email protected]  •  Slides  available:  hgp://ucla.in/IDJq2u  

40


Recommended