I3.2.pptx обратить внимание

  • Upload
    -

  • View
    234

  • Download
    0

Embed Size (px)

Citation preview

  • 7/31/2019 I3.2.pptx

    1/33

    Urbana-Champaign, May12, 2012

    INARC I3.2 Mid-Year ReportI3.2: Modeling and Mining of Text-Rich Information Networks

    Dan Roth (Task Co-lead) Jiawei Han (Task Co-Lead)

    Heng Ji (CUNY) Xifeng Yan (UCSB)

    University of Illinois at Urbana-ChampaignNS-CTA: INARC

  • 7/31/2019 I3.2.pptx

    2/33

    I3.2: Modeling and Mining ofText-Rich Information Networks

    Key Objectives:

    Structurally model a text-rich info. network and investigatemethods for mining knowledge from such networks

    Enhance keyword search and knowledge discovery capabilityby the text-rich info. network model

    Deliverables:

    Q1: Methodologies for modeling and construction of multi-dimensional, relatively structured info. networks byprogressive info. network analysis

    Q2: Models for enhanced text data analysis using relativelystructured, heterogeneous info. networks

    Q3: Methods for multi-facet search in text-rich info. networksQ4: System prototype demo of the approaches

    Impact:

    Modeling, principles, and methodologies developed for text-rich info. networks will lead to more relevant query results

    Key Technical Innovations:

    Exploitation of mostly unstructured data from reportsalong with some relatively structured metadata andthe links (e.g., hyperlinks) between reports to discover

    key entities associated to a given query. Theexploitation builds upon semantic processing (e.g.,topic modeling), network analysis (e.g, iTopics) anddata mining (e.g., topic/text cubes) technologies

    Efficient algorithms to enrich text mining techniqueswith the info. network topology

    Information trustworthiness analysis in text-rich info.networks and other text-rich networks

    Role Researchers

    Lead D. Roth, UIUC (INARC)

    Lead J. Han, UIUC (INARC)

    Primary H. Ji, CUNY (INARC)

    Primary X. Yan, UCSB (INARC)

    Collaborators N. Chawla, Notre Dame (SCNARC) (linked with E2.3)

    J. J. Garcia-Luna-Aceves, UCSC (CNARC)

    M. Magdon-Ismail, RPI (SCNARC) (linked with S2.1)

    Z. Wen, IBM (SCNARC)

    Total $322K

    2

  • 7/31/2019 I3.2.pptx

    3/33

    3

    Text-Rich Information Networks: Combining contents & network

    Focused on large heterogeneous information networks Collections ofnews articles from diverse resources, blogs and forums

    Wikipedia, an information network consisting ofstructured and

    unstructured data

    Developed State-of-the-art algorithmic tools

    Supporting knowledge acquisition, information extraction, text modeling

    and integrated information structure discovery

    Utilizing deep text analysis & large scale statistical models over the

    content and the structure of the network

    Make use of both explicit network structure and hidden ontological

    structure (e.g., category structure)

    Advanced our understanding of how to:

    Acquire and extract information from heterogeneous information

    networks when data is noisy, volatile, uncertain, and incomplete

    3

    Advancing the

    State-of-the-Art ofNetwork Science

  • 7/31/2019 I3.2.pptx

    4/33

    Subtask3: Multi-Facet

    Search

    4

    Subtask 1: Modeling and construction of multi-dimensional, relatively

    structured information networks by integrated text and information analysis

    4

    Overall Task Organization

    Subtask2: Topic

    Modeling and

    Discovery with

    InfoNet

    Subtask 3: Multi-facet search in

    text-rich information networks

    Subtask 2: Enhanced text

    data analysis usingrelatively structured,

    heterogeneous

    information networks

    Subtask1: Text-richInfoNet

    Construction

  • 7/31/2019 I3.2.pptx

    5/33

    5

    Subtask 1: Modeling and construction of multi-dimensional, relatively

    structured information networks by integrated text and informationanalysis

    Explicitly capture the interplay between textual topics and network

    structure

    Subtask 2: Enhanced text data analysis using relatively structured,heterogeneous information networks

    Novel theories and methods to make text data and information network

    mutually enhance each other in text understanding and information

    analysis

    Subtask 3: Multi-facet search in text-rich information networks

    Exploring effective methods for search and mining in text-rich

    information networks

    5

    Novelty Claims

  • 7/31/2019 I3.2.pptx

    6/33

    6

    Modeling and construction of multi-dimensional, relatively structured

    information networks by integrated text and information analysis

    Data Fusion and Information Network Fusion: Web structure mining for

    integration of web data with info. networks *WWW11, SIGMOD11 demo+

    Wikification (integration of wikipedia for entity/concept resolution) *ACL11+

    Enrichment & disambiguation of information network

    Dynamic Acquisition of Taxonomic Relations Network *EMNLP10+

    Leverage Semantic Information Network to Enhance Entity Co-reference

    Resolution and Entity Identification [ACL-HLT11]

    Micro and Macro Collaborative Networks Ranking for Entity and EventCoreference Resolution *EMNLP2011SUB+

    Markov Logic Networks and Learning-to-Rank to Enhance Open Domain Role

    Discovery *TAC10+

    6

    Subtask 1: Text-Rich Network Modelingand Construction

    G

  • 7/31/2019 I3.2.pptx

    7/33

    Growing Parallel Paths for WebStructure Mining

    DIV UL

    AB

    AC

    HTML DIV UL

    LI

    LI

    AX

    AY

    HTML DIV UL

    LI

    LI

    AZ

    AW

    TABLE TR

    TD

    TD AU

    AV

    HTML

    HTML

    LI

    LI

    DIV

    DIV ...

    ...

    Page A

    Page D

    Page E

    Page F

    DIV P AFHTML

    Page C

    DIV

    P

    AE

    Page B

    HTML

    P

    AD

    1

    2

    3

    4

    5

    6

    X

    Y

    Z

    W

    U

    V

    Path

    Result:

    Tim Weninger, Fabio Fumarola, Cindy Xide Lin, Rick Barber, Jiawei Han, and Donato Malerba,

    Growing Parallel Paths for Entity-Page Discovery, WWW'11, Mar. 2011 7

    http://www.cs.uiuc.edu/homes/hanj/pdf/www11_twininger.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/www11_twininger.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/www11_twininger.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/www11_twininger.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/www11_twininger.pdf
  • 7/31/2019 I3.2.pptx

    8/33

    WinaCS: Web Information Network Analysisfor Computer Science

    Web structure-guided information

    extraction and integration Integration of DBLP information

    networks

    Integration of mined web structures

    with DBLP networks for knowledge-

    base construction

    Supports intelligent querying & mining

    Tim Weninger, Marina Danilevsky, et al., WinaCS: Construction and Analysis of Web-Based

    Computer Science Information Networks", ACM SIGMOD'11 (system demo), Athens, Greece, June 2011. 8

    http://www.cs.uiuc.edu/homes/hanj/pdf/sigmod11_tweninger.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/sigmod11_tweninger.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/sigmod11_tweninger.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/sigmod11_tweninger.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/sigmod11_tweninger.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/sigmod11_tweninger.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/sigmod11_tweninger.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/sigmod11_tweninger.pdf
  • 7/31/2019 I3.2.pptx

    9/339

    Wikification: Example: Entity Resolution & Tracking

    Regi Blinker played three matchesfor Oranje. The left winger started

    his pro carreer at Feyenoord andplayed 400 official matches forFeyenoord, Celtic and Sparta. He

    retired from football in 2003.Where is he now?

    9

  • 7/31/2019 I3.2.pptx

    10/3310

    Wikification [ACL2011] Given:

    An information networks consisting of news articles and blogs,

    Wikipedia: Text, Structured Information, Network (hyperlink) Structureand Ontological (Category) Structure.

    Goal:

    Identify all entities and concepts mentioned in articles and blog

    Disambiguate & map each entity and concept to its appropriate

    Wikipedia page Entity (and Concept) Resolution

    Associate with each concept a collection of semantic attributes

    Progressively enrich the information network and enable betteraccess to it.

    Approach:

    A global optimization problem that accounts for Local, node-specific information,

    Global, node and network structure information

    Ontological network structure

    Machine Learning algorithms determine candidates and rank nodes

    10Lev Ratinov, Doug Downey, Mike Anderson, Dan Roth,Local and Global Algorithms for Disambiguation to Wikipedia ,ACL11,

    Id ifi i f T i

  • 7/31/2019 I3.2.pptx

    11/3311

    Identification of TaxonomicRelations [EMNLP2010]

    The use of information networks to acquire Taxonomic Relations.

    Given: An information networks consisting of news articles and blogs Pairs of Concepts or Entities

    Make use of: WikipediaText, Structured Information, Network(hyperlink) Structure and Ontological (Category) Structure.

    Goal:

    Developing large ontologies is essential to progressively enrich theinformation network and enable better access to it.

    Huge amount of work has been done on developing stationary networks

    Suffers from low coverage, noise, and brittleness

    A Machine Learning & Optimization based approach

    Exploits the fact that data in heterogeneous information networks isnoisy, uncertain, and incomplete.

    Considers multiple relations, makes use ofa global constraintoptimization process to leverage both Wikipedia and the web.

    Significantly outperforms existing well-known taxonomical networks.

    (Honda, Toyota) are Siblings

    M1A2 is-a Tank is-a Vehicle

    AK-47 is-a Gun

    11Quang Do and Dan Roth,Constraints based Taxonomic Relation Classification, EMNLP10

  • 7/31/2019 I3.2.pptx

    12/33

    Leverage Semantic Information Network to EnhanceEntity Coreference Resolution / Entity Identification

    Disambiguation

    Name Variant Clustering

    9.4% absolute improvement in micro-averaged accuracy

    (CUNY) Heng Ji and Ralph Grishman. "Knowledge Base Population: Successful Approaches

    and Challenges". ACL-HLT2011. 12

    https://agora.cs.illinois.edu/download/attachments/30425499/5.pdf?version=1&modificationDate=1300050634857https://agora.cs.illinois.edu/download/attachments/30425499/5.pdf?version=1&modificationDate=1300050634857https://agora.cs.illinois.edu/download/attachments/30425499/5.pdf?version=1&modificationDate=1300050634857https://agora.cs.illinois.edu/download/attachments/30425499/5.pdf?version=1&modificationDate=1300050634857https://agora.cs.illinois.edu/download/attachments/30425499/5.pdf?version=1&modificationDate=1300050634857https://agora.cs.illinois.edu/download/attachments/30425499/5.pdf?version=1&modificationDate=1300050634857
  • 7/31/2019 I3.2.pptx

    13/33

    13

    1cq

    2cq

    3cq

    4cq5cq

    6cq7cq ( )q

    Bo

    ( )q

    Ao

    q 0.70.4

    q

    0.30.6

    correct rank :

    Micro and Macro Collaborative Networks Ranking forEntity and Event Coreference Resolution

    Previous methods only focused on thetarget node and one learning theory

    itselfPropose a new collaborative networkranking theory which imitates humancollaborative learning

    Leverage inter-connections amongcollaborative entities in information

    networksAutomatic profiling for each node

    Construct a collaborative network for eachentity based on graph-based clustering

    Rank multiple decisions from collaborativeentities (micro) and algorithms (macro)based on global prediction

    7% absolute improvement in micro-averaged accuracy

    On-going CUNY+UIUC work: usingtopic modeling for entity clustering

    (CUNY) Zheng Chen and Heng Ji. 2011. Collaborative Ranking: A Case Study in EntityLinking. Proc. EMNLP2011 [SUB] 13

  • 7/31/2019 I3.2.pptx

    14/33

    Khamis Mushait

    14 14

    Wail Al-Shehri

    V3

    Markov Logic Networks and Learning-to-Rank to EnhanceOpen Domain Role Discovery

    Waleed Al-Shehri

    Abdul Aziz Al-OmariAbdul Rahman Al-Omari

    V4

    V6 V7 V8

    V9

    V10

    V11V12

    Wail Al-Shehri

    V3

    Waleed Al-Shehri

    Abdul Aziz Al-OmariAbdul Rahman Al-Omari

    V4

    911 SuspectTerrorist Network

    V15

    TerroristInformation Network

    originmember

    Al-Qaeda

    V13

    sibling

    news pageweb blog

    twitterforumBoston

    V14residence

    residence

    Mohamed AttaMohamed AttaV16

    pilot

    pilotSaudi Arabian Airlines

    Discovered 26 roles for persons, 16 roles for organizations and 13 roles for locations

    Markov Logic Networks for Cross-slot and Cross-query reasoning based on InfoNet andtextual linkages to resolve conflictions and predict missing links

    Weight=15:

    Weight=100:

    Maximum Entropy based Learning-to-rank model to re-rank candidate answers

    13%-22% absolute F-measure improvement

    (CUNY) Chen et al. "CUNY-BLENDER TAC-KBP2010 Entity Linking and Slot Filling SystemDescription". Proc. TAC2010 and Lecture Notes in Computer Science, 2010

    , , ( , ) ( , ) ( ) ( )x y z Ambiguous X Y Textual Linkage Y Z Pilot X Pilot Z Remove X

    , ,( , ) ( , ) ( , )

    x y zSibling X Y Origin Y Z Origin X Z

    https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_16.pdf?version=1&modificationDate=1300051108556https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_16.pdf?version=1&modificationDate=1300051108556https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_16.pdf?version=1&modificationDate=1300051108556https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_16.pdf?version=1&modificationDate=1300051108556https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_16.pdf?version=1&modificationDate=1300051108556https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_16.pdf?version=1&modificationDate=1300051108556https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_16.pdf?version=1&modificationDate=1300051108556https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_16.pdf?version=1&modificationDate=1300051108556https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_16.pdf?version=1&modificationDate=1300051108556https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_16.pdf?version=1&modificationDate=1300051108556https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_16.pdf?version=1&modificationDate=1300051108556https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_16.pdf?version=1&modificationDate=1300051108556https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_16.pdf?version=1&modificationDate=1300051108556https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_16.pdf?version=1&modificationDate=1300051108556
  • 7/31/2019 I3.2.pptx

    15/33

    15

    Enhanced text data analysis using relatively structured,

    heterogeneous information networks

    Progressive Dynamic Information Network Analysis[EMNLP11, ACL-HLT2011 (sub)]

    Integration of Heterogeneous Info. Network and TopicModeling (Biased Propagation) [KDD11 (sub)]

    Topic Modeling for Active Learning and Inference in EventNetwork Construction [ACL-HLT2011 (sub)]

    Geographical Topic Discovery & Comparison [WWW11]

    Latent Association Analysis of Document Pairs [KDD11

    (sub)]

    15

    Subtask 2. Network-Enhanced Text Analysis

  • 7/31/2019 I3.2.pptx

    16/33

    Ali Larijani

    IranSupreme NationalSecurity Council

    Islamic Republic ofIran Broadcasting

    FaridehMotahari

    TehranUniversity Hassan

    Rowhani

    Progressive Dynamic Information Network Analysis

    Motivations

    Most information obtained on text-rich

    InfoNet construction so far is viewed asstatic, ignoring the temporal dimensionof many types of attributes

    Approaches Temporal Role Representation

    [T1 T2 T3 T4] =

    New Evaluation Metric

    Local temporal role discovery usingnew kernel methods based ondependency paths

    Global inference and aggregation toresolve conflicts using Integer LinearProgramming (ongoing collaborationwith Dan Roth at UIUC)

    Results

    State-of-the-art temporal roleclassification accuracy and lowestvagueness/over-constraining

    (CUNY) Javier Artiles, Qi Li, Enrique Amigo and Heng Ji. 2011. Leveraging Cross-documentRedundancy for Temporal Information Extraction. EMNLP2011, ACL-HLT2011 [SUB] 16

    Baseline

    Aggregation over 2 tuples Aggregation over 10 tuples

    Our Approach on Information Networks

  • 7/31/2019 I3.2.pptx

    17/33

    Probabilistic Topic Models with Biased Propagation onHeterogeneous Information Networks [KDD11 (sub)]

    Problem and Motivation: Discover latent topics & identify clusters

    of multi-typed objects simultaneously Treat multi-typed objects differently (e.g.,

    D w. rich text & U w.o. explicit text) Solution and Contribution:

    Basic idea: biased topic propagation Propose a novel TMBP algorithm to

    directly incorporate heter. infornetinstead of homog. InforNet with topicmodeling (improve 20%-40% over PLSA)

    17

    Topic modeling with heterogeneous InforNet

    Topic modelBiased propagation

    (UIUC) Hongbo Deng, Jiawei Han, Bo Zhao, Yintao Yu, and Cindy Xide Lin, "Probabilistic TopicModels with Biased Propagation on Heterogeneous Information Networks", KDD'11 (sub)

    T i M d li f A i L i d I f i

  • 7/31/2019 I3.2.pptx

    18/33

    Topic Modeling for Active Learning and Inference in

    Event Network Construction

    Topic Modeling can enhanceinformation network construction by

    grouping similar event typestogether and converginginformation distributions

    Using Topic modeling, with only1/4 training data we can achievecomparable performance as

    passive learningCross-document inference withintopic clusters provided 10%improvement over state-of-the-artevent extraction, significant gainsover IR based clustering

    Ongoing work: apply new entity-driven and biased propagationbased topic modeling methods

    (CUNY + UIUC) Hao Li, Heng Ji, Hongbo Deng and Jiawei Han. 2011. Topically RelatedData is Better Data: Topic Modeling for Event Extraction. ACL-HLT2011[sub]

    Putin

    weapons

    nuclear

    talks

    forces

    troops

    army

    militaryBritish

    AFPmillion

    government

    dollars

    convicted

    billion company

    court

    sentence

    Event Type: "Contact"

    Trigger: talk, meet etc.

    Arguments: "Entity"

    "Instrument" "Place"

    "Time-Within"

    Event Type: "Business"

    Trigger: form, dissolve

    Arguments:

    "Org""Place" "Time-

    Within" "Agent"

    Event Type: "Attack"

    Trigger: blew, attackArguments: "Attacker"

    "Target" "Place" "Time-

    Within"

    EventType:"Transaction"

    Trigger: Borrow, Launch

    Arguments: "Giver"

    "Recipient""Money""Sell

    er""Artifact""Buyer"

    Pyong

    yang

    China

    officials

    Washington

    north

    southKorea

    program

    United

    States

    Saddam

    control

    fighting

    city

    Baghdad

    Iraqi

    regime

    Kurdish

    York

    case

    media

    Event Type: "Justice"

    Trigger: Arrest, Jail

    Arguments:"Defendant"

    "Time-Within"

    "Adjudicator" "Place"

    Doc 1

    Doc 3

    Doc 4

    Doc 6

    Doc N

    Doc 2

    Doc 5

  • 7/31/2019 I3.2.pptx

    19/33

    Geographical Topic Discovery & Comparison

    Motivation: Analyze GPS-associated documents,e.g., geo-tagged photos and tweets sent fromiphones

    Problem: Given a collection of GPS-associateddocuments and # of topics K, discover K geo-topics along with the topic distribution indifferent geo. locations

    Latent Geographical Topic Analysis Combine text and GPS location info

    Words that are close to each other are morelikely to be in the same region. Words thatare in the same regions are more likely to be

    in the same topic Regions are not known beforehand. Our

    framework adopts the region discoveryprocess according to the dataset

    19Zhijun Yin, Liangliang Cao, Jiawei Han, Chengxiang Zhai, and Thomas Huang, Geographical

    Topic Discovery and Comparison, WWW'11, Mar. 2011

    http://www.cs.uiuc.edu/homes/hanj/pdf/www11_zyin.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/www11_zyin.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/www11_zyin.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/www11_zyin.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/www11_zyin.pdf
  • 7/31/2019 I3.2.pptx

    20/33

    Latent Association Analysis of Document Pairs

    Latent Association Analysis (LAA) mines the topics of two documentsets simultaneously, taking the bipartite network between twodocument sets into consideration

    One of the first attempts to analyze the topic structures of twoconnected document sets, aiming to infer their mapping networkmodel

    LAA significantly outperforms existing algorithms with 70% accuracyimprovement

    Topic Simplex for Corpus 1

    ?

    Topic Simplex for Corpus 2

    0 1

    1

    ?

    Correlation Factor

    Document Pairs

    Gengxin Miao, et al., Latent Association Analysis of Document Pairs, KDD11 (sub)

  • 7/31/2019 I3.2.pptx

    21/33

    21

    Information Network-Based Trustworthiness Analysis *COLING10,

    Army Sci10 (Best Paper Award), WWW11+

    Progressive Network Analysis for Expert Search (Diffusion through

    Co-occurrence Relationships for Expert Search on the Web)

    *SIGIR11 sub+

    Modeling and Exploiting Heterogeneous Sources for Expertise

    Ranking *SIGIR11 sub+

    Personalized Recommendation on Information Networks

    *SIGIR11 sub+

    Multi-facet Search in Self-Boosting Information Networks

    (Demo: Terrorism Network Search and Browsing) *SIGIR11 sub+

    21

    Subtask 3: Multi-Facet Search and Mining

    Information Network Based Trustworthiness

  • 7/31/2019 I3.2.pptx

    22/33

    22

    Information Network-Based TrustworthinessAnalysis

    Given: Multiple Information networks: websites, blogs, forums, sensor networks

    Some claims, e.g., [Person A travelled to France], [There is a fire indowntown Chicago] Prior beliefs and background knowledge

    Our goal is to: Score trustworthiness of Claims based on support across multiple (trusted) sources in the network source characteristics:

    reputation, interest-group, verifiability of information, etc. Prior Beliefs and Background knowledge

    Rate databases/sources as more/less trustworthy Track how the trustworthiness of fact / database varies with time as the text

    corpus grows over time New framework for incorporating prior knowledge into anyfact-finding

    algorithm Done via a Linear Programming approach Highly expressive declarative constraints Tractable (polynomial time)

    Prior knowledge improve results Absolutely essential when the users judgment varies from the norm

    22Dan Roth et al, COLING10, Army Sci10 (Best Paper Award), WWW11

  • 7/31/2019 I3.2.pptx

    23/33

    Progressive Network Analysis for Expert Search

    Goal: find and rank people who have expertise described by user query

    Web pages are more noisy, contain spam compared to corpus in anenterprise. Both relevance and reputation should be considered

    Use a heterogeneous hypergraph to model the co-occurrencerelationships among people and words and devise a heat diffusion

    model on the hyerpgraph Applied to 0.5B web pages

    Accuracy: 50%-200% improvement than the leading language modelmethods. Significantly overcome noises in the Web.

    Ziyu Guan, et al., Diffusion through Co-occurrence Relationships for Expert Search on the Web, SIGIR11 (sub)23

    M d li d E l iti H t

  • 7/31/2019 I3.2.pptx

    24/33

    Modeling and Exploiting HeterogeneousSources for Expertise Ranking

    24

    Coauthor graphCitation graph

    Problem: How to leverage both heterogeneous network anddocuments to identify the relevant experts for a given query?

    Baseline: The expertise of a person could be characterizedbased on his/her associated documents (doc-based method)

    Intuitions:

    Citation graph: Similar documents are likely to have

    similar relevance to a given queryCoauthor graph: Two authors are most likely to sharesimilar expertise if they coauthor many papers.

    Document-author bipartite graph: mutually reinforcedbetween documents (x) and authors (y)

    Top-10 experts for query:Information retrieval

    Solution: We formulate a joint regularization framework toincorporate several hypotheses to capture the information ofdifferent graphs together with textual documents

    Hongbo Deng, et al., Modeling and Exploiting Heterogeneous Sources for Expertise Ranking, SIGIR11 (sub)

    Result: Using DBLP with 2M nodes and 10M edges.Significant improvements over the baseline.

    M lti Facet Search in Self Boosting Information Net orks

  • 7/31/2019 I3.2.pptx

    25/33

    Multi-Facet Search in Self-Boosting Information Networks(Example: Terrorism Network Search and Browsing)

    Demo: http://blender2.cs.qc.cuny.edu/BlenderGraph/

    Video: http://nlp.cs.qc.cuny.edu/terrorism.m4v

    (CUNY + UIUC) Sam Anzaroot, Javier Artiles, Heng Ji, Hongbo Deng and Jiawei Han. 2011. Search andBrowsing Self-Boosting Information Networks. SIGIR2011 [SUB]

    Facilitate a military analyst in expert finding and terrorist information search gathering,control and analysis for any given query Entity-topic analyzer for self-expansion and self-boosting: Terrorism organization members status of members (die, arrest,...) and information networks associatedwith each member

    http://blender2.cs.qc.cuny.edu/BlenderGraph/http://nlp.cs.qc.cuny.edu/terrorism.m4vhttp://nlp.cs.qc.cuny.edu/terrorism.m4vhttp://blender2.cs.qc.cuny.edu/BlenderGraph/
  • 7/31/2019 I3.2.pptx

    26/33

    26

    Subtask 1: Text-Rich Network Modeling and Construction

    Object search task enhanced by entity disambiguation and rolediscovery can provide methods for finding groups of soldiers and

    identifying terrorists with certain expertise

    Subtask 2: Network-Enhanced Text Analysis

    Asymmetric wars and counter-terrorism need understand text-rich net

    Text mining for monitoring potential threats and detecting terrorism

    with entity-topic modeling and event detection and tracking

    Subtask 3: Multi-Facet Search and Mining in Text-Rich Networks

    Most military applications need to search in multi-facets on text and

    unstructured data, including emails, reports, telecommunicationmessages, military-related news and blogs

    Our multi-facet multi-dimensional information network search and

    browsing tool has rich functions and provide intelligent network

    expansions

    26

    Military Relevance

  • 7/31/2019 I3.2.pptx

    27/33

    I3.2s Collaboration Network

    27

    I3.2: Han, Ji,Roth, Yan

    weekly teleconsfrequent emails5 joint papers

    I1.2

    Tarek, Charu

    I1.1

    RothHuang

    ARLCole

    Winkler

    I3.1Han, Yan

    T1.4

    Parsons

    E2.3Han

    Logic Reasoning

    for InformationValidation

    IRC LeungData &

    Experiments

    Military Data forTopic Analysis

    T1.1 Adali

    T1.5Lin, Wen

    S1.1 Lin

  • 7/31/2019 I3.2.pptx

    28/33

    Next Six Months and Path Ahead to 2012

    Continue research on mining text-intensive information networks

    Research in three frontiers: (1) integrated classification and clustering innetwork mining, (2) build up a theory on link/relationship analysis inheterogeneous networks, and (3) explore military applications

    Collaborations with researchers in other networks

    Work with Nitesh Chawla, who has done much work on link prediction,on evaluation of mining methods for clustering and classification of

    heterogeneous networks Work with SCNARC (Boleslaw Szymanski et al.) on using the method

    developed here to mine social and cognitive networks

    Next year research planned if funded

    Effective theory and methods for mining heterogeneous networks

    involving social and communication networks Network classification and clustering modeling in heterogeneous

    information, social, and communication networks

    Application of role discovery, network classification, and anomalydetection methods in military applications

    28

  • 7/31/2019 I3.2.pptx

    29/33

    Research Papers (Accepted/Published, 2011)

    1. Tim Weninger, Marina Danilevsky, Fabio Fumarola, Joshua Hailpern, Jiawei Han, et al., WinaCS:Construction and Analysis of Web-Based Computer Science Information Networks", Proc. of 2011

    ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'11), (system demo paper), Athens,Greece, June 2011.

    2. Zhijun Yin, Liangliang Cao, Jiawei Han, Chengxiang Zhai, and Thomas Huang, Geographical TopicDiscovery and Comparison, Proc. of 2011 Int. World Wide Web Conf. (WWW'11), Hyderabad,India, Mar. 2011 (Full paper).

    3. Tim Weninger, Fabio Fumarola, Cindy Xide Lin, Rick Barber, Jiawei Han, and Donato Malerba,Growing Parallel Paths for Entity-Page Discovery, Proc. of 2011 Int. World Wide Web Conf.

    (WWW'11), Hyderabad, India, Mar. 2011 (Poster paper)4. Heng Ji and Ralph Grishman. "Knowledge Base Population: Successful Approaches andChallenges". Accepted by Proc. the 49th Annual Meeting of the Association for ComputationalLinguistics: Human Language Technologies (ACL-HLT2011), 2011.

    5. Heng Ji, Adam Lee and Wen-Pin Lin. "Information Network Construction and Alignment fromAutomatically Acquired Comparable Corpora". Invited book chapter for Building and UsingComparable Corpora. Springer, 2011.

    6. Heng Ji, Benoit Favre, Wen-Pin Lin, Dan Gillick, Dilek Hakkani-Tur and Ralph Grishman. 2011.

    "Open-domain Multi-document Summarization via Information Extraction: Challenges andProspects". Invited book chapter for Multi-source, Multilingual Information Extraction andSummarisation. Springer.

    7. Lev Ratinov, Doug Downey, Mike Anderson, Dan Roth, Local and Global Algorithms forDisambiguation to Wikipedia , ACL11

    8. Y. Chan and D. Roth, Exploiting Syntactico-Semantic Structures for Relation Extraction, ACL11

    29

    http://www.cs.uiuc.edu/homes/hanj/pdf/sigmod11_tweninger.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/sigmod11_tweninger.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/www11_zyin.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/www11_zyin.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/www11_twininger.pdfhttps://agora.cs.illinois.edu/download/attachments/30425499/5.pdf?version=1&modificationDate=1300050634857https://agora.cs.illinois.edu/download/attachments/30425499/5.pdf?version=1&modificationDate=1300050634857https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_6.pdf?version=1&modificationDate=1300050897721https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_6.pdf?version=1&modificationDate=1300050897721https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_7.pdf?version=1&modificationDate=1300051015435https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_7.pdf?version=1&modificationDate=1300051015435https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_7.pdf?version=1&modificationDate=1300051015435https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_7.pdf?version=1&modificationDate=1300051015435https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_7.pdf?version=1&modificationDate=1300051015435https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_7.pdf?version=1&modificationDate=1300051015435https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_7.pdf?version=1&modificationDate=1300051015435https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_7.pdf?version=1&modificationDate=1300051015435https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_6.pdf?version=1&modificationDate=1300050897721https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_6.pdf?version=1&modificationDate=1300050897721https://agora.cs.illinois.edu/download/attachments/30425499/5.pdf?version=1&modificationDate=1300050634857https://agora.cs.illinois.edu/download/attachments/30425499/5.pdf?version=1&modificationDate=1300050634857http://www.cs.uiuc.edu/homes/hanj/pdf/www11_twininger.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/www11_twininger.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/www11_twininger.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/www11_zyin.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/www11_zyin.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/sigmod11_tweninger.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/sigmod11_tweninger.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/sigmod11_tweninger.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/sigmod11_tweninger.pdf
  • 7/31/2019 I3.2.pptx

    30/33

    Research Papers (Published, Sept.-Dec. 2010)

    1. Manish Gupta, Rui Li, Zhijun Yin, and Jiawei Han, Survey on Social Tagging Techniques", SIGKDD

    Explorations, 12(1):58-72, 2010.

    2. Lu Liu, Jie Tang, Jiawei Han, Meng Jiang, Shiqiang Yang, Mining Topic-Level Influence in

    Heterogeneous Networks", Proc. 2010 ACM Int. Conf. on Information and Knowledge

    Management (CIKM'10), Toronto, Canada, Oct. 2010

    3. Tim Weninger, Fabio Fumarola, Jiawei Han, Donato Malerba, Mapping Web Pages to Database

    Records via Link Paths", Proc. 2010 ACM Int. Conf. on Information and Knowledge Management

    (CIKM'10), Toronto, Canada, Oct. 2010.

    4. Xin Jin, Andrew Gallagher, Liangliang Cao, Jiebo Luo, and Jiawei Han, The Wisdom of SocialMultimedia: Using Flickr for Prediction and Forecast", Proc. 2010 ACM Multimedia Int. Conf.

    (ACM-Multimedia10), Florence, Italy, Oct. 2010

    5. Zheng Chen, Suzanne Tamang, Adam Lee, Xiang Li, Wen-Pin Lin, Javier Artiles, Matthew Snover,

    Marissa Passantino and Heng Ji. "CUNY-BLENDER TAC-KBP2010 Entity Linking and Slot Filling

    System Description". Proc. Text Analytics Conference (TAC2010), 2010

    6. Hao Li, Xiang Li, Heng Ji and Yuval Marton. 2010. "Domain-Independent Novel Event Discoveryand Semi-Automatic Event Annotation". Proc. the 23rd Pacific Asia Conference on Language,

    Information and Computation (PACLIC 2010)

    7. J. Pasternack and Dan Roth, Comprehensive Trust Metrics for Information Networks , Army

    Science Conf.10 (Best Paper Award), Dec. 2010.

    8. Q. Do and D. Roth, Constraints based Taxonomic Relation Classification, EMNLP10, Oct. 2010

    30

    http://www.cs.uiuc.edu/homes/hanj/pdf/cikm10_lliu.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/cikm10_lliu.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/cikm10_tweninger.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/cikm10_tweninger.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/mm10_xjin.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/mm10_xjin.pdfhttps://agora.cs.illinois.edu/download/attachments/30425499/hji_app_16.pdf?version=1&modificationDate=1300051108556https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_16.pdf?version=1&modificationDate=1300051108556https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_17.pdf?version=1&modificationDate=1300051134188https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_17.pdf?version=1&modificationDate=1300051134188https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_17.pdf?version=1&modificationDate=1300051134188https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_17.pdf?version=1&modificationDate=1300051134188https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_17.pdf?version=1&modificationDate=1300051134188https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_17.pdf?version=1&modificationDate=1300051134188https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_17.pdf?version=1&modificationDate=1300051134188https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_17.pdf?version=1&modificationDate=1300051134188https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_16.pdf?version=1&modificationDate=1300051108556https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_16.pdf?version=1&modificationDate=1300051108556https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_16.pdf?version=1&modificationDate=1300051108556https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_16.pdf?version=1&modificationDate=1300051108556https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_16.pdf?version=1&modificationDate=1300051108556https://agora.cs.illinois.edu/download/attachments/30425499/hji_app_16.pdf?version=1&modificationDate=1300051108556http://www.cs.uiuc.edu/homes/hanj/pdf/mm10_xjin.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/mm10_xjin.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/cikm10_tweninger.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/cikm10_tweninger.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/cikm10_lliu.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/cikm10_lliu.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/cikm10_lliu.pdfhttp://www.cs.uiuc.edu/homes/hanj/pdf/cikm10_lliu.pdf
  • 7/31/2019 I3.2.pptx

    31/33

    Research Papers (Submitted, 2011)1. (UIUC + U. Michigan) Cindy Xide Lin, Qiaozhu Mei, Yunliang Jiang, and Jiawei Han, "Inferring the Diffusion and Evolution of

    Topics in Social Communities", KDD'11 (sub)

    2. (UIUC) Hongbo Deng, Jiawei Han, Bo Zhao, Yintao Yu, Cindy Xide Lin, "Probabilistic Topic Models with Biased Propagation on

    Heterogeneous Information Networks", KDD'11 (sub)3. (UIUC) Zhijun Yin (UIUC), Liangliang Cao (UIUC), Jiawei Han (UIUC), Chengxiang Zhai (UIUC), Thomas Huang (UIUC), "LPTA: A

    Probabilistic Model for Latent Periodic Topic Analysis", KDD'11 (sub)

    4. (CUNY + UIUC) Heng Ji and Jiawei Han. 2011. Web-Scale Knowledge Discovery and Information Extraction. Invited Paper for

    IEEE Special Issue on Web-Scale Multimedia Processing and Applications. In Preparation.

    5. (CUNY + UIUC) Hao Li, Heng Ji, Hongbo Deng and Jiawei Han. 2011. Topically Related Data is Better Data: Topic Modeling for

    Event Extraction. Submitted to the 49th Annual Meeting of the Association for Computational Linguistics: Human Language

    Technologies (ACL-HLT2011)

    6. (CUNY + UIUC) Sam Anzaroot, Javier Artiles, Heng Ji, Hongbo Deng and Jiawei Han. 2011. Search and Browsing Self-BoostingInformation Networks. Submitted to the 34th Annual International ACM SIGIR Conference (SIGIR2011)

    7. (CUNY) Javier Artiles, Qi Li, Enrique Amigo and Heng Ji. 2011. Leveraging Cross-document Redundancy for Temporal

    Information Extraction. Submitted to Empirical Methods in Natural Language Processing (EMNLP2011)

    8. (CUNY) Javier Artiles, Enrique Amigo, Qi Li and Heng Ji. 2011. Evaluating Temporal Information Extraction. Submitted to ACL-

    HLT2011

    9. (CUNY) Zheng Chen and Heng Ji . 2011. Collaborative Ranking: A Case Study in Entity Linking. Submitted to Conference on

    Empirical Methods in Natural Language Processing (EMNLP2011)

    10. (CUNY) Qi Li, Javier Artiles and Heng Ji. 2011. Dependency Paths Kernel for Temporal Relation Classification. Submitted to 49thAnnual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT2011).

    11. (CUNY) Suzanne Tamang and Heng Ji. 2011. Learning-to-Rank for Slot Filling System Combination and Assessment. Submitted

    to Conference on Empirical Methods in Natural Language Processing (EMNLP2011)

    12. (CUNY) Zheng Chen, Suzanne Tamang, Adam Lee and Heng Ji. 2011. A Toolkit for Knowledge Base Population. Submitted to

    the 34th Annual International ACM SIGIR Conference (SIGIR2011)

    13. (CUNY) Xiang Li and Heng Ji. 2011. Comment-guided Reinforcement Learning for Slot Filling. Submitted to Conference on

    Empirical Methods in Natural Language Processing (EMNLP2011)

    31

  • 7/31/2019 I3.2.pptx

    32/33

    Other Technical Contributions (Book: UIC + UIUC + CMU) Philip S. Yu, Jiawei Han, and Christos Faloutsos (Editors), LINK MINING: MODELS,

    ALGORITHMS AND APPLICATIONS, Springer, 2010.

    (UIUC) Jiawei Han has received Daniel C. Drucker Eminent Faculty Award at UIUC (UCSB) Ms. Gengxin Miao, who was supported by the INARC program, has received IBM Ph.D. Fellowship for

    2011-2012. Gengxin Miao is co-supervised by Xifeng Yan at INARC.

    (CUNY) Heng Ji. CUNY Chancellor's "Salute to Scholar" Award, November 2010.

    (CUNY) Heng Ji. National Science Foundation Research Experiences for Undergraduates, March 2011

    Jiawei Han, Towards Integrated Mining of Multiple Social and Information Networks (keynote speech) The

    2011 Int. Conf. on Advances in Social Network Analysis and Mining (ASONAM11), July 2011.

    Jiawei Han, Exploring the Power of Heterogeneous Information Networks in Data Mining (keynote speech)The 2011 Int. SIAM Data Mining Conf. (SDM11), April 2011.

    Jiawei Han, Construction and Analysis of Web-Based Computer Science Information Networks (keynote

    speech) The 2011 Int. Conf. on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, June 2011.

    Latifur Khan, Wei Fan, Jiawei Han, Jing Gao, Mohammad Mehedy Masud, Data Stream Mining: Challenges

    and Techniques, (tutorial), The 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD

    2011), May 2011

    Jiawei Han, Web Structure Mining and Information Network Analysis: An Integrated Approach, invited speechat the Third International Workshop on Network Theory: Web Science Meets Network Science, March 2011.

    Heng Ji, Web-Scale Knowledge Discovery and Population from Unstructured Data, Keynote Speech ACLCLP

    2010 Information Retrieval Conference, December 2010.

    Heng Ji. Overview of the TAC2010 Knowledge Base Population Track, Keynote Speech at Web People Search

    (WePS-3) Conference, September 2010.

    32

    Personalized Recommendation on

    http://www.amazon.com/Link-Mining-Models-Algorithms-Applications/dp/1441965149http://www.amazon.com/Link-Mining-Models-Algorithms-Applications/dp/1441965149http://www.amazon.com/Link-Mining-Models-Algorithms-Applications/dp/1441965149http://www.amazon.com/Link-Mining-Models-Algorithms-Applications/dp/1441965149
  • 7/31/2019 I3.2.pptx

    33/33

    Personalized Recommendation onInformation Networks

    Concept extractionText Concept

    Combine text & links inheterogeneous

    networks

    Find good conceptualassociations of userinterests; distinguishclean sources andnoisy sources

    (UIUC) Chi Wang, et al., Learning Relevance in a Heterogeneous Social Network and Its