John M. Barnar IRFS 040610

Embed Size (px)

Citation preview

  • 8/8/2019 John M. Barnar IRFS 040610

    1/27

    www.digitalchemistry.co.uk

    Searching the Atoms andBonds in Chemical Patents

    Presented at IRF Symposium

    Vienna, Austria

    4 June 2010

    Dr John M. Barnard

    Scientific Director

    Digital Chemistry Ltd., UK

  • 8/8/2019 John M. Barnar IRFS 040610

    2/27

    2

    Outline Chemical structures in patents

    Principles of searching for chemical structures

    History of chemical structure searching in patents

    Current developments

    specific structures vs. Markush structures

    automatic analysis vs. manual curation

    online systems vs. in-house systems

    Retrieval performance evaluation

  • 8/8/2019 John M. Barnar IRFS 040610

    3/27

    3

    Chemical structures in patentsThe most important information

    in a chemical patent is oftenthe chemical structure

    disclosed or claimed.

    - specifies atoms and bondspresent and the way they areconnected

    - integrated mixture of textand structure diagrams

  • 8/8/2019 John M. Barnar IRFS 040610

    4/274

    Markush structures

    N

    NR3

    CH3 O

    R2

    R1

    R1 = phenyl / cyclohexyl / ...

    R2 = H / methyl / ...

    R3 = H / Cl / NO2 / ...

    Dr EugeneMarkush(1887-1968)

    Patents may include both Markushstructure claim and exemplifiedspecific structures.

    Classes of molecules withcommon structural features may cover millions (or infinite numbers)

    of specific structures

    allow protection of related moleculeswith common properties

    named after inventor involved in USlegal case in 1924

  • 8/8/2019 John M. Barnar IRFS 040610

    5/275

    Chemical structures in patents

    Markushstructure

    Specificstructure

    name

  • 8/8/2019 John M. Barnar IRFS 040610

    6/276

    Markush structuresSpecific structures can be generated by combinatorialassembly of alternatives for each R-group

    Variable-position

    attachment

    Variablemultiplicity

    Genericgroups Specific

    groups

    Non-structuraldescription

  • 8/8/2019 John M. Barnar IRFS 040610

    7/277

    Substructure search Search a database of chemical structures for

    all those containing a specified pattern ofatoms and bonds (substructure)

    NCO

    N

    N

    CH3 O

    CH3

    CH3

    NHC

    O

    N

    N

    CH2

    CH3

    CH3

    Query substructure:

    Retrieved molecule Retrieved molecule Molecule not retrieved

    N

    N

    N

    CH3

    O

    CH3

  • 8/8/2019 John M. Barnar IRFS 040610

    8/278

    Substructure search Originally applied to databases of specificstructures (single, fully-defined molecules)Exact and deterministicsearch algorithms

    based in topologicalgraph theory

    100% recall

    100% precision

    Search retrieves all database molecules that containthe query substructure and none of those that don't.

    Substructure search also possible for Markushstructures, but more complicated.

  • 8/8/2019 John M. Barnar IRFS 040610

    9/279

    Patent searching before 1980Chemical Fragmentation Codes

    substructure fragments used as index terms

    manually assigned by expert coders

    applied both to specific and Markush structures

    search uses Boolean logic for required combinations

    Fragment codes wereoriginally designed for

    punched cards.

    Connectivity / alternativenessrelationships betweenfragments usually lost

    Poor

    Precision

  • 8/8/2019 John M. Barnar IRFS 040610

    10/2710

    Patent searching the 1980s"Topological" / graphical systems introduced

    - with display of structure diagrams

    Initial work with non-patentdatabases journal literature "in-house" structures

    Commercial systems

    operational by start of decade "public" databases

    CAS Online

    Systme DARC

    "in-house" data MDL MACCS etc.

    Specific Structures Markush Structures

    Sheffield University academic research

    on patent Markushstorage and retrieval

    Commercial systems

    and databases launchedat end of decade

    Markush DARC(Derwent / Questel / INPI)

    MARPAT(Chemical Abstracts)

  • 8/8/2019 John M. Barnar IRFS 040610

    11/2711

    Patent searching since 1990

    Little change

    still only available onlinewith proprietarydatabases

    showing their age withclunky interfaces

    fragment code systemsstill widely used

    Commercial searchsystems Databases

    New databases of specificstructures from patents

    Reaxys (formerlyMDL/Elsevier ChemicalPatent Database

    SURECHEM

    Machine-readable patentdocuments available direct

    from patent offices

    Some automation in databasecreation

  • 8/8/2019 John M. Barnar IRFS 040610

    12/27

    12

    Related developments

    Markush applications outsidepatent field

    informatics for"combinatorial libraries"

    specific structureenumeration

    physicochemical property

    calculation

    Markush searching Data mining

    Chemical data extraction fromfree text and diagrams

    structure diagram "OCR" chemical nomenclature

    translation

    Research work on capture ofMarkush structures from free-text patentsNew "in-house" systems for

    patent Markush search underdevelopment

    Markush applications outsidepatent field

    informatics for"combinatorial libraries"

    specific structureenumeration

    physicochemical property

    calculation

  • 8/8/2019 John M. Barnar IRFS 040610

    13/27

    13

    Which way forward?

    Markush structures cover the scope of thepatent more comprehensively (better recall),but are more complicated to search, and canlead to poor retrieval precision.

    Which structures to index and search?

    How to build the databases?

    Exemplified / enumeratedspecific structures

    Markush structure

    Manual input andcuration

    Automatic analysis offull text patent

    At least at present, searchers regard curateddatabases as the "gold standard" for retrievalperformance.

    or

    or

  • 8/8/2019 John M. Barnar IRFS 040610

    14/27

    14

    Different approachesSpecific

    Structures

    MarkushStructures

    ManuallyCurated

    AutomaticallyExtracted

    MMS

    MARPAT

    SureChemCA Registry

    DerwentChemistryResource

    Reaxys

    IBM

    CLiDEchemoCR

    Databases

    Data-mining software

    DecrIPt

  • 8/8/2019 John M. Barnar IRFS 040610

    15/27

    15

    Using specific structuresConventional approach

    Extract specific structures frompatent

    manual curation

    CA Registry Derwent Chemistry

    Resource

    automatic extraction SureChem

    IBM combination of both

    Reaxys

    Search using standard

    substructure search software

    IssuesSelection of compounds

    exemplified "prophetic"

    anything with a name

    Effectiveness of automatic

    nomenclature identificationand translation

    Correctness of systematicnames in patent document

  • 8/8/2019 John M. Barnar IRFS 040610

    16/27

    16

    Using specific structuresOther "text analytics" approaches

    Accelrys/Notiora WorkAutomatic chemical name to

    structure conversionStructure "fingerprints" foreach molecule based onsubstructure fragments

    Logical "OR" of fingerprintsfor whole patent

    Structural similarity searchbased on logical "OR"fingerprints and maximumcommon substructure

    IBM WorkAutomatic chemical name

    to structure conversionVector representationderived from IUPACChemical Identifier (InChI)

    Structural similaritysearch based oncomparison of vectorrepresentations

  • 8/8/2019 John M. Barnar IRFS 040610

    17/27

    17

    Using Markush structuresExisting systems

    Two online systems/databasesavailable since late 1980s

    Merged Markush Service(ThomsonReuters /Markush DARC)

    MARPAT (ChemicalAbstracts Service/STN)

    Problems

    Excessively broad Markushesdefy existing systems, and give

    poor recall / precision

    Searchers often faced withmanually sifting 1000+ hits tofind 5 or 6 relevant patents

    R1 is a substituted orunsubstituted, mono-, di-or polycyclic, aromatic ornon-aromatic, carbocylicor heterocyclic ringsystem, or ...

  • 8/8/2019 John M. Barnar IRFS 040610

    18/27

    18

    Searchers commentsDiscussion in "breakout group" at International PatentInformation Conference (IPI-Confex), Venice, Mar 2009

    Multiple search tools are needed for comprehensive

    retrieval Search strategies need to focus on the core structure

    of interest and put up with poor precision

    Current systems based on automatic extraction andanalysis of nomenclature have limited usefulness

    Suggestions for improvement: ranking of search output

    more comprehensive indexing of specific structures

  • 8/8/2019 John M. Barnar IRFS 040610

    19/27

    19

    In-house Markush systems?Advantages over existing online systems

    Informatics support fordrug discoveryIntegration of patent datawith other chemicaldatabases

    end-user chemistaccess to patent data

    Adding patentability criteriato drug design

    Adjunct or preliminary toexisting systems

    confidentiality advantages

    Possib

    leuse

    ofstructural

    similarity

    and

    clusteranalysis

    techn

    iques

    Data miningStructure activity

    analysisPhysico-chemicalproperty calculation

    Competitive intelligence

    Identification ofunpatented "gaps" inchemical space

  • 8/8/2019 John M. Barnar IRFS 040610

    20/27

    20

    In-house Markush systems?Prospects

    SoftwareNew Markush search systems

    under development Digital Chemistry Ltd.

    ChemAxon

    Also work on selective

    enumeration of specificstructures from Markush

    DecrIPt Inc.

    DatabasesExisting curated databases

    ThomsonReuters haveexpressed interest inmaking MMS data available

    MARPAT database anotherobvious possibility

    "Home-grown" databases forspecialist purposes

    input software needed

    Automatic extraction from patentdocuments

  • 8/8/2019 John M. Barnar IRFS 040610

    21/27

    21

    Automatic Markush extractionCurrently a "hot area" for research, after a fallow period

    complex combined issues of text and image processing,nomenclature translation and semantic analysis

    Sheffield University3 publications (1992-97), initiallyanalysing Derwent patent abstracts.

    CLiDE Pro (KeyModule Ltd.)Work by A.P. Johnson (2009) extendingearlier chemical OCR software.

    Cambridge UniversityUnilever Centre for Molecular InformaticsOngoing work by Murray-Rust group onanalysis of full-text patents, extendingOPSIN nomenclature translation program.

    chemoCR (Fraunhofer SCAI)Recent work on prototype software forMarkush "reconstruction" from patenttext, with limited success.

    Cambridge UniversityUnilever Centre for Molecular InformaticsOngoing work by Murray-Rust group onanalysis of full-text patents, extendingOPSIN nomenclature translation program.

    ChemProspector (InfoChem)Ongoing research into extraction ofMarkush structures from patents. Commercially-viable operational

    systems probably still some way off.

  • 8/8/2019 John M. Barnar IRFS 040610

    22/27

    22

    Precision and recall

    Substructure search finds allmolecules in database that

    contain query substructure

    Patent databases with specific structures

    100% precision100% recall

    Poor precision

    Poor recall

    Database contains irrelevant ortrivial molecules from patent text

    Database omits moleculescovered by Markush structure

    Database contains incorrectmolecules (errors in nomenclatureidentification / translation)

  • 8/8/2019 John M. Barnar IRFS 040610

    23/27

    23

    Precision and recallPatent databases with Markush structures

    Poor precision

    Poor recall

    Query substructure matcheshighly generic description inunimportant part of Markush

    Using system search optionsto avoid matches with highlygeneric descriptions

    "broad/narrowtranslation" (DARC)

    "match level" (MARPAT)

    N

    N

    N

    CH3 O

    CH3 matches

    R84 is a substituted orunsubstituted, mono-, di- orpolycyclic, aromatic or non-aromatic, carbocylic orheterocyclic ring system, or ...

  • 8/8/2019 John M. Barnar IRFS 040610

    24/27

    24

    Patent search evaluationChemical substructure search systems usuallygive 100% precision and 100% recall

    retrieval performance evaluation not important

    Not really true for chemical patent searches much more room for argument about

    whether or not a hit is relevant

    Designers of chemical patent search systems may

    need to pay more attention to performance evaluation

    Precision / Recalltrade-off

    Evaluation of hitrelevance in context

    of type of query

    Consideration of the relativeimportance of different

    parts of Markush(what the patent "teaches")

  • 8/8/2019 John M. Barnar IRFS 040610

    25/27

    25

    TREC-CHEM Multi-year evaluation project under auspices of long-

    running Text Retrieval Conferences (TREC)

    Uses chemical patent data and queries with withrelevance judgements

    Used to compare retrieval experiments performed bydifferent groups

    Results from first year (2009) presented elsewhere

    most search approaches based on automaticanalysis of patent text, nomenclature extraction etc.

    some issues identified concerning automatedrelevance judgements based on cited prior artdocuments

  • 8/8/2019 John M. Barnar IRFS 040610

    26/27

    26

    TREC-CHEM

    It would be valuable to apply TREC-type evaluation to searchsystems for patent chemistry that are based on commercialdatabases

    MARPAT vs. MMS/Markush DARC

    existing systems and databases vs. new ones using newtechniques, automated data extraction etc.

    There are many potential benefits to practising patent searchers ininvolving cheminformaticians in the IR-IP debate for which the IRF

    provides a forum.

    TREC-CHEM is not using data from curated databases these are the current "industry standard" against

    which new approaches will ultimately be judged

    TREC rules do not allow commercial databasesto be included

  • 8/8/2019 John M. Barnar IRFS 040610

    27/27

    27

    Contact details

    Dr John M. Barnard

    Scientific Director, Digital Chemistry Ltd.46 Uppergate Road, Sheffield S6 6BX, UK

    [email protected]+44 (0)114 233 3170