14
Standardization and Generation of Parents for Open PHACTS Chemical Registry System Karen Karapetyan, Valery Tkachenko Colin Batchelor, Antony Williams

Standardization and Generation of Parents for Open PHACTS Chemical Registry System

Embed Size (px)

DESCRIPTION

Describes workflow for validation and standardization for Open PHACTS Chemical Registry System

Citation preview

Page 1: Standardization and Generation of Parents for Open PHACTS Chemical Registry System

Standardization and Generation of Parentsfor

Open PHACTS Chemical Registry System

Karen Karapetyan, Valery TkachenkoColin Batchelor, Antony Williams

Page 2: Standardization and Generation of Parents for Open PHACTS Chemical Registry System

Validation checks

Correct file format (SDF, MOL, CDX, etc)

“Valid” chemical structure Valid atoms (not query atoms) Valid bonds Valid valences Valid charges SP3 stereo

Synonyms Names (name to structure) SMILES, InChIs (SMILES/InChI to structure)

XRefs

Page 3: Standardization and Generation of Parents for Open PHACTS Chemical Registry System

Severity assigned to every validation issue

Page 4: Standardization and Generation of Parents for Open PHACTS Chemical Registry System

Filtering by severity and by issues

Page 5: Standardization and Generation of Parents for Open PHACTS Chemical Registry System

Standardization – Organometallics/Salts

Always disconnect N, O, and F from metals:

Disconnect nonmetals (except N,O,F) with transition metals (except Hg)

Ionize free metal with carboxylic acid (Metals of Group I and II)

O–

O

O H

ONa

+

Na

Page 6: Standardization and Generation of Parents for Open PHACTS Chemical Registry System

Standardization SMIRKS(based on InChI normalization and on FDA SRS)

Examples of InChI normalization

[*;H+:1]>>[*;H:1] [O,S,Se,Te:1]=[O+,S+,Se+,Te+:2][C-;v3:3]

>>[O,S,Se,Te:1]=[O,S,Se,Te:2]=[C:3] [N-,P-,As-,Sb-:1]=[C+;v3:2]>>[N,P,As,Sb:1]#[C:2]

Examples of FDA SRS rules

[n:1]=[O:2]>>[n+:1][O-:2] [*:1]=[N:2]#[N:3]>>[*:1]=[N+:2]=[N-:3] [N+0;H3:1].[C:3](=[O:4])[O:5][H:6]>>[N+1;H4:1].[C:3](=[O:4])[O-:5] Thiopurine [H:1][S:2][c:3]1[n:8][c:7]([H,*:13])[n:6][c:5]2[c:4]1[n:11][c:10]

([H,*:12])[n:9]2>>[H:1][N:8]1[C:7]([H,*:13])=[N:6][C:5]2=[C:4]([N:11]=[C:10]([H,*:12])[N:9]2)[C:3]1=[S:2]

Page 7: Standardization and Generation of Parents for Open PHACTS Chemical Registry System

Standardization Dearomatize

Double bond with adjacent wiggly single bond

Fold hydrogen atoms with no up or down bonds

ClCl

Cl

NH 2

O

Cl

N

H

H

Cl

H

Cl

O

Page 8: Standardization and Generation of Parents for Open PHACTS Chemical Registry System

Standardization Remove symmetric stereocenters

Turn off chiral flag if no up or down bonds Do Layout

Chiral flag is set

N H 2

NH 2NH 2

N H 2

Page 9: Standardization and Generation of Parents for Open PHACTS Chemical Registry System

Standardization – partially ionized acids(move proton from strong acids to a weaker)

Page 10: Standardization and Generation of Parents for Open PHACTS Chemical Registry System

For each Compound parent generation is attempted

“Tautomerism in large databases”, Sitzmann and others, J.Comput Aided Mol Des (2010)

Parent Description RDF

Charge-Unsensitive An attempt is made to neutralize ionized acids and bases. Envisioned to be an ongoing improvement while new cases appear.

void:linkPredicate skos:closeMatchdul:expresses cheminf:CHEMINF_000460;

Isotope-Unsensitive Isotopes replaced by common weight void:linkPredicate skos:closeMatch;dul:expresses cheminf:CHEMINF_000459

Stereo-Unsensitive SP3 and double bond stereo removed void:linkPredicate skos:closeMatchcheminf:CHEMINF_000456

Tautomer-Unsensitive

Tautomer canonicalization is attempting to generate a canonical tautomer

void:linkPredicate skos:closeMatch;dul:expresses cheminf:CHEMINF_000486;

Super Parent Super parent is generated by applying modifications of all of the above

void:linkPredicate skos:broadMatch;dul:expresses cheminf:CHEMINF_000458;

Page 11: Standardization and Generation of Parents for Open PHACTS Chemical Registry System

Fragment

SID 1

SDF1DataSource1Synonym1Synonym2XRef1

SID 2

SDF2DataSource2Synonym1Synonym3XRef2

OPS_ID 1

DepositedSubstances

Parents

Standardized MOLECULE

DataSource1DataSource2Synonym1Synonym2Synonym3XRef1XRef2

Charge Parent (OPS_ID 6)

Isotope Parent (OPS_ID 4)

Stereo Parent (OPS_ID 3)

Tautomer Parent (OPS_ID 5)

Super Parent (OPS_ID 7)

Compounds

OPS_ID 2

Standardized MOL

DataSource3DataSource4Synonym4Synonym5Synonym6XRef3XRef4

Page 12: Standardization and Generation of Parents for Open PHACTS Chemical Registry System
Page 13: Standardization and Generation of Parents for Open PHACTS Chemical Registry System

What do we use as chemical identity of the standardized records(primary compound key)?

• Standard InChI/InChIKey (currently used ChemSpider)• Absolute smiles (isomeric canonical)

Drawbacks• SMILES – can be too long; no accepted standard; needs to be hashed• Standard InChI

• does not distinguish between undefined and unknown stereo• by default standard InChI does some basic tautomer canonicalization

(not needed in new model)• By default assumes absolute stereo

Proposed SolutionNon-standard InChI with options: SUU SLUUD FixedH SUCF• much more sensitive to stereo description• Fixes mobile hydrogens (so tautomers could be distinguished)• Handles “AND-ed” relative stereo

Page 14: Standardization and Generation of Parents for Open PHACTS Chemical Registry System

Thanks

We would appreciate any comments.

For comments or questions [email protected]