Upload
karen-karapetyan
View
267
Download
0
Embed Size (px)
DESCRIPTION
Describes workflow for validation and standardization for Open PHACTS Chemical Registry System
Citation preview
Standardization and Generation of Parentsfor
Open PHACTS Chemical Registry System
Karen Karapetyan, Valery TkachenkoColin Batchelor, Antony Williams
Validation checks
Correct file format (SDF, MOL, CDX, etc)
“Valid” chemical structure Valid atoms (not query atoms) Valid bonds Valid valences Valid charges SP3 stereo
Synonyms Names (name to structure) SMILES, InChIs (SMILES/InChI to structure)
XRefs
Severity assigned to every validation issue
Filtering by severity and by issues
Standardization – Organometallics/Salts
Always disconnect N, O, and F from metals:
Disconnect nonmetals (except N,O,F) with transition metals (except Hg)
Ionize free metal with carboxylic acid (Metals of Group I and II)
O–
O
O H
ONa
+
Na
Standardization SMIRKS(based on InChI normalization and on FDA SRS)
Examples of InChI normalization
[*;H+:1]>>[*;H:1] [O,S,Se,Te:1]=[O+,S+,Se+,Te+:2][C-;v3:3]
>>[O,S,Se,Te:1]=[O,S,Se,Te:2]=[C:3] [N-,P-,As-,Sb-:1]=[C+;v3:2]>>[N,P,As,Sb:1]#[C:2]
Examples of FDA SRS rules
[n:1]=[O:2]>>[n+:1][O-:2] [*:1]=[N:2]#[N:3]>>[*:1]=[N+:2]=[N-:3] [N+0;H3:1].[C:3](=[O:4])[O:5][H:6]>>[N+1;H4:1].[C:3](=[O:4])[O-:5] Thiopurine [H:1][S:2][c:3]1[n:8][c:7]([H,*:13])[n:6][c:5]2[c:4]1[n:11][c:10]
([H,*:12])[n:9]2>>[H:1][N:8]1[C:7]([H,*:13])=[N:6][C:5]2=[C:4]([N:11]=[C:10]([H,*:12])[N:9]2)[C:3]1=[S:2]
Standardization Dearomatize
Double bond with adjacent wiggly single bond
Fold hydrogen atoms with no up or down bonds
ClCl
Cl
NH 2
O
Cl
N
H
H
Cl
H
Cl
O
Standardization Remove symmetric stereocenters
Turn off chiral flag if no up or down bonds Do Layout
Chiral flag is set
N H 2
NH 2NH 2
N H 2
Standardization – partially ionized acids(move proton from strong acids to a weaker)
For each Compound parent generation is attempted
“Tautomerism in large databases”, Sitzmann and others, J.Comput Aided Mol Des (2010)
Parent Description RDF
Charge-Unsensitive An attempt is made to neutralize ionized acids and bases. Envisioned to be an ongoing improvement while new cases appear.
void:linkPredicate skos:closeMatchdul:expresses cheminf:CHEMINF_000460;
Isotope-Unsensitive Isotopes replaced by common weight void:linkPredicate skos:closeMatch;dul:expresses cheminf:CHEMINF_000459
Stereo-Unsensitive SP3 and double bond stereo removed void:linkPredicate skos:closeMatchcheminf:CHEMINF_000456
Tautomer-Unsensitive
Tautomer canonicalization is attempting to generate a canonical tautomer
void:linkPredicate skos:closeMatch;dul:expresses cheminf:CHEMINF_000486;
Super Parent Super parent is generated by applying modifications of all of the above
void:linkPredicate skos:broadMatch;dul:expresses cheminf:CHEMINF_000458;
Fragment
SID 1
SDF1DataSource1Synonym1Synonym2XRef1
SID 2
SDF2DataSource2Synonym1Synonym3XRef2
OPS_ID 1
DepositedSubstances
Parents
Standardized MOLECULE
DataSource1DataSource2Synonym1Synonym2Synonym3XRef1XRef2
Charge Parent (OPS_ID 6)
Isotope Parent (OPS_ID 4)
Stereo Parent (OPS_ID 3)
Tautomer Parent (OPS_ID 5)
Super Parent (OPS_ID 7)
Compounds
OPS_ID 2
Standardized MOL
DataSource3DataSource4Synonym4Synonym5Synonym6XRef3XRef4
What do we use as chemical identity of the standardized records(primary compound key)?
• Standard InChI/InChIKey (currently used ChemSpider)• Absolute smiles (isomeric canonical)
Drawbacks• SMILES – can be too long; no accepted standard; needs to be hashed• Standard InChI
• does not distinguish between undefined and unknown stereo• by default standard InChI does some basic tautomer canonicalization
(not needed in new model)• By default assumes absolute stereo
Proposed SolutionNon-standard InChI with options: SUU SLUUD FixedH SUCF• much more sensitive to stereo description• Fixes mobile hydrogens (so tautomers could be distinguished)• Handles “AND-ed” relative stereo