Topics in bioinformaticsCS697Spring 2011Class 12 Mar-22-2011Molecular distance measurementsMolecular transformations
Rotation in 3D - matrices
Rotations in 3D Euler angles * Rotate the XYZ-system about the Z-axis by . The X-axis now lies on the line of nodes. * Rotate the XYZ-system again about the now rotated X-axis by . The Z-axis is now in its final orientation, and the x-axis remains on the line of nodes. * Rotate the XYZ-system a third time about the new Z-axis by .
QuaternionsExtension of complex numbers.h=a+bi+cj+dk, a,b,c,d real numbers.i,j,k : imaginary components s.t.:i2=j2=k2=-1ij=k, jk=i, ki=jij = -ji, jk = -kj, ki = -ik
Unit quaternionsUnit quaternions :Group under quaternion multiplication. can be mapped to:
Quaternions as axis-angle rotation
Molecular distance measurements
Measuring protein structure similarityVisual comparisonDihedral angle comparisonDistance matrixRMSD (root mean square distance)Given two shapes or structures A and B, we are interested in defining a distance, or similarity measure between A and B.Is the resulting distance (similarity measure) D a metric?
D(A,B) D(A,C) + D(C,B)
Comparing dihedral anglesTorsion angles () are:- local by nature- invariant upon rotation and translation of the molecule- compact (O(n) angles for a protein of n residues)Add 1 degreeTo all But
Internal distance matrix12346.08.15.9
Internal distance matrix (2)Advantages- invariant with respect to rotation and translation- can be used to compare proteinsDisadvantages- the distance matrix is O(n2) for a protein with n residues- comparing distance matrix is a hard problem- insensitive to chirality
Quality Assessment through RMSDRMSD: Root Mean Squared Deviation one of the simplest measures to quantify how different two protein conformations really are advantage: simple to compute by representing conformations as 3N vectors (N atoms) limitations: 1) limited to conformations of the same protein chain 2) atom-atom correspondence needed on different-length chains 3) not very descriptive if changes are localizedRMSD: Average atomic distance given two conformations of a chain of N atoms represent the conformations as two 3N vectors x and y RMSD(x,y) is the euclidean distance between x and y, averaged over the N atomslRMSD: least RMSD same conformation rigidly transformed in space (translated or rotated) should give an RMSD of 0 before computing RMSD(x,y) one needs to remove changes due to rigid-body transformations
Protein Structure SuperpositionA rigid-body transformation T is a combination of a translation t and a rotation R: T(x) = Rx+tThe quantity to be minimized is:
Where a and b are the two point sets.
The translation partE is minimum with respectto t when: Then:If both data sets A and B have been centered on 0, then t = 0 !Step 1: Translate point sets A and B such that their centroids coincide at the origin of the framework
The rotation partLet A and B be the centroids of A' and B', and A and B the matrices containing the coordinates of the points of A and B centered on 0:Build covariance matrix:x=3x33xNNx3
The rotation partU and V are orthogonal matrices, and D is a diagonal matrix containing the singular values.U, V and D are 3x3 matricesCompute SVD (Singular Value Decomposition) of C:Define S by:
2. Build covariance matrix:The algorithm1. Center the two point sets A and B3. Compute SVD (Singular ValueDecomposition) of C:5. Compute rotation matrix4. Define S:
O(N) in time!6. Compute lRMSD:
Some reading materialhttp://cnx.org/content/m11608/latest/#RMSD
lRMSD has ShortcomingslRMSD cannot capture localized changes: if a small perturbation occurs in a part of the structure, e.g. rotation of a hinge connecting two domains, lRMSD will report a large valueMain reason: lRMSD does not know how to attribute changes to specific atoms of the chainlRMSD distributes change equally (through the averaging) to all atoms in a protein chainMeasuring conformational similarity is an active research area
Other Quality Assessment: Shape SimilaritySometimes assessment of cavities on the surface of a protein is more important than description of the rest of the structure, especially when the goal is prediction of a binding site rather than of the entire structure (which can be thought of as a scaffold)Methods that assess surface area, solvent accessible surface area, that compute volumes, and detect cavities on proteins are very important in the context of binding and dockingModel each atom as a vdw sphere, the union of which gives the molecular surfaceNot all molecular surface is accessible to solvent. Rolling a solvent ball over the vdw spheres traces out the solvent accessible surface area (SASA) . SASA is important to quantitatively determine interactions of the protein
Solvent-accessible Solvent Area (SASA)Computational geometry methods that use Delaunay triangulations and alpha shapes assess SASA and other geometric descriptors of molecular surfaces, volumes, and cavities We will come back to this topic in the context of molecular docking further reading about shape computing at http://cnx.org/content/m11616/latest/SASA for a 1.4 ballSASA for a 1.5 ball. Increasing the radius reduces the SASA due to more cavities that a bulkier ball cannot penetrate
Ultrafast Shape Recognition (USR)Drug design Screening a number of potential compounds.Find a set of molecules which closely resemble a lead molecule from a HUGE database.Shape similarity may indicate similar binding properties and similar activity.
OverviewEfficient global comparison of molecular shapes.The molecules are represented as feature vectors, representing the relative positions of the atoms.Does not require alignment of the molecules.Suitable for large database search.
Feature vector representationThe shape of a molecule is uniquely determined by the relative positions of the atoms. Which are determined by the inter-atomic distances.The set of distances can be constrained due to forces that hold the atoms together.
Strategic feature pointsThe molecule is described as 4 sets of atomic distance distributions from feature points:Center of mass - ctdPoint closest to ctd - cstPoint farthest from ctd fctPoint farthest from fct ftfThe moments of the distributions are calculated and stored as a feature vector.Estimate of the size, compactness and symmetry of the molecule.
Feature vector for a molecule
Advantages and disadvantagesExtremely fast due to calculation of only 4N distances and distributions.Very sensitive to small changes in the molecule shape.Does not directly account for chemical interactions and atom types.
Quality Assessment through LGALocal-Global-Alignment (LGA) introduced by Adam Zemla in 2003 is being used as a more accurate similarity assessment than lRMSD in CASPLGA generates many different local superpositions to find regions where two conformations are similar: combines longest continuous segment (LCS) and global distance test (GDT) to find local and global similarities LCS superimposes the longest segments that fit under a selected RMSD cutoffGDT complements evaluations made with LCS searching for the largest (not necessary continuous) set of equivalent residues that deviate by no more than a specified distance cutoffFurther reading: Zemla A., "LGA - a Method for Finding 3D Similarities in Protein Structures", Nucleic Acids Research, 2003, Vol. 31, No. 13, pp. 3370-3374 http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=12824330