Upload
waleed-el-azab
View
217
Download
0
Embed Size (px)
Citation preview
8/14/2019 IR expert
1/8
0022-4766/03/4405-763 $25.00 2003 Plenum Publishing Corporation 763
Journal of Structural Chemistry. Vol. 44, No. 5, pp. 763-770, 2003Original Russian Text Copyright 2003 by V. N. Piottukh-Peletskii, K. S. Chmutina, and M. V. Korolevich
IR EXPERT: A NEW TYPE OF IR SPECTROSCOPY
INFORMATION SYSTEM FOR SOLVING
SPECTRUM-STRUCTURE PROBLEMS
V. N. Piottukh-Peletskii,1K. S. Chmutina,
1
and M. V. Korolevich2
UDC 543.42+681.3
This paper discusses the possibilities which the new information system IR EXPERT offers to chemists and
spectroscopists. Examples of spectrum-structure problems solved by using the system are discussed
namely, generation of structural hypotheses based on the IR spectrum of the compound, verification of
these hypotheses, and construction of empirical models of IR spectra based on the structure of the
compound.
Key words: IR spectroscopy, structure elucidation, structureproperty relationship, analysis of organic
compounds, spectrum simulation.
IR spectroscopy is one of the most popular methods for analyzing organic compounds in modern practice because of
a relatively low cost of IR spectrum measurements and moderate requirements to properties of the sample. Spectrum
processing generally follows two directions. The spectrum is compared with an IR spectroscopy database (DB), nowadays
often supplied together with instruments. The compound is considered identified if the search procedure results in a given
spectrum or a very closely related one. If it is known beforehand that the spectrum is missing in the DB, or, if it is not found
via the search procedure, then common practice is its visual analysis using correlation tables based on the personal experience
of the spectroscopist. Unfortunately, correlation tables give very limited data on spectrum-structure correlations, which often
cannot be interpreted unambiguously. In many cases, one can realize that the higher the complexity of the compound, thelower the reliability of its structure elucidation from the IR spectrum and correlation tables.
An alternative approach is searching the DB for close spectral analogs of the compound and their analysis [1].
Experience in handling IR spectra, as well as special investigations [2-4], showed that spectral similarity (which is often
known to be lower than in the case of an identical spectrum search) entails structural similarity. The structural analogs of
the compound selected in this way are rather valuable hints in solving structure elucidation problems. A number of
publications (e.g., [2, 4, 5]) have reported on attempts at computer-aided analysis of the selected structural analogs to reveal
the fragments of the structural formula hypothetically present in the compound under study. Useful structures, however,
may be output along with structures whose spectrum matches the spectrum under analysis accidentally, which is a serious
complicating factor. The researcher often finds it difficult to classify the structure as useful in view of the combinatorial
complexity of manual or software-based comparison of the output structures.
The lack of techniques for quantitative (statistical) evaluation of the reliability of the results is another challenge to
both IR and some other types of molecular spectroscopy. The systems described in the literature permit structure elucidation
1N. N. Vorozhtsov Novosibirsk Institute of Organic Chemistry, Siberian Branch, Russian Academy of Sciences;[email protected]. 2B. I. Stepanov Institute of Physics, National Academy of Sciences, Belarus. Translated fromZhurnalStrukturnoi Khimii, Vol. 44, No. 5, pp. 835-842, September-October, 2003. Original article submitted August 20, 2002.
8/14/2019 IR expert
2/8
764
at a level of individual imaginative problems, precluding the use of statistical estimates to evaluate approaches for efficiency
or solutions for reliability. One of the hindrances to the application of statistical evaluation procedures is intrinsic complexity
of organic compounds. Statistical estimates are inapplicable if the compound is viewed as unique and indivisible. If, however,
it is regarded as admitting a description by the set of independent structural characteristics, the question arises of whether the
given set of characteristics is correct and applicable to (or generally significant for) all DB structures.
In spite of all these difficulties, efforts are undertaken to develop means for qualitative computer-aided analysis of
organic compounds by IR spectra [6, 7]. The main tendency is a transition from strict structure identification problems for
compounds whose spectra are available in the DB to a wider and more productive class of problems arising in analyzing the
spectral and structural analogs.
The aim of this work is to demonstrate the applicability of the new tool in analytical practice, whereby the principles
of spectral and structural analogy in IR spectroscopy are used for structure elucidation of compounds by their IR spectra and
for predicting their spectral properties.
EXPERIMENTAL
All experiments were run with a DB containing ~32,000 IR spectra and corresponding organic structures. The
spectra are given by a full spectral curve and in a descriptor representation specifying the positions and intensities of
absorption bands. The full spectral curve is used for seeking spectra with closest Euclidean metrics and for spectrum
displaying and printing-out. The descriptor representation is used exclusively for seeking close spectral analogs. Structural
formulas of organic compounds are represented in the DB by molecular graphs. For each molecular graph, the graph
decomposition program constructs a full set of nonisomorphic connected fragments with 2-7 vertices, in which hydrogen
atoms are not encoded [5]. In the course of the construction, each fragment that is new to the DB is assigned a next
registration number. The structure of the compound is thus described by a list of the registration numbers of its fragments.
The list is regarded as an internal structure representation equivalent to a binary vector with dimensionality M, where Mis the
number of fragments registered in the DB, and each vector component assumes 1 or 0 as a value, depending on the presence of a
corresponding fragment in the structure. The number of different fragments registered in the DB of IR EXPERT is ~108,000.
The distance between the structures (the degree of their similarity) is estimated using the relation
ab a b
1 2 /( ),R F F F (1)
whereFabis the number of common fragments in structures a and b; FaandFbare the numbers of fragments in each structure
compared.
Two algorithms are employed to evaluate the distance between the IR spectra: comparison of spectral curves in
Euclidean metrics and comparison of descriptor representations in the metric developed earlier in [10] and reflecting the
degree of coincidence between the positions and intensities of absorption bands S= (A + M+P)/3. HereAis the component
corresponding to the degree of coincidence between the intensities, Pis responsible for the degree of coincidence between
band positions, and Mreflects the degree of coincidence between the most intense bands.
The software components of the system are linked within the IR EXPERT shell realized under Windows for PC. The
program components of the database are implemented with the VIRTA system [9]. This choice of the system was dictated by
the previous service experience, effective work with entries of variable length, compactness of the DB, and easyprogramming. Systems using SQL technology are less suitable in this case, because a typical problem to be solved is
searching for kclose analogs, which is not quite effective within the framework of SQL queries.
RESULTS AND DISCUSSION
The relationship between the structure of the organic compounds and IR spectral data using DB are revealed by:
analyzing structural characteristics of the compounds whose spectra have some features in common;
8/14/2019 IR expert
3/8
765
analyzing spectral features reflecting certain structural peculiarities of the compounds.
These methods previously have been investigated using an approach based on the principle of spectral and structural analogy
and structure description by full fragment compositions [5, 6, 10]. The fruitfulness of those studies allowed us to go over to
the development of the IR EXPERT software complex designed for helping the researcher in solving spectrum-structure
problems by IR spectroscopy.
Let us consider the application of this program complex to structure elucidation of organic compounds by spectral data.
GENERATING STRUCTURAL HYPOTHESES
The initial stage of structure elucidation consists in generating hypotheses about a compound, followed by refining
and verifying them in accordance with the above two methods of establishing structurespectrum relationships.
The first stage is searching the DB for spectral analogs of the compound using the system based on the IR
spectroscopy DB. The desired spectrum is given either by a spectral curve (e.g., the curve recorded on an instrument) or by
the positions and intensities of absorption bands. The characteristics of the sample and spectrum are pointed out, i.e., it is
determined whether the sample is pure or contains impurities, and whether the spectrum is full or taken as a fragment in
a certain spectral range (Fig. 1). The search result (SR) depends strongly on these conditions, as well as on the tolerances for
Fig. 1.Query for IR spectral analog search (a), list of selected spectra (b), and comparison of one of the selectedspectra with the desired spectrum (c).
8/14/2019 IR expert
4/8
766
Fig. 2.Fragment composition of one of the structuresof the search result (a) and estimated nonaccidentaloccurrence factors for separate fragments met amongten SR structures (b).
possible deviations in band positions and intensities, on the coincidence threshold to discard spectra with weak similarities,
and on a subset of the database in which the search will be conducted (common practice is search throughout the whole DB).
The typical search time throughout the DB (300 MHz Pentium Celeron) is ~40 s.
The search procedure compares the target spectrum with each DB spectrum, and the match factor is calculated.
Thereafter the results with a match factor higher than the given threshold are ranked and output (Fig. 1 b). The user analyzes a
match between the particular spectrum and the target spectrum (represented by a thin line in Fig. 1c). Statistical data suggests
that for match factors larger than 350, the selected compounds coincide or show similarity in their structural characteristics
[10], which is probably true for the case considered as an example.
A peculiarity of the IR EXPERT system is the possibility of visualizing and analyzing a fragment composition of the
compounds selected from the DB along with visualizing their structural formulas. Due to this, the structural hypothesis for
the compound may be compared with the suggested fragment composition output by the system. Figure 2ashows a window
for visualization of the structural formulas of the search result (left) and fragment composition of the correspondingcompound. If the compound is not identified, or, if identification is not quite reliable, then there arises the possibility of
evaluating the set of the structures selected by the search procedure (e.g., of the first ten structures) from the viewpoint of
their nonaccidental occurrence in the fragments.
A nonaccidental occurrence factor is the function of the frequency of a given fragment occurring both in the search
result and in the structural formulas of all DB compounds [11, 12]. For the former, the estimate refers to a particular result of
searching (Fig. 2b); along with other statistical estimates, it must be treated with care. There may be cases where, for an
8/14/2019 IR expert
5/8
767
inherently false fragment (i.e., one which is missing in the compound under analysis), the nonaccidental occurrence factor is
close to 1 (in view of the limited accuracy of representation of numbers, the program outputs 1.0).
After the search is completed, the user obtains full information about the selected spectra and structures, as well as
the list of structural fragments that are most likely to occur in the compound. This set of data permits one to reliably identify
the compound even if the spectrum under analysis and a spectrum selected from the database differ appreciably.
If the procedure has failed to identify the compound, the spectroscopist employs the structure generation program [5,
13] using additional information about the molecular formula of the compound (which is available, for example, from
element analysis or mass spectrometry data), and on the fragments selected, makes up a list of possible structural hypotheses
for the compound.
Also note that the nonaccidental occurrence parameter (given a certain threshold) is used to evaluate the probability
and reliability of recognition of particular fragments while analyzing the results of the gliding search of DB, when the
search task is sequential spectrum selection from the DB [5, 14]. The results of this experiment do not characterize a separate
search; rather, they characterize the spectrumstructure relationships inherent in IR spectroscopy within the framework of the
methods chosen for describing and comparing the spectra and structural formulas of the compounds.
VERIFICATION OF STRUCTURAL HYPOTHESES
In a typical case, the analyst (spectroscopist) has a hypothesis about the structure of a compound or a list of
hypothetical structures. This calls for verification of the correspondence between the suggested structure and the available IR
spectrum of the compound.
Several solutions to this problem were investigated within the framework of the IR EXPERT project. A solution
based on covering a molecular graph with fragments revealed via the spectral search procedure is described in [15].
For applied research, of greatest interest is probably the possibility for an IR spectrum simulation of a hypothetical
structure using the database, and of revealing (or verifying) particular spectrum-structure correlations.
As shown in [7, 16, 17], the spectra of the closest structural analogs might be used for spectrum simulations of
compounds with a specified structure if the degree of structural similarity between DB objects and a given object is high
enough. IR EXPERT permits one to find the closest structural analogs using the fragment composition match factor as a
similarity criterion for (I).
Figure 3 gives an example of searching analogs for the given structure (I). As a result, the user obtains a list of the
closest structural analogs with similarity estimates (Fig. 3a).
Based on the list of the analogs satisfying the given threshold of structural similarity, we can construct an empirical
model of the IR spectrum [10, 17] of the compound being tested. In Fig. 3 b, the model is represented in descriptor form with
each predicted absorption band corresponding to a vertical bar. The model may be compared with the available spectrum
(represented by a spectral curve) of the compound for verification of the structural hypothesis and for preliminarily discarding
unlikely hypotheses, for example, those constructed by a structure generator.
It is often needed to clarify the spectral behavior of a particular fragment or functional group in a given environment.Particular spectrum-structure correlations may be constructed by examining the common spectral features for the revealed list
of structural analogs, characterized by the presence of a particular fragment or list of fragments. In this case we can first
evaluate the number of compounds with smaller fragments incorporated in the fragment under analysis. Figure 4 illustrates
this by giving part of a tree of fragments with 2 to 7 vertices generated from four two-vertex fragments: CC, C=O, CN, and
the aromatic CC bond. Near each fragment, the figure gives the number of DB structures containing a given fragment.
8/14/2019 IR expert
6/8
768
Fig. 3.Selection of the closest structural analogs (a) and construction of the model spectrum of the compoundon their basis (b).
Analysis of the form and frequency of occurrence for fragments (Fig. 4a) allows the researcher to select most
interesting samples. If there are sufficiently many structures with such fragments, one can pass over to treat their spectral
features.
As an example we give the results obtained by the spectrum and structure analysis for compounds containing a
phthalimide fragment (medium fragment in the upper line, whose frequency of occurrence in DB structures is 196). For all
structures with this fragment, we construct an averaged IR spectrum to reveal spectral features of the fragment. The averaged
spectrum obtained from full spectral curves is given in Fig. 4c. A significant spectral correlation has been revealed only for
frequencies close to 1700-1800, 1400, and 700 cm1. Details are obtained from the histogram of the frequencies of occurrence
of absorption bands constructed for descriptor-represented spectra from the given sample.
Note that care should be taken when analyzing structurespectrum correlations by IR spectrum processing methods
of this kind and in making conclusions similar to those in [18, 19]. The main hindrance to revealing significant spectrum-
structure correlations is probably the fact that the fragments occurring in structures are mutually correlated [5]. It is difficult
to account for and annihilate this effect because this calls for an artificial construction of sample structures so that all theirfragments have a uniform representation.
Thus the developed software tools permit the analyst (spectroscopist, researcher) to solve the following main
problems:
search for spectral analogs;
analysis of the structures selected by the search procedure;
generation of lists of most probable structural fragments of the compound under study;
8/14/2019 IR expert
7/8
769
Fig. 4.Selection of a characteristic fragment (a, b) and construction of the corresponding spectral response (c).
IR spectrum simulation of hypothetical structures to compare the simulated spectra with the spectrum of the
compound;
statistical analysis to reveal the spectral characteristics typical of the fragments of the compound.
Hopefully, the described version of the IR EXPERT system will prove useful in both scientific and applied aspects in dealing
with qualitative analysis problems for elucidating structures of the known and new organic compounds. The software
8/14/2019 IR expert
8/8