IR expert

Embed Size (px)

Citation preview

  • 8/14/2019 IR expert

    1/8

    0022-4766/03/4405-763 $25.00 2003 Plenum Publishing Corporation 763

    Journal of Structural Chemistry. Vol. 44, No. 5, pp. 763-770, 2003Original Russian Text Copyright 2003 by V. N. Piottukh-Peletskii, K. S. Chmutina, and M. V. Korolevich

    IR EXPERT: A NEW TYPE OF IR SPECTROSCOPY

    INFORMATION SYSTEM FOR SOLVING

    SPECTRUM-STRUCTURE PROBLEMS

    V. N. Piottukh-Peletskii,1K. S. Chmutina,

    1

    and M. V. Korolevich2

    UDC 543.42+681.3

    This paper discusses the possibilities which the new information system IR EXPERT offers to chemists and

    spectroscopists. Examples of spectrum-structure problems solved by using the system are discussed

    namely, generation of structural hypotheses based on the IR spectrum of the compound, verification of

    these hypotheses, and construction of empirical models of IR spectra based on the structure of the

    compound.

    Key words: IR spectroscopy, structure elucidation, structureproperty relationship, analysis of organic

    compounds, spectrum simulation.

    IR spectroscopy is one of the most popular methods for analyzing organic compounds in modern practice because of

    a relatively low cost of IR spectrum measurements and moderate requirements to properties of the sample. Spectrum

    processing generally follows two directions. The spectrum is compared with an IR spectroscopy database (DB), nowadays

    often supplied together with instruments. The compound is considered identified if the search procedure results in a given

    spectrum or a very closely related one. If it is known beforehand that the spectrum is missing in the DB, or, if it is not found

    via the search procedure, then common practice is its visual analysis using correlation tables based on the personal experience

    of the spectroscopist. Unfortunately, correlation tables give very limited data on spectrum-structure correlations, which often

    cannot be interpreted unambiguously. In many cases, one can realize that the higher the complexity of the compound, thelower the reliability of its structure elucidation from the IR spectrum and correlation tables.

    An alternative approach is searching the DB for close spectral analogs of the compound and their analysis [1].

    Experience in handling IR spectra, as well as special investigations [2-4], showed that spectral similarity (which is often

    known to be lower than in the case of an identical spectrum search) entails structural similarity. The structural analogs of

    the compound selected in this way are rather valuable hints in solving structure elucidation problems. A number of

    publications (e.g., [2, 4, 5]) have reported on attempts at computer-aided analysis of the selected structural analogs to reveal

    the fragments of the structural formula hypothetically present in the compound under study. Useful structures, however,

    may be output along with structures whose spectrum matches the spectrum under analysis accidentally, which is a serious

    complicating factor. The researcher often finds it difficult to classify the structure as useful in view of the combinatorial

    complexity of manual or software-based comparison of the output structures.

    The lack of techniques for quantitative (statistical) evaluation of the reliability of the results is another challenge to

    both IR and some other types of molecular spectroscopy. The systems described in the literature permit structure elucidation

    1N. N. Vorozhtsov Novosibirsk Institute of Organic Chemistry, Siberian Branch, Russian Academy of Sciences;[email protected]. 2B. I. Stepanov Institute of Physics, National Academy of Sciences, Belarus. Translated fromZhurnalStrukturnoi Khimii, Vol. 44, No. 5, pp. 835-842, September-October, 2003. Original article submitted August 20, 2002.

  • 8/14/2019 IR expert

    2/8

    764

    at a level of individual imaginative problems, precluding the use of statistical estimates to evaluate approaches for efficiency

    or solutions for reliability. One of the hindrances to the application of statistical evaluation procedures is intrinsic complexity

    of organic compounds. Statistical estimates are inapplicable if the compound is viewed as unique and indivisible. If, however,

    it is regarded as admitting a description by the set of independent structural characteristics, the question arises of whether the

    given set of characteristics is correct and applicable to (or generally significant for) all DB structures.

    In spite of all these difficulties, efforts are undertaken to develop means for qualitative computer-aided analysis of

    organic compounds by IR spectra [6, 7]. The main tendency is a transition from strict structure identification problems for

    compounds whose spectra are available in the DB to a wider and more productive class of problems arising in analyzing the

    spectral and structural analogs.

    The aim of this work is to demonstrate the applicability of the new tool in analytical practice, whereby the principles

    of spectral and structural analogy in IR spectroscopy are used for structure elucidation of compounds by their IR spectra and

    for predicting their spectral properties.

    EXPERIMENTAL

    All experiments were run with a DB containing ~32,000 IR spectra and corresponding organic structures. The

    spectra are given by a full spectral curve and in a descriptor representation specifying the positions and intensities of

    absorption bands. The full spectral curve is used for seeking spectra with closest Euclidean metrics and for spectrum

    displaying and printing-out. The descriptor representation is used exclusively for seeking close spectral analogs. Structural

    formulas of organic compounds are represented in the DB by molecular graphs. For each molecular graph, the graph

    decomposition program constructs a full set of nonisomorphic connected fragments with 2-7 vertices, in which hydrogen

    atoms are not encoded [5]. In the course of the construction, each fragment that is new to the DB is assigned a next

    registration number. The structure of the compound is thus described by a list of the registration numbers of its fragments.

    The list is regarded as an internal structure representation equivalent to a binary vector with dimensionality M, where Mis the

    number of fragments registered in the DB, and each vector component assumes 1 or 0 as a value, depending on the presence of a

    corresponding fragment in the structure. The number of different fragments registered in the DB of IR EXPERT is ~108,000.

    The distance between the structures (the degree of their similarity) is estimated using the relation

    ab a b

    1 2 /( ),R F F F (1)

    whereFabis the number of common fragments in structures a and b; FaandFbare the numbers of fragments in each structure

    compared.

    Two algorithms are employed to evaluate the distance between the IR spectra: comparison of spectral curves in

    Euclidean metrics and comparison of descriptor representations in the metric developed earlier in [10] and reflecting the

    degree of coincidence between the positions and intensities of absorption bands S= (A + M+P)/3. HereAis the component

    corresponding to the degree of coincidence between the intensities, Pis responsible for the degree of coincidence between

    band positions, and Mreflects the degree of coincidence between the most intense bands.

    The software components of the system are linked within the IR EXPERT shell realized under Windows for PC. The

    program components of the database are implemented with the VIRTA system [9]. This choice of the system was dictated by

    the previous service experience, effective work with entries of variable length, compactness of the DB, and easyprogramming. Systems using SQL technology are less suitable in this case, because a typical problem to be solved is

    searching for kclose analogs, which is not quite effective within the framework of SQL queries.

    RESULTS AND DISCUSSION

    The relationship between the structure of the organic compounds and IR spectral data using DB are revealed by:

    analyzing structural characteristics of the compounds whose spectra have some features in common;

  • 8/14/2019 IR expert

    3/8

    765

    analyzing spectral features reflecting certain structural peculiarities of the compounds.

    These methods previously have been investigated using an approach based on the principle of spectral and structural analogy

    and structure description by full fragment compositions [5, 6, 10]. The fruitfulness of those studies allowed us to go over to

    the development of the IR EXPERT software complex designed for helping the researcher in solving spectrum-structure

    problems by IR spectroscopy.

    Let us consider the application of this program complex to structure elucidation of organic compounds by spectral data.

    GENERATING STRUCTURAL HYPOTHESES

    The initial stage of structure elucidation consists in generating hypotheses about a compound, followed by refining

    and verifying them in accordance with the above two methods of establishing structurespectrum relationships.

    The first stage is searching the DB for spectral analogs of the compound using the system based on the IR

    spectroscopy DB. The desired spectrum is given either by a spectral curve (e.g., the curve recorded on an instrument) or by

    the positions and intensities of absorption bands. The characteristics of the sample and spectrum are pointed out, i.e., it is

    determined whether the sample is pure or contains impurities, and whether the spectrum is full or taken as a fragment in

    a certain spectral range (Fig. 1). The search result (SR) depends strongly on these conditions, as well as on the tolerances for

    Fig. 1.Query for IR spectral analog search (a), list of selected spectra (b), and comparison of one of the selectedspectra with the desired spectrum (c).

  • 8/14/2019 IR expert

    4/8

    766

    Fig. 2.Fragment composition of one of the structuresof the search result (a) and estimated nonaccidentaloccurrence factors for separate fragments met amongten SR structures (b).

    possible deviations in band positions and intensities, on the coincidence threshold to discard spectra with weak similarities,

    and on a subset of the database in which the search will be conducted (common practice is search throughout the whole DB).

    The typical search time throughout the DB (300 MHz Pentium Celeron) is ~40 s.

    The search procedure compares the target spectrum with each DB spectrum, and the match factor is calculated.

    Thereafter the results with a match factor higher than the given threshold are ranked and output (Fig. 1 b). The user analyzes a

    match between the particular spectrum and the target spectrum (represented by a thin line in Fig. 1c). Statistical data suggests

    that for match factors larger than 350, the selected compounds coincide or show similarity in their structural characteristics

    [10], which is probably true for the case considered as an example.

    A peculiarity of the IR EXPERT system is the possibility of visualizing and analyzing a fragment composition of the

    compounds selected from the DB along with visualizing their structural formulas. Due to this, the structural hypothesis for

    the compound may be compared with the suggested fragment composition output by the system. Figure 2ashows a window

    for visualization of the structural formulas of the search result (left) and fragment composition of the correspondingcompound. If the compound is not identified, or, if identification is not quite reliable, then there arises the possibility of

    evaluating the set of the structures selected by the search procedure (e.g., of the first ten structures) from the viewpoint of

    their nonaccidental occurrence in the fragments.

    A nonaccidental occurrence factor is the function of the frequency of a given fragment occurring both in the search

    result and in the structural formulas of all DB compounds [11, 12]. For the former, the estimate refers to a particular result of

    searching (Fig. 2b); along with other statistical estimates, it must be treated with care. There may be cases where, for an

  • 8/14/2019 IR expert

    5/8

    767

    inherently false fragment (i.e., one which is missing in the compound under analysis), the nonaccidental occurrence factor is

    close to 1 (in view of the limited accuracy of representation of numbers, the program outputs 1.0).

    After the search is completed, the user obtains full information about the selected spectra and structures, as well as

    the list of structural fragments that are most likely to occur in the compound. This set of data permits one to reliably identify

    the compound even if the spectrum under analysis and a spectrum selected from the database differ appreciably.

    If the procedure has failed to identify the compound, the spectroscopist employs the structure generation program [5,

    13] using additional information about the molecular formula of the compound (which is available, for example, from

    element analysis or mass spectrometry data), and on the fragments selected, makes up a list of possible structural hypotheses

    for the compound.

    Also note that the nonaccidental occurrence parameter (given a certain threshold) is used to evaluate the probability

    and reliability of recognition of particular fragments while analyzing the results of the gliding search of DB, when the

    search task is sequential spectrum selection from the DB [5, 14]. The results of this experiment do not characterize a separate

    search; rather, they characterize the spectrumstructure relationships inherent in IR spectroscopy within the framework of the

    methods chosen for describing and comparing the spectra and structural formulas of the compounds.

    VERIFICATION OF STRUCTURAL HYPOTHESES

    In a typical case, the analyst (spectroscopist) has a hypothesis about the structure of a compound or a list of

    hypothetical structures. This calls for verification of the correspondence between the suggested structure and the available IR

    spectrum of the compound.

    Several solutions to this problem were investigated within the framework of the IR EXPERT project. A solution

    based on covering a molecular graph with fragments revealed via the spectral search procedure is described in [15].

    For applied research, of greatest interest is probably the possibility for an IR spectrum simulation of a hypothetical

    structure using the database, and of revealing (or verifying) particular spectrum-structure correlations.

    As shown in [7, 16, 17], the spectra of the closest structural analogs might be used for spectrum simulations of

    compounds with a specified structure if the degree of structural similarity between DB objects and a given object is high

    enough. IR EXPERT permits one to find the closest structural analogs using the fragment composition match factor as a

    similarity criterion for (I).

    Figure 3 gives an example of searching analogs for the given structure (I). As a result, the user obtains a list of the

    closest structural analogs with similarity estimates (Fig. 3a).

    Based on the list of the analogs satisfying the given threshold of structural similarity, we can construct an empirical

    model of the IR spectrum [10, 17] of the compound being tested. In Fig. 3 b, the model is represented in descriptor form with

    each predicted absorption band corresponding to a vertical bar. The model may be compared with the available spectrum

    (represented by a spectral curve) of the compound for verification of the structural hypothesis and for preliminarily discarding

    unlikely hypotheses, for example, those constructed by a structure generator.

    It is often needed to clarify the spectral behavior of a particular fragment or functional group in a given environment.Particular spectrum-structure correlations may be constructed by examining the common spectral features for the revealed list

    of structural analogs, characterized by the presence of a particular fragment or list of fragments. In this case we can first

    evaluate the number of compounds with smaller fragments incorporated in the fragment under analysis. Figure 4 illustrates

    this by giving part of a tree of fragments with 2 to 7 vertices generated from four two-vertex fragments: CC, C=O, CN, and

    the aromatic CC bond. Near each fragment, the figure gives the number of DB structures containing a given fragment.

  • 8/14/2019 IR expert

    6/8

    768

    Fig. 3.Selection of the closest structural analogs (a) and construction of the model spectrum of the compoundon their basis (b).

    Analysis of the form and frequency of occurrence for fragments (Fig. 4a) allows the researcher to select most

    interesting samples. If there are sufficiently many structures with such fragments, one can pass over to treat their spectral

    features.

    As an example we give the results obtained by the spectrum and structure analysis for compounds containing a

    phthalimide fragment (medium fragment in the upper line, whose frequency of occurrence in DB structures is 196). For all

    structures with this fragment, we construct an averaged IR spectrum to reveal spectral features of the fragment. The averaged

    spectrum obtained from full spectral curves is given in Fig. 4c. A significant spectral correlation has been revealed only for

    frequencies close to 1700-1800, 1400, and 700 cm1. Details are obtained from the histogram of the frequencies of occurrence

    of absorption bands constructed for descriptor-represented spectra from the given sample.

    Note that care should be taken when analyzing structurespectrum correlations by IR spectrum processing methods

    of this kind and in making conclusions similar to those in [18, 19]. The main hindrance to revealing significant spectrum-

    structure correlations is probably the fact that the fragments occurring in structures are mutually correlated [5]. It is difficult

    to account for and annihilate this effect because this calls for an artificial construction of sample structures so that all theirfragments have a uniform representation.

    Thus the developed software tools permit the analyst (spectroscopist, researcher) to solve the following main

    problems:

    search for spectral analogs;

    analysis of the structures selected by the search procedure;

    generation of lists of most probable structural fragments of the compound under study;

  • 8/14/2019 IR expert

    7/8

    769

    Fig. 4.Selection of a characteristic fragment (a, b) and construction of the corresponding spectral response (c).

    IR spectrum simulation of hypothetical structures to compare the simulated spectra with the spectrum of the

    compound;

    statistical analysis to reveal the spectral characteristics typical of the fragments of the compound.

    Hopefully, the described version of the IR EXPERT system will prove useful in both scientific and applied aspects in dealing

    with qualitative analysis problems for elucidating structures of the known and new organic compounds. The software

  • 8/14/2019 IR expert

    8/8