CSDB help: NMR tools

CSDB usage: glyco-NMR tools

This document described usage of the CSDB NMR simulator based on GODDESS software, algorithms behind the simulation, and various aspects of NMR data processing.

Contents

Usage of the NMR simulation

Statistical NMR simulation

Empirical and hybrid NMR simulation

NMR-based structure elucidation

Technical notes on the NMR data storage in CSDB

Acknowledgements and citing

Usage of the NMR simulation

Input parameters

This feature is available from the NMR simulation link in Extras section of the main menu, and from the (Sub)structure search form. To simulate the NMR data, you should first enter the structure of interest. Options to do this (1) are described in the (Sub)structure search section of Usage Help. This structure is previewed in SNFG format in area (2) and copied to the structure field (3) as a term in CSDB Linear encoding. The structure can be refined by manual editing of this field. Input of the NMR simulation

By default, only the nucleus (5) and solvent (6) parameters are displayed, however checking More parameters... (4) shows all of them, as in the figure.

Selection of 1H/13C for nucleus (5) will simulate 1D and 2D NMR spectra (COSY, TOCSY, edHSQC, and HMBC) for both nuclei.

Solvent selector (6) restricts simulation to data obtained in a specific solvent. Please note, that if empirical or hybrid method is selected for ¹³C simulation, only water or unrestricted can be selected as a solvent. CSDB contains ca. 4600 NMR spectra recorded in water (mainly of structures typical for bacteria) and ca. 1700 spectra recorded in pyridine (mainly of structures typical for plants), facilitating accurate simulation of the NMR data in these solvents. The distribution of solvents vs. records in the database can be displayed by presing Coverage link (6).

There are three modes of simulation regarding quality vs. speed: fast, accurate (default), and extreme. If you see unpredicted signals (question marks) in the assignment, set the quality option (7) to higher quality and press button (10) once again. Note that in the extreme quality mode, the calculation may take up to 15 minutes.

Together with the solvent, additional simulation parameters (8), such as pH or temperature range, allow limiting the database scope to the data obtained under certain experimental conditions. Solvent, quality mode, and additional parameters are applicable to the statistical simulation only, so they are disabled if empirical ¹³C simulation (9) is selected. ¹H NMR frequency (8) is used together with coupling constant estimation module for raw predition of the cross-peak widths in 2D spectra.

More experiments checkbox (11) adds more NMR experiments (DQF COSY, COSY RCT, HSQC-COSY, HSQC-TOCSY, depending on the selected nuclei) to 2D plots, however calculation takes longer.

If the default option (hybrid) is selected for ¹³C simulation (9), both empirical and statistical simulation approaches will be employed to obtain hybrid carbon chemical shifts. The empirical simulation is based on the incremental scheme with steric correction and utilizes internal databases of reference chemical shifts of mono-, di- and trimeric fragments and substitution effects. The statistical simulation is based on sequential generalization of the atomic surrounding of the predicted atom until enough structurally similar fragments are found in CSDB, with subsequent outlier detection and data averaging. The hybrid approach combines the results of empirical and statistical simulations in accordance with trustworthiness reported by both of them. It becomes available if the solvent is water or unrestricted. Proton chemical shifts are always simulated statistically.

Pressing button (10) simulates the NMR spectra and the result is displayed below the form.

Output of empirical 1D data

The empirical NMR simulation uses an incremental scheme to predict ¹³C NMR chemical shifts. Accuracy, advantages and disadvantages of this method are discussed below in this document. The empirical prediction displays a chemical shift assignment table, applied substitution effects, and subspectra of the unsubstituted residues for reference. Every row represents a certain residue in the structure of interest. Columns include:

Linkage = linkage path to the residue from the oligomer reducing end or polymer repeat rightmost residue (1).
Residue = residue name, ring size, configurations and N-acetylation pattern (2).
Trust = simulation trustworthiness per residue (0..100% colored from poor to good) (3). The overall trustworthiness for the whole structure is given below the table (5).
Cn = simulated chemical shifts, reference data and applied effects for every atom in a residue (4).

Next to the table are the sorted spectrum as a series of chemical shifts (one atom = one value) and its graphical plot. If some signals could not be predicted, the warning displays the number of missing peaks.

The table is exportable as TSV for usage in spreadsheet software. Supplementary data (effects etc.) can be hidden for convenient copy-and-pasting.

Output of statistical 1D data

The statistical NMR simulation uses a database-driven scheme to predict ¹³C and ¹H NMR chemical shifts. Accuracy, advantages and disadvantages of this method are discussed below in this document. The statistical prediction displays chemical shift assignment tables, expected simulation accuracy, and links to track the origin of data.

Every row represents a certain residue in the structure of interest. Columns include:

Linkage = linkage data for residue identification (the linkage path to the residue from the oligomer reducing end or from the rightmost residue in the repeating unit of a polymer) (1). Every row also contains a color square identifying the color code of signals in the 2D NMR plots.
Residue = residue name, ring size, and configurations in CSDB linear code (2).
Trust = simulation trustworthiness per residue (0..100% colored from poor to good) (3). The correlation between the trustworthiness level and chemical shift calculation accuracy was established by linear regression. It is used to predict the expected simulation error in ppm, which is displayed below each chemical shift. The overall trustworthiness for the whole structure is given below the table (5).
Cn / Hn = simulated chemical shifts, expected simulation error (in ppm) and trustworthiness metrics (in %) for every signal, the number of database records used to obtain the averaged data and a link to the processing report (4). The expected error and trustworthiness values ( 0 = poor, 100 = best) are color-coded to reflect how good the prediction was. Clicking on the number of records below each chemical shift opens a list of references (6). These references allow tracking the source of data to CSDB records and corresponding original publications (7). The data used in simulation are highlighted (X) in these records.

Next to the ¹³C assignment table are the sorted spectrum as a series of chemical shifts (one atom = one value) and its graphical plot. If some signals could not be predicted, the warning displays the number of missing peaks. ¹H NMR spectra are not plotted.

The table is exportable as TSV for usage in spreadsheet software. Supplementary data (trustworthiness, references) can be hidden for convenient copy-and-pasting.

Output of hybrid 1D data

The hybrid carbon chemical shift simulation implies heuristic mixing of data from both approaches mentioned above according to the inter-approach deviation, trustworthiness, dataset size and other parameters. As it benefits from both simulation methods, it usually provides most accurate results. This feature is available only if both empirical and statistical simulations have been run, and the solvent in the latter approach was unrestricted or water. The hybrid prediction displays a chemical shift assignment table and the schematic superimposed NMR spectrum.

Every row represents a certain residue in the structure of interest. Columns include:

Linkage = linkage data for residue identification (the linkage path to the residue from the oligomer reducing end or from the rightmost residue in the repeating unit of a polymer) (1).
Residue = residue name, ring size, and configurations in CSDB linear code (2).
Trust = hybridized simulation trustworthiness per residue (0..100% colored from poor to good) (3). The overall trustworthiness for the whole structure is given below the table (5).
Cn = simulated chemical shifts, trustworthiness metrics, and deviation between results from empirical and statistical approaches (Δ) for every atom in a residue (4).

Next to the table are the sorted spectrum as a series of chemical shifts (one atom = one value) and its graphical plot. The reference data from both approaches (red marks = empirical, blue marks = statistical) are added to the spectrum plot (7). The header reports the overall trustworthiness metrics and lists all signals in sorted order (6) for copy-and-pasting. If some signals could not be predicted, the warning displays the number of missing peaks.

The table is exportable as TSV for usage in spreadsheet software.

Output of 2D spectra

If 1H or 1H/13C was specified as a nucleus, 2D NMR spectra are visualized using the predicted chemical shifts and rougly estimated proton coupling constants. Depending on the nucleus and the state of More NMR experiments checkbox, two to eight 2D spectra are plotted. These experiments cover most of proton and carbon spin correlations commonly used in glycobiology (COSY, TOCSY, HSQC, HMBC and derived experiments). NOE correlations are currently not supported. As an example, COSY, edHSQC and HMBC are shown in the figure (1). A quick help links () beside the experiment names provide brief descriptions of what the signals in this spectrum reflect. In the edHSQC spectrum, positive cross-peaks are displayed as ellipses, while negative ones are displayed as rectangles. Links (2) lead to the derived experiments related to that displayed.

Links below the spectrum switch useful display options: color mode, signal labels and image resolution.

The color mode switch (3) determines how the signals are colored; it has two states: signal assigment and trust level. The former colors all the signals according to the residue color code, as displayed in the first column of the assignment table (exemplified in COSY and edHSQC). The latter colors all the signals in the range from red to green reflecting how accurate the simulation of the signal was (exemplified in HMBC).

The peak label switch (4) hides or shows the numbers beside signals. In the assignment color mode, these numbers correspond to the order of carbon atoms in residues. The combination of the color (=residue) and the label (=position in a residue) identifies every atom. NMR spectra, which provide non-direct correlations, may have complex labels, including bicolored ones (e.g. H1/C3) to identify inter-residue cross-peaks. In the trustworthiness color mode, numbers are the trustworthiness metrics. In the figure, HMBC is displayed without labels in the trustworthiness color mode, while COSY and edHSQC are displayed with labels in the assignment color mode.

The Hi-res image link (5) or clicking on a spectrum displays a larger image in a separate window for copy-and-pasting. If some of the signals could not be predicted (they have ? as a chemical shift in the assignment table), they are listed below the spectra in which they should have appeared.

The JDX link (6) exports the 2D spectrum in Jcamp-DX format. Two drop-down options are available:

Live view NMR transfers the spectrum to Cheminfo online NMR processor, where you can pan and zoom the spectrum or superimpose it with the experimental one.
Download file displays a page where you can analyse raw data and download the JDX file for further processing in the external NMR software, such as MestreLabs MestreNova, ACD/Labs NMR viewer or Bruker TopSpin.

Statistical NMR simulation

Overview

The statistical NMR simulation is proposed for simulation of ¹³C and ¹H chemical shifts for glycan structures, including oligomers, polymers and fragments. It outputs the NMR assignment tables, and plots 2D NMR spin correlations commonly used in glycobiology. The approach was reported as a part of Glycan-Optimized Database-Driven Empirical Spectrum Simulation (GODDESS). It has the following key features:

Database-driven modified HOSE scheme at the atom group level
Does not need dedicated databases but uses a regularly updated database (CSDB, >5000 spectra)
Generalizes structural surrounding of the atom under prediction until enough data are found
Averages the found data with outlier removal by the modified iterative Chauvenet criterion with check for shielded outliers
Reports trustworthiness and expected error based on the dataset size and total weight of all generalizations applied

Prediction accuracy and trustworthiness

The used structural surrounding generalization scheme is optimized for carbohydrates and their derivatives. It was proven to provide the ¹³C and ¹H chemical shift simulation with the accuracy outperforming the quantum mechanical methods (e.g. GIAO DFT B3LYP) in large basis sets. Root means square deviation of 0...3.0 ppm (¹³C) or 0...0.3 ppm (¹H) depending on how many similar structural fragments have NMR spectra stored in CSDB; typical values obtained on a pool of naturally-occurring glycans were 0.9 ppm (¹³C) and 0.1 ppm (¹H)). Besides NMR prediction, this approach is applicable to any atomic properties deposited in the database. The details are available in dedicated publications.

The trustworthiness level is estimated for every signal in the range 0..100%, the higher the better. This value is based on:

The total weight of all generalizations applied (see below; the higher the weights are, the lower the trustworthiness is)
Standard deviation of the sampling of all found chemical shifts (the bigger the deviation is, the lower the trustworthiness is)
The number of found chemical shifts (the bigger the dataset is, the better the trustworthiness is)

The expected prediction error (in ppm) is estimated from trustworthiness values using the linear regression. The regression rules were based on the correlations between these two parameters observed for various structure samplings.

Generalization scheme

The simulation engine searches CSDB for the fragment of a structure containing the residue with the atom under prediction, and all the adjacent residues that occupy the neighboring topological nodes. If such fragment is found, the data for the atom being predicted are checked for statistical outliers and averaged. Otherwise, the engine starts to apply generalizations of the structural surrounding from minor to major ones, until the obtained structural pattern is found in the database. Every act of generalization changes the fragment properties so that more potential structures match it. For example, a permutation of aDFucp into DFucp is a generalization act, as the former corresponds to a single fully determined fragment, while the latter matches two fragments: aDFucp and bDFucp. The figure below exemplifies some of possible generalization pathways for prediction of the C4(GlcN) chemical shift in a structure containing the aDManp(1-3)bDGlcpN(1- dissacharide fragment. Each blue arrow implies a single generalization act.

Every structural parameter of a fragment which can be generalized is assigned a weight factor depending on how strong is the expected effect of this parameter on the chemical surrounding of the analyzed atom. The estimation of this dependence empirically accounts for the nature of the parameter and its bond-distance from the predicted atom. The generalization sequence starts from parameters related to the most distal and conformationally flexible groups of atoms to minimize the effect of a generalization act on the predicted atom.

The generalizable descriptors are the following (term residue refers to a residue the predicted atom belongs to):

types of every atom in the residue and in the adjacent residues (hydroxy, amino, deoxy, carboxy, C=C, other)
atomic stereo configurations of chiral centers in the residue and in the adjacent residues (includes generalization of anomeric configurations)
ring size of the residue and the attached sugars (pyranose, furanose, linear)
presence of residue donor(s) and acceptor(s), separately at each atom
absolute configuration of optically active neighboring residues (same or different with the residue itself)

The weight factors were optimized iteratively to give the minimal deviation between the predicted and the experimental chemical shifts in a training set of structures. Every weight factor in a set lies in the range of 0..100 and depends on the following parameters:

type of residue the predicted atom belogs to (weight factors for pyranoses, furanoses, alditols and all other residues were iterated independently)
descriptor used (see above)
total weight of all descriptors located further from the predicted atom than the current descriptor (as an addend)
how far the descriptor of the residue is from the predicted atom (the number of bonds between: none,1,2,3,>3)
how far the descriptor of the adjascent residue is from the linkage point (the number of bonds between: none,1,2,3,>3), and how far the linkage point in the residue itself is from the predicted atom (itself, neighboring, distal)

The three quality vs. speed modes are supported. The logics underlying these modes is the following:

FAST - the number of generalization steps is limited to 5 per atom, the number of substituent generalization steps is limited to 1.
ACCURATE - the number of generalization steps is unlimited, the number of substituent generalization steps is limited to 1 or 2 (depending on the bond-distance between the predicted atom and the substituent), the maximal total generalization weight is 100.
EXTREME - the number of generalization steps is unlimited, the number of substituent generalization steps is limited to 4, the maximal records and total weight are unlimited.

Incremental and hybrid NMR simulation

Overview

The incremental NMR simulation is proposed for the prediction of ¹³C chemical shifts in water solution for glycan structures, including oligomers, polymers and fragments. This approach was adapted from Biopolymer Structure Elucidation (BIOPSEL) software reported earlier. It has the following key features:

Incremental scheme with steric correction (derived from BIOPSEL)
Considers structural surrounding (9-13 descriptors)
Uses dedicated chemical shift & effect databases (80 monomers, 2500 dimers & trimers, 150 theoretical effects)
Supports most glycan structural features (widespread and rare sugars, amino acids, alditols, phosphates, lipid moieties, residue modifications, dual linkages, repeat units etc.)
Predicts >10K chemical shifts per second, thus is suitable for NMR based structure prediction

The used empirical scheme was proven to provide better accuracy than quantum-chemical methods that were usually considered the best for NMR chemical shift simulation in organic chemistry, e.g. COSMO+B3LYP/6-311++G(2d,2p). You can expect a root means square error from less than 0.5 ppm (if a glycan contains widespread residues highly populated in the spectroscopic databases) to 3.0 ppm (for rare and unusual glycans and glycoconjugates). See details in a review on glyco-NMR simulation [Chem Soc Rev 2013].

The hybrid approach combines results from the incremental and statistical simulations in accordance to the trustworthiness reported by both approaches. More details on the averaging rules are available below.

Incremental prediction accuracy

The accuracy of simulation is an average of accuracy values for every residue in a structure, and ranges 0 to 100%. Spectra predicted with the accuracy above 75% may be considered quite trustworthy. The simulation accuracy of each residue depends on the following:

If a spectrum of an exact structural fragment (dimer or trimer) is found, the accuracy is 100%. Otherwise the simulator applies substitution effects to a spectrum of the unsubstituted residue, and the accuracy is calculated as 25*(3.5-max(EffP)-Prox), where
EffP is a perturbation introduced by the effect application (worst among all effects applied, see below),
Prox is a proximity term summed over all effect pairs (1.0 if two effects were applied on neighboring carbons, 0.5 if two effects were applied on carbons two bonds away from each other).

The effect perturbation term (0 = best, 3 = worst) may have the following values depending on which chemical shift increments were found in the theoretical effect database:

0.0 = used exact effects for the acceptor residue and all donors (5 descriptors per donor),
1.0 = used exact effects for the acceptor ancestor (e.g. Gal instead of Fuc) and all donors OR
1.0 = varied one structural descriptor per donor and used the obtained effects for the acceptor residue
1.5 = varied two structural descriptors per donor and used the obtained effects for the acceptor residue
2.0 = varied one structural descriptor per donor and used the obtained effects for the acceptor ancestor
2.5 = varied two structural descriptors per donor and used the obtained effects for the acceptor ancestor
3.0 = used default increments depending on the bond type

The used structural descriptors and default increments are listed in the next section.

Incremental simulation algorithm and dedicated databases

The primary spectroscopic database and the substitution effect database contain averaged literature data on chemical shifts, glycosylation and phosphorylation effects. The approximate coverage is 80 residues, 2500 dimers and trimers, and 150 theoretical effects; data are averaged for D₂O solutions at 318K. The following structural peculiarities are taken into account when searching for particular chemical shifts or substitution effects:

acceptor (which exact residue is substitited)
acceptor anomeric and ring size configurations, if exist
substitution position
type of donor: pyranose, furanose, alditol, phosphate, O-linked non-carbohydrate, N-linked non-carbohydrate
donor anomeric configuration, if exists
additional groups attached to the donor bonding atom, except a hydroxy function (none/carboxy/C-chain)
function at the donor atom next to a bond (hydroxy/amino/deoxy)
orientation of a substituent at the donor atom next to a bond (axial/equatorial/unoriented or both)
combination of donor and acceptor absolute configurations (same/different/unapplicable)

The simulator iterates through all residues in a structure and searches the primary spectroscopic database for chemical shifts characteristic for this residue in given structural enclosement. If these data are not found, a subspectrum of the residue is calculated from the spectrum of the unsubstituted residue and substitution effects.

If the desired effect is missing from the database, the type and orientation of a substituent next to the bonding atom in the donor residue are perturbed until the effect is found. If the perturbed effect is not found, the residue under prediction is temporarily replaced with a more widespread residue with the same basic configuration (e.g. Gal instead of FucNAc). If still no effect is found, it is simulated using the default increments.

The default increments are:

+6.0 glycosylation on donor bonding atom
+6.0 glycosylation on alpha-C
-1.0 glycosylation on beta-C
-1.2 amidation on amine alpha-C
-0.3 amidation on amine beta-C
-3.0 amidation on acid alpha-C
+0.6 amidation on acid beta-C
+3.0 amine on alpha-C
+4.0 phosphorylation on alpha-C
-0.5 phosphorylation on beta-C
-1.0 acylation on acid carboxyl
+5.0 acylation on alpha-C
-1.0 acylation on beta-C

If a residue subspectrum is calculated using substitution effects (rather than extracting the exact chemical shifts from the database), chemical shifts of C2 and C5 of non-reducing pyranoses are modified the following way:

-1.4 aldo-C2 (keto-C3) of b-sugars with equatorial H4,
-1.0 aldo-C2 (keto-C3) of other beta-sugars,
-0.5 aldo-C2 (keto-C3) of alpha-sugars with equatorial H2,
-0.5 aldo-C2 (keto-C3) of alpha-2-aminosugars with equatorial H4
+0.5 aldo-C5 (keto-C6) of all alpha-sugars.

Glycosylation effects for three widespread sugar configurations (glc, gal, man) are represented most fully, usually making the effect prediction for these basetypes more accurate. The chemical shift database contains limited data on non-N-acetylated aminosugars; thus, you should virtually N-acetylate them for better accuracy.

Hybridization rules

The hybridized spectrum benefits from combining the incremental and the database-driven predictions. The spectrum averaging algorithm uses the following parameters from the empirical (_emp) and statistical (_stat) simulations of every signal:

chemical shifts (δ_emp, δ_stat)
accuracy estimation as trustwothiness values (T_emp, T_stat)
absolute deviation between two chemical shifts Δ=|δ_emp-δ_stat|

In calculation of the averaged chemical shift, three options are possible:

Results from both approaches are close to each other (Δ<0.2). The averaged chemical shift is calculated as the arithmetic mean of two chemical shifts. If T_emp>75 or T_stat>75, the resulting trustworthiness T = 25*(4-Δ); otherwise T = (T_emp+T_stat)/2 - 25×Δ + 25.
Results from both approaches are far from each other (Δ≥0.2). The averaged chemical shift is a linear combination of its empirical and statistical counterparts: δ=K_emp×δ_emp+K_stat×δ_stat, where K_stat=T_stat³/(T_stat³+T_emp³) and K_emp=1-K_stat.
The resulting trustworthiness is a normalized deviation-biased linear combination of its empirical and statistical counterparts: T=(K_emp×T_emp+K_stat×T_stat-25×Δ×min(K_emp,K_stat)) × (0.9+0.2×min(K_emp,K_stat)).
Results from only one approach could be obtained. The averaged chemical shift is equal to that obtained, and trustworthiness is 10% less than that obtained, as it is not confirmed by the other approach.

The details on how these formulas were obtained are available in dedicated publications [J Chem Inf Model 2014, Analyt Chem 2015].

Acknowledgements and references

The empirical ¹³C NMR simulation feature was developed on the ground of the ideas of the spectrum simulation module of the BIOPSEL software [1,2]. Within the CSDB project [3,6,9], it was adapted to oligomeric structures, was improved to treat keto-sugars and other 'special cases' and got more accurate incrementation algorithm and web-interface. This and other glycan NMR simulation schemes were reviewed [4]. The statistical NMR simulation feature is based on the GODDESS carbohydrate generalization scheme [5,7]. The GRASS service [8] allows semi-automated NMR-based strucural elucidation and structural hypotheses ranking.

Please cite using the following template for Experimental part in your paper (leave or remove the 2nd sentence, depending on how you used the tool):

NMR spectrum assignment was done with the help of the chemical shift reference collection and simulation tool for ¹³C ^[5,7] and ¹H ^[7] nuclei at the Carbohydrate Structure Database (CSDB) ^[6]. To refine a set of structural hypotheses, the CSDB structural ranking tool ^[8] and empirical chemical shift simulation ^[4] were used.

F.V. Toukach, A.S. Shashkov "Computer-assisted structural analysis of regular glycopolymers on the basis of ¹³C NMR data" (Carbohydr. Res. 2001, 335(2): 101-114. DOI: 10.1016/S0008-6215(01)00214-2)
F.V. Toukach "Computer-assisted structural analysis of glycopolymers" (Proceedings of Eurocarb-12, France, Grenoble, 2003: PA-004)
Ph.V. Toukach "Bacterial Carbohydrate Structure Database 3: principles and realization" (J. Chem. Inf. Model. 2011, 51(1): 159-170. DOI: 10.1021/ci100150d)
Ph.V. Toukach, V.P. Ananikov "Recent advances in computational predictions of NMR parameters for structure elucidation of carbohydrates: methods and limitations" (Chem. Soc. Rev. 2013, 42: 8376-8415. DOI: 10.1039/C3CS60073D).
R.R. Kapaev, K.S. Egorova, Ph.V. Toukach "Carbohydrate structure generalization scheme for database-driven simulation of experimental observables, such as NMR chemical shifts" (J. Chem. Inf. Model. 2014, 54(9): 2594-2611. DOI: 10.1021/ci500267u)
Ph.V. Toukach, K.S Egorova "Carbohydrate structure database merged from bacterial, archaeal, plant and fungal parts", (Nucleic Acids Res. Database Issue 2016, 44(D1): D1229-D1236. DOI: 10.1093/nar/gkv840)
R.R. Kapaev, Ph.V. Toukach "Improved carbohydrate structure generalization scheme for ¹H and ¹³C NMR simulations" (Anal. Chem. 2015, 87(14): 7006-7010. DOI: 10.1021/acs.analchem.5b01413)
R.R. Kapaev, Ph.V. Toukach "GRASS: semi-automated NMR-based structure elucidation of saccharides" (Bioinformatics 2018, 34(6): 957-963. DOI: 10.1093/bioinformatics/btx696)
Carbohydrate Structure Database web-site, http://csdb.glycoscience.ru.

NMR-based structure elucidation

Overview

This tool aims at helping in structural elucidation studies and NMR spectrum assignment. It uses GRASS algorithm, which stands for Generation, Ranking and Assignment of Saccharide Structures. The tool ranks structural hypotheses according to the fit between the simulated and experimental NMR spectra. To do that, it iterates through all possible carbohydrates and their derivatives limited by specified constraints. For each generated structure, an empirical ¹³C NMR spectrum is simulated (see details above) and is compared to the provided experimental data. Not more than 500 best fitting structures are further refined using statistical ¹³C NMR spectrum simulation (see details above), to give a few top-matching structures. These structures are displayed as best matches together with the simulated NMR data.

The iterator permutates the following structural parameters:

monomeric residues (if superclasses are specifiede.g. Any hexose)
their absolute, anomeric and ring size configurations (if applicable but specified as unknown)
their N-acetylation pattern (if applicable but specified as allowed)
their linkage positions (those chemically possible and specified as allowed)
structure topology and sequence of residues (which can be further restricted by residue location or branching degree)

Permutation of ALL of these parameters may produce billions of structures, dropping the performance and negotiating the valuability of ranking. The suggested usage of this tool implies specification of as many knowns as possible; this allows for good predictional power of the remaining unknowns. For example, it can deduce anomeric configurations and residue sequence, when exact monomeric composition and linkage positions are provided. Or it can deduce linkage positions and a sequence, when a monomeric composition and configurations are provided. However, please don't expect a miracle if you provide no exact monomeric composition or none of the above-listed parameters for structures larger than tetrasaccharides.

To improve the credibility of results and make calculation faster, you can apply structural constraints obtained from the other experiments:

monomeric composition (from GC, GLC, HPLC experiments)
polymericity (from any chromatographic experiment)
the number of β-sugars (from counting doublets in the anomeric region of the ¹H NMR spectrum and the analysis of ¹J_CH couplings in pyranosides :α >166 Hz; β <166 Hz)
absence of furanoses (from the absence of signals in the region 80..86 ppm of the ¹³C NMR spectrum)
N-acetylation limitations (from the number of shifted signals in a range of 50..56 ppm in HSQC at another pH, or from counting signals at 22.1-22.5 ppm in the ¹³C NMR spectrum of a de-O-acetylated sample)
linkage positions (from the methylation experiment)
absolute configurations (from the optical rotation measurement)
the number of CH₂ groups (from the APT or DEPT-135 spectrum)
partial sequence data (from analysis of hydrolysis products)
phosphorylation degree (from the ³¹P NMR spectrum)
any other data narrowing the scope of the iterated structures

In most cases, natural compounds do not contain exotic features, such as rarely-occuring residues or atypical configurations, highly dendrite branching patterns or huge side chains in polymers. To exclude such structures from iteration, there is the search scope option Widespread structures only selected by default. Change this option to generate all the structures, including those unlikely to occur in nature (this will produce from ten- to hundred-fold increase in the total number of structures and calculation time). Widespread mode applies the following limitations:

If a certain residue is a superclass (e.g. Any residue or Any hexose), only those its representatives will be used that have been reported to occur more than 20 times in natural compounds (as deposited in CSDB). This gives about 200 widespread residues, in contrast to 2300 residues in total.
If absolute configuration exits but is unknown for a certain residue, only widespread values will be used, such as D for glucose, L for most amino acids, both D and L for rhamnose, etc.
If ring size exists but is unknown for a certain residue, only widespread values will be used, such as pyranose for glucose, furanose for ribose, or both for galactose.
Topologies with branched nodes having more than two substituents (except monovalent ones, like Ac) are excluded.
Polymers are not allowed to have more residues (except monovalent ones, like Ac) in side chains than in a backbone.
Linkages between two default outgoing centers of polyvalent residues (e.g. Glc(1-1)Gal or Fru(2-1)Glc) are not allowed. Cases like Glc(1-1)Me are considered as widespread.

Predicted structures are ranked according to the similarity of the experimental vs. simulated NMR spectrum (mean absolute deviation, also designated as Δ). Other metric of the fit goodness are also reported: RMS deviation, linear correlation factor, and trustworthiness reported by the the simulation engine. A few structural parameters exist, which do not significantly affect the NMR observables in sacharides. If such parameters are permutated, the resulting structural variants will occupy the neighboring positions in the result list, and their metrics will be close each to other. In this case you are expected to use common sense or additional experiments to distinguish a proper variant.

The experimental NMR spectrum is allowed to have false signals or missing signals, although the bigger is the difference between the expected and provided number of signals, the less accurate results are and the longer is calculation. In this case deviation between spectra is calculated after worst-fitting signals are removed from the bigger spectrum. Every such removal (as well as every unpredicted signal in the simulated spectrum) adds a negative impact to the resulting metric.

If there are many unknowns, the calculation can take long time. You can close your browser, but calculation will go on in background, and you will be notified by e-mail when results are ready. Alternatively you can periodically check the link to the job you have started, or wait for the results in your browser. If your browser session is closed by timeout, the calculation does not stop, and you can use the provided link to access your results. Besides prediction results, logs of the generated structures and calculation errors are available.

The structure iterator and ranking scheme were developed by Roman Kapaev and Philip Toukach within the CSDB project. On usage, please cite this publication: R.R. Kapaev, Ph.V. Toukach "GRASS: semi-automated NMR-based structure elucidation of saccharides" (Bioinformatics 2018, 34(6): 957-963. DOI: 10.1093/bioinformatics/btx696).

Pros and cons

Advantages of GRASS as compared to other solutions:

Great variety of supported structural features (oligomers and polymers; glycosidic, amidic, and phosphodiester linkages; more than 500 residues, including higher sugars, furanoses, alditols, amino acids, fatty acids etc.).
Flexible structural constraints providing multiple gradations from "only the number of residues is known" to "everything is known except one linkage or one anomericity".
Refinement of predictions with methods reported in 2016 as most accurate for carbohydrates NMR simulation [ref. 7 in dedicated publications]. The simulation module underwent validation and approval on thousands of glycopolymers and glycoconjugates.
Prediction of the set of the assigned and exportable 1D and 2D NMR spectra and PDB geometry in one click for every of generated structural hypotheses.
User friendliness and detailed help
High performance: minutes for common tasks; hours for highly sophisticated tasks with the lack of information (the latter is not recommended).
GRASS accuracy is approved by thorough validation on a test pool of over 500 glycans and derivatives (for details see Accuracy and perfomance)

Limitations of GRASS:

NMR spectra of carbohydartes are insensitive to a few structural features (e.g. if you have a 1-4 bond to a galactose, absolute configurations of bonded residues do not affect the NMR observables [ref.]). Due to this NMR-based ranking of structures will hardly distinguish structures that differ in these features, and non-NMR data are required for further verification of such hypotheses.
Mirrored sets of absolute configurations of residues (e.g. DGal-DGal-LFuc and LGal-LGal-DFuc) have the same NMR spectra in water. Due to this a common sense or additional experiments are needed to select one of the two sets.
If a composition contains superclasses or exotic residues, not all of the generated structures can be processed in phase 1 due to the lack of empirical data, and thus are missing from the result list. To diminish this limitation, we are continuously working on the expansion of the empirical simulation module to support more residues.
In case of loose structural constraints (if you have a lot of unknown configurations or linkages) distinguishment between top-matched structures can be low, especially for calculation of structures with more than five sugar residues. Typical tool usage implies that you know at least monomeric compositions, and preferably, one of structural patterns (configurations or linkages). In this case the remaining pattern and residue sequence are predicted.
The algorithm is intolerant to the deviation of the number of signals in the experimental spectrum, which can be caused by the poor peak picking of weak signals or overlaps. This means that you must have exactly the same number of signals as the number of carbons in a structure. Allowing to skip one or two signals is in our to-do-list for 2017.
Although dual linkages (e.g. pyruvate hemiacetal or cyclic chelating lactate) are supported by the simulation engine, such structures are not iterated.
You can not analyze the raw experimental NMR spectrum from an NMR machine automatically, and you have to pick the peaks and enter them manually.

Accuracy and performance

GRASS prediction power and performance have been tested on a set of 556 structures with published ¹³C NMR spectra recorded in water and possessing various types of structural features met in glycobiological research. If ¹³C NMR spectrum of a structure was stored in CSDB, it was virtually removed from the database to avoid the statistical NMR prediction bias. The prediction power was validated in eight modes differing in knowns/unknowns ratio. The figure displays the percentage of hits when correct structure was reported in the list of top-N hypotheses predicted by GRASS:

eight graphs correspond to eight modes, the knowns and unknowns are listed above graphs. The list of substitutable poisitions, the number of substituents and anomericity refer to each residue. The total number of β-sugars and widespread flag refer to the whole structrure. In these tests, monomeric composition (including absolute configuration and ringsize, if applicable) and polymer vs. oligomer status were always known. For oligomers it was also known, which residue is at reducing end.
color and shape of icons represent data on structures having different number of residues (from mono- to octasaccharide). In this summary, acetic acid residues attached to amino groups were not counted as separate residues. The less residues per structure, the better were the results (more hits).
horizontal scales represent the size of the top-ranked hypotheses list (1 to 10). Obviously, the bigger is a top-list, the more hits are observed.

As an example, mode 1 implies specification of polymericity, monomeric composition, and total number of β-sugars per structure unit; it implies specification of allowed substitution positions, number of substituents, and N-acetylation type (allowed/forbidden/demanded) for each residue. Other parameters are unrestricted by default. In this mode, GRASS provides fair predictions of structural parameters, which are most diffucult to determine without full NMR assignment, namely the sequence of residues and their anomeric configurations. For relatively complex structures having five or six residues, correct answers were among top five matches in ~75% cases. The prediction power in most complicated cases (seven or more residues) lies within 60%, however, it can be enhanced by adding more structural constraints.

There is no substantial accuracy decrease if a total number of β-sugars and/or number of substituents per residue are not constrained. However, for structures bigger than a trisaccharide, it is very desirable to restrict allowed substitution positions retrieved from the methylation analysis. Providing no structural constraints except the number of residues is reasonable for structures with one or two residues.

Most of the problems actual in natural carbohydrate research take from several minutes to one hour for calculation. As an example, performance in mode 1 is statistically summarized in the figure on the right.

Usage

The simplest way to use GRASS is to press Add residue as many times as the number of residues in a structure, select residues from drop-down lists, paste ¹³C NMR chemical shifts, press the Go! button, and read the best-fitting structure under position #1 in the result list. Click here to view a screenshot with calculation of a polymer containing one glucose residue and one fucose residue per repeat unit. The screenshot was made with advanced options hidden. To show them, click More options.... The details of GRASS user interface are discussed below.

The upper part of the screen contains job management tools. The Reset (1) link resets the iterator and predictor to the default state.

The prefix field (2) allows filtering out the jobs, which are named using this prefix. The default prefix is MyJob.

The Load (4) link displays the list of recent jobs (5) with names beginning with the specified prefix. The jobs are kept on the server for a month, so please download and backup your job files to save them permanently. Click on a job file in the list to load a job. The jobs displayed in grey contain only task formulation data but no calculation results. Those containing results are displayed in black and are decorated with the R character. The DL link (6) downloads a flat job file. If you see a progress indicator (7) it means that a corresponding job is currently running. Clicking on this indicator invokes a password-protected job termination dialog. If you load an active job, the progress status will start updating (see below) and browser will wait for the results.

The Save (3) link saves a current job (including results, if present) on the server. The job name is made of a prefix specified in (2) and current date and time. Every time you (re)start the calculation, a new timestamp is used, and the job is saved automatically. If you save your job while previous calculation is still running in your browser session, the session is closed and you start over with another timestamp. However, the previously started job continues running in background, and you can load it using Load job or the result link provided.

Pressing the Go! button starts the calculation of your job. When the calculation is finished, the results are saved to a file. A link to this file is provided when the calculation is started (4). If the browser session is still open, results are also delivered to the browser. If you close the browser, the calculation goes on in background and results are e-mailed to you when they are ready. The result link is persistent upon session close, so you can use it in future to access your results. Please keep in mind that if you close or reload the page, or save another instance of your job, this link will not be displayed anymore. So for long calculations, if you did not provide your e-mail, it is a good idea to save this link to be able to load the job results in the future.

To prevent overloading the server by concurrent tasks started from multiple pressing Go!, the button is blocked during the calculation. When the results are obtained and fetched to the browser, the button is unblocked. You can save the job and load it again to unblock the Go! button earlier; in this case your previous job will continue running in background.

During the calculation, its progress is displayed. The first phase includes up to 400 steps, for each of which the number of generated structures is reported (1). If some of these structures failed to be processed (e.g. due to the presence of exotic structural features unsupported by the empirical simulation engine), this number is shown in the form X of Y structures, where Y structures were generated, but only X of them could be processed. The total number of generated structures is reported below (2). The second phase consists of refining up to 500 best-fitting structures by the statitical simulation and forming a few top matches.

By pressing a stop sign (3) you can terminate the current job and release your IP adress for your other jobs. However, if you save the job (it creates a new timestamp), or load another job, or close the browser, the STOP sign will not be available anymore, and you will not be able to abort calculation of your job, although you can see which jobs are running in the Load job dialog. When requested to stop, a temporary new window displays a confirmation of job abortion, or a error message.

The raw estimation of how long you have to wait for results is displayed below the process status. If the job complexity estimation exceeds twelve hours, you are not allowed to run the calculation for free. To overcome this limitation, reduce the amount of unknowns in your task (apply more structural constraints), or register your e-mail as a priority user for 30$ / month. Regular users can run up to two simultaneous jobs (for no longer than 12 hours each). Priority users, identified by e-mail and client IP address, can run up to four simultaneous jobs for no longer than five days each.

Usage: input

Below the job tools you are supposed to input the Structure generation constraints and calculation options (see the screenshot below). You can display less options if you click on Hide (2). Some fields in this form can be changed implicitly, as a reaction to the user input in the other fields. Every time it happens, as well as when user input is considered non-optimal, a warning sign appears near a field that requires attention. Clicking on this sign displays an explanation why this warning was done. To avoid unexpected results, it is recommended to check all warnings before starting the calculation.

The only obligatory and non-iterated parameter is a number of residues in an oligomeric structure or in a polymer repeating unit. The default form has only one residue. You can add residues by clicking Add residue link (1), and remove them by clicking a cross sign (13) at the right of the corresponding row. If a certain residue is present in structure more than once, add it several times. Please note that monovalent substituents, such as acetic acid, are always considered as distinct residues, regardless to whether they are a part of OAc, NAc or other group. Every row in the form provides constraints on a single residue (identified by an order number in the beginning of the row):

The residue name (5) allows selection of residues from a drop-down list containing about a hundred of most widespread (>20 occurencies in CSDB) residues categorized by type. The order number (in the beginning of each row) is NOT correlated to the location of a residue in iterated structures. For fast navigation in this list start quick typing the residue name. If the desired residue is missing, use the first or the last line (Show all residues) to access the full list of residues. The drop-down list also contains superclasses, e.g. Any hexose, to iterate through all known residues belonging to a certain group. To reduce calculation time and improve distinguishment between top matched structures, it is recommended to avoid superclasses (especially hihgly populated ones), and to apply configurational constraints (anomeric, absolute and ring size) for all residues, which configurations are known. As an example, presence of a superclass Any residue increases the total number of iterated structures by ca. 200 times in widespread mode and by ca. 2000 times in all structures mode.
α/β selector of the anomeric configuration (3) is displayed for cyclic sugar residues and can be set to α, β, or ? (unknown).
D/L selector of the absolute configuration (4) is displayed for optically active residues and can be set to D, L, or ? (unknown). Some residues have R and S options instead of D and L. For some other residues (e.g. Kdo, L-gro-D-manHep, or Abe) absolute configuration is already implied in the residue name. In these cases the selector is not displayed. If a residue has a common absolute configuration (e.g. D for glucose), it is preselected on residue name change, but you can still modify it or set to unknown.
Ring form selector (6) is displayed for sugar residues and can be set to pyranose, furanose, open-chain, alditol, or ? (unknown). Alditols are included here to reduce the length of a residue list, and when you select alditol, it actually changes the residue itself, rather than its ring form. If a residue has a common ring form (e.g. pyranose for glucose), it is preselected on residue name change, but you can still modify it or set to unknown.
Checkboxes in the Allowed linkages block (8) indicate which positions in the residue can be substituted during structural permutations. All generated structures are checked for chemical allowability. The number of options depends on the carbon skeleton size of a residue. All carbons behind C6 are treated together as C7+ (i.e. these positions can either be all allowed or all forbidden). Please note that checking a certain position allows substitution at this position but does not demand it, so all boxes checked mean no limitations. Uncheking a box forbids substitution at this position. Depending on the deoxygenation pattern, ring size, and functionalization of a residue and on its location in the structure (e.g. terminal), some of these positions can be locked in either checked or unchecked state. The outgoing bond from the default center (C1 in aldoses and most non-sugar residues, C2 in ketoses, both C1 and C2 in amino acids or the like) is always allowed unless a residue location is reducing end. If you check C1 you allow also ingoing bonds at this position, like in Glc(1-1)Glc. Phosphoric and sulfuric acid residues are always allowed to form one bond, and they are allowed to form two bonds if location is not terminal or reducing. The link None or All (8) on the right allows setting or clearing all chemically possible positions at once.
The Min in and Max in drop-down lists (9) specify the minimal and maximal number of substituents (donors) that can be attached to the residue. Outgoing bond acceptors are not considered as substituents. The default value (?) means no limitations (0 for Min in; 9 for Max in). - (hyphen) in Max in means that residue can not have ingoing linakges. These selectors are shown only for residues that have at least one substitutable position, except the default donating atom (e.g. C1); the number of options depends on the substitutable positions and total number of residues.
Location selector (10) restrains a location, which a residue can occupy in iterated structures. Possible variants are Any, Terminal (at non-reducing end of an oligomer or a side chain), Reducing (at reducing end; available for oligomers only), and Not red. (any location except reducing end). Only one residue can be Reducing. Please note that if a residue has an alkyl (or other) group at anomeric center, it is not considered as reducing anymore, because alcohol residue itself becomes pseudo-reducing. For monovalent residues this selector is not shown.
The N-acetylation type selector (11) declares whether free amino groups of a residue should be acetylated. The following options are possible: Demanded (always N-acetylate), Forbidden (do not N-acetylate) and Allowed (generate both variants). If a default option is selected (allowed) this residue will be present in iterated structures both as N-acetylated and non-N-acetylated, if it does not contradict to other constraints, such as number of acetic acid residues. When Demanded is selected, acetic acid residues are added or reused, and attached to the current residue as to an acceptor. When you select Forbidden, connected acetic acid residues are removed. When you select Allowed, only the connection is removed, but acetic acid residues remain in composition.
Acceptors field (12) lists the order numbers of residues, to which the residue can donate by its outgoing linkage. It is not shown for a residue at oligomer reducing end, which can not have acceptors. Any means no limitations. Clicking on this field displays a menu where you can select allowed acceptors from the existing rows. For multiple selection click with Ctrl key down. To finish selecting click outside a menu. When you specify certain acceptors, the partial structures are visualized by arrows connecting residue icons in the preview area (7). Please note that a residue can have itself as an acceptor, implying that it is linked to another instance of the same residue in the next repeating unit of a polymer. If structural scope includes oligomers only, self-linkage is disabled.
The cross sign (13) at the end of the row removes a residue from monomeric composition. It is shown if there are two or more residues, excluding demanded N-acetyl groups. When you remove a residue, the order numbers of the remaning residues change, and if specific connections between residues were specified, they are recalculated accordingly. Removal of a residue with demanded N-acetate removes the connected acetic acid residue as well. To remove the connected N-acetate you should first disconnect it, or select forbidden as N-acetylation state of the acceptor residue.

For your reference, the summary of a residue, including its name with all configurations applied and an SNFG icon, is displayed in the preview area (7). If partial substructures were specified, they are represented by arrows connecting the residues. More than one outgoing arrow means that a residue can have any of the depicted acceptors. No outgoing arrows mean no limitations.

The table below residue constraints (search depth and scope) allows refining the structure iteration by providing structure-wide parameters:

Search depth (14) can be Widespread structures only (default) and All possible structures. The former option significantly speeds up the calculation, and allows not obscuring the output by exotic structures. The differences between these modes are detalized above in the "Overview" section.
Next two checkboxes define polymericity (15) of the generated structures. You can check oligomers, polymers, or both. However, in most cases a researcher knows whether a structure to elucidate is oligomeric or polymeric, so please uncheck the inappropriate scope to speed up the calculation. Among polymeric structures, only those are iterated that have a distinct repeating unit (regular polymers).
β-anomers selectors (16) limit the search to structures having the specified (equal, greater or less) number of sugar residues in β-anomeric form (usually determined from the ¹J_CH or ³J_HH coupling constant measurement). If all residues that can have anomeric forms have explicitly specified anomeric configuration, this field is blocked with the calculated value. The minimal and maximal numbers in the drop-down list reflect the current number of known and unknown anomericities.
CH₂ carbons selector (17) limits the search to those structures containing the specified number of CH₂ groups (usually determined from the APT or DEPT-135 experiment). If residue composition does not contain superclasses, whose members can have different number of CH₂ groups, this field is blocked with the calculated value. The minimal and maximal numbers in the drop-down list reflect the current superclass pattern.
No furanoses flag (18) allows exclusion of all furanose-containing structures from iteration (following the absence of signals in the characteristic region of the ¹³C NMR spectrum). If ring sizes of all residues are specified explicitly, or there is at least one explicit furanose, this checkbox is blocked in the checked or unchecked state.

To Find best matching structures you are expected to input the unassigned experimental ¹³C NMR spectrum recorded in water (19). All generated structures are matched against this spectrum. Type or paste space- or newline-separated chemical shifts in arbitrary order. One digit after a decimal dot is enough. For coincided signals of multiple integral intensity, specify the chemical shift more than once. Please don't forget signals of acetates and quaternary carbons (in case peaks are too small to pick, you can use 175 for every -COOH group). The expected number of signals depends on the specified monomeric composition; it can be a range if there are superclasses having a variative number of carbon atoms. This range is displayed above the spectrum together with the current number of signals, which is highlighted in red if it does not conform to the expected limits. The maximal difference between the number of signals in the experimental and the simulated spectrum (22) defaults to 1 signal. If you select no signals you should specify exactly all signals, and structures having a different number of non-equivalent carbons are filtered out. Selection of greater maximal difference (1 or 2 signals) allows skipping true signals or adding false signals in the experimental spectrum, and thus introduces tolerance to signal picking errors and allows broader interpretation of superclasses. However, this decreases accuracy and slows down the calculation.

Two flow-control options include:

Save N best-fitting structures (21) to select the size of the result list. The actual size of this top list can be less than requested, if not enough structures could be generated or not enough structures passed the preliminary check. The preliminary check fails if empirical estimation of innacuracy exceeds 5 ppm. If you uncheck the checkbox, structures will not be compared to the experimental spectrum, which can be used for faster generation of the structure pool only, to study the diversity under the specified constraints.
Save generated structures (22) option tells the iterator to log all the iterated structures (in CSDB format) to a file. After the calculation is finished, this log can be shown or downloaded by clicking Structure log link below the result table. This log can be used for the analysis which structures were generated, and which of them failed to be processed by the NMR simulation engine (for example, if the results that you expected are missing from the top list). Please note that for complex tasks with many unknowns, the structure log can be huge, which slows down the result output in a browser.

Complex job calculation may take longer than your browser timeout, but it continues running in background after the session ends. You will be notified on the e-mail you entered (24) when the results are ready, and the link to load the results will be provided. If you don't input the e-mail address, you don't get notified, but you can periodically check the link displayed after you pressed the Go! button (23). If you are a priority user, you must specify the e-mail address to get recognized.

Usage: output

As soon as calculation is finished, the results are saved on the server, and page is reloaded to display them in a table containing the desired number of best-matching structures, one compound per row. The rows are sorted according to the deviation between experimental and simulated NMR spectra.

The left column of the result table displays metrics reflecting how good a structure in this row fits the experimental spectrum. It has several lines (1):

#N - the rank of a structure in the result list. #1 is the best-matching structure, #2 is second-good etc.
Δ ~ N.NN - mean absolute deviation of the experimental vs. simulated spectrum, in ppm (the average distance between pairwise signals in two sorted unassigned spectra, the lower is better).
Corr = 0.NNN - linear correlation coefficient between the unassigned sorted experimental vs. simulated spectra (full match is 1.000).
RMS dev = N.NN - root-mean-square deviation of the experimental vs. simulated spectrum, in ppm.
Trust = NN% - trustworthiness of the statistical simulation of the ¹³C NMR spectrum, in percent (detalized in Accuracy and trustworthiness section of Statistical NMR simulation). This metric is colored from red (0%) to green (100%).
If not enough spectra could be predicted statictically but there are empirical results from the first phase of simulation, results can contain such structures as well. In this case, only rank, Δ_EMP (mean deviation from empirical caclulation), and Trust value from empirical calculation are dispalyed.

The second column shows the predicted structure (structural hypothesis) in SNFG graphic format (2). You can switch between SNFG (graphic) and SweetDB (pseudographic) representations by clicking on Structure as text / Graphic structure link (4). Holding a mouse cursor over a graphical structure pops up a balloon with CSDB linear code of this structure.

Clicking on the 3D icon (5) opens an atomic model of the structure as predicted by Sweet-II 3D modeler at Glycosciences.DE. Further you can download atomic coordinates in PDB format or visually explore the model to estimate NOEs.

The Sim assignment button (3) opens a separate window, where a predicted structure is passed to the NMR simulation module. As a result, you get assigned ¹³C and ¹H spectra, results from three simulation methods (empirical, statistical, and hybrid), expected accuracy and trustworhiness on per-atom basis, and plotted 2D NMR spectra based on chemical shift simulation and raw estimation of coupling constants. Further you can export the simulated 2D spectra to process them externally and compare to the experimental ones.

The statistically simulated ¹³C NMR spectrum (7) is schematically plotted in black. The line height depends on the number of overlapped signals only. These multiple-intensity signals have a small horizontal line at their lower end, indicating the range from which coincided signals were gathered. The experimental spectrum (6), derived from the chemical shifts you entered, is displayed in grey above the simulated one for reference. Both spectra are aligned against the same scale for easy visual comparison. The sorted simulated chemical shifts (and other comments, e.g. warnings about missing signals) are listed below the spectrum for copy-and-pasting. If you need chemical shifts in the order of assignment to residues and atoms, press the Sim assignment button and use the export TSV feature.

Below the table you find a copy of link to load the task and its calculation results (8), and links to show structure and error logs.

Clicking on Show structure log (9) displays a report on structure generation. It contains the number of steps at phase one, the number structures generated at each step, and the number of structures refined in phase two. The number of structures for which empirical simulation failed to run (e.g., due to the presence of exotic reasidues) is also included. Clicking the link once again hides the structure log. If you checked option Save generated structures, all generated structures are also saved to this log in CSDB linear format. The prefix failed means that a structure could not be processed, and thus it is a priori missing from the results. For complex tasks, there can be millions of structures in this log, so if you ordered to save generated structures, please be patient while your browser opens the log.

Clicking on Show error log (10) displays the error log. Except error messages, explaining why a certain structure could not be processed, it contains memory and performance data. Clicking the link once again hides the error log.

Structure iteration algorithm

1. The structure iteration begins with processing user-defined residue names, anomeric and absolute configurations, and ring forms. For each node of structure composition, a set of candidate residues is generated basing on these constraints. For example, if residue is rhamnose, anomeric configuration (α/β) is undefined (?), absolute configuration is D, and the ring form is pyranose, the set for this residue will contain two instances, α- and β-D-rhamnopyranose. If no furanoses option is set, the resulting residue sets will lack residues in furanose form. In the widespread mode, only common residues will be included in the residue sets where possible. For example, if residue name is glucose, anomeric configuration is α, absolute configuration is ?, and the ring form is ?, the resulting residue set will include only α-D-glucopyranose; however, both D- and L- forms of α-glucofuranose will be generated if the ring form is explicitly set to furanose (for which no widespread residues are possible).

2. The next step is obtaining a list of appropriate structure topologies having a given number of residues. A topology is a directed graph, which represents how residues are linked with each other, ignoring the nature of residues. All residues are considered as identical objects (nodes), which may have one or no outgoing linkages (acceptors) and up to four ingoing linkages (substituents). To explore topologies in graphic form, click here.

Here and below, we represent topologies as strings of digits (in bold). Within the strings, the digit position (starting from 1 at the left of string) is the node order, and the digit itself stands for the order of a node it donates to. If a node has no outgoing bonds (reducing end), its digit is zero. Here are several examples of the encoded topologies:

linear trimeric repeating unit of a polymer: 312
linear trisaccharide: 012
branched trisaccharide: 011
pentasaccharide repeating unit including a single-residue side branch: 51134
hexasaccharide repeating unit with one single-residue side chain per each of the three backbone residues: 511335

At this stage, the total number of residues, the structural scope (polymers/oligomers), the search depth (widespread or all possible structures) and the number of terminal residues are taken into account. In the widespread mode, topologies are sorted out if they a priori meet at least one of the following criteria:

there are more non-monovalent residues in side chains (except monovalent ones) than in a backbone. For example, if structural scope is polymers and residue composition does not include monovalent residues, 112 will be excluded.
any node in a topology has three or more non-monovalent substituents. For example, 4111 will be excluded if there are no monovalent residues in the structure.

For each topology that passed through these filters, a set of all possible transposition operators (T-operators) is derived. A T-operator defines a transposition (of nodes within topology), which does not affect any structure having this topology. For example, structure aDGalp(1-4)[bDManp(1-3)]bDGlcp, which has topology 011, does not change if node 2 (galactose) and node 3 (mannose) are swapped, and substitution positions are swapped accordingly. Hence, there is a T-operator for 011, standing for exchanging nodes 2 and 3.

3. For each topology matching the constraints, all possible combinations of residue set arrangements (RSAs) are generated. Duplicate arrangements are filtered out by applying T-operators to each new arrangement and checking whether it has already been generated. At this stage, allowed acceptors, allowed locations (reducing end, terminal, etc.) and minimal/maximal numbers of substituents for each residue are taken into account. In most cases, when all residue sets include residues of the same valency (it is true for any residue name except ANY superclass), RSAs in the widespread mode are sorted out if they contain a residue with three or more non-monovalent substituents.

4. For each RSA, all combinations of particular residues are generated. Duplicate instances are filtered out by applying T-operators. Here, user-defined N-acetylation patterns (allowed,demanded or forbidden linkage of acetic acid residues to amino groups), the total number of β-sugars, a total number of signals in the experimental spectrum (total number of non-equivalent carbons), and the total number of CH₂ carbons are utilized for structure filtering. The resulting combinations, which have residues with three or more non-monovalent substituents are sorted out in case they have not been exluded during step 3.

5. For each combination of residues, all possible combinations of substitution positions are generated. Duplicate instances are filtered out by applying T-operators. The substitution positions are iterated in accordance with the user input and the chemical possibility of residues to form ester, ether or amide bonds. In the widespread mode, linkages between default outgoing centers (e.g., Glc(1-1)Glc) of two non-monovalent residues are excluded.

The above notes are summarized in the figure:

Technical notes on the NMR data storage in CSDB

NMR spectrum encoding

This section describes the format of the NMR data in the CSDB dump file (although in the database the spectra are stored as relational tables). You need this only if you wish to upload data to CSDB.

The NMRH and NMRC fields describe assigned ¹³C or ¹H NMR spectra, accordingly. The subspectra of each residue should be separated by double slash (//) and have the following format:
#<residue linkage>_<residue name> <chemical shifts>. The residue names and their linkage pathes should match the structure record (see 'Structure encoding' for details).

<Residue linkage> is a comma-separated sequence of linkage positions that leads to the residue from the root of the structure (e.g. #3,2_A for residue A in the structure A(1-2)B(1-3)C). The root is either the reducing end residue or the rightmost residue in the polymer repeating unit. The linkage of the root residue is always #. Each linkage position is a number of the corresponding carbon or zero if there is no such number (e.g. for phosphoric acid residues). For carbon enumeration, please refer to Monomeric namespace subdatabase. If the linkage is unknown, use question mark (?), as in the structure record. To build the linkage through fuzzy residues, use variants with smaller substitution positions (e.g. #2,4_A but not #2,6_A in the structure A(1-2)<B(1-4)|B(1-6)>C). For dually-linked residues, use the first value as a linkage (e.g. #4_A but not #6_A for the structure A(2-4:2-6)B).

<Chemical shifts> is a space-separated list of chemical shifts of this residue in the carbon atom number ascending order. To resolve ambiguities regarding signal order or eqivalent atoms, consult the atomic patterns at the Monomeric namespace subdatabase (generally, it conforms to IUPAC numbering). If a certain chemical shift is unknown, use a question mark (?). In ¹H NMR spectra, there maybe more than one or no proton signals corresponding to certain carbon numbers. In this case, separate multiple values by hyphen (-) or use single hyphen if there is no signal (e.g. -COOH groups or quaternary carbons). Please note, that every carbon in the residue must have a corresponding field in the ¹³C or ¹H NMR subspectrum (- or ? or value), unless residue signals have not been published at all (in this case, leave the subspectrum blank, e.g. #2_Ac //)

For example, the ¹H NMR assignment table for -4)aLFucp(1-P-3)[Ac(1-5)]aXNeup(2- may look like this:
#4,0_aLFucp 5.01 3.80 4.30 4.25 4.70 1.21// #4_P // #5_Ac - 2.02// #_aXNeup - - 2.32-2.38 3.60 3.40 4.40 ? 3.62 3.50-3.55 // , where H7 of neuraminic acid is unassigned.

NMR spectrum comparison

During the search for NMR signals in the database (see Usage for help on the NMR search interface), spectra are ranked according to their similarity to the search term. To calculate the similarity between two spectra, the CSDB engine forms all possible subspectra of the larger NMR spectrum with the number of signals equal to that found in the smaller spectrum, and the best-fitting subspectrum is used to calculate the similarity value. Similarity is the reverse value of the mean pairwise deviation between signals in the sorted spectra. 1 means the average difference is 1 ppm, 10 means it is 0.1 ppm, 0.1 means it is 10 ppm etc. A value of 1000 stands for full similarity (exact match of chemical shifts). Please note, that the similarity may be very high if there are only a few signals in the smaller spectrum, but they fit well. Good similarity values for carbon spectra are 1 and above; good values for proton spectra are 5 and above.

If more than one spectrum is assigned to a compound (e.g. in different conditions or in different publications), the similarity between this compound and the given spectrum is calculated as the average of similarities of all spectra assigned to it for a given nucleus.

Home