CSDB usage: statistical tools
This section describes tools for statistical analysis and clusterization based on the CSDB content. These features are available from the Extras section of the menu. For basic help, refer to the CSDB usage help.
This feature provides the distribution of monomeric or dimeric fragments in structures from specified taxonomic group(s) and data on their uniqueness and location in the structures.
Rank selector (1) determines the rank of taxa to analyze: Domain, Phylum, Class, Genus, Species (exemplified on the screenshot) or Strain. Depending on the selected rank, the taxa available in the database are displayed as list (5). For domains, no lists are displayed, and domains can be selected using the Display groups checkboxes (2). For other ranks, Display groups filter the taxa to those included in the checked domains only. For species or strains, an additional list of genera (3) is provided for easier selection. As soon as a genus is selected, the list of species or strains (5) is updated. For every rank lower than domain, multiple selection is allowed with the Ctrl key. To select the entire list of taxa, press the Select all link (4).
Option group (6) determines which fragments will be included in the statistics:
- Combine anomeric forms, when checked, treats different anomers as a single residue without anomeric configuration, rather than produces separate residues from different anomeric forms. This option affects monosaccharide residues in cyclic forms only.
- Include undefined configs includes fragments containing residues with unknown anomeric, absolute or ring size configuration in the result. By default, such fragments are excluded.
- Include only saccharides excludes fragments containing at least one residue other than a monosaccharide. Please note, that acetylated aminosugars are interpreted as a dimeric fragment containing a non-sugar residue Ac and a sugar residue. However, their sugar moiety will be included as a monomer (or as a dimer, if it is substituted by another sugar residue). Checking this box will modify the state of other checkboxes.
- Include monovalent residues includes fragments containing (or consisting of) monovalent residues (including acetyl groups) in the result.
- Include aglycons in oligomers includes aglycons as monomers and aglycons linked to the reducing end residues as dimers. The aglycons are stored separately from structures and are processed only if they are attached to the reducing end of mono- or oligomeric structures. As soon as aglycons in polymers are attached not to every instance of the repeating unit, they are always excluded from the fragment abundance analysis.
- Include aliases will include fragments containing residues represented by superclasses (HEX, PEN, LIP etc.) and aliases (Sug, Subst, Subst1 etc.)
- Explain 'Subst' aliases replaces all Subst aliases with actual alias values and treats them as different residues. When unchecked, all Subst aliases from all structures are combined under a single residue name like Subst or Subst1.
Option group (7) determines whether the position of fragments in a structure is distinguished:
- Distinguish inline/terminal/reducing treats fragments regarding not only their nature but also their position at the reducing or non-reducing end (or at both ends). Monomers or dimers. which occupy different positions, are processed as different fragments and form separate statistical data in the result.
- Distinguish branching degree. When previous option is checked, this option will distinguish not only the position but also the branching degree of a fragment, which is the number of its substituents, including or excluding monovalent ones (see the next option). The branching degree is more thorough branching differentiation than inline/terminal/reducing only.
- ...and ignore monovalent substituents determines whether to count monovalent residues like acetyl or methyl groups when calculating the branching degree. This options has sense only with two previous options checked.
Checking box (8) limits the results only to those fragments, which are unique for selected taxa. The scope of the uniqueness analysis is defined by upper-group rank selector (9). Allowed variants are all biota (unique among all structures in the database), its kingdom (unique within the domain the selected taxa belong to) and its phylum (unique within the phylum the selected taxa belong to). Depending on the selected taxon rank (9), some options in the upper-group rank selector may be disabled.
The Monomer namespace link (12) displays the list of monomers present in the database, abundance data on their various forms and residue properties.
The buttons run the query for monomers (10) or dimers (11) to display the result page (screenshot for a sample query is on the right).
On the result page, the header displays the cumulative data on the generated statistics (1) and the list of selected taxa and their rank (2). The restrictions used to pick up the fragments are overviewed below (3).
The table of results includes the following columns:
- Position of the fragment in the structure and its branching degree if it was queried (4). This column is present if any position-related option was set in the query. Fragments at the reducing ends and fragments that are identical to the whole structure are highlighted with pink; unsubstituted (terminal) fragments are highlighted with cyan. If this column is present, identical fragments with different positions occupy different rows.
- Donor (left) residue of the dimeric fragment (5). This column is present in the table of dimeric fragments only.
- Linkage between residues in the dimeric fragment (6). The word aglycon means that a dimer consists of a residue at the reducing end and its aglycon. This column is present in the table of dimeric fragments only.
- Acceptor (right) residue of the dimeric fragment or a residue comprising the monomeric fragment (7).
On the sample screenshot, fragments with different anomeric configurations were combined. Superclasses, if the corresponding option was set in the query, are displayed in blue. Residues with missing anomeric (if not combined), absolute and ring size configurations are greyed.
- Abundance of fragments (at indicated positions) in the structures from the selected taxa (8). The value in parentheses is a part (%) of the total abundance of all fragments matching the query (see the last row).
- Compound IDs (9) are links to the corresponding compounds. The number of these IDs may be lower than the fragment abundance, as fragments can be present in the whole structure or in the polymer repeating unit more than once.
- Abundance in selected taxa (10) is a distribution of fragments among particular taxa of the selected rank, visualized as a light-green histogram. If only one taxon was selected in the query, every cell in this column will have a single bar with 100% abundance.
Clicking on any of the column headers (4)-(8) resorts the table in the selected column. By default, the table is sorted by the descending abundance.
The accessory link Export TSV (11) exports the results as tab-separated values for copy-pasting into Microsoft Excel or other table software. The accessory link Monomers or Dimers (12) runs the same query for fragments of the other size (one or two residues) than those present in the table.
The monomer namespace table provides information on the naming and properties of residues, of which all structures in the database are composed. You may need this data if you type structural queries in the expert form and you are unsure whether a particular residue can be encoded and how its name is spelled.
- Residue column (4) lists the residue base names without configurations.
- Size column (5) reflects the size of the sugar carbon skeleton (from di to dec) and may contain the reserved values of mva (monovalent non-sugar residues) and nsu (polyvalent non-sugar residues). Asterisk (*) means that a residue has no variable absolute configuration (the chirality is encoded in the name or the residue is not optically active).
- Type column (6) reflects the type of residue: ald = cyclic aldose, ket = cyclic ketose, opn = open chain sugar, sug = aldose, ketose or open-chain sugar (for superclasses), ol = alditol, alk = alkyl, lip = acyl, sph = sphingoid, ino = inositol derivative, pep = amino acid, unspecified = none of the above.
- Abundance column (7) lists how many times the residue in any of its forms appears in the database. Multiple instances within a single compound are counted as many times, as the residue appears in the structure.
- Structures column (8) lists how many distinct compounds the residue appears in (multiple instances within a single compound are counted once).
- Description or atomic pattern column (9) contains IUPAC names and comments, so you can use the browser search function to locate a residue of interest.
By clicking on the column headers the table can be sorted alphabetically (default), by residue properties or by abundance. Explanation of the residue class (Size, Type) and Atomic pattern and other details are below the table.
Every residue name can be expanded by clicking on + to view the abundance-decreasing list of anomeric, absolute and ring size forms in which the residue is present (4). Every form of the residue has an atomic pattern (10) displayed in the last column and a link to the corresponding MonosaccharideDB entry (11) if available. The links Expand all and Collapse all (2) in the page header expand or collapse all residues in the table.
Atomic pattern (10) is characteristic for atoms in the residue, sorted by ascending carbon numbers from left to right. Every atom is represented by a single character: o = >CH-OH (hydroxy) or -CH(OH)2 (dihydroxy), n = >CH-NH- (amino), a = -COOH or -XOnOH, e.g. phosphate (acid), d = -CH2- or >CH- or >C< (deoxy), D = -CH= or >C= (deoxy sp2) or tertiary sp-C, O = -C(OH)= (hydroxy sp2), N = -C(NHR)= (amino sp2) or -CONH2 (amide), x = other carbon (>C=O, etc.), ? = any (in superclasses), or unclear due to indefinite ring size configurations, or yet unknown.
The link Residue subdatabase dump (1) displays raw data on all residues parsable by the CSDB engine (not only those present in structures in the database). For more details on the residue names, please refer to Structure encoding, section 2.
This tools displays database coverage data for a given taxonomic group. The rank of this group can be selected using rank selector (1). Allowed ranks are all biota (no limitations, default), domain, phylum, class and genus. As soon as the rank is selected, the list of matching taxa (3) is shown or updated according to domain filter (2). For kingdoms, the domain filter is used as a selector. Please note, that although all domains have checkboxes, there is no sense in checking domains beyond the scope of the database (bacteria, archaea and protista in Bacterial CSDB; plants and fungi in Plant&Fungal CSDB).
Publication year span checkbox (4) filters results to those published within the specified range only. Structure type checkbox (5) filters results to those containing structures of the specified type only (mono- and oligomers, all polymers, monomers and homopolymers, cyclic polymers or biological repeating units). The Display coverage button (6) processes the query.
Results are presented as a table containing the following columns:
Numbers of structures, publications and organisms are links to the corresponding CSDB result pages containing all instances of these data entries. Next to the numbers are their relative parts in the total number of entries found. This parenthesized value is reflected by light-green horizontal bars on the histogram. The cumulative values are shown in the last row. Clicking on the column header resorts the table in this column.
- Selected taxa (column header is a rank). If a single taxon was selected, all rows contain the same name in this column. If all biota was selected, this column is absent.
- Their subtaxa (column header is a rank). Every taxon is split into subtaxa of the lower rank, and the results are grouped by these subtaxa. If the selected rank was genus, two tables are displayed: one for genera split into species (which include all strains) and one for genera split into species and strains.
- Structures. Number of structures for the corresponding subtaxon found in the database.
- Publications. Number of publications in which these structures are present.
- Organisms. Number of taxonomically distinct organisms or groups of organisms from which these structures were obtained.
- NMR spectra. Number of NMR spectra for these structures present in the database.
This feature generates a distance matrix between mono- or dimeric fragment pools from taxa populated in both the (bacterial and plant&fungal) databases. Based on this matrix, the taxa are clustered into groups, and the corresponding dendrograms are displayed. The exported matrix can be used for clustering of taxa according to the glycans they biosynthesize, can be visualized as phenetic trees or processed externally.
At first, the program generates a list of taxa, in accordance with the specified constraints (Scope settings and General settings), and a list of structural fragments that will be included in the analysis, in accordance with the specified constraints (General settings and Fragment pool settings). Having these two lists prepared, the program builds binary occurrence codes that reflect the occurrence of particular fragments in structures assigned to organisms belonging to every taxon under analysis. These occurrence codes are compared to give Hamming distances between taxa, which are normalized by the exploration degree of the two taxa being compared (how many structures are assigned to them). These distances form a dissimilarity matrix used for cluster analysis of taxa and building of phenetic graphs.
GENERAL SETTINGS include options and thresholds for the generation of the taxon list, fragment list and occurrence codes:
- The Rank selector (4) specifies the rank of taxa that will be compared (Kingdom, Phylum, Genus, Species or Strain). Depending on the scope settings, some of the upper ranks may be disabled. If Species is selected (as on the exemplary screenshot), an additional link Specify exact species and a list of currently specified species appears on the right. This link allows picking up the names from the total list of species present in the bacterial and plant&fungal databases. Usage of this option operates on the global level and resets the scope settings to All biota. The list can be copy-pasted in the child window to avoid repeating of the species selection process.
- Two Taxon population thresholds (5) define how populated a taxon should be for inclusion in the analysis. The population is a number of structures assigned to organisms belonging to the taxon or its subtaxa. The first of these two fields sets the minimum for this number in absolute units (number of structures), and the second one sets it in the relative units () (number of structures normalized by the total number of structures in the database a taxon is deposited in). Checkboxes allow selection of which threshold to use, including both of them.
- Two Abundance thresholds (6) define how 'popular' a fragment should be to be qualified as present in biota and thus included in the analysis. Higher abundance thresholds bias the analysis to widespread residues, allowing avoidance of analytical artifacts and atypical rarely-occurring fragments. The first threshold is a minimal number of structures in which the fragment should be present in the databases. The second threshold is a minimal number of instances of the fragment in all structures in the databases. As a structure may have more than one identical fragment, the second threshold is sensible only if it is greater than the first. Both these thresholds are applied during the fragment pool generation.
- Fragment presence threshold (7) defines how frequently a fragment should appear in the structures from a taxon to be qualified as present in this taxon. If the fragment population in a taxon is equal or higher than this threshold, the occurrence code for this taxon will have 1 (true) at the position corresponding to this fragment; otherwise it will have 0 (false) at this position.
- Size of fragments (8) selects fragments of which size (monomeric or dimeric) will be used for the analysis. Analysis on monomers suggests clustering according to global structural composition (may be associated with synthases of taxa), while analysis on dimers suggests clustering according to linkages present in glycans (may be associated with transferases of taxa).
- Type of structures (9) is a filter applied to the databases before processing to change the scope of structures queried for the fragments. Allowed options are any (default), only polymers, only oligomers or optimized. Usually, structures of different types are located in different compartments of living cells. Optimized implies most biologically active structures in every domain: polymers from bacteria, fungi and archaea, and oligomers from other kingdoms. Please note, that changing this option affects the pool of structures on which the other thresholds operate, and thus optimal thresholds are different for different types of structures.
- Format (10) defines how the dissimilarity matrix is exported (R, FITCH or TSV). If the default R-project format is selected, the visualization of results as a dendrogram is done automatically on the result page. FITCH matrices can be processed by multiple clustering software, while the TSV format is the most universal one.
SCOPE SETTINGS allow processing of the analysis inside a particular taxonomic group, which affects the selection of taxa. This taxonomic group may include one or more taxa selected in the taxon selector (3). The group-rank selector (1) switches this list (3) to the desired rank and updates its content according to both databases combined. Allowed ranks are
All biota (no limitations), Domain, Phylum, Class and Genus. This rank has sense only if it is higher than the rank of taxa for analysis selected in general settings (4). Group filter (2) limits list (3) only to those taxa that belong to the checked domains. Multiple taxa can be selected with the Ctrl key. The taxa for the analysis will include only subtaxa of the taxa specified in the scope settings.
FRAGMENT POOL SETTINGS (11) include options for generation of the fragment list and are analogous to those used in the Fragment abundance tool:
- Combine anomeric forms, when checked, treats different anomers as a single residue without anomeric configuration, but does not produce separate residues from different anomeric forms. This option does not affect residues, which have no anomeric forms.
- Exclude underdetermined residues omits fragments containing residues with unknown anomeric (if not combined), absolute or ring size configurations from the analysis.
- Exclude monovalent residues omits fragments containing (or consisting of) monovalent residues (including acetyl groups on aminosugars) from the analysis.
- Exclude superclasses omits fragments containing (or consisting of) residues presented by superclasses (like HEX) or aliases (like Subst).
- Differentiate aliases replaces all Subst aliases with actual alias values prior to the analysis. When checked, aliases are treated as different residues depending on their actual values; otherwise all Subst aliases are combined under a single residue name like Subst or Subst1.
- Sugars only omits fragments containing at least one residue other than a monosaccharide from the analysis. Please note that acetylated aminosugars are interpreted as a dimeric fragment containing a non-sugar residue Ac and a sugar residue.
- Exclude aglycons omits fragments with residues at the reducing end classified as aglycons. If unchecked, aglycons from mono- and oligomeric structures undergo the analysis together with other residues.
- Differentiate location (currently disabled) processes identical fragments at different locations in the structure (inline, terminal or reducing) as different fragments.
- Strict comparison (currently always enabled) implies that to decide if two fragments are the same or different, strict comparison of configurations is performed, e.g. ?DGlcp (unknown anomer) is not equal to aDGlcp. When unchecked, residues with known configurations are considered a subset of those with unknown.
Pressing the Clusterize button (12) runs the analysis. The specified restrictions affect the total number of processed taxa and fragments, so the calculation may take from 30 seconds to 15 minutes.
When the analysis is done, the results are displayed (an example is shown on the right screenshot). They start with an overview of the taxonomic scope (1) (on which supergroup(s) of organisms the data were obtained). Reports on the number of generated fragments of the desired size and on the number of generated taxa of the desired rank (2) indicate that taxon and fragment pools were prepared without errors.
The occurrence bit-code generation report (3) includes the Show link that displays a table with the taxa and their bit-codes, as well as a separate list of the used fragments. For easy correlation, bits in the occurrence codes and fragments in the list are split in groups of five and have the same order. The dissimilarity matrix generation report (4) also has the Show link that displays the distance matrix in the selected format.
Next two blocks show a copy of the Calculation parameters (5) used in the analysis and Coverage data on used taxa (6). The latter data are displayed in a tab-separated table with taxon names and database markers ((BA) = bacterial, (PF) = plant&fungal), a number of organisms within a taxon and its subtaxa (second column), and a number of structures assigned to these organisms (third column).
If the dissimilarity matrix format was chosen as R-project, a dendrogram reflecting the results of clustering (7) and additional options (9)-(12) are displayed. Regardless of the matrix format, the analysis results (except a dendrogram tree) are stored on the CSDB server in two files referenced by persistent links (8). The first (TXT file) contains the dump of the input parameters, generated dissimilarity matrix and coverage data on taxa. The second file (data in the specified format) contains the dissimilarity matrix alone, for processing in other software. If you need these files, please copy the links or the job name and use them within six months, or download the files.
For R-formatted matrices, additional options are available to rebuild a dendrogram from the same dissimilarity matrix without re-running the analysis. Graph type selector (9) specifies the dendrogram type: Phylogram (rectangular stems), Cladogram (angular stems), Unrooted tree (as on the screenshot) or Circular tree (with the root in the center). The leafs (=taxa) of the built phenetic tree are colored according to the number of largest clusters specified in selector (10). The button Rebuild dendrogram (11) updates image (7) and exports the phenetic tree in the format chosen by selector (12). Allowed formats are No export (default), Newick tree and Nexus tree. After the export, a link to the corresponding file appears in link area (8). The image update process opens another browser window, and if there were no errors, it is closed automatically after the image is refreshed.
The CPU time needed for calculation (13) is provided for reference.