CSDB usage: statistical tools

This section describes tools for statistical analysis and clusterization based on the CSDB content. These features are available from the Extras section of the menu. For basic help, refer to the CSDB usage help.
  • Fragment abundance
  • Monomer namespace
  • Coverage data
  • Taxon clustering
  • Fragment abundance

    This feature provides the distribution of monomeric or dimeric fragments in structures from specified taxonomic group(s) and data on their uniqueness and location in the structures.

    Fragment abundance form

    Rank selector (1) determines the rank of taxa to analyze: Domain, Phylum, Class, Genus, Species (exemplified on the screenshot) or Strain. Depending on the selected rank, the taxa available in the database are displayed as list (5). For domains, no lists are displayed, and domains can be selected using the Display groups checkboxes (2). For other ranks, Display groups filter the taxa to those included in the checked domains only. For species or strains, an additional list of genera (3) is provided for easier selection. As soon as a genus is selected, the list of species or strains (5) is updated. For every rank lower than domain, multiple selection is allowed with the Ctrl key. To select the entire list of taxa, press the Select all link (4).

    Option group (6) determines which fragments will be included in the statistics:

    Option group (7) determines whether the position of fragments in a structure is distinguished:

    Checking box (8) limits the results only to those fragments, which are unique for selected taxa. The scope of the uniqueness analysis is defined by upper-group rank selector (9). Allowed variants are all biota (unique among all structures in the database), its kingdom (unique within the domain the selected taxa belong to) and its phylum (unique within the phylum the selected taxa belong to). Depending on the selected taxon rank (9), some options in the upper-group rank selector may be disabled.

    The Monomer namespace link (12) displays the list of monomers present in the database, abundance data on their various forms and residue properties.

    The buttons run the query for monomers (10) or dimers (11) to display the result page (screenshot for a sample query is on the right).

    Fragment abundance result

    On the result page, the header displays the cumulative data on the generated statistics (1) and the list of selected taxa and their rank (2). The restrictions used to pick up the fragments are overviewed below (3).

    The table of results includes the following columns:

    Clicking on any of the column headers (4)-(8) resorts the table in the selected column. By default, the table is sorted by the descending abundance.

    The accessory link Export TSV (11) exports the results as tab-separated values for copy-pasting into Microsoft Excel or other table software. The accessory link Monomers or Dimers (12) runs the same query for fragments of the other size (one or two residues) than those present in the table.

    Monomer namespace
    Monomeric namespace

    The monomer namespace table provides information on the naming and properties of residues, of which all structures in the database are composed. You may need this data if you type structural queries in the expert form and you are unsure whether a particular residue can be encoded and how its name is spelled.

    By clicking on the column headers the table can be sorted alphabetically (default), by residue properties or by abundance. Explanation of the residue class (Size, Type) and Atomic pattern and other details are below the table.

    Every residue name can be expanded by clicking on + to view the abundance-decreasing list of anomeric, absolute and ring size forms in which the residue is present (4). Every form of the residue has an atomic pattern (10) displayed in the last column and a link to the corresponding MonosaccharideDB entry (11) if available. The links Expand all and Collapse all (2) in the page header expand or collapse all residues in the table.

    Atomic pattern (10) is characteristic for atoms in the residue, sorted by ascending carbon numbers from left to right. Every atom is represented by a single character: o = >CH-OH (hydroxy) or -CH(OH)2 (dihydroxy), n = >CH-NH- (amino), a = -COOH or -XOnOH, e.g. phosphate (acid), d = -CH2- or >CH- or >C< (deoxy), D = -CH= or >C= (deoxy sp2) or tertiary sp-C, O = -C(OH)= (hydroxy sp2), N = -C(NHR)= (amino sp2) or -CONH2 (amide), x = other carbon (>C=O, etc.), ? = any (in superclasses), or unclear due to indefinite ring size configurations, or yet unknown.

    The link Residue subdatabase dump (1) displays raw data on all residues parsable by the CSDB engine (not only those present in structures in the database). For more details on the residue names, please refer to Structure encoding, section 2.

    Coverage data
    Coverage statistics form

    This tools displays database coverage data for a given taxonomic group. The rank of this group can be selected using rank selector (1). Allowed ranks are all biota (no limitations, default), domain, phylum, class and genus. As soon as the rank is selected, the list of matching taxa (3) is shown or updated according to domain filter (2). For kingdoms, the domain filter is used as a selector. Please note, that although all domains have checkboxes, there is no sense in checking domains beyond the scope of the database (bacteria, archaea and protista in Bacterial CSDB; plants and fungi in Plant&Fungal CSDB).

    Publication year span checkbox (4) filters results to those published within the specified range only. Structure type checkbox (5) filters results to those containing structures of the specified type only (mono- and oligomers, all polymers, monomers and homopolymers, cyclic polymers or biological repeating units). The Display coverage button (6) processes the query.

    Results are presented as a table containing the following columns:

    Numbers of structures, publications and organisms are links to the corresponding CSDB result pages containing all instances of these data entries. Next to the numbers are their relative parts in the total number of entries found. This parenthesized value is reflected by light-green horizontal bars on the histogram. The cumulative values are shown in the last row. Clicking on the column header resorts the table in this column.

    Taxon clustering

    This feature generates a distance matrix between mono- or dimeric fragment pools from taxa populated in both the (bacterial and plant&fungal) databases. Based on this matrix, the taxa are clustered into groups, and the corresponding dendrograms are displayed. The exported matrix can be used for clustering of taxa according to the glycans they biosynthesize, can be visualized as phenetic trees or processed externally.

    At first, the program generates a list of taxa, in accordance with the specified constraints (Scope settings and General settings), and a list of structural fragments that will be included in the analysis, in accordance with the specified constraints (General settings and Fragment pool settings). Having these two lists prepared, the program builds binary occurrence codes that reflect the occurrence of particular fragments in structures assigned to organisms belonging to every taxon under analysis. These occurrence codes are compared to give Hamming distances between taxa, which are normalized by the exploration degree of the two taxa being compared (how many structures are assigned to them). These distances form a dissimilarity matrix used for cluster analysis of taxa and building of phenetic graphs.

    Taxon clustering form

    GENERAL SETTINGS include options and thresholds for the generation of the taxon list, fragment list and occurrence codes:

    SCOPE SETTINGS allow processing of the analysis inside a particular taxonomic group, which affects the selection of taxa. This taxonomic group may include one or more taxa selected in the taxon selector (3). The group-rank selector (1) switches this list (3) to the desired rank and updates its content according to both databases combined. Allowed ranks are All biota (no limitations), Domain, Phylum, Class and Genus. This rank has sense only if it is higher than the rank of taxa for analysis selected in general settings (4). Group filter (2) limits list (3) only to those taxa that belong to the checked domains. Multiple taxa can be selected with the Ctrl key. The taxa for the analysis will include only subtaxa of the taxa specified in the scope settings.

    FRAGMENT POOL SETTINGS (11) include options for generation of the fragment list and are analogous to those used in the Fragment abundance tool:

    Taxon clustering result

    Pressing the Clusterize button (12) runs the analysis. The specified restrictions affect the total number of processed taxa and fragments, so the calculation may take from 30 seconds to 15 minutes.

    When the analysis is done, the results are displayed (an example is shown on the right screenshot). They start with an overview of the taxonomic scope (1) (on which supergroup(s) of organisms the data were obtained). Reports on the number of generated fragments of the desired size and on the number of generated taxa of the desired rank (2) indicate that taxon and fragment pools were prepared without errors.

    The occurrence bit-code generation report (3) includes the Show link that displays a table with the taxa and their bit-codes, as well as a separate list of the used fragments. For easy correlation, bits in the occurrence codes and fragments in the list are split in groups of five and have the same order. The dissimilarity matrix generation report (4) also has the Show link that displays the distance matrix in the selected format.

    Next two blocks show a copy of the Calculation parameters (5) used in the analysis and Coverage data on used taxa (6). The latter data are displayed in a tab-separated table with taxon names and database markers ((BA) = bacterial, (PF) = plant&fungal), a number of organisms within a taxon and its subtaxa (second column), and a number of structures assigned to these organisms (third column).

    If the dissimilarity matrix format was chosen as R-project, a dendrogram reflecting the results of clustering (7) and additional options (9)-(12) are displayed. Regardless of the matrix format, the analysis results (except a dendrogram tree) are stored on the CSDB server in two files referenced by persistent links (8). The first (TXT file) contains the dump of the input parameters, generated dissimilarity matrix and coverage data on taxa. The second file (data in the specified format) contains the dissimilarity matrix alone, for processing in other software. If you need these files, please copy the links or the job name and use them within six months, or download the files.

    For R-formatted matrices, additional options are available to rebuild a dendrogram from the same dissimilarity matrix without re-running the analysis. Graph type selector (9) specifies the dendrogram type: Phylogram (rectangular stems), Cladogram (angular stems), Unrooted tree (as on the screenshot) or Circular tree (with the root in the center). The leafs (=taxa) of the built phenetic tree are colored according to the number of largest clusters specified in selector (10). The button Rebuild dendrogram (11) updates image (7) and exports the phenetic tree in the format chosen by selector (12). Allowed formats are No export (default), Newick tree and Nexus tree. After the export, a link to the corresponding file appears in link area (8). The image update process opens another browser window, and if there were no errors, it is closed automatically after the image is refreshed.

    The CPU time needed for calculation (13) is provided for reference.