CSDB usage: basic operations

This section describes details of basic user operations in CSDB. For additional help, refer to Extras and Maintenance help.

Content:

Menu

The menu is what you see on the left side of the screen when you enter the project web-site. If you don't see the menu, the frame support is probably turned off in your browser.

Red numbers indicate how many distinct structures are currently present in CSDB and in how many publications. Four sections stand for the following operation groups:


Performing the queries
Search scope

In every search form you are expected to fill in the search terms, select the scope and run the query. The search terms depend on the type of search and are explained in the subsequent sections. The Go button (6) processes the query. Text field (7) informs the search engine how many result records should be output per page (default is 30).

Every search form, except the ID search, has a selector identifying the search scope. This feature allows refining the queries by their intersection or combination with queries of different types. If the default value whole database (1) is selected, the search will be performed through the whole database content.

If there has been a previous query within the current browser session, two more variants become unblocked: Search in the result of the previous query (2) and Combine with the result of the previous query (3). A summary of the previous query results (4) is displayed below the selector.
Search in... intersects the current and previous queries (logical AND), e.g. you can first search for some substructure and then refine the results using some bibliographic data by selection of the search scope as (2).
Combine with... combines the current and previous queries (logical OR), e.g. you can first search for some substructure and then extend the results using some other substructure or other data by selection of the search scope as (3).
If the Negate search checkbox (5) is checked, the current query will be negated (logical NOT), i.e. the database returns all results that do NOT match the search criteria. This option applies to the results after processing all other search-dependent options, so use it carefully if multiple search criteria are specified. Combining this option with Search in... produces the logical AND NOT operation.

Note: if you intersect or combine search results of different type more than twice, at the final step you may obtain more results than expected. For example, imagine that you searched for the substructure "bDQuipN4N" and got 100 structures (compound IDs). Then you searched for publications using AND scope, applied the criterion "published after 2005", and got 20 papers (publication IDs) containing the structures from the previous query. After this you searched for organisms using AND NOT scope, specify "Shigella" genus, and got 1000 taxons (organism IDs), which is much more than you expected. This is because the criterion "AND NOT Shigella" was applied only to the results of the previous search with no regard to pre-previous search. It means that returned results are those organisms, which do not belong to the Shigella genus and which are published in the 20 papers found at the second step. However, those 20 papers might contain other structures (not only those structures containing "bDQuipN4N" which you found at the first step) assigned to other organisms, giving you 1000 taxons in total. This behavior is applicable to search series in which you intersect results three or more times, and at least two steps use a search type different from the previous one (e.g. "structures AND articles AND organisms", or "structures AND articles AND structures"). Series like "structures AND structures AND articles AND articles" are not subject to this behavior because there is only one variantion of a search type.


Search using IDs
ID search

This form allows data retrieval using unique CSDB identifiers for record, publication, compound or organism. The record ID is equal to the value of the ID field in the CSDB dump file. Enter ID or ID range in text field (2), separating different IDs or ID ranges with commas, e.g. 1,2. The range is identified by hyphen, e.g. 1-10.

By changing the search scope (1), you can search for other CSDB IDs: structure IDs, publication IDs and organism IDs. The remaining IDs (source, spectrum, relation) are proposed for RDF generation only (you cannot use them for search queries).

In most cases, users do not know these IDs. The ID search may be useful for fast access to the data that have been previously searched for, if the ID was remembered. To make a link to a certain record from a webpage, you should use the following syntax <A HREF="http://csdb.glycoscience.ru/{DB_name}/core/search_id.php?id_list={my_ID}&mode={my_search_scope}">{my_link_name}</A>, where {my_search_scope} = record (default) or structure or publication or organism, and {DB_name} = bacterial or plant_fungal.

Link (3) produces an RDF feed for selected IDs, link (4) produces meta-data in the ThomsonReuters DCI XML format.


(Sub)structure search
(Sub)structure search

This form lets you search the database by fragments of chemical structure. As you enter the structure by one of these methods, pressing Return the structure to the search page... returns you back to the structure search form with the pre-filled search term field (7). The result of your input is converted to the CSDB linear encoding; it is put to field (8). This field remains editable, so you can use the generated structure as a starting point for manual editing. As soon as CSDB linear encoding is present, the structure is previeved in area (7) in the graphic SNFG format. If the structure was entered manually but could not be parsed, parsing errors are displayed instead.

Literal term (9) limits the search to those compounds that have the specified text in their CSDB linear encoding (including aglycons and 'Subst' aliases). Two checkboxes define where to look for the text: in aglycons, structural annotations and aliases and structrure linear codes (10), and/or in trivial names of compounds (11). If no structure is specified, but the literal term is given, all structures matching this text will be returned, as if the structure was ANY.

By default the query engine interprets your term as a substructure, i.e. those structures will be returned that contain the specified fragment (including fragments located across polymer repeating unit borders). To limit search only to those structures that match the search term exactly (are equal, rather than contain) use Treat search term as... selector (12) to choose complete structure. Open linkages in the search term are interpreted differently depending on the status of this option. In case of search for a fragment, open linkages are considered links to ANY residues, e.g. -3)aGlcp(1- is equivalent to the ANY(?-3)aGlcp(1-?)ANY fragment. In case of search for a complete structure, open linkages indicate the repeating unit of a polymer, e.g. -3)aGlcp(1- is equivalent to [-3)aGlcp(1-]n.

In case of search for the complete structure of a heteropolymer, only those records will be found where this polymer is described using the same repeating unit frame as in the search request. If the complete search term is a polymeric structure, two result sets will be combined:

The comparison of complete structures may be strict or fuzzy, accordingly to the selected option. The strict comparison implies exact match between the search term and the structure, except the repeating unit positioning in polymers.

The Search for molecule types selector (13) specifies among which structure types the search is performed (monomers, oligomers, polymers, biological repeating units, fragments/motifsm etc.). All molecule types (default) means no limitations.

The Search for structures with published NMR data only checkbox (14) limits the search only to those compounds that have an NMR assignment table stored. The Compound class checkbox and drop-down list (15) limit the search to a certain compound class. The Restrict taxonomical domain checkbox and drop-down list (16) limit the search to structures assigned to organisms from the certain taxonomic kingdom.

There are additional tools, which do not process the query:
The Predict NMR link (17) simulates a 13C and 1H NMR spectra for a given structure. For more information, please refer to the NMR prediction help.
The Sweet 3D model link (18) sends te structure to Sweet-II engine for 3D visualization. The Glycam model link (18) sends the structure to AMBER/GLYCAM for conformational processing and generation of atomic coordinates.


Structure wizard
Structure wizard

The wizard allows construction of structural queries without knowledge of the CSDB structure encoding language. The first step is to select a suitable topology from drop-down list (1). The wizard supports topologies of up to four residues (besides monovalent substituents). The graphic representation of topology (2) is displayed on the right. As soon as the topology is selected, the appropriate number of residue sections appears below (three on the screenshot). The linear preview of the structure is displayed next to the topology ((3))

Each residue section header (4) includes the position of this residue within selected topology (e.g. Residue A) and the constructed residue name with all configurations and substitutions applied.

The constructed structural query term is displayed in the bottom of the page (17). By pressing Return the above structure to the structure search page... (18) it is passed to the parent window, and the wizard is closed.


Composition search
Composition search

This form allows searching structures by their residue composition, e.g. MS data. The default composition is a single hexose residue. Drop-down list (1) lets you select a residue base type (e.g. HEX, Glc, GlcN etc.) without configurations and ring size. Only most widespread residues are included. If you need a residue, which is missing from this list, you should select its last line - complete list. Control (2) specifies how many instances of this residue should be in the desired composition. The selected residue and its amount are displayed on the right (3).

The first part of the residue list contains superclasses (PEN, HEX etc.). These entries should be used when the unit identity is unclear. Please note, that the HEXN superclass is contained within the HEX superclass, which may produce redundant results when you select HEXN.

Two buttons let you increase (Add unit, (4)) or decrease (Remove unit, (5)) the number of different units in the composition. The total amount of units is displayed in the header. If the composition contains multiple instances of the same type, e.g. two Glucose residues, you should select the residue Glc using (1) and its amount 2 using (2), rather than select 1 x Glc twice.

The Search for molecule types selector (6) specifies among which structure types the search is performed (monomers, oligomers, polymers, biological repeating units, fragments/motifsm etc.). All molecule types (default) means no limitations.

The Search for complete composition checkbox (7) controls the exclusiveness of the search. If it is checked (default), only the structures that contain no other residues except those specified in the composition will be returned, i.e. the input is interpreted as the composition of a complete molecule. If this box is uncheked, the input is interpreted as the composition of a structural fragment, and more structures are returned, including those containing other residues besides those specified.

The Compound class checkbox and drop-down list (8) limit the search to a certain compound class. The Restrict taxonomical domain checkbox and drop-down list (9) limit the search to structures assigned to organisms from the certain taxonomic kingdom.


Structural motif search

The search for structural motifs is based on ranking structures according to how many disaccharides (residues and their linkage are considered using fuzzy logic and without modifications and monovalent substituents) they have in common with the structure in the search term. This feature is not yet supported.


Taxonomic search
Taxonomic search

This form allows retrieving particular organisms and associated data by their taxonomic names. Alphabetical lists of genera (2), species (3) and strains/serogroups (4) provide taxonomic specification (position of an organism in the tree of life). For fast navigation in the genera list, type first characters of a genus name in quick succession.

The lists are generated according to the biological domains selected by the checkboxes in the upper row (1). By default, mainly microorganisms are included (no plants and animals). The genus (2) should be selected obligatory. Default positions for species (3) and strains (3) are Any which means no limitations. The second position sp. in the list of species is used to find microorganisms with the specified genus and strain but the unassigned species name. As soon as the genus is selected, the species and strain lists are updated accordingly. As the strain/serogroup list may be rather long, there is text field (5) to enter it directly (or a part of it, to search a taxon by incomplete strain speicfication). The default value is * (=no limitations).

If the organism was reclassified or a taxon was renamed, you can use any of the names to identify the organism. However, not all of the organism name remappings and synonyms are currently stored in CSDB.

The Search among host organisms checkbox (6) will search the specified taxon among host organisms, rather than among associated organisms themselves. E.g., if you specify Mus musculus and check this box, the search will return this organism and structures that were found in all microorganisms infecting mice or extracted from mice.

On checking Use NCBI Tax ID (7), taxon selection lists disappear, and you can identify an organism or a taxonomic group by the NCBI Taxonomy Database ID. If Including taxonomic children (8) is checked, all organisms that belong to the specified group and have lower ranks will also be included (e.g. search for Proteus with this checkbox on will return Proteus penneri, Proteus mirabilis, Proteus sp. 1234 etc.). If you use selection lists rather than NCBI Tax ID, taxonomic children are always included.

The link List of organisms (9) displays the complete list of organisms that are present in the database. The form Process taxonomy in NCBI... (10) retrieves the data from the NCBI Taxonomy Database using the selected genus and species as criteria. The data displayed are: scientific name of the organism, synonymic names, rank and taxonomic lineage.


Bibliographic search

This form is proposed for search using bibliographic data and keywords. If search criteria are provided in several sections of this form (e.g. authors and title), the intersection of queries will be returned. The queries are case-insensitive and accent-independent (but please note, that accented consonants may be stored as a combination of latin characters in publication metadata).

The Authors field (1) lets you input the author name(s). Click on a character in the helper (2) to input national symbols missing from your keyboard. To avoid spelling errors, it is recommended to use the author index available from the Author index button (3). For the author index window to appear, at least two first characters of the author family name (4) should be specified beside the button. The list is restricted to the author names beginning with the specified character combination. As soon as you click a certain author in this list, it is copied to the bibliographic form. The author field supports query language with term grouping and wildcards (click here for details). The sample query "Armstrong L" AND (Einstein) will find publications written together by Armstrong L and Einstein (with any initials).

Bibliograhic search

The Title field (5) lets you define which words should be present in the publication title. This field supports query language with term grouping, wildcards and logical operations (click here for details). The sample query capsul* OR C?S will find publications containing at least one of the following words in the title: capsule, capsules, capsular, CPS, COS.
If the checkbox search also in abstract (6) is checked, publication abstracts will be analyzed for the specified terms together with the titles. Please note that not all of the publications within CSDB have abstracts stored.

To limit the search to publications with certain keywords assigned, use the Keywords field (7). The list of keywords assigned to every publication matches the keyword list published in the paper. The query language is supported.
If the checkbox search also in title (8) is checked, the publication title will be analyzed for the specified terms together with the keyword list.

Checkbox (14) filters out publications with no structure elucidation described. If this option is checked, publications with additional studies on the structures elucidated elsewhere will not be returned.

The Restrict taxonomical domain checkbox and drop-down list (15) limit the search to publications describing organisms from the certain taxonomic kingdom.

Pressing the PubMed XML link (16) converts the specified bibliographic data to the NCBI PubMed XML format and outputs it in a separate window.


Query language

The query language is supported in the Authors, Title and Keywords fields of the bibliographic search form and utilizes the following syntax:

  • Quotation (") makes terms unseparable, e.g. "structure elucidation" will return data where these words come together. Always enquote a term if it contains blankspaces, i.e. consists of several words that should come together in the specified order, e.g. "Einstein A". Wildcards (* and ?) are not supported inside quotes.
  • Asterisk (*) replaces any number (including zero) of alphanumeric characters, e.g. Capsule* will return data containing words Capsule, Capsules, Capsule-like etc.
  • Question mark (?) replaces exactly one alphanumeric character, e.g. ?-antigen will return data containing words O-antigen, K-antigen etc.

    The Title field used to search for terms in publication title and abstract supports additional query syntax:

  • Ampersand (&) intersects the terms using logical AND, e.g. (Term1 & Term2) will return data containing both Term1 and Term2. This is default option.
  • Vertical bar (|) combines the terms using logical OR, e.g. (Term1 | Term2) will return data containing either Term1 or Term2, or both.
  • Power sign (^) applies logical NOT to the term, e.g. ^Term1 will return data NOT containing Term1.
  • Parentheses are used to group subqueries into more complex expressions and to declare preference of logical operations, e.g. (Term1 | Term2) & (^Term3).


    NMR data search
    NMR data search

    This form allows searching for compounds with NMR spectra containing the specified signals. Selector (1) allows selection of a particular nucleus. The subspectrum to search for should be typed in window (3). You can separate signals with spaces or new line characters, the sorting is not required; the allowed characters are numerals and decimal dot. The threshold field (2) allows output filtering according to the spectra similarity (see below). Only the compounds with the similarity higher than this threshold will be returned. Good values for carbon spectra are 1 and above; good values for proton spectra are 5 and above.

    If checkbox (4) is checked (as by default), only the spectra that contain the specified signals within one residue will be returned; otherwise, the spectra are analyzed without assignment of subspectra to residues. Please note, that with this option unchecked the search may take extremely long. Generally, the less signals you specify and the less widespread chemical shifts are, the faster the search is.

    To compare the spectra, the CSDB engine forms all possible subspectra of the stored NMR spectrum, the size of the subspectrum being equal to the number of signals in the user input (3). The best-fitting subspectrum is used to calculate the similarity value. Similarity is the estimation of the inverse average deviation between signals, e.g. 1 means the average difference is 1 ppm, 10 means 0.1 ppm, 0.1 means 10 ppm etc. 0 stands for no similarity at all, 1000 is for full similarity (exact match of chemical shifts). If a compound has more than one spectrum (e.g. in different conditions or in different publications), the similarity for this compound is calculated as the average of the similarity values of all spectra assigned to it for given nuclei.

    If the user input has less signals than an ideal experimental subspectrum, which may occur due to signal overlap, it will result in lower but still not null similarity. To improve accuracy, it is recommended to input chemical shifts of signals with double integral intensity twice.

    The output compounds are sorted in the similarity descending order, and all associated spectra for a given nuclei are displayed. Chemical shifts close to those from the search term (3) are highlighted. The highlight threshold is ±0.4 ppm for carbon chemical shifts and ±0.2 ppm for proton ones.


    Output of results
    Output header & footer

    Every search request leads to a number of data units (compounds, publications, organisms etc. ), and what it looks like depends on the type of your search. However, the header and footer (shown on the right) remain similar. Every data unit can be displayed in two forms: collapsed (main data only) and expanded (all related data). Clicking on Collapse this record or Expand this record in the bottom of the data unit switches its display style. Red numbers in the first line of the header (1) represent how many units (compounds, publications, organisms, IDs) have been found and how many of them are displayed on this page. If there are more results than the per-page parameter, the Previous and Next links (2) allow navigation through the result output pages. Expand all records (3) (or collapse all records, depending on the current state) allows bulk operation over the data unit display style.

    The footer of the output page contains a link to resort records (4), according to the selected criterion (publication year, microorganism name, etc. - depends on the search type). The resorting is applied to all the records returned, not only to those displayed on the current page. The New query link (5) returns you to the search form of the same type as the one that produced this output.

    Record output

    When you search for a structural fragment or composition or NMR spectra, data units are COMPOUNDS found in CSDB, plus major compound-related data (compound ID, structural formula in the Sweet-DB format, structure type, aglycon, molecular weight, chemical formula, trivial name, NMR data, compound class and references to other structural databases) and a list of publications in which this compound is described. Every publication in this sublist is accompanied with a link to a CSDB record ID (and a list of associated organisms within this record). This record ID covers this compound within this publication and allows access to complete data originating from the paper.

    When you search for bibliographic data, data units are PUBLICATIONS found in CSDB, plus major publication-related data (article ID, authors, title, issue data, keywords, publisher, involved institutions, corresponding author's email, used methods and references to other bibliographic databases) and the list of compounds that are described in this publication. Every compound in this sublist is accompanied with a link to a CSDB record ID (and a list of associated organisms within this record). This record ID covers this compound within this publication and allows access to complete data originating from the paper.

    When you search for taxonomic data, data units are ORGANISMS found in CSDB, plus major organism-related data (organism ID, systematic name, taxonomic domain, phylum and references to other taxonomic databases) and a list of compounds that are associated with this organism. Every compound in this sublist is accompanied with a link to a CSDB record ID (and an associated publication within this record). This record ID covers this compound in association with this organism and allows access to complete data originating from the paper.

    When you search for CSDB record ID or follow the CSDB record ID links in the three previous cases, data units are RECORDS displaying all data available. A record is a combination of compound and a bibligraphic reference, it may have one or more taxonomic references.

    A screenshot on the left shows an example of a completely filled record in the expanded form:

  • (1) is a number of this record among the records found, (2) is a CSDB ID of this record.
  • authors' list (3), article, thesis or chapter title (4) and issue data (5) stand for the bibliographic reference associated with this record. Journal issue data are: journal name, volume, subvolume if any, year, page range. Book and chapter issue data are: book title (incl. series title and volume), editors, publisher, year, chapter number, page range. Symposium proceedings issue data are: symposium name, publisher if available, place, year. Structure visualization
  • (6) is a graphic representation of the structure in the SNFG format. Clicking on Show as text (7) explains the meaning of the used icons. Clicking on Show as text (7) displays the structure display in pseudo-graphic extended SweetDB format. You can visualy compare the graphic vs. pseudo-graphic representations of the sample structure in the figure on the right. CSDB linear code is not used for visualization; it is given for reference.
  • Taxonomical annotations (8) refer to one or more organisms (genus, species, strain if available) associated with the structure. If there are organism remappings (newer synonyms if an organism was reclassified/renamed after the publication, and older synonyms that identify the same organism in older publications), they are displayed in parentheses (previously named..., later renamed to...) (9). The taxonomical cross-references (10) link you to the NCBI Taxonomy database.
  • Extended taxonomic and medical information (11) includes: taxonomic domain and phylum; host organism occupied by the discussed microorganisms; organ or tissue in the associated or host organism from which the compound was extracted; disease of the host associated with the compound or microorganism. Medical termes are linked to the MeSH database.
  • Extended bibliographic information (12) includes: a flag identifying whether the structure was elucidated within the associated publication; www-address of the publication; bibliographic cross-references (PubMed ID, DOI, journal or chapter NLM ID); publisher company; corresponding author's email; list of authors' affiliations. Two white boxes display the publication abstract (13) and keywords (14).
  • Extended compound information (15) includes: structure type (oligo, mono, repeating unit etc.); polymerization degree or molecular weight; chemical formula; location of the structure inside the article; originally published erroneous structure (to keep it searchable, if the structure was revised); aglycon information; trivial name; and compound class.
  • The next block stands for other publication-specific information on the structure (16): experimental methods described in the paper; biological activity information; enzymes that releaze or process the structure; availability of synthetic, biosynthetic, genetic and conformation data in the paper; other comments (e.g. errors in this publication found elsewhere).
  • The related records (17) is a list of links to the CSDB records that contain the same or related structures (products of degradation, molecule parts, biochemically related, etc.) or the same publication. The links in this section include cross-references to other databases (CCSD, GlycomeDB, CAS registry, patent numbers, etc.), if available.
  • The NMR information includes: temperature, solvent, chemical shift reference if not TMS (NMR conditions) (18) used in 1D NMR experiments, and NMR signal assignment tables (19) for 1H and 13C. The linkage column in these tables indicate the path to the residue from the reducing end or from the rightmost residue in the repeating unit. An experimental 13C NMR spectrum is schematically plotted beside the table (20).
  • The record is followed by three structure-related tools: converter to other glycan-encoding languages (GlycoCT, LINUCS, GLYDE, etc.) (21); 13C and 1H 1D/2D NMR spectra simulator (22) (click here for details); and 3D visualization and/or conformation analysis by Sweet-II engine or by AMBER/GLYCAM (23).



    Home