Database documentation

This section describes the architecture of CSDB, data formats and other technical details. For user help, please refer to Basic usage and Advanced features.

Content:

  • Database scheme
  • Topology generator
  • Dump format
  • Structure encoding
  • Database scheme

    CSDB utilizes a MySQL relational database to store data. The connection table approach is used for structures.
    Click on the entity relationship scheme to enlarge it (PDF).

    CSDB entities

    The example database is explained (entities + data) in the figure (click to enlarge to A3 PDF):

    CSDB schema
    Topology generator

    CSDB contains four special tables for caching residue connection topologies (unlisted on the scheme). To view pre-generated topologies, click here.

    Dump format

    The CSDB dump is a backup of the database and a reference file for all the database content. It is a human-readable text file, and content correction and error-tracking are usually performed on it. The dump file may contain one or more records (including a variant of complete dump that contains all CSDB records) separated by two blank lines. Lines starting with # symbol are counted for comments and are not processed.

    Every record is represented by several lines, each for content of a single field. The line starts with the field name followed by colon (:) and field content. Line breaks inside the field content are not allowed. Below is the explanation of each field. Please refer to the data submission page and to the example record in the end of this page for more details on field meanings. Fields that cannot be empty are in bold.

    FieldExplanation
    ID: Unique CSDB record ID. Those records that have some unresolved problems should be marked by star (*; they are not processed until corrected). Records associated with publications that contain no information on structure of natural carbohydrates should be marked with two stars and an explanation why this record is excluded from the database (e.g. ID: 100 ** structure not found; such records are not processed at all).
    TH: 1 if the structure was elucidated or revised in the associated publication (even by poor methods),
    0 if the structure discussed in the associated publication was elucidated elsewhere (e.g. associated publication is a review or biological study of a known structure).
    If the structure is supposed or arbitrary but authors obtained the structural data and publish the structure in the paper, use TH=1 and ST2=MOTIF.
    If only a fragment was elucidated, while a bigger structure is known or supposed from elsewhere, use TH=1 and ST2=FRAGMENT for the fragment itself, and TH=0 for a bigger structure.
    If the structure was elucidated erroneously, use TH=1, ST1=proper structure, ST1ORIG=erroneous structure. In this case, please specify in NT, how the structure was revised (basing on the NMR data, revised later from [ref...], etc.)
    AU: Comma-separated list of authors. Last name goes first, then initials without dots. National characters are supported (e.g. Müller AB).
    TI: Title of publication (article, chapter, book, symposium thesis). Please do not capitalize every word. If a title was copy-pasted from PDF, check it for missing spaces and coincided characters.
    JN: Journal name without abbreviations. If unsure, please use the bibliographic search form to check the spelling.
    For book publications, use the following format: JN: BOOK: book name; Eds.: comma-separated editors; Publisher.
    For symposium publications, use the following format: JN: SYMP: symposium name (symposium number : year : place); Eds.: comma-separated editors; Publisher.
    If editor or publisher is unknown, type semicolons only, e.g. SYMP: Eurocarb (17th : 2013: Tel-Aviv);;
    PY: Publication year.
    VL: Volume number. If the volume imprint contains a subvolume/issue, e.g. 36(5), specify the subvolume/issue in parentheses.
    For books, specify the volume number outside parentheses and the chapter number inside parentheses. The book series number should be included in the book title (JN).
    PG: Hyphen-separated start and end page numbers. If the end page is unknown, use the start page only.
    RL: Cross-references to bibliography: comma-separated list of resource-identyfier pairs (resource identifier is before colon, reference is after colon, e.g. PMID:123456789). Allowed resources are: PMID (NCBI PubMed ID), DOI (DOI code), URL (www address), NLMID (NCBI NLM ID of book or chapter).
    EA: E-mail of the corresponding author, e.g. address@gmail.com; My Name <my_email@my_server.ru>
    AD: Semicolon-separated list of author affiliations (institution, city, country). Each affiliation should be listed only once, the order is arbitrary.
    AB: Publication abstract. Please change linebreaks to blankspaces. National characters are supported. If abtract was copy-pasted from PDF, check it for missing spaces, coincided characters and extra line breaks. All chemical structures should be either specified in the linear form or replaced with the word /structure/.
    ST1: Chemical structure in CSDB linear encoding (see Structure encoding rules for details).
    If you used Subst for a fully defined substructure, except simplest cases, it is strongly recommended to add a SMILES code after the second equation mark, e.g. aDRibf(1-3)Subst // Subst = questin = SMILES CC1=CC(=C2C(=C1)C(=O)C3=CC(=CC(=C3C2=O)OC)O)O
    (optional) ST1ORIG: Originally published erroneous chemical structure in CSDB linear encoding (required only if ST1 contains the revised structure).
    ST2: Structure type: OLIGO (oligomeric structure), CHEM (chemical repeating unit of polymer), BIOL (proven biological repeating unit), SBIOL (suggested biological repeating unit), MONO (oligomeric structure with a single carbohydrate entity), HOMO (repeating unit of homopolymer), CYCLO (repeating unit of cyclic polymer), FRAGMENT (poly- or oligomeric fragment of a bigger structure), MOTIF (supposed, idealized or arbitrary structure; exemplary structure with certain pattern of side chains, where multiuple interpretations are possible; exemplary structure with explicit values of n/m/k/etc indicating the polymericity of subfragments - in this case add a comment, e.g. NT: motif of polymeric structure 29 when n=m=1, see ID #####). Biological repeating unit can be suggested basing on the same structure from the same organism in another record (BIOL) or on general knowledge, e.g. if a repeating unit contains a single aminosugar, its biological repeat is likely to start with this aminosugar. In E. coli, presence of genes GNE or GNU indicate biological repeat start with GalNAc.
    ST3: For polymers: Polymerization degree preceded by n=, e.g. n=12-15, or molecular mass in Daltons. Ranges and relations are supported. e.g. 10000-30000 or >30000.
    For oligomers: Molecular mass with or without ion type in square brackets, e.g. 9813 [M+H]+.
    Separate multiple values by comma.
    SL: Structure location inside the publication, e.g. structure 1, HPLC fraction 2, compound 7a, fig. 3, p.456, etc.
    AG: Aglycon information (what is attached to the reducing end and by which position, if known), e.g. (->6') lipid A or inner core, bDGalpN C6. If the aglycon is a single residue present in the monomeric namespace (e.g. Allyl) or more than one residue is attached to the aglycon, encode it in the structure (ST1). AG field is proposed for describing aglycons if aglycon structure could not be fully determined, or when aglycon caps the reducing end of a polymeric glycan. In all other cases it is advised to encode aglycon as Subst alias, and add its SMILES code in the explanation (see example above in ST1).
    Leading parentheses indicate the attachment site in the aglycon (e.g. (->3) sapogenin F1). Greek letters, apostrophes and quotes are allowed.
    MF: Molecular formula (for mono- and oligosaccharides only). Carbons first, then hydrogens, then alphabetically.
    NMRH: 1H NMR assignment data (see NMR data storage for details). Paste structure into NMR template generator to generate a template to fill with chemical shifts. If the spectrum was published but chemical shifts were not difitized, or not yet deposited in CSDB, type present in publication. Proton enumeration within residues should follow carbon enumeration. Chemical shifts of two protons attached to the same carbon atom should be separated by hyphen in ascending order.
    The spectrum should correspond to the exact structure specified in ST1. Please avoid situations like a spectrum deposited for an oligomer, while structure is a polymer with the same repeating unit (because capping groups are different), or a spectrum deposited for a fragment, while structure contains other moieties attached, etc. If the structure contains non-stoichiometric moieties (residues denoted by %), provide spectra for the structure variant with these moieties attached.
    NMRC: 13C NMR assignment data (see NMR data storage for details). Paste structure into NMR template generator to generate a template to fill with chemical shifts. If the spectrum is published but not yet deposited in CSDB, type present in publication. For carbon enumeration, refer to Monomer namespace.
    The spectrum should correspond to the exact structure specified in ST1. Please avoid situations like a spectrum deposited for an oligomer, while structure is a polymer with the same repeating unit (because capping groups are different), or a spectrum deposited for a fragment, while structure contains other moieties attached, etc. If the structure contains non-stoichiometric moieties (residues denoted by %), provide spectra for the structure variant with these moieties attached.
    NMRT: Temperature used to carry out NMR experiments, in Kelvins. Nucleus can be specified in parentheses. If carbon and proton spectra were recorded at different temperature, separate values with comma, e.g. 313(H), 298(C).
    NMRS: Solvent used to carry out NMR experiments (chemical formula or abbreviation). Separate mixed solvents with slash (e.g. CD3OD / D2O), starting from those with greater part. If ratios are known, write them before the solvents: 90%D2O / 10%H2O / 25mM NaCl or vol 67%CDCl3 / vol 33%DMSO-d6, etc. If a reference standard is not TMS, specify it as another solvent: D2O / DSS. pH value can be appended after semicolon: D2O; pD 7.5.
    SO: Semicolon-separated list of biological sources of the structure, without abbreviations. The first word in every source is interpreted as genus, the second is for species, all the other words (from the third on) are combined serogroup and/or strain. If the species name is unknown, while strain is known, specify sp. for species. If only genus or genus/species information is available, incomplete taxonomic annotations are allowed. If only a higher rank is known (e.g. family), parenthesize it. For hybrid organisms, combine two taxons with star (A * B). If a taxon was renamed or an organism was reclassified after the data were published, specify the newer name in curly brackets after the organism, e.g. Enterobacter sakazakii O6 {NEW: Cronobacter sakazakii O6}. Older synonyms are specified as {OLD: ....}. The published taxon name should be always outside curly brackets; any number of OLD and/or NEW terms can be combined inside curly brackets.
    KD: Before slash (/): Taxonomic domain (bacteria,protista,archaea,fungi,algae,plant,animal,mammal,human).
    After slash (/): Taxonomic phylum.
    If there are multiple organisms of different domains or phyla, separate values with comma (in this case number of values should correspond to the number of organisms, including remappings in curly brackets).
    OTI: Organ or tissue in the host organism from which the structure was extracted, or organella of the microorganism.
    DSS: Disease of the host organism associated with the structure, or disease of the patient from which the structure was extracted.
    HO: Systematic name (genus and species) of the host organism in which the microorganism (specified in SO) was found.
    NC: (Trivial) name of the compound. Greek letters, apostrophs and quotes are supported. If multiple, separate by semicolon.
    CC: Comma-separated list of classes and roles of the compound, e.g. O-antigen, EPS, phosphoglycolipid, GPI-anchor etc. For spelling and word order of popular classes, please refer to Class abundancy table
    MT: Comma-separated list of methods that were used to elucidate or process the structure. For spelling and word order of popular methods, please refer to Method abundancy table.
    BA: Biological activity of the compound (free text), including binding information and serological data. If unsure, please indicate whether these data are present in the paper, e.g. serological data.
    EI: Comma-separated list of enzymes that release or process the structure.
    BG: Availability of biosynthetic and genetic data in the associated publication (biochemical data, genetic data)
    SY: Availability of data on laboratory or industrial synthesis of the compound (chemical, chemical fragmentary, enzymatic, enzymatic in vivo, chemoenzymatic, chemical and enzymatic, synthesis (unknown), fragmentary (unknown), modelling, biosynthesis; leave blank if no synthesis is discussed in associated publication).
    KW: Comma-separated list of keywords from the associated publication. Greek letters, apostrophs and quotes are supported.
    NT: Any comments that do not fit to other fields (e.g. errors found in the article).
    3D: Availability of 3D-structure and conformation data (conformation data, computer modelling, dynamics etc.)
    RR: Comma-separated list of related CSDB record IDs:
    - other structures compared with this structure in the publication,
    - fragments of this structure,
    - similar structures having only minor differences with this structure,
    - structures of the same type from the same organism,
    - bigger structures, which this structure is a fragment of,
    - other subunits of the same molecule,
    - etc.
    If multiuple structures are published in a review, cross-link them only if they are produced by the same organism or are closely interrelated. If a referenced record is in another domain, precede it with letter B, P or F (depending on the dump it is contained in: bacterial, plant or fungal). If a reference should not be imported, but you still wish to keep it (e.g. referenced record is temporarily excluded from the dump by adding a star), add * afterwards. Example: 1234,B5678,100500*,12345
    DB: Cross-references to other structural databases: comma-separated list of pairs resource-identyfier (resource identifier is before colon, reference is after colon, e.g. CA:123456789). Widespread resource identifiers are CCSD (Carbbank ID),GlycomeDB (GlycomeDB ID), CA (Chemical Abstracts access number), CA-RN (CA substance registry number), US-PT (USA patent number), ProtDB (Protein Databank ID), GenDB (GenBank ID).
    TAX: Comma-separated list of cross-references to the NCBI Taxonomy database, according to the order of organisms in the SO field. The number of values should be equal to the number of taxons in SO, excluding taxon remappings in curly brackets. Please specify NCBI TaxID of the organism (strain). If no NCBI TaxID exists for the organism, use NCBI TaxID of the species (or genus if species is missing from the NCBI taxonomy database) in parentheses. If sp. is used for species and NCBI TaxID of the exact organism is unknown, use NCBI TaxID of the genus. For hybrid organisms, combine two TaxIDs with star (100*101). If a genus is missing from the NCBI Taxonomy database, please use a TaxID of the kingdom and negate it (e.g. -2 for bacteria). A value of -1 means organism is unknown.
    U1: Last name of the record submitter (yours).
    U2: Record submission date (DD.MM.YYYY).
    U3: RefManID of this paper in the local database of the Carbohydrate Chemistry Lab, Zelinsky Institute. Add PDF if a full-text PDF is present.
    U4: Comma-separated list of RefManIDs of related papers in the local database of the Carbohydrate Chemistry Lab, Zelinsky Institute.
    U5: If the record was imported from another database (e.g. Carbbank), U5 lists errors found in the original record.
    U6: Filename (with extension) of the publication full text in CSDB local cache.

     
    Example dump of fully filled hypothetical record:

    #CSDB dump file - example
    
    ID: 50000
    TH: 1
    AU: Toukach FV, Akvinski F
    TI: Antigen-induced suppression of sexual identification of worras infected by Deliriums trementii
    JN: Pseudoscience Letters
    PY: 1943
    VL: 489(12)
    PG: 9723-9725
    RL: PMID:123456789, DOI:10.1000/12345.6789, URL: http://toukach.ru/rus/hs_main.htm
    EA: never@cnt.ru
    AD: N.D. Zelinsky Institute of Organic Chemistry, Moscow, Russia
    AB: The structure of the O-specific polysaccharide of Deliriumus trementii O66 has been elucidated using 2D-NMR approach, including... The studies of biological activity revealed the effect of this glycopolymer on sexual behaviour of Worra-worras bla-bla-bla...
    ST1: -4)[Subst(?-3)]a?Fucp(1-P-4)[10%Ac(1-5)aXNeup(2-6),]bDGlcpN(1- // Subst = phosphorylated mannan
    ST1ORIG: -4)[Subst(?-3)]a?Fucp(1-P-4)[aXNeup(2-6)]bDGlcpN(1- // Subst = mannan (erroneous structure)
    ST2: SBIOL
    ST3: n=45
    SL: HPLC fraction 2
    AG: core oligosaccharide
    MF: C25 H42 N2 O21 P
    NMRH: #4,0,3_Subst // #4,0_a?Fucp   5.01 3.80 4.30 4.25 4.70 1.21// #4_P // #6,5_10%Ac   - 2.02// #6_aXNeup   - 4.97 2.32-2.38 3.60 3.40 4.40 3.60 3.62 3.50-3.55// #2_Ac   - 2.03// #3_Me   1.33// #0_bDGlcpN   4.87 3.33 3.85 4.00 4.15 3.80-3.90// 
    NMRC: #4,0,3_Subst // #4,0_a?Fucp   96.5 68.6 71.0 81.9 67.8 15.9// #4_P // #6,5_10%Ac   175.0 23.0// #6_aXNeup   174.1 107.2 41.0 69.0 52.9 73.9 69.2 72.6 63.6// #2_Ac   175.0 23.2// #3_Me   16.3// #0_bDGlcpN   103.0 55.6 78.9 70.9 75.4 67.1// 
    NMRT: 308
    NMRS: 90%D2O / 10%H2O / 0.01M CD3COOD; pD 5.5
    SO: Deliriumus trementii O66 PCM2004, Deliriumus sp. 12345  {NEW: Deliriumus mirabilis 12345, OLD: Worrodeliriumus sp. 12345}
    KD: bacteria / Proteobacteria
    OTI: brain
    DSS: sexual misidentification, paranoya
    HO: Worra worra
    NC: khrenobiose,okhrenobiose
    CC: O-antigen
    MT: FAB-MS, MALDI-TOF, methylation, 1D-NMR, 2D-NMR, 13C-NMR, HF solvolysis, partial hydrolysis, Smith degradation
    BA: causes suppression of Worra-worra's natural instincts, binds to monoclonal antibody 1B1, serological data available
    EI: CoQ-III
    BG: biochemical and genetic data
    SY: chemical fragmentary
    KW: worra,Deliriumus trementii,structure,polysaccharide
    NT: structure was revised (see ID 50001)
    3D: conformation data, computer modelling, dynamics
    RR: 50001
    DB: CCSD:123456,CA-RN:300-4183,ProtDB:PIR1 - ABC456,SpecDB:2-12-85-0-6
    TAX: 6661313,(6661314)
    U1: Toukach
    U2: 31.12.2004
    U3: 55555
    U4: 66666
    U5: incorrect structure
    U6: id50000.pdf
    
    

    Structure encoding

    A human-readable one-line structure encoding language was developed within the CSDB project to help error tracking and simplify data posting and input/output operation. The comparison of CSDB Linear with other languages is shown in the table (red = worse, green = better):

    LanguageProjectApproachCompleteUnambiguousControllableParseableFuzzyness support
    IUPAC
    IUPAC extendedSweet-DB, Carbbankpseudo-
    graphics
    GLYDE 1.2XML
    CabosMLJCGGDBXML?
    GlycoCTGlycomeDB
    Linear CodeCFGURL
    LINUCSGlycosciences.de
    KCFKEGG
    WURCSJCGGDB, ChEBI, PDBURL
    CSDB LinearCSDBURL

    The main advantages of CSDB Linear are built-in support of multiple non-carbohydrate constituents and co-existence of unambiguiety and human-readability. The disadvantage is no support for nested repeating units, for combination of repeating and non-repeating parts in a single structure, and for undetermined side chain attachment sites. A detailed language description is available in the dedicated help section: Structure encoding.

    Home