CSDB structure linear encoding

This section describes the rules for encoding of carbohydrate and derivative structures in a single line ('CSDB linear code'). You may need this information in three cases:



Residues are described by the sequence of terms like <residue name>(<linkage>). In the case of the reducing end residue, the expression in parentheses is not required. For example, A(1-3)B(1-4)C is a linear fragment in which residue A substitutes position 3 of residue B by its first position, residue B substitutes position 4 of residue C by its first position, and residue C substitutes nothing. Here and below, Latin capitals stand for residue names.

Topology and linkage If the structure is polymeric, the leftmost and rightmost residues should have open linkages, e.g. -2)A(1-3)B(1-4)C(1- means the same structure as above, with the only difference that it represents repeating units linked to each other by the 1-2 linkage.

If there are branching points, one chain is always considered the main one, and others are the side chains. To distinguish which chains are main and which are side, refer to the corresponding section below. The side chains are enclosed in square brackets together with linkage indication parentheses. For example, t)A(1-3)[B(1-4)]C is a branched fragment in which residue A substitutes position 3 of residue C, while residue B substitutes position 4 of residue C. Several side chains attached to one residue are separated by commas. Side chains may also be linear or branched, and all combinations of nesting square brackets are allowed, e.g. -7)B(1-3)[D(1-3)[E(2-6)]C(1-6),G(2-6:2-4)F(1-4)]A(1- is a topology depicted on the right.


The linkage between residiues is specified in parentheses to the right of the donor residue (<goes by>-<goes to>), where goes by denotes the position (carbon number) by which this residue substitutes another residue (usually 1 or 2), and goes to denotes which position of the linked residue is substituted. For example, 1-O-methyl,2-O-α-D-manno-β-D-glucoside is represented by aDMan(1-2)bDGlcp(1-1)Me.

Outgoing (goes_by) and ingoing (goes_to) linkage positions may be numbers or question marks (?) if unknown. The ingoing linkage position to Subst(N) aliases may contain position modifiers (a,b,c,',").

It is assumed that the linkage is formed with elimination of water or ammonia, giving ester (including glycosidic and phosphodiester), ether, amide or amine bond. To specify a carbon-carbon linkage, use C-character after the outgoing position (e.g. bDGlc(1C-6)bDFucp).

If a residue forms more than one outgoing linkages to its acceptor residue (pyruvates, biphosphates etc.), use colon to separate the linkages inside parenthesis, e.g. xRPyr(2-6:2-4)aDGal means the 4,6-pyruvated galactose. The higher position in acceptor always goes first (A(2-6:2-4)B but not A(2-4:2-6)B). Biphosphates and bisulphates have 0 as their <goes by> index, e.g. P(0-6:0-4)bDGalp.

Except for bisubstitution, phosphates and sulphates should be included into the linkage parenthesis like this: aDGlc(1-P-4)bLFuc (in the chain) or P-4)bLFuc (at the non-reducing end) or aDGlc(1-P (at the reducing end). Longer chains (di-, tri- etc.) are allowed: aDGlc(1-P-P-4)bLFuc or S-S-4)bLFuc

Names of residues

Each residue name is composed of several fields following each other without separators: Residue naming

Monosaccharide base names are abbreviations of trivial names indicating, together with the absolute configuration, stereochemistry of all chiral carbons. Below are most widespread examples:

Non-sugar base names are abbreviations of their trivial names:

Examples: aD6dTalA, aXKdo, xLGro, xDManN-ol, Ac, xRPyr, ?DFucN3N.
     You can try to combine these fields and examine more names here.

Click here for the monomer namespace table

Lipid base names

The naming system for lipid residues matches the general naming system described above. l should be used for anomeric configuration. For most lipids there are reserved names like Pam, Ole, Vac etc. (the complete list is available in the aliphatic acid section of the complete residue list). If there is no reserved base name, the following rules may be used to construct new base names:

Distinguishing between main and side chains; side chain order

The chain is called normal if it is not secondary. The chain is secondary if it starts with any of the following residues (has it at the reducing end or consists only of it):

To keep this encoding unambiguous, you should pay attention to which chains are encoded as main and which are the side ones.

When comparing substitution positions, the following special cases exist:

Non-stoichiometrical linkages

A linkage is non-stoichiometric if its donating residue is present in a non-stoichiometric amount in polymer repeating units or the structure represents a mixture of oligomers. In this case, the residue name should be preceded by the stoichiometry degree in percents (e.g. 40%bDGlc). A single percent sign without a number (e.g. %Ac) means that the residue is present in a non-stoichiometric quantity but the exact amount is not known. Phosphate and sulphate residues can be preceded by percentage as well, e.g. xDRib-ol(1-50%P-4)bDGalp

The percentage is applied only to the outgoing linkage of the residue, which means that if a residue is substituted, all its children chains are non-stoichiometrical, too. For example:

The root residue of an oligomer cannot be non-stoichiometric. To encode structures like A-B-%C, use A[%C]B or %C[A]B.

Structural fuzziness

Unknown anomeric or absolute configurations and ring sizes can be encoded on the level of the residue name (see above). For unknown linkage positions, use question mark, e.g. Subst1(?-?)bLFucp or xXEtN(1-P-?)aXKdop. For residues that are not fully elucidated, see the Superclasses section

Structural uncertainties

Fuzziness at the topological level can be encoded by one of two syntactic constructions: <<A(n-m)|B(p-q)>> for exclusive combination (XOR) and <A(n-m)|B(p-q)> for inclusive combination (OR). This means that e.g. <<A(1-3)|B(1-4)>>C is a disaccharide, in which either C3 of residue C is substituted by residue A or C4 of residue C is substituted by residue B, but not both at once, while <A(1-3)|B(1-4)>C is a disaccharide, in which C3 of residue C is substituted by residue A, or C4 of residue C is substituted by residue B, or both these positions are substituted by A and B, accordingly.

A fuzzy residue enclosed in angle brackets can be substituted itself, e.g. D(1-2)<<A(1-3)|B(1-4)>>C means that the structure is either D(1-2)A(1-3)C or D(1-2)B(1-4)C. If a variant inside the substituted fuzzy block is longer that one residue, it is assumed that the residue on its non-reducing end is substituted, i.e. A<BC|D>E is interpreted as ADE or ABCE, but not as ADE or A[B]CE. It is recommended to avoid substituted fuzzy residues and to add substitution each of its variants in a fuzzy set (use E[<DA|DB>]C instead of E[D<A|B>]C) . But if it is impossible, always use as short chains inside a fuzzy block as possible, i.e. -ED<A|B>C- but not -E<DA|DB>C-

More than two variants inside a fuzzy block are allowed, e.g.<A|B|C>.

Angle brackets can be nested up to one level, e.g. <<D|<<A|B>>|C>> or <<<<A|B>>|C>> or <<<A|B>|C>> or <<<A|B>>|C> or <<C|<A|B>>> etc.

Fuzzy residues at the reducing end or at the rightmost position in the polymer repeating unit are not supported (use ...[<<lXDco(2-1)|lXLin(2-1)>>]aDGalp, but NOT ...aDGalp(1-2)<<lXDco|lXLin>>).

Monovalent and inorganic acid residues

All monovalent substituents (Ac, Me, Et, Fo and other residues that CANNOT be substituted) should be described as separate residues, e.g. aDGal(1-3)bDGlcNAc should be recorded as aDGal(1-3)[Ac(1-2)]bDGlcN. If a monovalent residue is an aglycon at the reducing end, write it as following: aDGlc(1-Me. During the substructure search (but not during the data upload or automated exchange) you can specify monovalent residues in the usual way, e.g. bDQuiNAc3NAc4Ac.

Except for bisubstitution, phosphates and sulphates should be included into the linkage parentheses like this: aDGlc(1-P-4)bLFuc (in the chain) or P-4)bLFuc (at the non-reducing end) or aDGlc(1-P (at the reducing end). Longer chains (di-, tri- etc.) are allowed: aDGlc(1-P-P-4)bLFuc or S-S-4)bLFuc

Aliases and superclasses

If an exact residue at a certain position in the structure is unknown, a superclass name can be used instead of a residue name. Superclasses do not require anomeric and absolute configurations and ring size. The following superclasses are supported:

  • TET - any tetrose
  • PEN - any pentose
  • HEX - any hexose
  • HEP - any heptose
  • OCT - any octose
  • NON - any nonose
  • SUG - any sugar
  • ALK - any alkyl chain
  • LIP - any fatty acyl
  • CER - any N-acylated sphingoid
  • SPH - any sphingoid

For rarely occuring sugars or non-carbohydrate residues or other residues that are not stored in the monomer database, use aliases. There are several allowed alias types:

All aliases other than Subst or SubstN should have anomeric and absolute configurations and ring size (which may be ?=unknown). All aliases should be explained in the comment section of the encoding after two slashes (//), e.g. aDGlc(1-?)Subst // Subst = Lipid A. Several alias explanations are separated by semicolon, e.g. Subst1(?-6)aDGlc(1-?)Subst2 // Subst1 = acyl or polyglycerol; Subst2 = Lipid A.

If the Subst or SubstN alias stands for aglycon (a moiety at the reducing end that cannot be encoded using the rules above) and it is attached to an unknown or default (C1 of aldoses, C2 of ketoses) position, it should be encoded in the aglycon (AG) field rather than in the structure (ST1), i.e.:

ST1: aDGalp
AG: lipid A
but not ST1: aDGalp(1-?)Subst // Subst =Lipid A

Greek letters, single and double quotes are allowed in Subst(N) explanations (ST1) and aglycons (AG). Substitution positions with single or double quotes or letters (3', 4", 15a) are allowed for Subst(N) residues only.

Aglyca and unit types

There is a special field AG to store aglycon information. Please, use it only if the aglycon cannot be encoded in the structure itself, i.e. the aglycon is a residue not supported by the database, or an non-encodable group of residues, or a superclass (see above). The aglycon attachment is assumed C1 for aldoses and C2 for ketoses. If possible, always try to encode a residue in structure rather than in aglycon:

ST1: aDGalp(1-P-P-5)xXnucA
but not ST1: aDGalp

If the carbohydrate moiety attachment position is not default, the linking atom in the aglycon is specified the following way:
AG: (->3) 3-hydroxy-2,4-diamino-toluene (aglycon is attached via its C3).

The unit type is stored separately in the ST2 field and may be the following:

Polymerization degree should be specified in ST3 if known, e.g. ST3: n=5-12