This section describes the rules for encoding of carbohydrate and derivative structures in a single line ('CSDB linear code'). You may need this information in three cases:
Residues are described by the sequence of terms like <residue name>(<linkage>). In the case of the reducing end residue, the expression in parentheses is not required. For example, A(1-3)B(1-4)C is a linear fragment in which residue A substitutes position 3 of residue B by its first position, residue B substitutes position 4 of residue C by its first position, and residue C substitutes nothing. Here and below, Latin capitals stand for residue names.
If the structure is polymeric, the leftmost and rightmost residues should have open linkages, e.g. -2)A(1-3)B(1-4)C(1- means the same structure as above, with the only difference that it represents repeating units linked to each other by the 1-2 linkage.
If there are branching points, one chain is always considered the main one, and others are the side chains. To distinguish which chains are main and which are side, refer to the corresponding section below. The side chains are enclosed in square brackets together with linkage indication parentheses. For example, t)A(1-3)[B(1-4)]C is a branched fragment in which residue A substitutes position 3 of residue C, while residue B substitutes position 4 of residue C. Several side chains attached to one residue are separated by commas. Side chains may also be linear or branched, and all combinations of nesting square brackets are allowed, e.g. -7)B(1-3)[D(1-3)[E(2-6)]C(1-6),G(2-6:2-4)F(1-4)]A(1- is a topology depicted on the right.
The linkage between residiues is specified in parentheses to the right of the donor residue (<goes by>-<goes to>), where goes by denotes the position (carbon number) by which this residue substitutes another residue (usually 1 or 2), and goes to denotes which position of the linked residue is substituted. For example, 1-O-methyl,2-O-α-D-manno-β-D-glucoside is represented by aDMan(1-2)bDGlcp(1-1)Me.
Outgoing (goes_by) and ingoing (goes_to) linkage positions may be numbers or question marks (?) if unknown. The ingoing linkage position to Subst(N) aliases may contain position modifiers (a,b,c,',").
It is assumed that the linkage is formed with elimination of water or ammonia, giving ester (including glycosidic and phosphodiester), ether, amide or amine bond. To specify a carbon-carbon linkage, use C-character after the outgoing position (e.g. bD1dGlc(1C-7)Subst // Subst = trans-zeatin). Please note, that since there is no OH group at C1 of a sugar residue in C-glycosides, it is encoded as 1-deoxy-monosaccharide (1dGlc). To specify carbon-nitrogen linkage in N-glycosides and N-linked glycoproteins, please use 1N derivatives of monosaccharides to keep consistency with NMR simulation modules (e.g. bDGlc1N(1-4)xLAsn).
If a residue forms more than one outgoing linkages to its acceptor residue (pyruvates, biphosphates etc.), use colon to separate the linkages inside parenthesis, e.g. xRPyr(2-6:2-4)aDGal means the 4,6-pyruvated galactose. The higher position in acceptor always goes first (A(2-6:2-4)B but not A(2-4:2-6)B). Biphosphates and bisulphates have 0 as their <goes by> index, e.g. P(0-6:0-4)bDGalp.
Except for bisubstitution, phosphates and sulphates should be included into the linkage parenthesis like this: aDGlc(1-P-4)bLFuc (in the chain) or P-4)bLFuc (at the non-reducing end) or aDGlc(1-P (at the reducing end). Longer chains (di-, tri- etc.) are allowed: aDGlc(1-P-P-4)bLFuc or S-S-4)bLFuc
Each residue name is composed of several fields following each other without separators:
Monosaccharide base names are abbreviations of trivial names indicating, together with the absolute configuration, stereochemistry of all chiral carbons. Below are most widespread examples:
Non-sugar base names are abbreviations of their trivial names:
Examples: aD6dTalA, aXKdo, xLGro, xDManN-ol, Ac, xRPyr, ?DFucN3N.
You can try to combine these fields and examine more names here.
Click here for the monomer namespace table
The naming system for lipid residues matches the general naming system described above. l should be used for anomeric configuration. For most lipids there are reserved names like Pam, Ole, Vac etc. (the complete list is available in the aliphatic acid section of the complete residue list). If there is no reserved base name, the following rules may be used to construct new base names:
The chain is called normal if it is not secondary. The chain is secondary if it starts with any of the following residues (has it at the reducing end or consists only of it):
To keep this encoding unambiguous, you should pay attention to which chains are encoded as main and which are the side ones.
When comparing substitution positions, the following special cases exist:
A linkage is non-stoichiometric if its donating residue is present in a non-stoichiometric amount in polymer repeating units or the structure represents a mixture of oligomers. In this case, the residue name should be preceded by the stoichiometry degree in percents (e.g. 40%bDGlc). A single percent sign without a number (e.g. %Ac) means that the residue is present in a non-stoichiometric quantity but the exact amount is not known. Phosphate and sulphate residues can be preceded by percentage as well, e.g. xDRib-ol(1-50%P-4)bDGalp
The percentage is applied only to the outgoing linkage of the residue, which means that if a residue is substituted, all its children chains are non-stoichiometrical, too. For example:
The root residue of an oligomer cannot be non-stoichiometric. To encode structures like A-B-%C, use A[%C]B or %C[A]B.
Unknown anomeric or absolute configurations and ring sizes can be encoded on the level of the residue name (see above). For unknown linkage positions, use question mark, e.g. Subst1(?-?)bLFucp or xXEtN(1-P-?)aXKdop. For residues that are not fully elucidated, see the Superclasses section
Fuzziness at the topological level can be encoded by one of two syntactic constructions: <<A(n-m)|B(p-q)>> for exclusive combination (XOR) and <A(n-m)|B(p-q)> for inclusive combination (OR). This means that e.g. <<A(1-3)|B(1-4)>>C is a disaccharide, in which either C3 of residue C is substituted by residue A or C4 of residue C is substituted by residue B, but not both at once, while <A(1-3)|B(1-4)>C is a disaccharide, in which C3 of residue C is substituted by residue A, or C4 of residue C is substituted by residue B, or both these positions are substituted by A and B, accordingly.
A fuzzy residue enclosed in angle brackets can be substituted itself, e.g. D(1-2)<<A(1-3)|B(1-4)>>C means that the structure is either D(1-2)A(1-3)C or D(1-2)B(1-4)C. If a variant inside a substituted fuzzy block is longer that one residue, it is assumed that the residue on its non-reducing end is substituted, i.e. A<BC|D>E is interpreted as ADE or ABCE, but not as ADE or A[B]CE. If a residue on the non-reducing end is monovalent or phosphate or sulfate, it is omitted from consideration and substitution focus is moved to its acceptor. If there are more than one non-monovalent non-reducing ends, the substitution focus is selected arbitrary. To reduce ambiguity and complexity of iterpretation it is recommended to avoid substituted fuzzy residues and to add substitution each of its variants in a fuzzy set (use E[<DA|DB>]C instead of E[D<A|B>]C) . But if it is impossible, always use as short chains inside a fuzzy block as possible, i.e. -ED<A|B>C- but not -E<DA|DB>C-
If stoichiometry of branch reducing ends is specified inside a fuzzy block, e.g. -2)D(1-2)[<<E(1-4)30%A(1-3)|70%B(1-3)>>]C(1-, it is interpreted as ratio of variants. If the total sum of these values is less than 100%, or logic is OR, the whole fuzzy set can be missing. For example, <30%A(1-3)|70%B(1-4)>C means one of four variants (A(1-3)C, B(1-4)C, A(1-3)[B(1-4)]C, and C), while <<30%A(1-3)|70%B(1-4)>>C means one of two variants (A(1-3)C, B(1-4)C) and <<30%A(1-3)|60%B(1-4)>>C means one of three variants (A(1-3)C, B(1-4)C, and C). If a fuzzy set may become empty (variant C in previous examples), it cannot be substituted.
More than two variants inside a fuzzy block are allowed, e.g.<A|B|C>.
Angle brackets can be nested up to one level, e.g. <<D|<<A|B>>|C>> or <<<<A|B>>|C>> or <<<A|B>|C>> or <<<A|B>>|C> or <<C|<A|B>>> etc.
Fuzzy residues at the reducing end or at the rightmost position in the polymer repeating unit are not supported (use ...[<<lX3HODco(3-1)|lX3HOMyr(3-1)>>]aDGalp, but NOT ...aDGalp(1-3)<<lX3HODco|lX3HOMyr>>).
All monovalent substituents (Ac, Me, Et, Fo and other residues that CANNOT be substituted) should be described as separate residues, e.g. aDGal(1-3)bDGlcNAc should be recorded as aDGal(1-3)[Ac(1-2)]bDGlcN. If a monovalent residue is an aglycon at the reducing end, write it as following: aDGlc(1-Me. During the substructure search (but not during the data upload or automated exchange) you can specify monovalent residues in the usual way, e.g. bDQuiNAc3NAc4Ac.
Except for bisubstitution, phosphates and sulphates should be included into the linkage parentheses like this: aDGlc(1-P-4)bLFuc (in the chain) or P-4)bLFuc (at the non-reducing end) or aDGlc(1-P (at the reducing end). Longer chains (di-, tri- etc.) are allowed: aDGlc(1-P-P-4)bLFuc or S-S-4)bLFuc
If an exact residue at a certain position in the structure is unknown, a superclass name can be used instead of a residue name. Superclasses do not require anomeric and absolute configurations and ring size. The following superclasses are supported:
For rarely occuring sugars or non-carbohydrate residues or other residues that are not stored in the monomer database, use aliases. There are several allowed alias types:
All aliases other than Subst or SubstN should have anomeric and absolute configurations and ring size (which may be ?=unknown). All aliases should be explained in the comment section of the encoding after two slashes (//), e.g. aDGlc(1-?)Subst // Subst = Lipid A. Several alias explanations are separated by semicolon, e.g. Subst1(?-6)aDGlc(1-?)Subst2 // Subst1 = acyl or polyglycerol; Subst2 = Lipid A.
If the Subst or SubstN alias stands for aglycon (a moiety at the reducing end that cannot be encoded using the rules above) and it is attached to an unknown or default (C1 of aldoses, C2 of ketoses) position, it should be encoded in the aglycon (AG) field rather than in the structure (ST1), i.e.:
AG: lipid A
|but not||ST1: aDGalp(1-?)Subst // Subst =Lipid A
Greek letters, single and double quotes are allowed in Subst(N) explanations (ST1) and aglycons (AG). Substitution positions with single or double quotes or letters (3', 4", 15a) are allowed for Subst(N) residues only.
There is a special field AG to store aglycon information. Please, use it only if the aglycon cannot be encoded in the structure itself, i.e. the aglycon is a residue not supported by the database, or an non-encodable group of residues, or a superclass (see above). The aglycon attachment is assumed C1 for aldoses and C2 for ketoses. If possible, always try to encode a residue in structure rather than in aglycon:
|but not||ST1: aDGalp
If the carbohydrate moiety attachment position is not default, the linking atom in the aglycon is specified the following way:
AG: (->3) 3-hydroxy-2,4-diamino-toluene (aglycon is attached via its C3).
The unit type is stored separately in the ST2 field and may be the following:
Polymerization degree should be specified in ST3 if known, e.g. ST3: n=5-12