CSDB help

CSDB structure linear encoding

This section describes the rules for encoding of carbohydrate and derivative structures in a single line ('CSDB linear code'). You may need this information in four cases:

if you plan to use an expert form of query when searching for a substructure
if you are going to submit your data to CSDB
if you are establishing an automated cross-database data exchange
if you query CSDB at the SQL level

Content:

Topology
Linkages
Repeating (sub)structures
Names of residues
Lipid base names
Monovalent and inorganic acid residues
Order of substituents
Non-stoichiometrical linkages
Ambiguieties and alternative substructures
Superclasses and aliases
Aglyca
Atom numbering
Unit types and polymerization degree
Notation redundancy

Topology

Residues are described by a sequence of terms like <residue name>(<linkage>). In the case of the reducing end residue, the expression in parentheses is not required. For example, A(1-3)B(1-4)C is a linear fragment, in which residue A substitutes position 3 of residue B by its first position, residue B substitutes position 4 of residue C by its first position, and residue C substitutes nothing. Here and below, Latin capitals stand for residue names.
Topology and linkage

Topology and linkage

The logic of residue connection is a tree graph. The rightmost residue in an oligomeric structure (reducing end) is its root. Every residue except root must substitute a single residue. Any residue, including root, can have zero or more substituents. Substituent is a residue at the left (e.g. glycosidic bond donor, or a child in a tree graph); substituted residue is one at the right (e.g. glycosidic bond acceptor, or a parent in a tree graph). Polymeric structures have a pseudo-root, which substitutes another instance of the same repeating unit, thus forming a cyclic graph.

If there are branching points, one chain is always considered as main, and others are side chains. To distinguish which chains are main and which are side, refer to the corresponding section below. The side chains are enclosed in square brackets together with parentheses indicating a linkage. For example, ^^A(1-3)[B(1-4)]C is a branched fragment in which residue A substitutes position 3 of residue C, while residue B substitutes position 4 of residue C. In substructure search requests, the ^^ prefix indicates that the marked residue is unsubstituted. Several side chains attached to one residue are separated by commas. Side chains can also be linear or branched, and all combinations of nesting square brackets are allowed, e.g. -7)B(1-3)[D(1-3)[E(2-6)]C(1-6),G(2-6:2-4)F(1-4)]A(1-.

Linkages

The linkage between residues is specified in parentheses to the right of the donor residue (<goes by>-<goes to>), where goes by denotes the position (carbon number) by which this residue substitutes another residue (usually 1 or 2), and goes to denotes which position of the linked residue is substituted. For example, 1-O-methyl,2-O-α-D-manno-β-D-glucoside is represented by aDMan(1-2)bDGlcp(1-1)Me.

Outgoing (goes by) and ingoing (goes to) linkage positions can be numbers or question marks (?) if unknown. The ingoing linkage position to Subst(N) aliases can contain position modifiers (a,b,c,',").

It is assumed that any linkage is formed with elimination of water or ammonia, giving ester (including glycosidic and phosphodiester), ether, amide or amine bond. To specify a carbon-carbon linkage, use C-character after the outgoing position (e.g. bD1dGlcp(1C-4)Subst // Subst = 2-aminopyridine = SMILES N{2}c1c{4}cccn1). Please note, that since there is no OH group at C1 of a sugar residue in C-glycosides, it is encoded as 1-deoxy-monosaccharide (1dGlc). To specify carbon-nitrogen linkage in N-glycosides and N-linked glycoproteins, please use 1N derivatives of monosaccharides to keep consistency with NMR simulation modules (e.g. bDGlc1N(1-4)xLAsn).

If a residue forms more than one outgoing linkages to its acceptor residue (pyruvates, biphosphates etc.), use colon to separate the linkages inside parenthesis, e.g. xRPyr(2-6:2-4)aDGal means the 4,6-pyruvated galactose. The higher position in acceptor always goes first (A(2-6:2-4)B but not A(2-4:2-6)B). Biphosphates and bisulphates have 0 as their <goes by> index, e.g. P(0-6:0-4)bDGalp.

Except for bisubstitution, phosphates and sulphates should be included into the linkage parenthesis like this: aDGlc(1-P-4)bLFuc (in the chain) or P-4)bLFuc (at the non-reducing end) or aDGlc(1-P (at the reducing end). Longer chains (di-, tri- etc.) are allowed: aDGlc(1-P-P-4)bLFuc or S-S-4)bLFuc

Although linkages are undirected, it is strongly recommended to follow left-to-right logic of substitution to comply with a tree representation of carbohydrates. To do this, arrange residues in a way to make anomeric centers (or other default linkage atoms in non-carbohydrate residues) the outgoing linkage positions, e.g. to record a disaccharide built of two aldoses A and B use A(1-2)B but not B(2-1)A (and C(1-3)[A(1-2)]B but not C(1-3)B(2-1)A). In some special cases (e.g. linkage between two anomeric centers: aDGlcp(1-2)bDFruf) it is impossible to apply this rule. In this case, minimize the numer of inverted linkages in a structure.

If a linkage produces another stereo-center, which was achiral in a free residue, and its configuration cannot be encoded as anomericity or absolute configuration (e.g. xSPyr for S-linked pyruvic acid diacetal), explain a newborn stereocenter as S or R after two slashes: xDGlca(1-1)Me // xDGlca = S-D-glucose (open). The same explanation can be used for carbon-carbon linkage producing a newborn stereo-center in Subst alias: bDGlcp(1C-2)Subst // Subst = 2S-1-aminopropane = SMILES C(N){2}CC (please note that 2S is not encoded in SMILES).

Repeating (sub)structures

If a structure is reported as a repeating unit, the leftmost and rightmost residues should have open linkages, e.g. -2)A(1-3)B(1-4)C(1- means ABC fragment is repeated, and repeating units are linked to each other by 1-2 linkage. For polymerization frame positioning, see section Unit types and polymerization degree.

The repeating parts inside bigger structures can be described using the syntax A/B/n=?/C, where B is a repeated part of the structure, n=? is a repeat count (can be a number, range, or question mark if unknown), A is a capping part at non-reducing end, and C is a core part at reducing end. The whole structure can be polymeric as well (i.e. it can have open linkages at left and right), different repeating parts can be sequential or nested up to one level.

The whole structure, any side chain, or any repeating part cannot start with a repeating part at its right end; to encode this, extract one repeat out of a repeating part (use e.g. /AB/n=9/AB instead of /AB/n=10/. Please note, that the whole repeat should be extracted, not only its rightmost residue).
The repeating part cannot have another repeating part at its left end (use e.g. /AB/AB/n=9/C/n=20/D instead of //AB/n=10/C/n=20/D).
There can be only one chain attached to a repeating part on the left. If the leftmost residue in the leftmost instance of a repeating part has more than one chain attached, extract it out of a repeating unit, i.e. use A[B]C/C/n=9/D instead of A[B]/C/n=10/D

Examples:

aDGalp(1-2)/aDGlcpA(1-4)/n=20/aDManp - poly-4-glucuronic acid capped with a 2-linked galactose and attached to a monomannose core by a-1-4 bond.
/aDGlcp(1-4)aLRhap(1-3)/n=0-5/aDManp - a dissacharide repeated from 0 to 5 times, attached to mannose, non-capped. The inter-repeat linkage is 1-3.
Me(1-3)/[Ac(1-2)]aDGlcpN(1-4)[Ac(1-3)]aLRhap(1-3)/n=?/aDGlcp(1-4)aLRhap - repeated disacchride with some monovalent side chains and a methyl cap.
-3)[Ac(1-6),/aDMan(1-2)/n=4/aDManp(1-2)]bDGlcp(1- - mannose repeated 5 times in an uncapped side chain; the branched glucose has another side chain (6-acetate) as well; the whole structure is also repeated.
/bDManp(1-4)/n=5//aDGlcp(1-4)/n=?/aDManp - two sequential repeats.
/aDFucp(1-6)/bDManp(1-4)/n=3/aDGlcp(1-4)/n=10/aDGalp - nested repeats (total composition: 10 Fuc, 30 Man, 10 Glc, 1 Gal).
aDGalp(1-2)/<<aDGlcpA(1-4)|aDGalpA(1-4)>>/n=?/aDManp - combination of a repeating part and a fuzzy residue (see below).

Names of residues

Each residue name is composed of several fields following each other without separators:
Residue namespace

Residue namespace. Obligatory fields are in red.

Anomeric configuration. It is one character, a = alpha, b = beta, l = indicates a lipid residue, x = this residue does not have anomeric forms or is a mixture of anomers, ? = unknown. Monovalent residues do not require this field.
Absolute configuration. It is one character, D, L, R, S, X = this residue does not have anomeric forms, ? = unknown. Note that if the absolute configuration is a part of the residue base name (e.g. LDManHep), you should specify X as its absolute configuration: aXLDManHep. Monovalent residues do not require this field.
Residue base name, including deoxygenation information if any. Several characters, first capitalized, others lowercase.
Ring size modifier. It is one character, p = pyranose, f = furanose, a = open-chain, ? = unknown or in any form. In most cases (except for base names ending with p, f or a character, e.g. Ala or Rha), the question mark can be omitted.
Double bond modifier (Xen, where X is the lower carbon number of those forming a double bond), if present.
Amino group modifiers - zero or more capital N. If the position of amino group differs from 2, it should be specified before the capital N, e.g. aLRha4N. If the amino group is implied by the residue base name, you don't need to modify it, e.g. xLLys but NOT xLLysN6N.
Capital A if a residue is an uronic acid.
All other modifiers (-ulosonic, -ulosaric, -onic, -X-ulo, XF, XCMe, etc., where X is the modifier position) in alphanumeric order, if any. Several modifiers are allowed, e.g. aLDmanHepp4enN3NA6F.
-ol modifier if a residue is an alditol (which is not implied by the residue base name as in e.g. glycerol (Gro)). This modifier is incompatible with ring size and some other modifiers.

Monosaccharide base names are abbreviations of trivial names encoding (together with absolute configuration) stereochemistry of all chiral carbons. Below are some widespread examples:

Zero or one chiral carbon: ethylene glycol (Etg), glycerol (Gro), tetrose with a single degeneration of stereochemistry (?dgroTet, where ? = deoxy position)
Two chiral carbons: tetroses (Ery,Thrs), pentoses with a single degeneration of stereochemistry (Rul,Xul,?dthrPen,?dEryPen, where ? = deoxy position), hexoses with a double degeneration of stereochemistry
Three chiral carbons: aldo-pentoses (Rib,Ara,Xyl,Lyx), keto-hexoses (Fru,Psi,Sor,Tag), hexoses with a single degeneration of stereochemistry (Abe,Col,Tyv,Par,?dxxxHex, where ? = deoxy position, xxx = pentose base name)
Four chiral carbons: aldo-hexoses (Glc,Man,Gal,All,Alt,Tal,Gul,Ido), 6-deoxy aldo-hexoses (Qui,Rha,Fuc,6dxxx where xxx = hexose base name), 6-deoxy aldo-heptoses (6dxxxHep where xxx = hexose base name), higher sugars (Kdo)
Five chiral carbons: aldo-heptoses (??xxxHep, where ?? = absolute configurations of the tail and main parts, respectively, and xxx = hexose base name. E.g. LDmanHep is L-glycero-D-manno-heptose), higher sugars (Ko,Neu,Pse,Leg, other nonulosonic acids)

Non-sugar base names are abbreviations of their trivial names:

Alkyl (Me,Et,Pr etc.) and other monovalent substituents (Fo,Ph,Bn,Allyl,etc.)
Acyl substituents (Ac,Lac,?HOBut, where ? = hydroxy position, etc.), including fatty acids and sphingoids (Sph, Sphn etc.). For lipid base names, see the next section.
Inositol and derivatives (Ino etc.)
Amino acids (named accordingly to the protein three-letter code: Gly,Ala,Ser etc.)
Nucleosides (nucX, nucdX where X = nitrogenous base one-letter code)
Other residues (EtN,Cho,Pyr,Mal,Suc etc.)

Examples: aD6dTalA, aXKdo, xLGro, xDManN-ol, Ac, xRPyr, ?DFucN3N.
You can try to combine these fields and examine more names here.

Click here for the monomer namespace table

Lipid residue names

The naming of lipid residues matches the general naming described above. l should be used for anomeric configuration. For most lipids there are reserved names like Pam, Ole, Vac etc. (the complete list is available in the aliphatic acid section of the complete residue list). If there is no reserved base name, the following rules can be used to construct the new ones:

the name skeleton is C<number of carbons>, e.g. C17, where <number of carbons> is the total number of carbons in all subchains of the residue.
the skeleton can be appended with prefices and postfices in the alphanumeric order, except that the hydroxy- prefix should be the closest to the skeleton.
the postfix for double bond is the following: ={<comma-separated list of double bond positions>}, e.g. C17={9,11} means a C17:2 fatty acyl with double bonds in positions 9 and 11. If a double bond stereochemistry is known, it can be specified as t for E-configuration (trans-) and c for Z-configuration (cis-), e.g. C24={t9,c11,17}.
the postfix for a C-cycle is c{<comma-separated cycle closure positions>}, e.g. C17c{9,11} means a C17:2 fatty acyl with C9 and C11 attached to each other.
Hydroxy functions are encoded with the <n>HO prefix, where n is the attachment position. If there are several hydroxy functions, the attachment positions are separated with comma, e.g. 3,15HOC17 means 3,15-dihydroxy-septadecanoic acid.
Branching sites are indicated by one or more of the following prefices: i for iso-branching (at the pre-last carbon), ai for anteiso-branching (at the pre-pre-last carbon), b for branching at an unknown position and <X>b<Y> for other types of branching, where X is the branching position and Y is the length of a side subchain, e.g. 13b1 means methyl group at C13. Please note, that the number after capital C still remains the TOTAL number of carbons, including side subchains, e.g. 2b14b1iC13 means 2,4,9-trimethyl-decanoic acid.

Monovalent and inorganic acid residues

All monovalent substituents (Ac, Me, Et, Fo and other residues that CANNOT be substituted) should be described as separate residues, e.g. aDGal(1-3)bDGlcNAc should be recorded as aDGal(1-3)[Ac(1-2)]bDGlcN. If a monovalent residue is an aglycon at the reducing end, write it as following: aDGlc(1-1)Me. During the substructure search (but not during the data upload or automated exchange) you can specify monovalent residues in a common way, e.g. bDQuiNAc3NAc4Ac.

Phosphates and sulphates should be included into the linkage parentheses like this: aDGlc(1-P-4)bLFuc (in the chain) or P-4)bLFuc (at the non-reducing end) or aDGlc(1-P (at the reducing end). Longer chains (di-, tri- etc.) are allowed: aDGlc(1-P-P-4)bLFuc or S-S-4)bLFuc. However, in case of biphosphates or multiple substitutions, a regular syntax is allowed with 0 as a substitution position, e.g. NH2(1-0)P(0-2:0-3)bDGlcp

Distinguishing between main and side chains; side chain order

The chain is called normal if it is not secondary. The chain is secondary if it starts with any of the following residues (has it at the reducing end or consists only of it):

monovalent residues (Ac, Me, etc.)
phosphates (P) and sulphates (S)
Subst and SubstN aliases (see below). These aliases are "more secondary" than the others, e.g. if there are both Subst aliases and monovalent residues, one of the latter is considered a main chain.

To keep this encoding unambiguous, you should pay attention to which chains are encoded as main and which are the side ones.

In polymers, a chain that forms the polymer backbone is always main.
In oligomers and fragments of structure:
If one chain is normal and other ones are secondary, the former chain is taken for main, e.g. ...aDGal(1-6)[Ac(1-4)]bDGlc... but NOT Ac(1-4)[...aDGal(1-6)]bDGlc...
If all chains are normal or all are secondary, the chain substituting the position with a smaller number is taken for main, e.g. ...Ac(1-3)[Ac(1-4)]bDGlc... but NOT ...Ac(1-4)[Ac(1-3)]bDGlc...
If there are several side chains attached to one residue, the order in which they follow inside square brackets is substitution position descending, e.g. ...[Ac(1-6),Ac(1-2)]... but NOT ...[Ac(1-2),Ac(1-6)]...
Variants inside the sets of alternative substructures (fuzzy inserts) are sorted the same way as side chains (substitution position descending).

When comparing substitution positions, the following special cases exist:

Question mark (?) is always greater than any numeric position
If a donor is a fuzzy residue, this may result in a fuzzy substitution position in the acceptor. Such position has the same rank as a question mark, e.g. <...-5)|..-7)> is greater than ..-8).
If substitution positions are the same (e.g., both are ? or they result from a fuzzy block like <A(1-5)|B(1-5)>), the alphanumeric comparison of the names of reducing-end residues is applied. If the names are also the same, the sort order is arbitrary.

Non-stoichiometrical linkages

A linkage is non-stoichiometric if its donating residue is present in a non-stoichiometric amount in polymer repeating units or the structure represents a mixture of oligomers. In this case, the residue name should be preceded by the stoichiometry degree in percents (e.g. 40%bDGlc). A single percent sign without a number (e.g. %Ac) means that the residue is present in a non-stoichiometric quantity but the exact amount is not known. Inline phosphate and sulphate residues can be preceded by percentage as well, implying non-stoichiometry of the bond between inorganic acid residue and its acceptor, e.g. xDRib-ol(1-50%P-4)bDGalp

The percentage is applied only to the outgoing linkage of the residue, which means that if a residue is substituted, all its children chains are non-stoichiometrical, too. For example:

%40A(n-m)B(x-y)C means that 40% of residues B are substituted with residue A (A moiety in the whole structure is 40%),
A(n-m)40%B(x-y)C means that 40% of residues C are substituted with AB chain (A and B moieties in the whole structure is 40% each),
40%A(n-m)40%B(x-y)C means that 40% of residues C are substituted with residue B, of which 40% is substituted with residue A (B moiety in the whole structure is 40%, A moiety is 40%*40% = 16%)

The residue within a polymer backbone and the root residue in an oligomer cannot be non-stoichiometric. To encode structures like A-B-%C, use A[%C]B or %C[A]B.

Alternative substructures and other fuzziness

Unknown anomeric or absolute configurations and ring sizes can be encoded at the residue name level (see above). For unknown linkage positions, use a question mark, e.g. Subst1(?-?)bLFucp or xXEtN(1-P-?)aXKdop. For residues that are not fully elucidated, see the Superclasses section.

Structural uncertainties and SMILES-encoded alias

Fuzziness at the topological level can be encoded by one of two syntactic constructions: <<A(n-m)|B(p-q)>> for exclusive combination (XOR) and <A(n-m)|B(p-q)> for inclusive combination (OR). This means that e.g. <<A(1-3)|B(1-4)>>C is a disaccharide, in which either C3 of residue C is substituted by residue A or C4 of residue C is substituted by residue B, but not both at once, while <A(1-3)|B(1-4)>C is a disaccharide, in which C3 of residue C is substituted by residue A, or C4 of residue C is substituted by residue B, or both these positions are substituted by A and B, accordingly.

A fuzzy residue enclosed in angle brackets can be substituted itself, e.g. D(1-2)<<A(1-3)|B(1-4)>>C means that the structure is either D(1-2)A(1-3)C or D(1-2)B(1-4)C. If a variant inside a substituted fuzzy block is longer that one residue, it is assumed that the residue on its non-reducing end is substituted, i.e. A<BC|D>E is interpreted as ADE or ABCE, but not as ADE or A[B]CE. If a residue at the non-reducing end is monovalent or phosphate or sulfate, it is omitted from consideration and substitution focus is moved to its acceptor. If there are more than one non-monovalent non-reducing ends, the substitution focus is selected arbitrary. To reduce ambiguity and complexity of interpretation it is recommended to avoid substituted fuzzy residues and to add substitution to each of its variants in a fuzzy set (use E[<DA|DB>]C instead of E[D<A|B>]C). But if it is impossible, always use as short chains inside a fuzzy block as possible, i.e. -ED<A|B>C- but not -E<DA|DB>C-

If stoichiometry of branch reducing ends is specified inside a fuzzy block, e.g. -2)D(1-2)[<<E(1-4)30%A(1-3)|70%B(1-3)>>]C(1-, it is interpreted as ratio of variants. If the total sum of these values is less than 100%, or logic is OR, the whole fuzzy set can be missing. For example, <30%A(1-3)|70%B(1-4)>C means one of four variants (A(1-3)C, B(1-4)C, A(1-3)[B(1-4)]C, and C), while <<30%A(1-3)|70%B(1-4)>>C means one of two variants (A(1-3)C, B(1-4)C) and <<30%A(1-3)|60%B(1-4)>>C means one of three variants (A(1-3)C, B(1-4)C, and C). If a fuzzy set can become empty (variant C in previous examples), it cannot be substituted.

More than two variants inside a fuzzy block are allowed, e.g.<A|B|C>.

Angle brackets can be nested up to one level, e.g. <<D|<<A|B>>|C>> or <<<<A|B>>|C>> or <<<A|B>|C>> or <<<A|B>>|C> or <<C|<A|B>>> etc.

Possible risk: alternative sets are proposed mainly to specify a single variative residue. Although more sophicsticated features are supported, it is recommended to avoid them, as their interpretation can be ambiguous or misleading for chemists:

long alternative variants, when shorter variants are possible (use <<A(1-2)|C(1-2)>>B(1-3) but not <<A(1-2)B(1-3)|C(1-2)B(1-3)>>);
alternative sets, where one is a part of another (use %A(1-2)B(1-3)C but not <<A(1-2)B(1-3)|B(1-3)>>C as it is the same as non-stoichiometrical presence of A);
side chains terminating alternative variants(<<[A]B|C>>);
proximal combinations of alternative sets (<<..>>) and subrepeating units (/../n=../).

Unsupported:

Fuzzy residues at the reducing end or at the rightmost position in a polymer repeating unit (use ...[<<lX3HODco(3-1)|lX3HOMyr(3-1)>>]aDGalp, but NOT ...aDGalp(1-3)<<lX3HODco|lX3HOMyr>>).
Alternative sites of attachment of a structural terminus to carrier residues of the core structure. You can use <<AC[B]D|AB[C]D>>, where A is attached either to B or to C residue in a branched trimer C[B]D (please note, that in case of OR-logic, the record <AC[B]D|AB[C]D> is equivalent to %AC[%AB]D). However, bigger constructs should be described as separate structures, especially if multiple termini have unknown mapping to multiple carriers, which leads to combinatorial growth of the number of variants.

Residue superclasses and aliases

If an exact residue at a certain position in the structure is unknown, a superclass name can be used instead of a residue name. Superclasses do not require anomeric and absolute configurations and ring size. The following superclasses are supported:

TET - any tetrose
PEN - any pentose
HEX - any hexose
HEP - any heptose
OCT - any octose
NON - any nonose

SUG - any sugar
ALK - any alkyl chain
LIP - any fatty acyl
CER - any N-acylated sphingoid
SPH - any sphingoid

For rarely occuring sugars or non-carbohydrate residues or other residues that are not stored in the monomer database, use aliases. There are several allowed alias types:

Sug - some new sugar
Hex - some new hexose
Subst - other substituent
SubstN where N is a number - other substituent

All aliases other than Subst or SubstN should have anomeric and absolute configurations and ring size (which can be ?=unknown). All aliases should be explained in the comment section of the encoding after two slashes (//), e.g. aDGlc(1-4)Subst // Subst = Kdo-rich core, ID 12345. Several alias explanations are separated by semicolon, e.g. Subst1(?-6)aDGlc(1-1)Subst2 // Subst1 = acyl or polyglycerol; Subst2 = 3-hydroxy-2,4-diamino-toluene = SMILES ....

For fully defined aliases, add SMILES code after second equation mark if possible. Although trivial / IUPAC name can be omitted (e.g., Subst(1-6)aDGlcp // Subst = SMILES C=C{1}C(O)CO ), it is recommended to keep trivial or short systematic names for clarity.

Substitution positions should be indicated with a curly-bracketed atom number before the corresponding carbon: aDGlcp(1-3)Subst // Subst = laricytrin = SMILES O{3}C1=C(C2=CC(OC)={54}C(O){55}C(O)=C2)OC3=C({5}C(O)=C{7}C(O)=C3)C1=O. Details:

If a carbon is surrounded by square brackets, place {...} before the opening bracket (use ...{1}[C@H]... but not ...[{1}C@H]...)
If substitution position is unknown, indicate all substitutable positions in SMILES, e.g. aDGlcp(1-?)Subst // Subst = 3-hydroxy-5-hydroxymethylpyridine = SMILES N1C{3}C(O)CC({7}CO)C1
If an alias forms an outgoing bond only, the linked position should be specified nevertheless: Subst(1-6)aDGlcp // Subst = 1,2-dihydroxy-but-3-ene = SMILES C=C{1}C(O)CO
For atom numers with non-numeric suffices, add 50 for apostroph, 100 for double quotes, 150 for 'a', 200 for 'b', and 250 for 'c': aDGlcp(1-4')Subst // Subst = 2-(4'-hydroxyphenyl)-benzoic acid = SMILES O{54}C(C=C1)=CC=C1C2=CC=CC=C2{7}C(O)=O
If a hetero atom is substituted, you should still indicate the carbon to which this hetero atom is attached. However, when multiple heteroatoms are attached to a single carbon, curly brackets before this carbon can lead to unexpected results, e.g. both C2 and C5 carbons in 2,5-dihydroxypyrrole will produce an N-linked residue. If correct bond position cannot be achieved via carbons, it is allowed to place curly brackets before heteroatoms. The same approach can be used if incorrect structures are generated for a bond to a heteroatom inside a cycle. In all suspicious cases, please check the generation of a structural formula explicitly via Check structures tool, and variate curly bracket position to achieve a correct result.

You can use online JSME structure editor to draw a structure and generate its SMILES code. To identify a substituted atom easily, mark a carbon of interest by -button and it will appear in square brackets with a number in the resulting SMILES. To generate SMILES, press . To check an existing SMILES code, use import feature (). Please keep in mind that you should draw complete residues including all -OH or -NH₂ groups as they appear before water is eliminated upon linkage (e.g. draw methanol if you need to SMILESify an O-linked methyl group).

Greek letters, single and double quotes are allowed in Subst(N) explanations (ST1) and aglycons (AG). Substitution positions with single or double quotes or letters (3', 4", 15a) are allowed for Subst(N) residues only.

Aglyca

There is a special field AG to store aglycon information. Please, use it only if the aglycon cannot be encoded in the structure itself, i.e. the aglycon is a residue not supported by the database, or an non-encodable group of residues, or a superclass (see above). The aglycon attachment is assumed C1 for aldoses and C2 for ketoses. If possible, always try to encode a residue in structure rather than in aglycon:

ST1: aDGalp(1-P-P-5)xXnucA AG: but not ST1: aDGalp AG: ADP

However, if the Subst or SubstN alias stands for aglycon (a moiety at the reducing end that cannot be encoded using the rules above) and it is attached to an unknown or default (C1 of aldoses, C2 of ketoses) position, it should be encoded in the aglycon (AG) field rather than in the structure (ST1), i.e.:

ST1: aDGalp AG: lipid A but not ST1: aDGalp(1-?)Subst // Subst =Lipid A AG:

If known, the linking atom in the aglycon is specified the following way:
AG: (->3) 2,3-dihydroxy fatty acid (a fatty acid with unknown length/saturation degree is attached via hydroxyl at its C3).

In O- and N-glycans, encode aminoacid in ST1 if it is known, or write a general rule in AG othrewise. Please note modifier 1N in N-glycans reflecting that there is no -OH group at the attachment site:

ST1: aDGlcp1N(1-4)xLAsn AG: protein CC: N-glycan or ST1: aDGlcp(1-3)xLThr AG: protein CC: O-glycan or ST1: aDGlcp AG: (->3) L-Thr/L-Ser (protein) CC: O-glycan

In all other glycosides, than O-linked, use prefices/siffices at the anomeric position to indicate elimination of the -OH group: 1N suffix for N-glycosides; 1S suffix for S-glycosides; 1d prefix (or 1, in case there are more deoxygenated carbons) for C-glycosides, e.g. bDGlcp1N(1-4)xLAsn; bDGlcp1S(1-3)xLCys; bD1dGlcp(1C-3)Ph; bD1,2,6daraHexp(1C-1)Subst // Subst = ethane = SMILES {1}CC

Atom numbering

Atom numbering is used to derive substitution positions. For residues in the CSDB vocabulary it should follow the atomic pattern specified in Monomeric namespace. In simple cases like common monosaccharides or amino acids it corresponds to IUPAC recommendations.

For atypical aglycons and Subst aliases, you can use any numbering scheme, however substitution positions in CSDB Linear (ST1 field), and in SMILES (in curly brackets), and in NMRH/NMRC data prefices should match each other. Usage of atom numbering common for compounds/residues of the same class is recommended. If it is missing or unobvious, use the same numbering as provided by the authors of the annotated paper. It is also advisable to use the same numbering scheme for similar aglycons in different records.

Unit types and polymerization degree

The unit type is stored separately in the ST2 field and can be the following:

OLIGO - oligomeric structure
MONO - oligomeric structure containing a single carbohydrate residue (and possibly, non-carbohydrate substituents)
CHEM - chemical repeating unit of a polymer (polymerization frame is unknown)
BIOL - proven biological repeating unit of a polymer (polymerization frame is proven)
SBIOL - suggested biological repeating unit of a polymer (polymerization frame is suggested using analogs)
CYCLO - repeating unit of a cyclic polymer (polymerization frame does not matter)
HOMO - repeating unit of a homopolymer (polymerization frame is obvious)
FRAGMENT - poly- or oligomeric fragment of a bigger carbohydrate structure, which has not been extracted/analysed as a compound sample. An O- or N-glycan from a glycoprotein is not a fragment if it is discussed separately.
MOTIF - poly- or oligomeric structural motif: supposed, idealized or arbitrary structure; exemplary structure with certain pattern of side chains, where multiple interpretations are possible; exemplary structure with explicit values of n/m/k/etc indicating the polymericity of subfragments

Chemical vs. biological repeating units. A chemical repeating unit does not imply knowledge of the repeating frame positioning, e.g. -ABC-, -BCA-, and -CAB- are the same polymers. In most papers where a primary structure of glycopolymers is elucidated, authors do not track the end groups, and thus provide a structure of a chemical repeating unit (CHEM) only. In contrast, if authors determine what exact residues occupy the reducing and non-reducing ends of a polymer (e.g. if they know from the biochemical context by which residue the polymer is attached to the core or other carrier), the three above variants are NOT the same structure anymore, and in this case, the declared repeating unit is called a biological one (BIOL). Sometimes biological repeating unit is not proven experimentally but it can be suggested (SBIOL) basing on the same structure from the same organism in some other record (BIOL) or on general knowledge, e.g. if a repeating unit contains a single aminosugar, its biological repeat is likely to start with this aminosugar.

Polymerization degree should be specified in ST3 field if known. Ranges, lists and relation symbols are supported, e.g. ST3: n=5-7, 12 or ST3: n>100. The number indicates how many types the whole structure is repeated; for subrepeating units inside the structure see section Repeating parts.

The n character tells that the value is a polymerization degree. If there is no n character, the value is interpreted as molecular mass in Daltons, e.g. ST3: 1686.3 [M+H]+ (hydrogenated molecular ion) or ST3: 70000-100000 (70-100 kDa).

Notation redundancy

Some structures can be recorded in multiple ways, especially when a full record includes other fields than ST1 (CSDB Linear code). If selection rules are not defined, please use the following criteria:

If a whole structure is repeated, use as few residues in the repeating unit as possible, e.g. α-cyclodextrin is ST1: -4)aDGlcp(1- ST2: CYCLO ST3: n=6 but not ST1: -4)aDGlcp(1-4)aDGlcp(1- ST2: CYCLO ST3: n=3.
Do not extract units from a repeating sub-sequence if it can be avoided: /A/n=?/A is correct while /A/n=?/AA and A/A/n=?/A are redundant.
For known n a sub-repeat syntax is optional as it is translated to oligomers automatically. Please apply the same criterion (residue count): /A/n=3/B is better than AAAB. In case of two residues, a sub-repeat syntax has no benefit in terms of readability.
Choice between a small structure repeated a few times, and an oligomer is arbitrary and usually follows the authors of the original publication. For more than four residues, repeating sequence is preferrable (ST1: -3)bDGlcp(1- ST2: HOMO but not bDGlcp(1-3)bDGlcp(1-3)bDGlcp(1-3)bDGlcp(1-3)bDGlcp ST2: OLIGO), however the database is not fully normalized against this criterion yet.
Use left-to-right order of donor-to-acceptor linkage: aDGlcp(1-2)bDGlcp but not bDGlcp(2-1)aDGlcp.
If donor and acceptor cannot be distinguished, place a residue linked by its default outgoing position at left: P-3)xDRib-ol but not xDRib-ol(3-P.
If two default outgoing positions are linked, use common sence (e.g. aDGlcp(1-2)bDFruf for sucrose) or alphanumeric sorting.
If you find other examples of CSDB Linear encoding redundancy, please report it to netboxtoukach.ru, and we will update this section.

Home