About the MAP Format
1. What is the MAP Format?
The Modification and Annotation in Proteins (MAP) format is a proposed extension of the widely-used FASTA format for representing protein sequences. The FASTA format represents protein sequences using single-letter amino acid codes and has remained popular due to its simplicity and broad support. However, FASTA is limited to the 20 standard L-amino acids and cannot explicitly represent complex modifications or non-standard residues within sequences. To address these limitations, the MAP format was developed as an enhanced, backward-compatible format that can capture a richer set of information about protein sequences.
In MAP format, users can represent modified, non-natural, and annotated residues directly within a protein sequence string by using special curly brace tags. This allows encoding of post-translational modifications, chemical modifications, non-standard amino acids, and sites of molecular interaction or mutation, all within the sequence. The MAP format remains compatible with existing computational tools for sequence analysis, storage, and retrieval by preserving the base amino acid sequence text, but augments it with embedded annotations. This manual provides a comprehensive guide to using the MAP format, including detailed descriptions of each tag type, examples of usage, and supported features.
In summary, the MAP format has two primary objectives:
- Residue-Level Annotation: This provides detailed annotation information about a residue in a protein, like residue interact with DNA or residue have post-translational modification. This includes marking modified amino acids, special residues, interaction sites, insertions, deletions, and point mutations by inserting descriptive curly-brace tags directly in the sequence.
- Protein-Level Annotation: This provides detailed annotation information about a protein, like DNA binding proteins, membrane protein. This information has been incorporated in the header line by metadata. The header (definition line) can include well-defined tags in curly braces (e.g., {org:Homo sapiens}, {func:Electron transport}) that capture high-level attributes of the protein such as its organism of origin, functional role, or source database, without altering the protein’s primary identifier or sequence.
Each of these aspects is supported by specific tagging conventions described in the sections below.
2. Purpose of MAP Format
The MAP format is designed to incorporate additional structural and functional information into protein sequence entries. In particular, MAP allows inclusion of the following types of information within a protein sequence entry:
- Chemical Modifications: Representation of chemical alterations such as cyclic linkages (e.g., disulfide bonds or head-to-tail cyclization), N- or C-terminal modifications, inclusion of D-amino acids, common post-translational modifications, and other non-natural modifications of amino acid side chains.
- Annotated Residues: Marking of specific residues that are involved in interactions with other molecules (for example, residues that bind ATP, DNA, metal ions, lipids, or other proteins). This helps identify functional sites like binding pockets or interaction interfaces.
- Mutation Information: Indication of sequence variants, including point mutations (amino acid substitutions), insertions, and deletions at specific positions. This is useful for documenting isoforms, engineered mutants, or polymorphisms within a protein sequence.
- Protein-Level Metadata: Integration of structured metadata about the protein as a whole (e.g., source organism, function, subcellular location, database ID) in the header line of the entry.
By embedding this information in the sequence, MAP format aims to create a more informative representation that can be easily parsed by machines while remaining human-readable. Researchers and bioinformaticians can use MAP to capture essential protein metadata and modifications without needing separate records or databases for annotations.
3. Overview of the Format
A MAP format entry is structured similarly to a FASTA entry, with a header line followed by a sequence line. The key difference is the use of curly braces { } in the sequence to denote annotations or modifications of the standard residues. A MAP format entry consists of two parts: a header line and one or more sequence lines. This structure is similar to FASTA, with enhancements to include annotations:
3.1 Header Line
- It always starts with a
>symbol. - Right after the
>, you'll find the protein's identifier (like a name or code). - Then, you can optionally include metadata tags enclosed in curly braces
{}. Each tag gives you a specific piece of information about the entire protein. - Multiple tags are separated by spaces.
| Header Line | Explanation |
|---|---|
>Protein123 {org:Homo sapiens} {func:Electron transport} |
Protein123 is the ID. {org:Homo sapiens} means the protein is from humans. {func:Electron transport} describes its function. |
3.2 Sequence Lines
- These lines contain the one-letter codes for the amino acids in the protein.
- The magic of MAP happens here\! If an amino acid has a modification or annotation, it's immediately followed by a tag in curly braces
{...}that describes it. - These tags provide details about a specific amino acid without breaking the main sequence.
| Sequence with Tags | Explanation |
|---|---|
ACS |
The Serine (S) is phosphorylated, indicated by the {ptm:Phos} tag. |
A |
The Alanine (A) has been mutated to Arginine (R), shown by {mut:R}. |
This two-tiered structure ensures that fine-grained details (at the residue level) are embedded in the sequence itself, while broader descriptors of the protein are recorded in a structured way in the header. This structure ensures that the core sequence is retained (so tools expecting FASTA can still read the letters), while the additional information is encapsulated in an easy-to-identify format. In the following sections, we describe the format in more detail, including how it builds on FASTA and the specific tags used for various annotations.
4. How MAP Extends the FASTA Format
MAP builds upon the simple structure of FASTA to add more layers of information.
4.1 Traditional FASTA Format
- It's a basic text format for representing sequences.
- Each entry has a header line (starting with
>) with an identifier and maybe a description. - Followed by sequence lines containing only the standard 20 amino acid letters (or nucleotide). No modifications are directly indicated in the sequence.
| FASTA Entry Line 1 | Explanation |
|---|---|
>sp|P12345|FOSB_HUMAN FosB (Human) protein |
Header with protein ID and description. |
MSTFVNQSLARQRATKMRNRQKNNARERKAEGKSRRK |
The amino acid sequence (composed solely of the 20 standard amino acid one-letter codes, with no modifications indicated). |
4.2 Extension of FASTA for MAP
- MAP keeps the basic FASTA structure, so it's still compatible with tools that read FASTA.
- Header Enhancement: MAP headers can include the same information as FASTA, but they can also contain structured metadata within curly braces
{}to give more context about the whole protein. - Sequence Enhancement: This is the key difference. MAP uses curly braces
{}within the sequence line to directly annotate modifications or special features of individual amino acids.
| MAP Entry Line 1 | Explanation |
|---|---|
>Peptide001 {org:Homo sapiens} {func:Signal peptide} |
Header with ID, organism (Homo sapiens), and function (Signal peptide). |
ARGCS |
Sequence with annotations: Serine (S) is post-translationally ({ptm:}) modified with a phosphate group (Phosphorylation) ({ptm:Phos}), and the peptide is cyclic ({cyc:N-C}) after the last amino acid (K), indicating a cyclization between the N-terminus and C-terminus of the peptide (a head-to-tail cyclization forming a cyclic peptide). |
The MAP format example above illustrates how additional information is compactly represented: the fifth residue is annotated as phosphorylated, and the entire peptide is noted to be cyclic. A standard FASTA parser would ignore the {...} tags (or treat them as unknown characters), but the core sequence "ARGCSDEFGHIK" is still present and in order. By extending FASTA in this way, MAP format allows richer detail while preserving backward compatibility. In the following sections, we describe the various types of residue-level modifications and annotations supported in the MAP format, as well as the syntax for each tag type. We also cover how to annotate protein-level information in the header line.
5. Residue-Level Modifications & Annotations
Residue-level modifications refer to chemical changes or non-standard features of specific amino acids in the sequence. These tags, placed right after an amino acid in the sequence, describe specific changes or features of that amino acid. The general format for modification tags is {prefix:Code}.
- Cyclization (e.g., disulfide bonds or head-to-tail cyclization)
- Terminal modifications (modifications at the N-terminus or C-terminus of the protein)
- D-amino acids (non-standard chirality of amino acids)
- Common post-translational modifications (PTMs)
- Other non-natural chemical modifications (synthetic modifications, e.g., PEGylation)
- Non-natural amino acid residues (unnatural amino acids incorporated into the sequence)
- Conjugation of macromolecules (attachment of larger moieties like lipids or other biomolecules)
- Isotopic and fluorescent labels
5.1 Cyclization in Proteins/Peptides
Cyclization means forming a loop or ring structure within a peptide. The tag is {cyc:}.
| Cyclization Type | Tag | Description | Example Sequence |
|---|---|---|---|
| Head-to-Tail | {cyc:N-C} |
Link between the N-terminus (start) and C-terminus (end). | ACDAFIHIK |
| Disulfide Bonding | {cyc:X-Y} |
Link (usually a disulfide bond) between amino acids at positions X and Y. | ACDCFGHIK |
| Multiple Cycles | {cyc:X-Y,Z-W} |
Multiple cyclizations. | ACDECCHIKC |
| Undefined Cyclization | {cyc} |
A cyclic structure where the specific type isn't detailed. | ACDECCHIKC |
5.2 Terminal Modifications
Proteins often undergo modifications at their N-terminus or C-terminus, which can alter stability, localization, or activity. For example, N-terminal acetylation and C-terminal amidation are common modifications in peptides and proteins that can protect against degradation or alter function. In MAP format, terminal modifications are indicated by {nt:Type} for N-terminal modifications and {ct:Type} for C-terminal modifications, placed at the respective end of the sequence.
- The {nt:...} tag is placed after the last character of the sequence if it denotes a modification of the N-terminus, because conceptually it modifies the start of the chain. (It might seem counterintuitive to place it at the end, but since the sequence is written N-to-C, we indicate it at the position of the N-terminus, which is before the first residue. In practice, we append the tag after the sequence string for formatting reasons, to avoid confusion at the very beginning of the line.)
- Similarly, a {ct:...} tag is placed at the end of the sequence to denote a C-terminal modification.
Common N-terminal modifications include amidation, formylation, acetylation, glycosylation, and methylation of the N-terminus. Common C-terminal modifications include amidation, glycosylation, and addition of other groups like a carboxylate modifications or peptide tags.
Examples of N and C-terminal modifications:
| Modification Type | Tag Prefix | Example Tag | Description | Example Sequence |
|---|---|---|---|---|
| N-terminal | {nt:} |
{nt:Amid} |
N-terminal amidation (converting the -COOH to -CONH2). | ACDIFA |
| N-terminal | {nt:} |
{nt:Glyco} |
N-terminal glycosylation (sugar attached). | PQRAD |
| N-terminal | {nt:} |
{nt:Acet} |
N-terminal acetylation (adding an acetyl group CH3CO-). | LMNIF |
| N-terminal | {nt:} |
{nt:Formyl} |
N-terminal formylation (adding a formyl group to the N-terminal amine). | LMNIAC |
| N-terminal | {nt:} |
{nt:Me} |
N-terminal methylation (adding a methyl group to the N-terminal amine). | QWERT |
| Undefined N-term | {nt} |
{nt} |
Some unspecified modification at the N-terminus. | ACDEFG |
| C-terminal | {ct:} |
{ct:Amid} |
C-terminal amidation (converting the -COOH to -CONH2). | ACDIF |
| C-terminal | {ct:} |
{ct:Glyco} |
C-terminal glycosylation (sugar attached). | PQRACD |
| C-terminal | {ct:} |
{ct:Acet} |
C-terminal acetylation. | LMNIAC |
| Undefined C-term | {ct} |
{ct} |
An unspecified C-terminal modification. | ACDEFG |
5.3 D-Amino Acids
This tag {d} Most amino acids in natural proteins are in the L configuration (chirality). indicates that an amino acid has a "mirror image" (D-form) structure instead of the usual L-form and are not commonly found in proteins produced by ribosomes, though they do occur in some peptides (like certain bacterial cell wall peptides and some neuropeptides). Incorporating D-amino acids into a peptide can dramatically increase resistance to proteases and change the peptide’s structure and activity
- In the MAP format, a D-amino acid is indicated by the tag {d} placed immediately after the amino acid letter. This tag signifies that the residue is the D-enantiomer instead of the normal L-form.
| Tag | Description | Example Sequence |
|---|---|---|
{d} |
Indicates the preceding amino acid is a D-amino acid. | GA |
Complete example:
>Peptide2A{d}CDEF{d}GHIK
In this example, the sequence has a D-alanine at position 1 and a D-phenylalanine at position 6 (assuming A is position 1, F is position 6 in the sequence). All other residues are L-form by default.
5.4 Post-Translational Modifications (PTMs)
- Post-translational modifications (PTMs) are covalent modifications to amino acids that occur after a protein has been synthesized (translated).
- PTMs are crucial for regulating protein function, stability, localization, and interactions. Common PTMs include phosphorylation, glycosylation, acetylation, methylation, ubiquitination, sumoylation, hydroxylation, and lipidation (among others).
- In MAP format, a post-translational modification on a specific residue is denoted with a
{ptm:Code}tag immediately after that residue. - The ptm prefix indicates a PTM, and the code specifies the type of modification. Here are some common PTM tags and their meanings:
| PTM Code | Modification Type | Affected Residues (Common) | Example Sequence |
|---|---|---|---|
Phos |
Phosphorylation | Serine (S), Threonine (T), Tyrosine (Y) | GRAS |
Glyc |
Glycosylation | Asparagine (N) | GRPN |
Ac |
Acetylation | Lysine (K), N-terminus | GGLK |
Me |
Methylation | Lysine (K), Arginine (R) | MKGR |
Ub |
Ubiquitination | Lysine (K) | SK |
Sumo |
Sumoylation | Lysine (K) | FKPV |
OH |
Hydroxylation | Proline (P), Lysine (K) | FKPV |
palm |
Palmitoylation | Cysteine (C) | GGLKC |
| (None) | Undefined PTM | Any | GRPN |
Note: The choice of codes (Phos, Glyc, Ac, Me, Ub, Sumo, OH, palm, etc.) is based on abbreviations of the modification names. These are suggestions; if a community or database standard exists, those should be followed for consistency. For example, one might use {ptm:Phos} as shown, or a different code like {ptm:p} if a standard arises, but in this manual we use the longer abbreviations for clarity.
Complete example:
>Peptide002ACS{ptm:Phos}DER{ptm:Me}FGHIK
In this example, the sequence contains two modifications: the serine (S) is phosphorylated, and the arginine (R) is methylated. Each modification is indicated right after the modified residue (S{ptm:Phos} and R{ptm:Me} respectively).
5.5 Non-Natural Modifications (NNM)
Non-natural modifications (often abbreviated here as NNM) are chemical modifications that are not typical biological PTMs. These include synthetic modifications like attaching polymers, adding fluorescent probes (though we cover labels in 5.8 separately), or other chemical groups that are not gene-encoded modifications. These modifications are often used in drug development or protein engineering to enhance stability or to attach drugs/labels.
In MAP, we use the {nnm:Code} tag for non-natural modifications. Some examples:
| NNM Code | Modification Type | Example Sequence |
|---|---|---|
PEG |
PEGylation | AGRK |
Fluoro |
Fluorination | STY |
Biotin |
Biotinylation | RKCK |
| (undefined) | An undefined non-natural modification. | GRPN |
If the specific modification is not among the predefined codes, {nnm} alone can be used to denote "there is a non-natural modification here, type unspecified." For example, GRPN would mean the asparagine has some non-natural modification, but we’re not detailing what it is in the sequence.{nnm}QR
Complete example:
>Peptide002CK{nnm:PEG}DY{nnm:PMe}FC{nnm:Biotin}HIK
In this sequence, multiple non-natural modifications are present: a PEGylation on K, a phos-methyl on Y, and a biotin on C. This illustrates that many tags can coexist in one sequence, each immediately following the residue it modifies.
5.6 Non-Natural Residues (NNR)
Non-natural residues (NNRs) are amino acids not among the standard 20 genetically encoded amino acids. These could be amino acid analogs or synthetic amino acids incorporated via chemical peptide synthesis or specialized translational machinery (such as tRNA suppressor techniques in synthetic biology). NNRs allow for fine-tuned control of protein properties, such as adding functional groups not normally present, enhancing stability, or creating binding sites.
In MAP format, non-natural amino acids are indicated with the {nnr:Code} tag at the position where the non-natural amino acid occurs.
Typically, the one-letter code of the standard amino acid is not used for these; instead, the sequence might show a placeholder or the preceding residue, but since we still need a letter in the sequence for FASTA compatibility, one approach is to use a similar letter or X for unknown.
However, a clearer approach in MAP is to use a specific single-letter code if available (for example, many unnatural amino acids do not have standard one-letter codes, so one might choose a letter like X or Z to represent them and then clarify with the tag).
For this manual, we will assume the sequence uses an appropriate placeholder letter (like X) and the {nnr:Name} tag clarifies what that residue is.
Alternatively, some unnatural amino acids have been assigned one-letter codes in certain contexts (like U for selenocysteine, O for pyrrolysine, etc., which are actually natural but rare amino acids).
For truly non-natural, often X is used as placeholder.
Common examples of NNR tags (the code after nnr: is usually a short abbreviation of the residue name):
| NNR Code | Non-Natural Residue | Example Sequence |
|---|---|---|
Nle |
Norleucine | LNX |
Hph |
Homophenylalanine | LFX |
Can |
Canavanine | GRX |
Orn |
Ornithine | AKX |
Cit |
Citrulline | KKRX |
Har |
Homoarginine | MKRX |
Aze |
Azetidine-2-carboxylic acid | NPX |
βAla |
β-Alanine | HAX |
Aib |
α-Aminoisobutyric acid | HAX |
Bpa |
p-Benzoyl-L-phenylalanine | LFX |
Cha |
Cyclohexylalanine | GFX |
Fpa |
4-Fluorophenylalanine | KFX |
Nal |
2-Naphthylalanine | DFX |
| (Unknown) | Undefined non-natural residue (NNR) | LLLX |
Since each non-natural residue usually replaces a standard amino acid position, how do we denote it in the sequence string? One approach is to use X as the letter in the sequence for any non-standard amino acid and rely on the {nnr:Neme} to specify what it actually is. Alternatively, if the unnatural amino acid is similar to a natural one, one might use that letter and then clarify. To avoid confusion, using X is a clear way to mark “this is not one of the 20 standard amino acids,” and then the tag clarifies which one it is.
Note: The above examples use X as a generic placeholder for the unnatural amino acid in the actual sequence string. If a particular community or tool allows certain letters for certain non-standard amino acids, those could be used. For instance, sometimes O is used for pyrrolysine and U for selenocysteine in actual sequences (which are standard in an extended sense). The MAP format is flexible; the key is that the tag {nnr:Neme} clarifies what the residue is.
5.7 Conjugation of Macromolecules
Proteins can be conjugated to other molecules such as polymers, lipids, or other biomolecules (like attaching a protein to a DNA, or to a drug molecule). These conjugations often involve a covalent bond between a specific amino acid and another molecular entity. In the MAP format, we denote such attachments with the {conj:Code} tag, where conj stands for conjugation.
| Conjugation Code | Attached Molecule Type | Example Sequence |
|---|---|---|
Lipid |
Lipid | G |
Mal |
Maleimide conjugate | AC |
DBCO |
Dibenzocyclooctyne | K |
The conjugation tag is meant to capture attachments that are not small chemical groups (those would be in PTMs or NNM) but larger entities or specialized linkers.
5.8 Isotopic & Fluorescent Labeling
Isotopic labeling and fluorescent labeling are special cases of modifications often used in experimental contexts:
- Isotopic labeling: incorporating stable isotopes like ^13C, ^15N, ^2H (deuterium), or ^18O into specific positions in the protein for NMR studies or mass spectrometry. These do not change the chemical structure in a way that alters the amino acid letter (it’s still the same amino acid, just heavier), but it is useful to annotate them.
- Fluorescent labeling:attaching a fluorescent dye (such as fluorescein, rhodamine, Cy5, FITC, etc.) to a residue (commonly lysine or cysteine) for detection or imaging purposes.
In MAP format, these can be represented with an {iso:Code} tag (think of iso as indicating either isotope or possibly "label"). We choose iso as a prefix to cover these labeling cases for simplicity.
| Label Code | Label Type | Example Sequence |
|---|---|---|
13C |
Carbon-13 isotope | CA |
15N |
Nitrogen-15 isotope | GAQ |
2H |
Deuterium isotope | L |
18O |
Oxygen-18 isotope | AS |
Fluorescein |
Fluorescein dye | ASK |
Cy5 |
Cyanine5 dye | Y |
FITC |
Fluorescein isothiocyanate | S |
Complete example:
>Peptide010AK{iso:Fluorescein}DEFA{iso:13C}HIK
This indicates the lysine (K) is labeled with a fluorescein dye, and the alanine (A) later in the sequence is ^13C-labeled. In practice, one might label multiple residues or uniformly label a protein. The MAP format allows indicating specific labeled sites if needed.
Having covered various chemical modifications and special residues (Sections 5.1 through 5.8), we now move on to annotation of functional sites—namely, residues that interact with other molecules—and to indicating sequence variants. These are not chemical modifications per se, but rather annotations of residue function or changes in the primary sequence.
6. Residue Annotations (Interaction Sites)
Beyond chemical modifications, it is often useful to annotate residues that have certain functional roles or interactions. For instance, an enzyme active site, a DNA-binding residue, or a metal-binding site are not "modified" in a chemical sense, but we want to mark them as important. The MAP format includes an {IR:...} tag for Interacting Residues or Important Residues, which can denote interactions with various biomolecules or ligands.
The prefix IR stands for "Interaction Residue" (or one might think "Interacts with..."). The code following IR: specifies the type of interaction or the molecule with which the residue interacts.
| IR Code | Interacting Molecule | Example Sequence |
|---|---|---|
DNA |
DNA | AKL |
RNA |
RNA | CDE |
Pro |
Protein | HIJ |
ATP |
ATP | MN |
GTP |
GTP | QR |
UTP |
UTP | UV |
NAD |
NAD | YZ |
FAD |
FAD | CD |
FMN |
FMN | GH |
Heme |
Heme | KL |
These tags allow marking residues that are functionally important because they bind or contact some other entity. For example, in a DNA-binding protein, one might annotate all the residues that contact the DNA with {IR:DNA} to easily see where the DNA-binding interface is along the sequence.
The code after IR: can be fairly flexible, representing small molecules (ATP, GTP, heme), macromolecules (DNA, RNA, Protein), or even specific molecule names or identifiers. For instance, one could use {IR:Ca} for a calcium-binding residue, or {IR:Zn} for a zinc-binding residue, even though those weren’t explicitly listed above. As another example, if a protein interacts with a specific small molecule inhibitor or drug, you might even use something like {IR:DrugName} as a tag.
Complete example usage in a sequence with multiple interactions::
>Peptide011ACD{IR:ATP}FK{IR:PI}HI{IR:miR-21}AT{IR:DNA}
Breaking down this example:
D– aspartate interacting with ATP.{IR:ATP}K– lysine interacting with "PI". Here, PI could stand for phosphatidylinositol (a lipid) or inorganic phosphate. Given context, it likely means a phosphatidylinositol (a lipid signaling molecule), showing that the format could allow such abbreviations.{IR:PI}I– isoleucine interacting with miR-21, which implies this residue is involved in binding a microRNA (miR-21). This is an example of using a specific molecule name (miR-21) in the tag.{IR:miR-21}T– threonine interacting with DNA.{IR:DNA}
This illustrates that the IR tag can be customized for a wide range of interaction partners. The MAP format user should ensure that any custom codes used after IR: are understandable (perhaps by providing a legend or comments elsewhere, since the format itself doesn’t enforce a controlled vocabulary for the ligand names). In summary, the {IR:...} tags allow annotation of residues by their interaction partners, enriching the sequence with functional information beyond just the amino acid modifications.
7. Protein Variants (Mutations, Insertions, Deletions)
The MAP format provides a way to represent mutations, insertions, and deletions relative to a reference sequence, directly within the sequence string. This table summarizes the tags and examples:
Proteins often exist in multiple variant forms due to genetic mutations, alternative splicing, or engineered changes. The MAP format provides a way to represent mutations, insertions, and deletions relative to a reference sequence, directly within the sequence string. This is particularly useful for documenting how a variant differs from a canonical sequence without writing two sequences or using diff formats.
We use three types of tags for variants:
- {mut:...} for point mutations or replacements (the prefix "mut" indicates that the preceding residue(s) are replaced by the ones listed in the tag).
- {ins:...} for insertions (the prefix "ins" indicates that the residues inside the tag are inserted at that position).
- {del:...} for deletions (the prefix "del" indicates that the residues listed are deleted from the sequence).
The way to interpret these in a sequence context is as follows:
| Variant Type | Tag | Description |
|---|---|---|
| Mutation | {mut:...} | Original residue(s) replaced by new residues. |
| Insertion | {ins:...} | Residues inserted at a position. |
| Deletion | {del:...} | Residues deleted from the sequence. |
Examples of variant tags:
Point Mutation (single amino acid substitution):
- Example: GPA{mut:R}KV
- Here, the original sequence has A at that position, and {mut:R} indicates that this alanine is mutated to arginine (R). The interpretation is that in the variant, A is replaced by R. So the variant sequence would read as GPRKV. The MAP notation shows the original letter A and then notes the mutation to R.
Multiple Residue Mutation (replacement of a sequence segment):
- Example: AGRTH{mut:MKN}DGG
- In this sequence, the segment RTH is followed by {mut:MKN}. This indicates that the segment "RTH" is replaced by "MKN". The letters RTH are the original residues, and MKN are the new residues in the variant. So originally it was AGRTHDGG, and in the variant it would be AGMKNDGG. The MAP format shows what was there and what it changed to in one string. After the mutation tag, the sequence continues with DGG, which are unchanged residues following the mutation site.
Insertion:
- Example: AG{ins:G}TCA
- This indicates that a glycine (G) is inserted after the sequence "AG" and before "TCA". In other words, relative to the reference, an extra "G" is present at that position. The original context was "AGTCA"; with the insertion it becomes "AGGTCA". The MAP notation shows the insertion at the position between G and T of the original sequence.
- Example (multiple insertion): ATRG{ins:MK}TCA
- This means after "ATRG" and before "TCA", the dipeptide "MK" is inserted. If the original was "ATRGTCA", the new sequence becomes "ATRGMKTCA".
Deletion:
- Example (single deletion): ARGM{del}KLV
- Here, an M is followed by {del}, indicating that M is deleted in the variant. The original sequence has "M" at that position, but the variant does not. Essentially, the variant sequence would read as "ARGKLV" (with the M missing). The MAP notation keeps the M in the sequence string to show context but marks it as deleted.
- Example (multiple deletion): AGH{del:RTH}DGG
- Suppose the original sequence is AGHRTHDGG. In the MAP notation we write AGH{del:RTH}DGG to indicate that the segment "RTH" is deleted. In the variant, that segment is gone, so the variant sequence would be AGHDGG. We have shown the segment "RTH" inside the braces to explicitly state which residues are removed. (An alternative could have been writing AGHRTH{del}DGG by placing the tag after the sequence to remove, but the notation we use here with listing them in the tag is clearer for multiple residues.)
Clarification of Usage:
- Use {mut:X} after a single residue to indicate it changes to X.
- Use {mut:XYZ} after a sequence segment to indicate those residues change to XYZ (which can be a different length than the original segment).
- Use {ins:XYZ} at the position of insertion (after the preceding original residue) to indicate XYZ residues are inserted.
- Use {del} after a single residue to indicate that residue is removed. For multiple residues, it's often clearer to include them inside the braces as {del:...} after the first residue of the group, or simply after the last of the group with them inside. The key is consistency: this manual suggests listing the deleted residues inside the tag for clarity.
These variant annotations allow a single MAP sequence to explicitly document a change relative to a reference. If one were to strip away the curly brace annotations, one would typically get the original (unmodified) sequence. The tags themselves describe how the variant differs.
Summary of variant tags:
| Variant Type | Tag | Description | Example | Explanation |
|---|---|---|---|---|
| Point Mutation | {mut:Code} |
Single amino acid substitution | GPA{mut:R}KV |
Alanine (A) is replaced by Arginine (R). |
| Multiple Residue Mutation | {mut:Code} |
Replacement of a sequence segment | AGRTH{mut:MKN}DGG |
Segment RTH is replaced by MKN. |
| Insertion | {ins:Code} |
Single amino acid insertion | AG{ins:G}TCA |
Glycine (G) is inserted after AG and before TCA. |
| Multiple Insertion | {ins:Code} |
Multiple amino acid insertion | ATRG{ins:MK}TCA |
Dipeptide MK is inserted after ATRG and before TCA. |
| Single Deletion | {del} |
Single amino acid deletion | ARGM{del}KLV |
Methionine (M) is deleted. |
| Multiple Deletion | {del:Code} |
Multiple amino acid deletion | AGH{del:RTH}DGG |
Segment RTH is deleted. |
8. Protein-Level Annotations in the Header
While the previous sections dealt with residue-level annotations (within the sequence line), the MAP format also supports annotations at the protein level, which are included in the header line of the entry. In a traditional FASTA header, everything after the > is an unstructured description. MAP format adds structure to the header by using curly brace tags in the header line to indicate specific types of metadata about the protein. These header tags allow one to record information such as the protein’s origin, function, source database, etc., in a machine-readable way. They appear after the protein name or identifier in the header line.
The general format is:
>ProteinID {tag1:Value1} {tag2:Value2} {tag3:Value3} ...
Here, ProteinID (or name) is written as usual (without braces), and then each metadata item is given as a {keyword:value} pair in braces.
Standardized header tags in MAP format include:
| Tag | Description | Example |
|---|---|---|
{id:...} |
A unique identifier for the protein. | {id:P99999} |
{name:...} |
The full protein name or gene name. | {name:Cytochrome c} |
{org:...} |
The organism from which the protein is derived. | {org:Homo sapiens} |
{func:...} |
A brief functional description. | {func:Electron transport} |
{loc:...} |
Subcellular localization. | {loc:Mitochondrion} |
{src:...} |
Source or database. | {src:UniProt} |
{len:...} |
Length of the protein sequence. | {len:104} |
{exp:...} |
Experimental status. | {exp:Validated} |
{bind:...} |
Binding partner category. | {bind:DNA} |
{target:...} |
Target indication. | {target:drug} |
{note:...} |
Miscellaneous notes. | {note:Synthetic variant} |
These tags are optional and can be included as needed. They provide a structured way to include what would normally be free-text comments in a FASTA header.
Example of a MAP header with multiple tags:
>CYCS_HUMAN {id:P99999} {name:Cytochrome c} {org:Homo sapiens} {func:Electron transport} {loc:Mitochondrion} {src:UniProt} {len:104} {exp:Validated}
Breaking this down:
- CYCS_HUMAN – the primary identifier (could be a shorthand or a concatenation of gene name and species in this case). This is left untagged because it's the “label” of the entry.
{id:P99999}– a unique ID (here a UniProt ID for human cytochrome c, hypothetical P99999 for example).{name:Cytochrome c}– the common name of the protein.{org:Homo sapiens}– the organism is Homo sapiens.{func:Electron transport}– functionally, it’s involved in electron transport.{loc:Mitochondrion}– localized in the mitochondrion.{src:UniProt}– source of this sequence information is the UniProt database.{len:104}– the sequence length is 104 amino acids.{exp:Validated}– this sequence has been experimentally validated (as opposed to predicted).
After the header line, the sequence would follow on the next line, potentially with the modifications as described in earlier sections. The header tags do not affect the sequence; they are purely metadata.
Usage notes:
- All header tags should be enclosed in their own {} braces and separated by spaces.
- The order of tags is not strictly enforced, but it might be logical to order them as shown (id, name, org, func, loc, etc.). Consistency in ordering can help readability.
- If certain tags are not applicable or unknown, they can be omitted. There’s no requirement to include all these tags for every entry.
- Users can define additional custom tags if needed (for example, {disease:...} to indicate associated disease if it’s a variant found in disease, or {isoform:...} to indicate isoform number). The
{note:...}tag is a general catch-all for any such information not covered by a specific tag.
By structuring the header, one can programmatically parse out this metadata easily, which is a big advantage over free-form FASTA headers. It ensures key info about each protein is accessible in a consistent way.
9. Conclusion
The MAP format extends the simplicity of FASTA with powerful annotation capabilities, allowing detailed description of protein modifications, variants, and metadata within the sequence record itself. By using curly-brace tags with clear prefixes, MAP maintains human readability and machine parsability. Researchers can encode complex information—such as post-translational modifications ({ptm:} tags), binding sites ({IR:} tags), or mutations ({mut:}, {ins:}, {del:} tags)—directly alongside sequence data. Likewise, protein-level details can be embedded in the header in a structured manner.
In summary, the MAP format is a flexible and backward-compatible way to represent enriched protein sequence information. It facilitates comprehensive documentation of protein variants and modifications, which is invaluable for data exchange in proteomics, protein engineering, and bioinformatics. Users of the MAP format are encouraged to follow the conventions outlined in this manual to ensure consistency and to update the format as needed for emerging types of annotations. With standardized usage, MAP format can greatly enhance the utility of sequence databases by marrying sequence data with contextual biological information in one place.