Biological Data Formats: FASTA, FASTQ, GenBank, SAM/BAM & PDB

Introduction to common biological data formats supported by Biopython, including FASTA, GenBank, FASTQ, and PDB. Structure and features of each data format. Biological data formats are used to represent and store biological information. Various file formats are used in bioinformatics and computational biology. Biopython provides support for handling multiple biological data formats.

Common Biological Data Formats

FASTA Format: Simple text-based format for representing nucleotide or protein sequences. Consists of a header line starting with ‘>’ and the sequence data.
GenBank Format: Standard format for representing DNA or RNA sequences along with annotations. Contains sequence data, features, and metadata in a structured manner.
FASTQ Format: Used to store high-throughput sequencing data, including DNA reads and their quality scores. Contains sequence reads, base qualities, and additional information.
PDB Format: Protein Data Bank format for representing protein structures. Contains atomic coordinates, atom types, and other structural information.

FASTA Format for Nucleotide Sequences

FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (“>”) symbol in the first column.

The simplicity of the FASTA format makes it easy to manipulate and parse sequences using text-processing tools and scripting languages. File extensions: file.fa, file.fasta, file.fsa

>NC_000011.10:c2161209-2159779 Homo sapiens chromosome 11, GRCh38.p14 Primary Assembly
AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTTTGCGTCAGGT
GGGCTCAGGATTCCAGGGTGGCTGGACCCCAGGCCCCAGCTCTGCAGCAGGGAGGACGTGGCTGGGCTCG
TGAAGCATGTGGGGGTGAGCCCAGGGGCCCCAAGGCAGGGCACCTGGCCTTCAGCCTGCCTCAGCCCTGC
CTGTCTCCCAGATCACTGTCCTTCTGCCATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTG
GCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAG
CTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCT
GCAGGGTGAGCCAACTGCCCATTGCTGCCCCTGGCCGCCCCCAGCCACCCCCTGCTCCTGGCGCTCCCAC

#Homo sapiens chromosome 11, GRCh38.p14 Primary Assembly
#NCBI Reference Sequence: NC_000011.10

SAM (Sequence Alignment Map)

The SAM format consists of one header section and one alignment section. The lines in the header section start with the character ‘@’, and the lines in the alignment section do not. All lines are TAB delimited.

Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and a variable number of optional fields for flexible or aligner-specific information. Example: file.sam

An annotated example of SAM format. (A) An example alignment result. (B) The alignment result represented in SAM format. Black color represents the SAM information, and colorful text is the annotation. | CC: zyxue.github.io

Currently, SeqIO doesn’t support writing SAM/BAM files directly.

FASTQ file format

FASTQ format is a text-based format specifically designed to store biological sequences (usually nucleotide sequences) along with their corresponding quality scores. Here’s a closer look at its structure and how it differs from FASTA:

Structure: A FASTQ file has four lines for each sequence:

Header Line (begins with “@”): This line identifies the sequence with an ID and an optional description. It’s similar to the header line in FASTA format.
Sequence Line: This line contains the actual sequence of nucleotides, typically represented by single-letter codes (A, C, G, T, or U).
Separator Line (begins with “+”): This line is a placeholder and doesn’t contain any data.
Quality Score Line: This line encodes the quality score for each base in the sequence. Quality scores are typically represented by ASCII characters with a specific encoding scheme (like Sanger or Illumina). Higher scores indicate higher confidence in the base call.

Key Differences from FASTA:

Quality Scores: FASTQ includes quality scores, which are absent in FASTA. This makes FASTQ ideal for analyzing data from high-throughput sequencing technologies where base call accuracy can vary.
Line Count: FASTQ files have four lines per sequence, while FASTA files have only two (header and sequence).

File extensions: file.fastq, file.sanfastq, file.fq

Here’s an example of a FASTQ file with two records:

@071112_SLXA-EAS1_s_7:5:1:817:345
GGGTGATGGCCGCTGCCGATGGCGTC
AAATCCCACC
+
IIIIIIIIIIIIIIIIIIIIIIIIII
IIII9IG9IC
@071112_SLXA-EAS1_s_7:5:1:801:338
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBI

FASTQ files are mostly used to store short-read data from high-throughput sequencing experiments. The sequence and quality scores are usually put into a single line each.

FASTQ format and Phred quality scores. cc: gencoded.com

Biopython and FASTQ: Biopython’s SeqIO module can currently only read FASTQ files. It doesn’t directly support writing them.

Protein Data Bank (PDB) file format

The Protein Data Bank (PDB) file format describes the three-dimensional (3D) structures of molecules held in the Protein Data Bank, now succeeded by the mmCIF format is now the default format used by the Protein Data Bank. The PDB format provides for description and annotation of protein and nucleic acid structures including atomic coordinates, secondary structure assignments, as well as atomic connectivity.

The primary information stored in the PDB archive consists of coordinate files for biological molecules. These files list the atoms in each protein and their 3D location in space. These files are available in several formats (PDB, mmCIF, XML). A typical PDB formatted file includes a large “header” section of text that summarizes the protein, citation information, and the details of the structure solution, followed by the sequence and a long list of the atoms and their coordinates.

PPT – Structure Databases: The Protein Data Bank (slideserve.com)

Example of file.pdb

HEADER    EXTRACELLULAR MATRIX                    22-JAN-98   1A3I
TITLE     X-RAY CRYSTALLOGRAPHIC DETERMINATION OF A COLLAGEN-LIKE
TITLE    2 PEPTIDE WITH THE REPEATING SEQUENCE (PRO-PRO-GLY)
...
EXPDTA    X-RAY DIFFRACTION
AUTHOR    R.Z.KRAMER,L.VITAGLIANO,J.BELLA,R.BERISIO,L.MAZZARELLA,
AUTHOR   2 B.BRODSKY,A.ZAGARI,H.M.BERMAN
...
REMARK 350 BIOMOLECULE: 1
REMARK 350 APPLY THE FOLLOWING TO CHAINS: A, B, C
REMARK 350   BIOMT1   1  1.000000  0.000000  0.000000        0.00000
REMARK 350   BIOMT2   1  0.000000  1.000000  0.000000        0.00000
...
SEQRES   1 A    9  PRO PRO GLY PRO PRO GLY PRO PRO GLY
SEQRES   1 B    6  PRO PRO GLY PRO PRO GLY
SEQRES   1 C    6  PRO PRO GLY PRO PRO GLY
...
ATOM      1  N   PRO A   1       8.316  21.206  21.530  1.00 17.44           N
ATOM      2  CA  PRO A   1       7.608  20.729  20.336  1.00 17.44           C
ATOM      3  C   PRO A   1       8.487  20.707  19.092  1.00 17.44           C
ATOM      4  O   PRO A   1       9.466  21.457  19.005  1.00 17.44           O
ATOM      5  CB  PRO A   1       6.460  21.723  20.211  1.00 22.26           C
...
HETATM  130  C   ACY   401       3.682  22.541  11.236  1.00 21.19           C
HETATM  131  O   ACY   401       2.807  23.097  10.553  1.00 21.19           O
HETATM  132  OXT ACY   401       4.306  23.101  12.291  1.00 21.19           O
...
#from a file describing the structure of a synthetic collagen-like peptide

GenBank file format

GenBank format (GenBank Flat File Format) consists of an annotation section and a sequence section. The start of the annotation section is marked by a line beginning with the word “LOCUS”. The start of the sequence section is marked by a line beginning with the word “ORIGIN” and the end of the section is marked by a line with only “//”.

Extensions “gb” and “genbank”

*Sample GenBank DNA file highlighting the three main sections.*

LOCUS       AAU03518                 237 bp    DNA     linear   PLN 04-FEB-1995
DEFINITION  Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
            rRNA and 5.8S rRNA genes, partial sequence.
ACCESSION   U03518
VERSION     U03518.1
KEYWORDS    .
SOURCE      Aspergillus awamori
  ORGANISM  Aspergillus awamori
            Eukaryota; Fungi; Dikarya; Ascomycota; Pezizomycotina;
            Eurotiomycetes; Eurotiomycetidae; Eurotiales; Aspergillaceae;
            Aspergillus.
REFERENCE   1  (bases 1 to 237)
  AUTHORS   Borsuk,P., Gniadkowski,M., Kucharski,R., Bisko,M., Kanabus,M.,
            Stepien,P.P. and Bartnik,E.
  TITLE     Evolutionary conservation of the transcribed spacer sequences of
            the rDNA repeat unit in three species of the genus Aspergillus
  JOURNAL   Acta Biochim. Pol. 41 (1), 73-77 (1994)
   PUBMED   8030378
REFERENCE   2  (bases 1 to 237)
  AUTHORS   Borsuk,P.
  TITLE     Direct Submission
  JOURNAL   Submitted (17-NOV-1993) Borsuk P., University of Warsaw, Department
            of Genetics, Al.Ujazdowskie 4, Warsaw, Pl-00478, Poland
FEATURES             Location/Qualifiers
     source          1..237
                     /organism="Aspergillus awamori"
                     /mol_type="genomic DNA"
                     /strain="wild type"
                     /db_xref="taxon:105351"
                     /clone="clone pAaw1"
                     /tissue_type="mycelium"
                     /clone_lib="A.awamori Sau3AI partial digest genomic
                     library in pUC18"
                     /dev_stage="mycelia, young"
     rRNA            <1..20
                     /product="18S ribosomal RNA"
     misc_RNA        21..205
                     /product="internal transcribed spacer 1"
     rRNA            206..>237
                     /product="5.8S ribosomal RNA"
ORIGIN      
        1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc
       61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg
      121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc
      181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc
//

Reading and Writing Biological Data with Sequence Input/Output

Sequence Input/Output ( SeqIO) module provides a convenient way to read and write biological data in various formats. SeqIO.read() reads a single record from a file. SeqIO.parse() reads multiple records from a file. SeqIO.write() writes sequences to a file in a specified format.

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO

rec1 = SeqRecord(
    Seq(
        "MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD"
        "GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK"
        "NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM"
        "SSAC",
    ),
    id="gi|14150838|gb|AAK54648.1|AF376133_1",
    description="chalcone synthase [Cucumis sativus]",
)

rec2 = SeqRecord(
    Seq(
        "YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ"
        "DMVVVEIPKLGKEAAVKAIKEWGQ",
    ),
    id="gi|13919613|gb|AAK33142.1|",
    description="chalcone synthase [Fragaria vesca subsp. bracteata]",
)

rec3 = SeqRecord(
    Seq(
        "MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMC"
        "EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP"
        "KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN"
        "NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV"
        "SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW"
        "IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT"
        "TGEGLEWGVLFGFGPGLTVETVVLHSVAT",
    ),
    id="gi|13925890|gb|AAK49457.1|",
    description="chalcone synthase [Nicotiana tabacum]",
)

my_records = [rec1, rec2, rec3]

SeqIO.write(my_records, "my_example.faa", "fasta")

#output file

>gi|14150838|gb|AAK54648.1|AF376133_1 chalcone synthase [Cucumis sativus]
MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD
GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK
NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM
SSAC
>gi|13919613|gb|AAK33142.1| chalcone synthase [Fragaria vesca subsp. bracteata]
YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ
DMVVVEIPKLGKEAAVKAIKEWGQ
>gi|13925890|gb|AAK49457.1| chalcone synthase [Nicotiana tabacum]
MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMC
EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP
KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN
NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV
SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW
IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT
TGEGLEWGVLFGFGPGLTVETVVLHSVAT

#
#Code Adapted from Biopython v: 1.85.dev0

Reading Sequences from a FASTA file

The SeqIO.parse() function reads multiple sequences from a FASTA file. Each record object represents a single sequence with attributes like id (header) and seq (sequence data).

Format	Readable w/ Biopython	Writable w/ Biopython
FASTA	Yes	Yes
FASTQ	Yes	No (can be converted to FASTA for writing)
GenBank	Yes (basic)	Yes (basic)
SAM/BAM	Yes	No
PDB	Yes	No

current (03/03/24) capabilities of Biopython’s SeqIO module

References

Li, Heng, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, and 1000 Genome Project Data Processing Subgroup. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16): 2078-2079.
Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T. N. Bhat, Helge Weissig, Ilya N. Shindyalov, Philip E. Bourne, The Protein Data Bank, Nucleic Acids Research, Volume 28, Issue 1, 1 January 2000, Pages 235–242, https://doi.org/10.1093/nar/28.1.235
Benjamin Hepp, Violette Da Cunha, Florence Lorieux, Jacques Oberto, BAGET 2.0: an updated web tool for the effortless retrieval of prokaryotic gene context and sequence, Bioinformatics, Volume 37, Issue 17, September 2021, Pages 2750–2752, https://doi.org/10.1093/bioinformatics/btab082
Documentation · Biopython
FASTQ format gencoded.com
FASTA format – Wikipedia

Biological Data Formats: FASTA, FASTQ, GenBank, SAM/BAM & PDB

Common Biological Data Formats

FASTA Format for Nucleotide Sequences

SAM (Sequence Alignment Map)

FASTQ file format

Protein Data Bank (PDB) file format

GenBank file format

Reading and Writing Biological Data with Sequence Input/Output

Reading Sequences from a FASTA file

References

Leave a Reply Cancel reply

Check out these ...

Common Biological Data Formats

FASTA Format for Nucleotide Sequences

SAM (Sequence Alignment Map)

FASTQ file format

Protein Data Bank (PDB) file format

GenBank file format

Reading and Writing Biological Data with Sequence Input/Output

Reading Sequences from a FASTA file

References

Sign Up For Daily Newsletter

Our resources that will help you excel in your academics and research.

Leave a Reply Cancel reply