Introduction to common biological data formats supported by Biopython, including FASTA, GenBank, FASTQ, and PDB. Structure and features of each data format. Biological data formats are used to represent and store biological information. Various file formats are used in bioinformatics and computational biology. Biopython provides support for handling multiple biological data formats.
Common Biological Data Formats
- FASTA Format: Simple text-based format for representing nucleotide or protein sequences. Consists of a header line starting with ‘>’ and the sequence data.
- GenBank Format: Standard format for representing DNA or RNA sequences along with annotations. Contains sequence data, features, and metadata in a structured manner.
- FASTQ Format: Used to store high-throughput sequencing data, including DNA reads and their quality scores. Contains sequence reads, base qualities, and additional information.
- PDB Format: Protein Data Bank format for representing protein structures. Contains atomic coordinates, atom types, and other structural information.
FASTA Format for Nucleotide Sequences
FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (“>”) symbol in the first column.
The simplicity of the FASTA format makes it easy to manipulate and parse sequences using text-processing tools and scripting languages. File extensions: file.fa, file.fasta, file.fsa
>NC_000011.10:c2161209-2159779 Homo sapiens chromosome 11, GRCh38.p14 Primary Assembly
AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTTTGCGTCAGGT
GGGCTCAGGATTCCAGGGTGGCTGGACCCCAGGCCCCAGCTCTGCAGCAGGGAGGACGTGGCTGGGCTCG
TGAAGCATGTGGGGGTGAGCCCAGGGGCCCCAAGGCAGGGCACCTGGCCTTCAGCCTGCCTCAGCCCTGC
CTGTCTCCCAGATCACTGTCCTTCTGCCATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTG
GCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAG
CTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCT
GCAGGGTGAGCCAACTGCCCATTGCTGCCCCTGGCCGCCCCCAGCCACCCCCTGCTCCTGGCGCTCCCAC
#Homo sapiens chromosome 11, GRCh38.p14 Primary Assembly
#NCBI Reference Sequence: NC_000011.10
SAM (Sequence Alignment Map)
The SAM format consists of one header section and one alignment section. The lines in the header section start with the character ‘@’, and the lines in the alignment section do not. All lines are TAB delimited.
Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and a variable number of optional fields for flexible or aligner-specific information. Example: file.sam

Currently, SeqIO
doesn’t support writing SAM/BAM files directly.
FASTQ file format
FASTQ format is a text-based format specifically designed to store biological sequences (usually nucleotide sequences) along with their corresponding quality scores. Here’s a closer look at its structure and how it differs from FASTA:
Structure: A FASTQ file has four lines for each sequence:
- Header Line (begins with “@”): This line identifies the sequence with an ID and an optional description. It’s similar to the header line in FASTA format.
- Sequence Line: This line contains the actual sequence of nucleotides, typically represented by single-letter codes (A, C, G, T, or U).
- Separator Line (begins with “+”): This line is a placeholder and doesn’t contain any data.
- Quality Score Line: This line encodes the quality score for each base in the sequence. Quality scores are typically represented by ASCII characters with a specific encoding scheme (like Sanger or Illumina). Higher scores indicate higher confidence in the base call.
Key Differences from FASTA:
- Quality Scores: FASTQ includes quality scores, which are absent in FASTA. This makes FASTQ ideal for analyzing data from high-throughput sequencing technologies where base call accuracy can vary.
- Line Count: FASTQ files have four lines per sequence, while FASTA files have only two (header and sequence).
File extensions: file.fastq, file.sanfastq, file.fq
Here’s an example of a FASTQ file with two records:
@071112_SLXA-EAS1_s_7:5:1:817:345
GGGTGATGGCCGCTGCCGATGGCGTC
AAATCCCACC
+
IIIIIIIIIIIIIIIIIIIIIIIIII
IIII9IG9IC
@071112_SLXA-EAS1_s_7:5:1:801:338
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBI
FASTQ files are mostly used to store short-read data from high-throughput sequencing experiments. The sequence and quality scores are usually put into a single line each.

Biopython and FASTQ: Biopython’s SeqIO
module can currently only read FASTQ files. It doesn’t directly support writing them.
Protein Data Bank (PDB) file format
The Protein Data Bank (PDB) file format describes the three-dimensional (3D) structures of molecules held in the Protein Data Bank, now succeeded by the mmCIF format is now the default format used by the Protein Data Bank. The PDB format provides for description and annotation of protein and nucleic acid structures including atomic coordinates, secondary structure assignments, as well as atomic connectivity.
The primary information stored in the PDB archive consists of coordinate files for biological molecules. These files list the atoms in each protein and their 3D location in space. These files are available in several formats (PDB, mmCIF, XML). A typical PDB formatted file includes a large “header” section of text that summarizes the protein, citation information, and the details of the structure solution, followed by the sequence and a long list of the atoms and their coordinates.

Example of file.pdb
HEADER EXTRACELLULAR MATRIX 22-JAN-98 1A3I
TITLE X-RAY CRYSTALLOGRAPHIC DETERMINATION OF A COLLAGEN-LIKE
TITLE 2 PEPTIDE WITH THE REPEATING SEQUENCE (PRO-PRO-GLY)
...
EXPDTA X-RAY DIFFRACTION
AUTHOR R.Z.KRAMER,L.VITAGLIANO,J.BELLA,R.BERISIO,L.MAZZARELLA,
AUTHOR 2 B.BRODSKY,A.ZAGARI,H.M.BERMAN
...
REMARK 350 BIOMOLECULE: 1
REMARK 350 APPLY THE FOLLOWING TO CHAINS: A, B, C
REMARK 350 BIOMT1 1 1.000000 0.000000 0.000000 0.00000
REMARK 350 BIOMT2 1 0.000000 1.000000 0.000000 0.00000
...
SEQRES 1 A 9 PRO PRO GLY PRO PRO GLY PRO PRO GLY
SEQRES 1 B 6 PRO PRO GLY PRO PRO GLY
SEQRES 1 C 6 PRO PRO GLY PRO PRO GLY
...
ATOM 1 N PRO A 1 8.316 21.206 21.530 1.00 17.44 N
ATOM 2 CA PRO A 1 7.608 20.729 20.336 1.00 17.44 C
ATOM 3 C PRO A 1 8.487 20.707 19.092 1.00 17.44 C
ATOM 4 O PRO A 1 9.466 21.457 19.005 1.00 17.44 O
ATOM 5 CB PRO A 1 6.460 21.723 20.211 1.00 22.26 C
...
HETATM 130 C ACY 401 3.682 22.541 11.236 1.00 21.19 C
HETATM 131 O ACY 401 2.807 23.097 10.553 1.00 21.19 O
HETATM 132 OXT ACY 401 4.306 23.101 12.291 1.00 21.19 O
...
#from a file describing the structure of a synthetic collagen-like peptide
GenBank file format
GenBank format (GenBank Flat File Format) consists of an annotation section and a sequence section. The start of the annotation section is marked by a line beginning with the word “LOCUS”. The start of the sequence section is marked by a line beginning with the word “ORIGIN” and the end of the section is marked by a line with only “//”.
Extensions “gb” and “genbank”

LOCUS AAU03518 237 bp DNA linear PLN 04-FEB-1995
DEFINITION Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
rRNA and 5.8S rRNA genes, partial sequence.
ACCESSION U03518
VERSION U03518.1
KEYWORDS .
SOURCE Aspergillus awamori
ORGANISM Aspergillus awamori
Eukaryota; Fungi; Dikarya; Ascomycota; Pezizomycotina;
Eurotiomycetes; Eurotiomycetidae; Eurotiales; Aspergillaceae;
Aspergillus.
REFERENCE 1 (bases 1 to 237)
AUTHORS Borsuk,P., Gniadkowski,M., Kucharski,R., Bisko,M., Kanabus,M.,
Stepien,P.P. and Bartnik,E.
TITLE Evolutionary conservation of the transcribed spacer sequences of
the rDNA repeat unit in three species of the genus Aspergillus
JOURNAL Acta Biochim. Pol. 41 (1), 73-77 (1994)
PUBMED 8030378
REFERENCE 2 (bases 1 to 237)
AUTHORS Borsuk,P.
TITLE Direct Submission
JOURNAL Submitted (17-NOV-1993) Borsuk P., University of Warsaw, Department
of Genetics, Al.Ujazdowskie 4, Warsaw, Pl-00478, Poland
FEATURES Location/Qualifiers
source 1..237
/organism="Aspergillus awamori"
/mol_type="genomic DNA"
/strain="wild type"
/db_xref="taxon:105351"
/clone="clone pAaw1"
/tissue_type="mycelium"
/clone_lib="A.awamori Sau3AI partial digest genomic
library in pUC18"
/dev_stage="mycelia, young"
rRNA <1..20
/product="18S ribosomal RNA"
misc_RNA 21..205
/product="internal transcribed spacer 1"
rRNA 206..>237
/product="5.8S ribosomal RNA"
ORIGIN
1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc
61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg
121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc
181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc
//
Reading and Writing Biological Data with Sequence Input/Output
Sequence Input/Output ( SeqIO
) module provides a convenient way to read and write biological data in various formats. SeqIO.read()
reads a single record from a file. SeqIO.parse()
reads multiple records from a file. SeqIO.write()
writes sequences to a file in a specified format.
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
rec1 = SeqRecord(
Seq(
"MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD"
"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK"
"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM"
"SSAC",
),
id="gi|14150838|gb|AAK54648.1|AF376133_1",
description="chalcone synthase [Cucumis sativus]",
)
rec2 = SeqRecord(
Seq(
"YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ"
"DMVVVEIPKLGKEAAVKAIKEWGQ",
),
id="gi|13919613|gb|AAK33142.1|",
description="chalcone synthase [Fragaria vesca subsp. bracteata]",
)
rec3 = SeqRecord(
Seq(
"MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMC"
"EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP"
"KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN"
"NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV"
"SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW"
"IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT"
"TGEGLEWGVLFGFGPGLTVETVVLHSVAT",
),
id="gi|13925890|gb|AAK49457.1|",
description="chalcone synthase [Nicotiana tabacum]",
)
my_records = [rec1, rec2, rec3]
SeqIO.write(my_records, "my_example.faa", "fasta")
#output file
>gi|14150838|gb|AAK54648.1|AF376133_1 chalcone synthase [Cucumis sativus]
MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD
GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK
NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM
SSAC
>gi|13919613|gb|AAK33142.1| chalcone synthase [Fragaria vesca subsp. bracteata]
YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ
DMVVVEIPKLGKEAAVKAIKEWGQ
>gi|13925890|gb|AAK49457.1| chalcone synthase [Nicotiana tabacum]
MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMC
EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP
KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN
NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV
SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW
IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT
TGEGLEWGVLFGFGPGLTVETVVLHSVAT
#
#Code Adapted from Biopython v: 1.85.dev0
Reading Sequences from a FASTA file
The SeqIO.parse()
function reads multiple sequences from a FASTA file. Each record
object represents a single sequence with attributes like id
(header) and seq
(sequence data).
Format | Readable w/ Biopython | Writable w/ Biopython |
FASTA | Yes | Yes |
FASTQ | Yes | No (can be converted to FASTA for writing) |
GenBank | Yes (basic) | Yes (basic) |
SAM/BAM | Yes | No |
PDB | Yes | No |
SeqIO
moduleReferences
- Li, Heng, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, and 1000 Genome Project Data Processing Subgroup. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16): 2078-2079.
- Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T. N. Bhat, Helge Weissig, Ilya N. Shindyalov, Philip E. Bourne, The Protein Data Bank, Nucleic Acids Research, Volume 28, Issue 1, 1 January 2000, Pages 235–242, https://doi.org/10.1093/nar/28.1.235
- Benjamin Hepp, Violette Da Cunha, Florence Lorieux, Jacques Oberto, BAGET 2.0: an updated web tool for the effortless retrieval of prokaryotic gene context and sequence, Bioinformatics, Volume 37, Issue 17, September 2021, Pages 2750–2752, https://doi.org/10.1093/bioinformatics/btab082
- Documentation · Biopython
- FASTQ format gencoded.com
- FASTA format – Wikipedia