Biopython Tutorial: How to Calculate GC Content in FASTA Files

Introduction

FASTA format is a widely used text-based format for representing nucleotide sequences. Understanding the composition of these sequences is crucial in various fields of biological research, particularly in genomics and bioinformatics. One of the key metrics derived from nucleotide sequences is the GC content, which is the percentage of nucleotide bases in a DNA molecule that are either guanine (G) or cytosine (C).

Importance of GC Content

GC content is an important parameter that can affect the stability and structure of DNA, influence gene expression, and provide insights into the evolutionary relationships between organisms. High GC content can lead to increased stability of DNA due to stronger hydrogen bonding between G and C bases compared to A and T bases. Conversely, low GC content may be associated with certain genomic characteristics, such as increased mutation rates or specific adaptations to environmental conditions.

Overview of Bio.SeqIO Module

The Biopython library offers a comprehensive suite of tools for biological computation. The Sequence Input/Output, Bio.SeqIO module is particularly useful for reading and writing sequence data in various formats, e.g., FASTA, GenBank. This module provides a convenient interface for accessing sequence records and their associated annotations based on Biopython Tutorial and Cookbook, provide objects to represent biological sequences.

Usage of Bio.SeqIO

The Bio.SeqIO module allows for straightforward manipulation of sequence data. Users can read FASTA files, extract sequences, and perform operations on them with minimal code. The following functions are central to using Bio.SeqIO:

SeqIO.parse(): This function reads sequence records from a file in a specified format, allowing iteration over each record. Each sequence record has metadata, such as an ID and description, and includes the sequence data itself.
SeqIO.write(): This function writes sequence records to a specified output file in a designated format.

from Bio import  SeqIO

file_path = "example.fasta"
records = SeqIO.parse(file_path, "fasta")
for record in records:
    print("Header:", record.id)
    print("Sequence:", record.seq)
    print()

#Output
Header: sp|P25730|FMS1_ECOLI
Sequence: MKLKKTIGAMALATLFATMGASAVEKTISVTASVDPTVDLLQSDGSALPN

Header: sp|P15488|FMS3_ECOLI
Sequence: LTLSNTGVSNTLVGVLTLSNTSIDTVSIASTNVSDTSKNGTVTFAHETNNS

Biopython Program to Calculate GC Content of a FASTA File

Requirements

A text editor or integrated development environment (IDE) for writing and executing Python scripts (e.g., VSCode, PyCharm, Jupyter Notebook).
This protocol can be executed on any operating system that supports Python, including Windows, macOS, and Linux
Dependencies: Ensure that Biopython is installed in your Python (version 3.6 or higher) environment. If not installed, you can do so by running the following command in your terminal via pip (pip install biopython) or download it from python.org.

Here’s the FASTA (sequences.fasta) file that will be used in this tutorial.

>ref|NC_005213.1|:883-2691
ATGAAAAAGCCCCAACCCTATAAAGATGAAGAGATATATTCTATTTTAGAAGAGCCCGTAAAACAATGGT
TTAAAGAGAAATACAAAACATTCACTCCCCCACAAAGGTATGCAATAATGGAAATACATAAAAGGAACAA
TGTTTTAATTTCTTCCCCCACAGGTTCGGGAAAAACGTTAGCAGCGTTTTTAGCTATAATAAATGAATTA
ATAAAGTTATCTCATAAAGGAAAATTAGAAAATAGAGTTTATGCCATTTATGTTTCTCCATTAAGAAGTT
TAAATAACGATGTAAAGAAAAACTTAGAAACTCCATTAAAAGAAATAAAAGAAAAAGCGAAAGAGCTTAA

>ref|NC_005214.1|:1000-1900
ATGCGTACGTTGACCTTGAATGGCTTAGCTCAGTGACGTAGCTGTAACGGTACGCGTGTATACATCC
TTAATCCGTTAGCAGCTCTGACGATGCCGATCCTGTAAGTTGACGTAAGTCGTGATGACGCGTCTAC
TACGATACATGCTGTGAGGCCATCAGTCTGACATCGATCGTACGCGTACGTACGTAAGGAGCTAACG
TAACTGACGTATACGTGACGAGTTGACGATCGAACGATCGAATCGTCGATGATCGACGACGTACGAC
TAGGTAACGTAGCTAGCGTGACGCGTGATGACGAGCAGACGAGCGTGATCGTACGATCAGCTTACGT

>ref|NC_005215.1|:1500-2800
ATGACGTAGCGTACGTAGCTGATCGTAGCGTGATGCTACGCGTACGTAGCGTAGCTGATCGTAGCAG
TTAGCTAGCTAGCTAGTAGCTAGTCAGTGATCGTACGTAGCGTACGATGACTAGCTAGCGTGATGAC
GTTAGCTAGCTAGTAGCTAGCGTAGCTAGTACGTAGCGTACGTTAGCTGACTAGCTGAGCTAGTAGG
CAGATCGTACGGTACGTAGCGTACGGTACGTACGTAGCTAGCTGATGATGCGTAGCTGACTGAGCTT
GAGCTGACTACGTGACGTAGCAGCGTGAGTACGATCGTACGACGATCGTACGTACGATCGTGACGT

>ref|NC_005216.1|:2000-3900
ATGTACGTAGCTAGTACGTTAGGTCAGCTGACGGTACGGTACGTAGGAGCTAGTCGTCGAGCGTTAG
TAGCTAGCGTAGTGACGACGTACGAGTACGTGACGTAGTAGCTGACGTCGATGACGATCGTCGACGT
ACGTTAGCTAGGTCAGTACGTAGCGTACGTAGTGAGCGTCGACGATCGACGTTAGCAGTGACGAGTC
GACGACGTACGAGTAGCGTACGTTAGCTAGCTGATGACGTCGACGATCGTACGAGTGAGTCGAGTAC
GTAGACGTACGTCGTCGATCGTAGCTAGCTGACGTACGTCGTGATGTCGTAGCTAGCGACGTCGAT

>ref|NC_005217.1|:1200-2500
ATGTCGATAGTGACGTGTCGACTGACGTAGCGTACGTTAGTCGACGTAGTACGATGATGACGTAGC
TAGTGAGTAGTAGGTCGACGTTGACGATAGTCAGCGTACGATCGACGTCGACGTAGTAGCTGACGTA
TAGTAGGTCGTAGTGATGTCGAGCTAGCTGACGTAGTGACGACGAGTCGTCGACGTTAGCGTAGCTG
CTGACGTGACGATGAGCTAGTAGTCGACGTAGCGTGACGAGCTAGTGACGTTAGTCGACGTGATCGT
GTAGCTAGCTAGTGTCGAGTCGTAGCTAGCTGACGACGGTAGTACGTGACGTAGTACGACGTAGCT

Counting GC Content

The calculation involves the use of Bio.SeqUtils.GC() function for the GC percentage calculation rather than: gc_content = (sequence.count(“G”) + sequence.count(“C”)) / len(sequence) * 100.

# Import the necessary modules
from Bio import SeqIO
from Bio.SeqUtils import GC

# Define the path to the FASTA file
fasta_file_path = "sequences.fasta"
gc_content_dict = {}

# Parse the FASTA file
for record in SeqIO.parse(fasta_file_path, "fasta"):
    sequence = str(record.seq)  # Convert sequence to a string
    gc_content = GC(sequence)  # Calculate GC content using Bio.SeqUtils.GC()
    gc_content_dict[record.id] = gc_content

# Print GC content for each sequence
for seq_id, gc in gc_content_dict.items():
    print(f"Sequence ID: {seq_id}, GC Content: {gc:.2f}%")

Output

Sequence ID: ref|NC_005213.1|:883-2691, GC Content: 29.71%
Sequence ID: ref|NC_005214.1|:1000-1900, GC Content: 50.45%
Sequence ID: ref|NC_005215.1|:1500-2800, GC Content: 51.50%
Sequence ID: ref|NC_005216.1|:2000-3900, GC Content: 53.59%
Sequence ID: ref|NC_005217.1|:1200-2500, GC Content: 52.25%

GC content graph

To plot the GC content graph, will be using matplotlib (pylab) module.

from Bio import SeqIO
from Bio.SeqUtils import GC
import matplotlib.pylab as plt #matplotlib (pylab) to plot the graph

fasta_file_path = "sequences.fasta"
gc_content_dict = {}

for record in SeqIO.parse(fasta_file_path, "fasta"):
    sequence = str(record.seq)
    gc_content = GC(sequence)
    gc_content_dict[record.id] = gc_content

for seq_id, gc in gc_content_dict.items():
    print(f"Sequence ID: {seq_id}, GC Content: {gc:.2f}%")

# Plotting the GC content
plt.figure(figsize=(10, 6))  # Set the figure size
plt.bar(gc_content_dict.keys(), gc_content_dict.values(), color='skyblue')  # Create a bar plot
plt.title('GC Content of Sequences')  # Set the title
plt.xlabel('Sequence ID')  # Set the x-axis label
plt.ylabel('GC Content (%)')  # Set the y-axis label
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better visibility
plt.tight_layout()  # Adjust layout to make room for labels
plt.grid(axis='y')  # Add a grid for better readability

# Show the plot
plt.show()

GC content graph using matplotlib (pylab)

Summary

The article provides a detailed guide on calculating GC content in FASTA files using the Biopython library. It highlights the significance of GC content in molecular biology, affecting DNA stability and gene expression. The tutorial outlines the use of the Bio.SeqIO module for reading FASTA files and demonstrates how to compute GC content using the Bio.SeqUtils.GC() function. Additionally, it includes code examples for visualizing the GC content through bar plots using the Matplotlib library.

Reference

Cock, P. J. A., et al. Biopython Tutorial and Cookbook. Biopython.org, 2009.

Biopython Tutorial: How to Calculate GC Content in FASTA Files

Introduction

Importance of GC Content

Overview of Bio.SeqIO Module

Usage of Bio.SeqIO

Biopython Program to Calculate GC Content of a FASTA File

Requirements

Counting GC Content

Output

GC content graph

Summary

Reference

Leave a Reply Cancel reply

Check out these ...

Introduction

Importance of GC Content

Overview of Bio.SeqIO Module

Usage of Bio.SeqIO

Biopython Program to Calculate GC Content of a FASTA File

Requirements

Counting GC Content

Output

GC content graph

Summary

Reference

Sign Up For Daily Newsletter

Our resources that will help you excel in your academics and research.

Leave a Reply Cancel reply