BioDBtBioDBtBioDBt
  • Molecular Biology
  • NGS technologies
  • Advanced rDNA
  • Bioinformatics
  • Tools
Notification Show More
BioDBtBioDBt
  • Molecular Biology
  • NGS technologies
  • Advanced rDNA
  • Bioinformatics
  • Tools
Have an existing account? Sign In
Follow US
© 2024 BioDBt
Home » Bioinformatics » Biopython » Biopython Tutorial: How to Calculate GC Content in FASTA Files

Biopython Tutorial: How to Calculate GC Content in FASTA Files

Beaven
Last updated: 31/10/24
By Beaven - Senior Editor Biopython
Share
6 Min Read
This post may be undergoing an editorial review to improve its content. Updates or revisions may occur to enhance accuracy, clarity, and completeness.
SHARE
Highlights
  • Learn how to calculate GC content in FASTA nucleotide files using Biopython with this comprehensive step-by-step guide. GC content is crucial for understanding DNA stability, gene regulation, evolutionary relationships, mutation rates, and phylogenetic studies, making it a key factor in molecular biology and bioinformatics.

Introduction

FASTA format is a widely used text-based format for representing nucleotide sequences. Understanding the composition of these sequences is crucial in various fields of biological research, particularly in genomics and bioinformatics. One of the key metrics derived from nucleotide sequences is the GC content, which is the percentage of nucleotide bases in a DNA molecule that are either guanine (G) or cytosine (C).

Importance of GC Content

GC content is an important parameter that can affect the stability and structure of DNA, influence gene expression, and provide insights into the evolutionary relationships between organisms. High GC content can lead to increased stability of DNA due to stronger hydrogen bonding between G and C bases compared to A and T bases. Conversely, low GC content may be associated with certain genomic characteristics, such as increased mutation rates or specific adaptations to environmental conditions.

Overview of Bio.SeqIO Module

The Biopython library offers a comprehensive suite of tools for biological computation. The Sequence Input/Output, Bio.SeqIO module is particularly useful for reading and writing sequence data in various formats, e.g., FASTA, GenBank. This module provides a convenient interface for accessing sequence records and their associated annotations based on Biopython Tutorial and Cookbook, provide objects to represent biological sequences.

Usage of Bio.SeqIO

The Bio.SeqIO module allows for straightforward manipulation of sequence data. Users can read FASTA files, extract sequences, and perform operations on them with minimal code. The following functions are central to using Bio.SeqIO:

  1. SeqIO.parse(): This function reads sequence records from a file in a specified format, allowing iteration over each record. Each sequence record has metadata, such as an ID and description, and includes the sequence data itself.
  2. SeqIO.write(): This function writes sequence records to a specified output file in a designated format.
from Bio import  SeqIO

file_path = "example.fasta"
records = SeqIO.parse(file_path, "fasta")
for record in records:
    print("Header:", record.id)
    print("Sequence:", record.seq)
    print()

#Output
Header: sp|P25730|FMS1_ECOLI
Sequence: MKLKKTIGAMALATLFATMGASAVEKTISVTASVDPTVDLLQSDGSALPN

Header: sp|P15488|FMS3_ECOLI
Sequence: LTLSNTGVSNTLVGVLTLSNTSIDTVSIASTNVSDTSKNGTVTFAHETNNS

Biopython Program to Calculate GC Content of a FASTA File

Requirements

  1. A text editor or integrated development environment (IDE) for writing and executing Python scripts (e.g., VSCode, PyCharm, Jupyter Notebook).
  2. This protocol can be executed on any operating system that supports Python, including Windows, macOS, and Linux
  3. Dependencies: Ensure that Biopython is installed in your Python (version 3.6 or higher) environment. If not installed, you can do so by running the following command in your terminal via pip (pip install biopython) or download it from python.org.

Here’s the FASTA (sequences.fasta) file that will be used in this tutorial.

>ref|NC_005213.1|:883-2691
ATGAAAAAGCCCCAACCCTATAAAGATGAAGAGATATATTCTATTTTAGAAGAGCCCGTAAAACAATGGT
TTAAAGAGAAATACAAAACATTCACTCCCCCACAAAGGTATGCAATAATGGAAATACATAAAAGGAACAA
TGTTTTAATTTCTTCCCCCACAGGTTCGGGAAAAACGTTAGCAGCGTTTTTAGCTATAATAAATGAATTA
ATAAAGTTATCTCATAAAGGAAAATTAGAAAATAGAGTTTATGCCATTTATGTTTCTCCATTAAGAAGTT
TAAATAACGATGTAAAGAAAAACTTAGAAACTCCATTAAAAGAAATAAAAGAAAAAGCGAAAGAGCTTAA

>ref|NC_005214.1|:1000-1900
ATGCGTACGTTGACCTTGAATGGCTTAGCTCAGTGACGTAGCTGTAACGGTACGCGTGTATACATCC
TTAATCCGTTAGCAGCTCTGACGATGCCGATCCTGTAAGTTGACGTAAGTCGTGATGACGCGTCTAC
TACGATACATGCTGTGAGGCCATCAGTCTGACATCGATCGTACGCGTACGTACGTAAGGAGCTAACG
TAACTGACGTATACGTGACGAGTTGACGATCGAACGATCGAATCGTCGATGATCGACGACGTACGAC
TAGGTAACGTAGCTAGCGTGACGCGTGATGACGAGCAGACGAGCGTGATCGTACGATCAGCTTACGT

>ref|NC_005215.1|:1500-2800
ATGACGTAGCGTACGTAGCTGATCGTAGCGTGATGCTACGCGTACGTAGCGTAGCTGATCGTAGCAG
TTAGCTAGCTAGCTAGTAGCTAGTCAGTGATCGTACGTAGCGTACGATGACTAGCTAGCGTGATGAC
GTTAGCTAGCTAGTAGCTAGCGTAGCTAGTACGTAGCGTACGTTAGCTGACTAGCTGAGCTAGTAGG
CAGATCGTACGGTACGTAGCGTACGGTACGTACGTAGCTAGCTGATGATGCGTAGCTGACTGAGCTT
GAGCTGACTACGTGACGTAGCAGCGTGAGTACGATCGTACGACGATCGTACGTACGATCGTGACGT

>ref|NC_005216.1|:2000-3900
ATGTACGTAGCTAGTACGTTAGGTCAGCTGACGGTACGGTACGTAGGAGCTAGTCGTCGAGCGTTAG
TAGCTAGCGTAGTGACGACGTACGAGTACGTGACGTAGTAGCTGACGTCGATGACGATCGTCGACGT
ACGTTAGCTAGGTCAGTACGTAGCGTACGTAGTGAGCGTCGACGATCGACGTTAGCAGTGACGAGTC
GACGACGTACGAGTAGCGTACGTTAGCTAGCTGATGACGTCGACGATCGTACGAGTGAGTCGAGTAC
GTAGACGTACGTCGTCGATCGTAGCTAGCTGACGTACGTCGTGATGTCGTAGCTAGCGACGTCGAT

>ref|NC_005217.1|:1200-2500
ATGTCGATAGTGACGTGTCGACTGACGTAGCGTACGTTAGTCGACGTAGTACGATGATGACGTAGC
TAGTGAGTAGTAGGTCGACGTTGACGATAGTCAGCGTACGATCGACGTCGACGTAGTAGCTGACGTA
TAGTAGGTCGTAGTGATGTCGAGCTAGCTGACGTAGTGACGACGAGTCGTCGACGTTAGCGTAGCTG
CTGACGTGACGATGAGCTAGTAGTCGACGTAGCGTGACGAGCTAGTGACGTTAGTCGACGTGATCGT
GTAGCTAGCTAGTGTCGAGTCGTAGCTAGCTGACGACGGTAGTACGTGACGTAGTACGACGTAGCT

Counting GC Content

The calculation involves the use of Bio.SeqUtils.GC() function for the GC percentage calculation rather than: gc_content = (sequence.count(“G”) + sequence.count(“C”)) / len(sequence) * 100.

# Import the necessary modules
from Bio import SeqIO
from Bio.SeqUtils import GC

# Define the path to the FASTA file
fasta_file_path = "sequences.fasta"
gc_content_dict = {}

# Parse the FASTA file
for record in SeqIO.parse(fasta_file_path, "fasta"):
    sequence = str(record.seq)  # Convert sequence to a string
    gc_content = GC(sequence)  # Calculate GC content using Bio.SeqUtils.GC()
    gc_content_dict[record.id] = gc_content

# Print GC content for each sequence
for seq_id, gc in gc_content_dict.items():
    print(f"Sequence ID: {seq_id}, GC Content: {gc:.2f}%")

Output

Sequence ID: ref|NC_005213.1|:883-2691, GC Content: 29.71%
Sequence ID: ref|NC_005214.1|:1000-1900, GC Content: 50.45%
Sequence ID: ref|NC_005215.1|:1500-2800, GC Content: 51.50%
Sequence ID: ref|NC_005216.1|:2000-3900, GC Content: 53.59%
Sequence ID: ref|NC_005217.1|:1200-2500, GC Content: 52.25%

GC content graph

To plot the GC content graph, will be using matplotlib (pylab) module.

from Bio import SeqIO
from Bio.SeqUtils import GC
import matplotlib.pylab as plt #matplotlib (pylab) to plot the graph

fasta_file_path = "sequences.fasta"
gc_content_dict = {}

for record in SeqIO.parse(fasta_file_path, "fasta"):
    sequence = str(record.seq)
    gc_content = GC(sequence)
    gc_content_dict[record.id] = gc_content

for seq_id, gc in gc_content_dict.items():
    print(f"Sequence ID: {seq_id}, GC Content: {gc:.2f}%")

# Plotting the GC content
plt.figure(figsize=(10, 6))  # Set the figure size
plt.bar(gc_content_dict.keys(), gc_content_dict.values(), color='skyblue')  # Create a bar plot
plt.title('GC Content of Sequences')  # Set the title
plt.xlabel('Sequence ID')  # Set the x-axis label
plt.ylabel('GC Content (%)')  # Set the y-axis label
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better visibility
plt.tight_layout()  # Adjust layout to make room for labels
plt.grid(axis='y')  # Add a grid for better readability

# Show the plot
plt.show()
GC content graph using matplotlib (pylab)

Summary

The article provides a detailed guide on calculating GC content in FASTA files using the Biopython library. It highlights the significance of GC content in molecular biology, affecting DNA stability and gene expression. The tutorial outlines the use of the Bio.SeqIO module for reading FASTA files and demonstrates how to compute GC content using the Bio.SeqUtils.GC() function. Additionally, it includes code examples for visualizing the GC content through bar plots using the Matplotlib library.


Reference

  1. Cock, P. J. A., et al. Biopython Tutorial and Cookbook. Biopython.org, 2009.

TAGGED:FASTA fileGC Content

Sign Up For Daily Newsletter

Our resources that will help you excel in your academics and research.
By Beaven
Senior Editor
Manjengwa, B. is currently pursuing an M.Sc. (Hons) in Biotechnology at Panjab University, Chandigarh, having completed his B.Sc. (Hons) in Biotechnology. His specialized training includes Next Generation Sequencing Technologies: Data Analysis and Applications, Academic Paper Writing and Intellectual Property Rights (IPR), and Digital Marketing and Management Studies.
Leave a Comment Leave a Comment

Leave a Reply Cancel reply

You must be logged in to post a comment.

Check out these ...

testing

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus…

Beaven

Yeast Two-Hybdrid (Y2H) system explained

The Yeast Two-Hybrid (Y2H) system or Yeast Two-Hybrid Assay represents a powerful…

TanviBeaven

Ligase Chain Reaction (LCR) Explained

Ligase chain reaction (LCR) is a thermostable DNA ligase-dependent DNA amplification which…

Beaven Tags: Ligase Chain Reaction (LCR)
BioDBtBioDBt
Follow US
© 2024 BioDBt (Bioinformatics-Driven Biotechnology)
  • Privacy Policy
  • Cookie Policy
  • About us
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?

Not a member? Sign Up