BioDBtBioDBtBioDBt
  • Molecular Biology
  • NGS technologies
  • Advanced rDNA
  • Bioinformatics
  • Tools
Notification Show More
BioDBtBioDBt
  • Molecular Biology
  • NGS technologies
  • Advanced rDNA
  • Bioinformatics
  • Tools
Have an existing account? Sign In
Follow US
© 2024 BioDBt
Home » Bioinformatics » Tools » How to use SPAdes Genome Assembler tutorial

How to use SPAdes Genome Assembler tutorial

Beaven
Last updated: 28/10/24
By Beaven - Senior Editor Tools
Share
8 Min Read
This post may be undergoing an editorial review to improve its content. Updates or revisions may occur to enhance accuracy, clarity, and completeness.
SHARE

This protocol provides a step-by-step guide to performing de novo genome assembly with SPAdes genome assembler on a Windows Subsystem for Linux (WSL) setup using Ubuntu. SPAdes (St. Petersburg genome assembler) is a versatile toolkit for assembling and analyzing sequencing data from Illumina and IonTorrent platforms. SPAdes Assembly Toolkit pipelines also support a hybrid mode, allowing the integration of long reads from PacBio and Oxford Nanopore as supplementary data, enhancing assembly quality and accuracy.

This protocol emphasizes the setup, input requirements, command usage, and troubleshooting tips tailored to a WSL environment.

Materials

  1. A Windows PC with WSL enabled.
  2. Software:
    • WSL: Installed with an Ubuntu environment. Setup Guide: WSL Installation
    • SPAdes: Version 4.0.0 or later, installed in the WSL Ubuntu environment.
  3. Data: Illumina paired-end FASTQ files (SRR17344889 from NCBI SRA database for demonstration).

Supported Input Data Types

Data TypeDescriptionRecommended Coverage
Illumina paired-end readsStandard genomic libraries50x or higher
Illumina mate-pair readsLong-insert libraries10x-30x
PacBio CLR readsLong reads for hybrid assembly10x-20x
Oxford Nanopore readsLong reads for hybrid assembly10x-20x
Ion Torrent readsSingle-end reads40x or higher
Sanger readsLegacy data supportAny coverage

Procedure

Install WSL and Set Up Ubuntu

  1. Enable WSL on your Windows system. Ensure your Windows version supports WSL2 for better performance.
  2. Install Ubuntu from Microsoft Store and set up the environment by updating the package list:
sudo apt update && sudo apt upgrade -y

Install SPAdes in Ubuntu

  1. Install SPAdes by running the following commands:
  2. Verify Installation
sudo apt install spades
spades.py -v
SPAdes requires a 64-bit Linux system or Mac OS and Python (3.8 or higher) to be pre-installed on it. To obtain SPAdes you can either download binaries or download source code and compile it yourself.

Prepare Input Data

Create a directory, sra_data. We’ll download a real dataset using the prefetch command from the sra-tools. Find a small dataset on the NCBI SRA database. For example, you can use this accession number: SRR17344889.

Set up the working directory. Open your terminal and create a directory for the dataset:

mkdir ~/sra_data
cd ~/sra_data

Download the dataset using prefetch:

 prefetch SRR17344889

After downloading, you can convert the SRA file to FASTQ format using the fastq-dump command.
Convert the SRA file to FASTQ:

fastq-dump --split-files SRR062146.sra

This will create two FASTQ files:

SRR17344889_1.fastq
SRR17344889_2.fastq
Assess data quality (using tools like FastQC) to ensure that the reads are clean and of high quality.

Perform Assembly with SPAdes

Run SPAdes with a command adjusted for your dataset. Run SPAdes

cd ~/SPAdes-4.0.0-Linux/bin
./spades.py \
    -1 ../sra_data/SRR17344889_1.fastq \
    -2 ../sra_data/SRR17344889_2.fastq \
    --careful \
    -t 16 \                    # Number of threads
    -m 64 \                    # Memory limit in GB
    -k 21,33,55,77 \           # K-mer sizes for assembly
    -o ../spades_output

Breakdown of Command Options:
-1 <filename>: Specifies the first read file (forward_reads).
-2 <filename> : Specifies the second read file (reverse_reads).
–careful: Improves the accuracy of the assembly.
-o spades_output: The output directory for the assembly results.

SPAdes Command Line Options

OptionDescriptionDefault ValueExample Usage
--isolateRuns SPAdes in isolate mode for standard isolate genome assembly–--isolate
-kList of k-mer sizes (must be odd and less than 128)21,33,55-k 21,33,55,77
--carefulReduces number of mismatches and short indelsOFF--careful
-oDirectory to store all the resulting files–-o /path/to/output
--pe1-1, --pe1-2Forward and reverse paired-end reads first library–--pe1-1 reads1.fastq --pe1-2 reads2.fastq
--mp1-1, --mp1-2Forward and reverse mate-pair reads first library–--mp1-1 mp1.fastq --mp1-2 mp2.fastq
--s1Single-end reads first library–--s1 singles.fastq
--pacbioPacBio reads–--pacbio reads.fastq
--nanoporeOxford Nanopore reads–--nanopore reads.fastq
-t, --threadsNumber of threads to use16-t 24
-m, --memoryMemory limit in Gb250-m 128
--only-assemblerRuns only assembly (without read error correction)OFF--only-assembler
--only-error-correctionRuns only read error correction (without assembly)OFF--only-error-correction
--cov-cutoffCoverage cutoff value (auto, off, or numeric)AUTO--cov-cutoff 10
--phred-offsetPHRED quality offset in the input reads (33 or 64)33--phred-offset 64
--metaRuns SPAdes in metagenomic modeOFF--meta
--plasmidRuns plasmidSPAdes pipeline for plasmid detectionOFF--plasmid
--rnaRuns rnaSPAdes pipeline for RNA assemblyOFF--rna
--checkpointsSave intermediate check-points (‘last’, ‘all’, or ‘off’)LAST--checkpoints all
--continueContinues run from the last available check-pointOFF--continue
--restart-fromRestart run with updated options starting from a specified check-point–--restart-from k33
--disable-gzip-outputForces uncompressed outputOFF--disable-gzip-output
--disable-rrDisables repeat resolution stageOFF--disable-rr
-h, --helpShows help message–--help
--versionShows version number–--version
Command Line Options – SPAdes Assembly Toolkit, n.d.

Notes:

  1. All paired-end and mate-pair libraries can be input using separate pairs of files (forward and reverse) or an interleaved file.
  2. Multiple libraries can be specified using incrementing numbers (e.g., --pe1-1, --pe2-1, etc.).
  3. The k value must be odd and less than 128 due to the implementation of the de Bruijn graph algorithm.
  4. Memory limit should be set according to the available system resources and genome size.

Review Output and Assemble Quality

cd spades_output
ls

Look for key files, particularly:

output_dir/
├── corrected/          # Error-corrected reads
├── scaffolds.fasta     # Final scaffolds
├── contigs.fasta       # Final contigs
├── assembly_graph.fastg # Assembly graph in FASTG format
├── contigs.paths       # Paths in the assembly graph
├── scaffolds.paths     # Scaffold paths
├── params.txt          # Parameters used
└── spades.log          # Log file
cat contigs.fasta

The output should contain high-quality, assembled contigs in FASTA format. Further validation with tools like QUAST is recommended to confirm assembly accuracy.


Keywords

SPAdes, De Novo Assembly, Genome Assembly, WSL, Ubuntu, Bioinformatics, Illumina, Next-Generation Sequencing, Linux, Protocols


Disclaimer: This protocol and example commands are provided for demonstration purposes only. Please ensure all inputs, parameters, and file paths are carefully tailored to your specific data and experimental setup. For research applications, verify the software version, test dataset suitability, and parameter optimization according to your project requirements.


References

Prjibelski, Andrey, et al. “Using SPAdes De Novo Assembler.” Current Protocols in Bioinformatics, vol. 70, no. 1, 2020, p. e102, https://doi.org/10.1002/cpbi.102. Accessed 28 Oct. 2024.

Further resources

  1. Bankevich A, et al. (2012). SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology, 19(5):455-477.
  2. Prjibelski AD, et al. (2020). Using SPAdes De Novo Assembler. Current Protocols in Bioinformatics, 70:e102.
  3. Antipov D, et al. (2016). hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics, 32(7):1009-1015.
  4. Wick RR, et al. (2017). Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLOS Computational Biology, 13(6):e1005595.
  5. Vasilinetc I, et al. (2015). IonHammer: Homopolymer-Aware De Novo Genome Assembly for Ion Torrent Reads. BMC Bioinformatics, 16(Suppl 1):S7.
TAGGED:Genome AssemblerSPAdes

Sign Up For Daily Newsletter

Our resources that will help you excel in your academics and research.
By Beaven
Senior Editor
Manjengwa, B. is currently pursuing an M.Sc. (Hons) in Biotechnology at Panjab University, Chandigarh, having completed his B.Sc. (Hons) in Biotechnology. His specialized training includes Next Generation Sequencing Technologies: Data Analysis and Applications, Academic Paper Writing and Intellectual Property Rights (IPR), and Digital Marketing and Management Studies.
Leave a Comment Leave a Comment

Leave a Reply Cancel reply

You must be logged in to post a comment.

Check out these ...

testing

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus…

Beaven

Yeast Two-Hybdrid (Y2H) system explained

The Yeast Two-Hybrid (Y2H) system or Yeast Two-Hybrid Assay represents a powerful…

TanviBeaven

Ligase Chain Reaction (LCR) Explained

Ligase chain reaction (LCR) is a thermostable DNA ligase-dependent DNA amplification which…

Beaven Tags: Ligase Chain Reaction (LCR)
BioDBtBioDBt
Follow US
© 2024 BioDBt (Bioinformatics-Driven Biotechnology)
  • Privacy Policy
  • Cookie Policy
  • About us
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?

Not a member? Sign Up