How to use SPAdes Genome Assembler tutorial

This protocol provides a step-by-step guide to performing de novo genome assembly with SPAdes genome assembler on a Windows Subsystem for Linux (WSL) setup using Ubuntu. SPAdes (St. Petersburg genome assembler) is a versatile toolkit for assembling and analyzing sequencing data from Illumina and IonTorrent platforms. SPAdes Assembly Toolkit pipelines also support a hybrid mode, allowing the integration of long reads from PacBio and Oxford Nanopore as supplementary data, enhancing assembly quality and accuracy.

This protocol emphasizes the setup, input requirements, command usage, and troubleshooting tips tailored to a WSL environment.

Materials

A Windows PC with WSL enabled.
Software:
- WSL: Installed with an Ubuntu environment. Setup Guide: WSL Installation
- SPAdes: Version 4.0.0 or later, installed in the WSL Ubuntu environment.
Data: Illumina paired-end FASTQ files (SRR17344889 from NCBI SRA database for demonstration).

Supported Input Data Types

Data Type	Description	Recommended Coverage
Illumina paired-end reads	Standard genomic libraries	50x or higher
Illumina mate-pair reads	Long-insert libraries	10x-30x
PacBio CLR reads	Long reads for hybrid assembly	10x-20x
Oxford Nanopore reads	Long reads for hybrid assembly	10x-20x
Ion Torrent reads	Single-end reads	40x or higher
Sanger reads	Legacy data support	Any coverage

Procedure

Install WSL and Set Up Ubuntu

Enable WSL on your Windows system. Ensure your Windows version supports WSL2 for better performance.
Install Ubuntu from Microsoft Store and set up the environment by updating the package list:

sudo apt update && sudo apt upgrade -y

Install SPAdes in Ubuntu

Install SPAdes by running the following commands:
Verify Installation

sudo apt install spades
spades.py -v

SPAdes requires a 64-bit Linux system or Mac OS and Python (3.8 or higher) to be pre-installed on it. To obtain SPAdes you can either download binaries or download source code and compile it yourself.

Prepare Input Data

Create a directory, sra_data. We’ll download a real dataset using the prefetch command from the sra-tools. Find a small dataset on the NCBI SRA database. For example, you can use this accession number: SRR17344889.

Set up the working directory. Open your terminal and create a directory for the dataset:

mkdir ~/sra_data
cd ~/sra_data

Download the dataset using prefetch:

 prefetch SRR17344889

After downloading, you can convert the SRA file to FASTQ format using the fastq-dump command.
Convert the SRA file to FASTQ:

fastq-dump --split-files SRR062146.sra

This will create two FASTQ files:

SRR17344889_1.fastq
SRR17344889_2.fastq

Assess data quality (using tools like FastQC) to ensure that the reads are clean and of high quality.

Perform Assembly with SPAdes

Run SPAdes with a command adjusted for your dataset. Run SPAdes

cd ~/SPAdes-4.0.0-Linux/bin
./spades.py \
    -1 ../sra_data/SRR17344889_1.fastq \
    -2 ../sra_data/SRR17344889_2.fastq \
    --careful \
    -t 16 \                    # Number of threads
    -m 64 \                    # Memory limit in GB
    -k 21,33,55,77 \           # K-mer sizes for assembly
    -o ../spades_output

Breakdown of Command Options:
-1 <filename>: Specifies the first read file (forward_reads).
-2 <filename> : Specifies the second read file (reverse_reads).
–careful: Improves the accuracy of the assembly.
-o spades_output: The output directory for the assembly results.

SPAdes Command Line Options

Option	Description	Default Value	Example Usage
`--isolate`	Runs SPAdes in isolate mode for standard isolate genome assembly	–	`--isolate`
`-k`	List of k-mer sizes (must be odd and less than 128)	21,33,55	`-k 21,33,55,77`
`--careful`	Reduces number of mismatches and short indels	OFF	`--careful`
`-o`	Directory to store all the resulting files	–	`-o /path/to/output`
`--pe1-1`, `--pe1-2`	Forward and reverse paired-end reads first library	–	`--pe1-1 reads1.fastq --pe1-2 reads2.fastq`
`--mp1-1`, `--mp1-2`	Forward and reverse mate-pair reads first library	–	`--mp1-1 mp1.fastq --mp1-2 mp2.fastq`
`--s1`	Single-end reads first library	–	`--s1 singles.fastq`
`--pacbio`	PacBio reads	–	`--pacbio reads.fastq`
`--nanopore`	Oxford Nanopore reads	–	`--nanopore reads.fastq`
`-t`, `--threads`	Number of threads to use	16	`-t 24`
`-m`, `--memory`	Memory limit in Gb	250	`-m 128`
`--only-assembler`	Runs only assembly (without read error correction)	OFF	`--only-assembler`
`--only-error-correction`	Runs only read error correction (without assembly)	OFF	`--only-error-correction`
`--cov-cutoff`	Coverage cutoff value (auto, off, or numeric)	AUTO	`--cov-cutoff 10`
`--phred-offset`	PHRED quality offset in the input reads (33 or 64)	33	`--phred-offset 64`
`--meta`	Runs SPAdes in metagenomic mode	OFF	`--meta`
`--plasmid`	Runs plasmidSPAdes pipeline for plasmid detection	OFF	`--plasmid`
`--rna`	Runs rnaSPAdes pipeline for RNA assembly	OFF	`--rna`
`--checkpoints`	Save intermediate check-points (‘last’, ‘all’, or ‘off’)	LAST	`--checkpoints all`
`--continue`	Continues run from the last available check-point	OFF	`--continue`
`--restart-from`	Restart run with updated options starting from a specified check-point	–	`--restart-from k33`
`--disable-gzip-output`	Forces uncompressed output	OFF	`--disable-gzip-output`
`--disable-rr`	Disables repeat resolution stage	OFF	`--disable-rr`
`-h`, `--help`	Shows help message	–	`--help`
`--version`	Shows version number	–	`--version`

Command Line Options – SPAdes Assembly Toolkit, n.d.

Notes:

All paired-end and mate-pair libraries can be input using separate pairs of files (forward and reverse) or an interleaved file.
Multiple libraries can be specified using incrementing numbers (e.g., --pe1-1, --pe2-1, etc.).
The k value must be odd and less than 128 due to the implementation of the de Bruijn graph algorithm.
Memory limit should be set according to the available system resources and genome size.

Review Output and Assemble Quality

cd spades_output
ls

Look for key files, particularly:

output_dir/
├── corrected/          # Error-corrected reads
├── scaffolds.fasta     # Final scaffolds
├── contigs.fasta       # Final contigs
├── assembly_graph.fastg # Assembly graph in FASTG format
├── contigs.paths       # Paths in the assembly graph
├── scaffolds.paths     # Scaffold paths
├── params.txt          # Parameters used
└── spades.log          # Log file

cat contigs.fasta

The output should contain high-quality, assembled contigs in FASTA format. Further validation with tools like QUAST is recommended to confirm assembly accuracy.

Keywords

SPAdes, De Novo Assembly, Genome Assembly, WSL, Ubuntu, Bioinformatics, Illumina, Next-Generation Sequencing, Linux, Protocols

Disclaimer: This protocol and example commands are provided for demonstration purposes only. Please ensure all inputs, parameters, and file paths are carefully tailored to your specific data and experimental setup. For research applications, verify the software version, test dataset suitability, and parameter optimization according to your project requirements.

References

Prjibelski, Andrey, et al. “Using SPAdes De Novo Assembler.” Current Protocols in Bioinformatics, vol. 70, no. 1, 2020, p. e102, https://doi.org/10.1002/cpbi.102. Accessed 28 Oct. 2024.

Further resources

Bankevich A, et al. (2012). SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology, 19(5):455-477.
Prjibelski AD, et al. (2020). Using SPAdes De Novo Assembler. Current Protocols in Bioinformatics, 70:e102.
Antipov D, et al. (2016). hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics, 32(7):1009-1015.
Wick RR, et al. (2017). Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLOS Computational Biology, 13(6):e1005595.
Vasilinetc I, et al. (2015). IonHammer: Homopolymer-Aware De Novo Genome Assembly for Ion Torrent Reads. BMC Bioinformatics, 16(Suppl 1):S7.

How to use SPAdes Genome Assembler tutorial

Materials

Procedure

Install WSL and Set Up Ubuntu

Prepare Input Data

Perform Assembly with SPAdes

SPAdes Command Line Options

Review Output and Assemble Quality

References

Leave a Reply Cancel reply

Check out these ...

Materials

Procedure

Install WSL and Set Up Ubuntu

Prepare Input Data

Perform Assembly with SPAdes

SPAdes Command Line Options

Review Output and Assemble Quality

References

Sign Up For Daily Newsletter

Our resources that will help you excel in your academics and research.

Leave a Reply Cancel reply