This protocol provides a step-by-step guide to performing de novo genome assembly with SPAdes genome assembler on a Windows Subsystem for Linux (WSL) setup using Ubuntu. SPAdes (St. Petersburg genome assembler) is a versatile toolkit for assembling and analyzing sequencing data from Illumina and IonTorrent platforms. SPAdes Assembly Toolkit pipelines also support a hybrid mode, allowing the integration of long reads from PacBio and Oxford Nanopore as supplementary data, enhancing assembly quality and accuracy.
This protocol emphasizes the setup, input requirements, command usage, and troubleshooting tips tailored to a WSL environment.
Materials
- A Windows PC with WSL enabled.
- Software:
- WSL: Installed with an Ubuntu environment. Setup Guide: WSL Installation
- SPAdes: Version 4.0.0 or later, installed in the WSL Ubuntu environment.
- Data: Illumina paired-end FASTQ files (SRR17344889 from NCBI SRA database for demonstration).
Supported Input Data Types
Data Type | Description | Recommended Coverage |
---|---|---|
Illumina paired-end reads | Standard genomic libraries | 50x or higher |
Illumina mate-pair reads | Long-insert libraries | 10x-30x |
PacBio CLR reads | Long reads for hybrid assembly | 10x-20x |
Oxford Nanopore reads | Long reads for hybrid assembly | 10x-20x |
Ion Torrent reads | Single-end reads | 40x or higher |
Sanger reads | Legacy data support | Any coverage |
Procedure
Install WSL and Set Up Ubuntu
- Enable WSL on your Windows system. Ensure your Windows version supports WSL2 for better performance.
- Install Ubuntu from Microsoft Store and set up the environment by updating the package list:
sudo apt update && sudo apt upgrade -y
Install SPAdes in Ubuntu
- Install SPAdes by running the following commands:
- Verify Installation
sudo apt install spades
spades.py -v
SPAdes requires a 64-bit Linux system or Mac OS and Python (3.8 or higher) to be pre-installed on it. To obtain SPAdes you can either download binaries or download source code and compile it yourself.
Prepare Input Data
Create a directory, sra_data. We’ll download a real dataset using the prefetch command from the sra-tools. Find a small dataset on the NCBI SRA database. For example, you can use this accession number: SRR17344889.
Set up the working directory. Open your terminal and create a directory for the dataset:
mkdir ~/sra_data
cd ~/sra_data
Download the dataset using prefetch:
prefetch SRR17344889
After downloading, you can convert the SRA file to FASTQ format using the fastq-dump command.
Convert the SRA file to FASTQ:
fastq-dump --split-files SRR062146.sra
This will create two FASTQ files:
SRR17344889_1.fastq
SRR17344889_2.fastq
Assess data quality (using tools like FastQC) to ensure that the reads are clean and of high quality.
Perform Assembly with SPAdes
Run SPAdes with a command adjusted for your dataset. Run SPAdes
cd ~/SPAdes-4.0.0-Linux/bin
./spades.py \
-1 ../sra_data/SRR17344889_1.fastq \
-2 ../sra_data/SRR17344889_2.fastq \
--careful \
-t 16 \ # Number of threads
-m 64 \ # Memory limit in GB
-k 21,33,55,77 \ # K-mer sizes for assembly
-o ../spades_output
Breakdown of Command Options:
-1 <filename>: Specifies the first read file (forward_reads).
-2 <filename> : Specifies the second read file (reverse_reads).
–careful: Improves the accuracy of the assembly.
-o spades_output: The output directory for the assembly results.
SPAdes Command Line Options
Option | Description | Default Value | Example Usage |
---|---|---|---|
--isolate | Runs SPAdes in isolate mode for standard isolate genome assembly | – | --isolate |
-k | List of k-mer sizes (must be odd and less than 128) | 21,33,55 | -k 21,33,55,77 |
--careful | Reduces number of mismatches and short indels | OFF | --careful |
-o | Directory to store all the resulting files | – | -o /path/to/output |
--pe1-1 , --pe1-2 | Forward and reverse paired-end reads first library | – | --pe1-1 reads1.fastq --pe1-2 reads2.fastq |
--mp1-1 , --mp1-2 | Forward and reverse mate-pair reads first library | – | --mp1-1 mp1.fastq --mp1-2 mp2.fastq |
--s1 | Single-end reads first library | – | --s1 singles.fastq |
--pacbio | PacBio reads | – | --pacbio reads.fastq |
--nanopore | Oxford Nanopore reads | – | --nanopore reads.fastq |
-t , --threads | Number of threads to use | 16 | -t 24 |
-m , --memory | Memory limit in Gb | 250 | -m 128 |
--only-assembler | Runs only assembly (without read error correction) | OFF | --only-assembler |
--only-error-correction | Runs only read error correction (without assembly) | OFF | --only-error-correction |
--cov-cutoff | Coverage cutoff value (auto, off, or numeric) | AUTO | --cov-cutoff 10 |
--phred-offset | PHRED quality offset in the input reads (33 or 64) | 33 | --phred-offset 64 |
--meta | Runs SPAdes in metagenomic mode | OFF | --meta |
--plasmid | Runs plasmidSPAdes pipeline for plasmid detection | OFF | --plasmid |
--rna | Runs rnaSPAdes pipeline for RNA assembly | OFF | --rna |
--checkpoints | Save intermediate check-points (‘last’, ‘all’, or ‘off’) | LAST | --checkpoints all |
--continue | Continues run from the last available check-point | OFF | --continue |
--restart-from | Restart run with updated options starting from a specified check-point | – | --restart-from k33 |
--disable-gzip-output | Forces uncompressed output | OFF | --disable-gzip-output |
--disable-rr | Disables repeat resolution stage | OFF | --disable-rr |
-h , --help | Shows help message | – | --help |
--version | Shows version number | – | --version |
Notes:
- All paired-end and mate-pair libraries can be input using separate pairs of files (forward and reverse) or an interleaved file.
- Multiple libraries can be specified using incrementing numbers (e.g.,
--pe1-1
,--pe2-1
, etc.). - The
k
value must be odd and less than 128 due to the implementation of the de Bruijn graph algorithm. - Memory limit should be set according to the available system resources and genome size.
Review Output and Assemble Quality
cd spades_output
ls
Look for key files, particularly:
output_dir/
├── corrected/ # Error-corrected reads
├── scaffolds.fasta # Final scaffolds
├── contigs.fasta # Final contigs
├── assembly_graph.fastg # Assembly graph in FASTG format
├── contigs.paths # Paths in the assembly graph
├── scaffolds.paths # Scaffold paths
├── params.txt # Parameters used
└── spades.log # Log file
cat contigs.fasta
The output should contain high-quality, assembled contigs in FASTA format. Further validation with tools like QUAST is recommended to confirm assembly accuracy.
Keywords
SPAdes, De Novo Assembly, Genome Assembly, WSL, Ubuntu, Bioinformatics, Illumina, Next-Generation Sequencing, Linux, Protocols
Disclaimer: This protocol and example commands are provided for demonstration purposes only. Please ensure all inputs, parameters, and file paths are carefully tailored to your specific data and experimental setup. For research applications, verify the software version, test dataset suitability, and parameter optimization according to your project requirements.
References
Prjibelski, Andrey, et al. “Using SPAdes De Novo Assembler.” Current Protocols in Bioinformatics, vol. 70, no. 1, 2020, p. e102, https://doi.org/10.1002/cpbi.102. Accessed 28 Oct. 2024.
Further resources
- Bankevich A, et al. (2012). SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology, 19(5):455-477.
- Prjibelski AD, et al. (2020). Using SPAdes De Novo Assembler. Current Protocols in Bioinformatics, 70:e102.
- Antipov D, et al. (2016). hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics, 32(7):1009-1015.
- Wick RR, et al. (2017). Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLOS Computational Biology, 13(6):e1005595.
- Vasilinetc I, et al. (2015). IonHammer: Homopolymer-Aware De Novo Genome Assembly for Ion Torrent Reads. BMC Bioinformatics, 16(Suppl 1):S7.