5 de enero de 2017

HS-BLASTN as replacement of MEGABLAST

Hi,
this is hopefully the last of a series of posts where I explored software choices that might replace BLAST+ programs in some scenarios. Today I'll write about HS-BLASTN, a parallel nucleotide local aligner reported to accelerate the default BLASTN algorithm (megablast), while producing the same results:


HS-BLASTN algorithm overview, taken from http://nar.oxfordjournals.org/content/early/2015/08/06/nar.gkv784.long
Note that megablast is a fast choice for intra-species comparisons and typically retrieves sequence matches with nucleotide identities greater than 70% (see Figure 3 in http://eead-csic-compbio.github.io/get_homologues/manual-est).

We will benchmark HS-BLASTN (version hs-blastn-0.0.5+) using the same Hordeum vulgare and Arabidopsis thaliana sequences used in a previous post.

$ hs-blastn index bur-0.fasta
#[IndexBuilder] Time elapsed: 89.002952 secs.
#-rw-rw-r-- 1 contrera contrera 442M Jan  5 09:59 bur-0.fasta.bwt
#-rw-rw-r-- 1 contrera contrera  13M Jan  5 09:58 bur-0.fasta.header
#-rw-rw-r-- 1 contrera contrera 116M Jan  5 09:59 bur-0.fasta.sa
#-rw-rw-r-- 1 contrera contrera  58M Jan  5 09:58 bur-0.fasta.sequence

$ ncbi-blast-2.2.30+/bin/makeblastdb -in bur-0.fasta -dbtype nucl
#Adding sequences from FASTA; added 67259 sequences in 5.51543 seconds.
#-rw-rw-r-- 1 contrera contrera  12M Jan  5 10:06 bur-0.fasta.nhr
#-rw-rw-r-- 1 contrera contrera 789K Jan  5 10:06 bur-0.fasta.nin
#-rw-rw-r-- 1 contrera contrera  15M Jan  5 10:06 bur-0.fasta.nsq

It can be seen that indexing a FASTA file with 67K sequences is about 16x slower with HS-BLASTN than with standard NCBI mableblastdb, and produces much larger index files (629Mb vs 28Mb). Now let's review search performance:
 
hs-blastn align -query SBCC073_fLF.fasta \
  -db bur-0.fasta -evalue 0.00001 -outfmt 6 -max_target_seqs 5 \
  -out SBCC073_fLF.bur-0.hsblastn
#[HS-BLASTN] done. Elpased time: 9.0055 secs.

hs-blastn align -query SBCC073_fLF.fasta \
  -db bur-0.fasta -evalue 0.00001 -outfmt 6 -max_target_seqs 5 \
  -out SBCC073_fLF.bur-0.hsblastn -num_threads 20
#[HS-BLASTN] done. Elpased time: 1.6599 secs.

time ncbi-blast-2.2.30+/bin/blastn -task megablast -query SBCC073_fLF.fasta \
  -db bur-0.fasta -evalue 0.00001 -soft_masking true -outfmt 6 -max_target_seqs 5 \
  -out SBCC073_fLF.bur-0.blastn
#real 0m33.943s

perl _split_blast.pl 20 2000 ncbi-blast-2.2.30+/bin/blastn \
  -task megablast -query SBCC073_fLF.fasta -db bur-0.fasta -evalue 0.00001 -soft_masking true \
  -outfmt 6 -max_target_seqs 5 -out SBCC073_fLF.bur-0.blastn 
# runtime:  7 wallclock secs ( 2.81 usr  0.50 sys + 60.47 cusr  7.28 csys = 71.06 CPU)

It can be seen that HS-BLASTN is ~4x faster using a single thread. About the same speedup is obtained when 20 threads are used and BLASTN is parallelized with help from an external script (_split_blast.pl). There are small nuisances, such as the fact that no soft-masking is available, or the ocasional non-stable output order of hits with same score, but it seems worth it nevertheless,
have a good weekend,
Bruno