this is hopefully the last of a series of posts where I explored software choices that might replace BLAST+ programs in some scenarios. Today I'll write about HS-BLASTN, a parallel nucleotide local aligner reported to accelerate the default BLASTN algorithm (megablast), while producing the same results:
HS-BLASTN algorithm overview, taken from http://nar.oxfordjournals.org/content/early/2015/08/06/nar.gkv784.long |
We will benchmark HS-BLASTN (version hs-blastn-0.0.5+) using the same Hordeum vulgare and Arabidopsis thaliana sequences used in a previous post.
$ hs-blastn index bur-0.fasta #[IndexBuilder] Time elapsed: 89.002952 secs. #-rw-rw-r-- 1 contrera contrera 442M Jan 5 09:59 bur-0.fasta.bwt #-rw-rw-r-- 1 contrera contrera 13M Jan 5 09:58 bur-0.fasta.header #-rw-rw-r-- 1 contrera contrera 116M Jan 5 09:59 bur-0.fasta.sa #-rw-rw-r-- 1 contrera contrera 58M Jan 5 09:58 bur-0.fasta.sequence $ ncbi-blast-2.2.30+/bin/makeblastdb -in bur-0.fasta -dbtype nucl #Adding sequences from FASTA; added 67259 sequences in 5.51543 seconds. #-rw-rw-r-- 1 contrera contrera 12M Jan 5 10:06 bur-0.fasta.nhr #-rw-rw-r-- 1 contrera contrera 789K Jan 5 10:06 bur-0.fasta.nin #-rw-rw-r-- 1 contrera contrera 15M Jan 5 10:06 bur-0.fasta.nsq
It can be seen that indexing a FASTA file with 67K sequences is about 16x slower with HS-BLASTN than with standard NCBI mableblastdb, and produces much larger index files (629Mb vs 28Mb). Now let's review search performance:
hs-blastn align -query SBCC073_fLF.fasta \ -db bur-0.fasta -evalue 0.00001 -outfmt 6 -max_target_seqs 5 \ -out SBCC073_fLF.bur-0.hsblastn #[HS-BLASTN] done. Elpased time: 9.0055 secs. hs-blastn align -query SBCC073_fLF.fasta \ -db bur-0.fasta -evalue 0.00001 -outfmt 6 -max_target_seqs 5 \ -out SBCC073_fLF.bur-0.hsblastn -num_threads 20 #[HS-BLASTN] done. Elpased time: 1.6599 secs. time ncbi-blast-2.2.30+/bin/blastn -task megablast -query SBCC073_fLF.fasta \ -db bur-0.fasta -evalue 0.00001 -soft_masking true -outfmt 6 -max_target_seqs 5 \ -out SBCC073_fLF.bur-0.blastn #real 0m33.943s perl _split_blast.pl 20 2000 ncbi-blast-2.2.30+/bin/blastn \ -task megablast -query SBCC073_fLF.fasta -db bur-0.fasta -evalue 0.00001 -soft_masking true \ -outfmt 6 -max_target_seqs 5 -out SBCC073_fLF.bur-0.blastn # runtime: 7 wallclock secs ( 2.81 usr 0.50 sys + 60.47 cusr 7.28 csys = 71.06 CPU)
It can be seen that HS-BLASTN is ~4x faster using a single thread. About the same speedup is obtained when 20 threads are used and BLASTN is parallelized with help from an external script (_split_blast.pl). There are small nuisances, such as the fact that no soft-masking is available, or the ocasional non-stable output order of hits with same score, but it seems worth it nevertheless,
have a good weekend,
Bruno