Hi,
about a year ago my colleague Javier Tamames told me about a piece of software called DIAMOND which could be thought of as a replacement of some BLAST tools, particularly BLASTP and BLASTX.
Recently we tested in-house the BLASTX alternative and I summarize here the results of finding protein-coding segments in transcripts from Arabidopsis thaliana (n=67,259) and barley (Hordeum vulgare, n=76,362) by comparing them to Swissprot, downloaded from ftp.uniprot.org.
You can get or build DIAMOND from https://github.com/bbuchfink/diamond.
First, we had to format Swissprot to support searches:
Then we could run the actual nucleotide-to-peptide sequence searches allocating 30 CPU cores. Note that files bur-0.fasta and SBCC073_fLF.fasta contain the A.thaliana and barley transcripts:
Similar searches were performed with barley sequences and the results then compared with help from a Perl script which required aligments to be 5% longer/shorter to be considered significantly loger or shorter, respectively. The total BLASTX alignments were 44,970 for A.thaliana and 32,887 for barley:
All in all, our tests suggest a reduction in computing time of 2 orders of magnitude with a 4% loss of sensitivity and alignments to protein sequences of the same length in most cases.
See you,
Bruno
about a year ago my colleague Javier Tamames told me about a piece of software called DIAMOND which could be thought of as a replacement of some BLAST tools, particularly BLASTP and BLASTX.
Author's benchmark of DIAMOND, taken from http://www.nature.com/nmeth/journal/v12/n1/full/nmeth.3176.html . |
Recently we tested in-house the BLASTX alternative and I summarize here the results of finding protein-coding segments in transcripts from Arabidopsis thaliana (n=67,259) and barley (Hordeum vulgare, n=76,362) by comparing them to Swissprot, downloaded from ftp.uniprot.org.
You can get or build DIAMOND from https://github.com/bbuchfink/diamond.
First, we had to format Swissprot to support searches:
$ makeblastdb -in uniprot_sprot.fasta -dbtype prot # produces 3 files $ diamond makedb --in uniprot_sprot.fasta --db uniprot_sprot.fasta # produces 1 file
Then we could run the actual nucleotide-to-peptide sequence searches allocating 30 CPU cores. Note that files bur-0.fasta and SBCC073_fLF.fasta contain the A.thaliana and barley transcripts:
$ diamond blastx -p 30 -d swissprot -q bur-0.fasta -o bur-0.diamond.tsv \ --max-target-seqs 1 --evalue 0.00001 \ --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovhsp sseq #Total time = 36.2444s $ diamond blastx -p 30 -d swissprot -q bur-0.fasta -o bur-0.diamond.sensitive.tsv \ --sensitive ...
#Total time = 144.636s $ diamond blastx -p 30 -d swissprot -q bur-0.fasta -o bur-0.diamond.more.tsv \ --more-sensitive ...
#Total time = 194.35s $time ncbi-blast-2.2.30+/bin/blastx -num_threads 30 -db ~/db/swissprot \
-query bur-0.fasta -out bur-0.blastx.tsv \ -max_target_seqs 1 -evalue 0.00001 -outfmt '6 std qcovhsp sseq' #real 351m30.313s #user 8294m28.727s #sys 16m26.575s
Similar searches were performed with barley sequences and the results then compared with help from a Perl script which required aligments to be 5% longer/shorter to be considered significantly loger or shorter, respectively. The total BLASTX alignments were 44,970 for A.thaliana and 32,887 for barley:
diamond_strategy matched same_hit same_length longer shorter bur-0.diamond.tsv 0.935 0.895 0.921 0.031 0.047 bur-0.diamond.sensitive.tsv 0.973 0.902 0.919 0.037 0.044 bur-0.diamond.more.tsv 0.973 0.902 0.918 0.037 0.045 SBCC073_fLF.diamond.tsv 0.889 0.807 0.852 0.071 0.077 SBCC073_fLF.diamond.sensitive.tsv 0.960 0.831 0.856 0.076 0.067 SBCC073_fLF.diamond.more.tsv 0.961 0.831 0.856 0.076 0.067
All in all, our tests suggest a reduction in computing time of 2 orders of magnitude with a 4% loss of sensitivity and alignments to protein sequences of the same length in most cases.
See you,
Bruno