Hi,
about a year ago my colleague
Javier Tamames told me about a piece of software called
DIAMOND which could be thought of as a replacement of some BLAST tools, particularly BLASTP and BLASTX.
Recently we tested in-house the BLASTX alternative and I summarize here the results of finding protein-coding segments in transcripts from
Arabidopsis thaliana (n=67,259) and barley (
Hordeum vulgare, n=76,362) by comparing them to Swissprot, downloaded from
ftp.uniprot.org.
You can get or build DIAMOND from
https://github.com/bbuchfink/diamond.
First, we had to format Swissprot to support searches:
$ makeblastdb -in uniprot_sprot.fasta -dbtype prot # produces 3 files
$ diamond makedb --in uniprot_sprot.fasta --db uniprot_sprot.fasta # produces 1 file
Then we could run the actual nucleotide-to-peptide sequence searches allocating 30 CPU cores. Note that files
bur-0.fasta and
SBCC073_fLF.fasta contain the
A.thaliana and barley transcripts:
$ diamond blastx -p 30 -d swissprot -q bur-0.fasta -o bur-0.diamond.tsv \
--max-target-seqs 1 --evalue 0.00001 \
--outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovhsp sseq
#Total time = 36.2444s
$ diamond blastx -p 30 -d swissprot -q bur-0.fasta -o bur-0.diamond.sensitive.tsv \
--sensitive ...
#Total time = 144.636s
$ diamond blastx -p 30 -d swissprot -q bur-0.fasta -o bur-0.diamond.more.tsv \
--more-sensitive ...
#Total time = 194.35s
$time ncbi-blast-2.2.30+/bin/blastx -num_threads 30 -db ~/db/swissprot \
-query bur-0.fasta -out bur-0.blastx.tsv \
-max_target_seqs 1 -evalue 0.00001 -outfmt '6 std qcovhsp sseq'
#real 351m30.313s
#user 8294m28.727s
#sys 16m26.575s
Similar searches were performed with barley sequences and the results then compared with help from a Perl script which required aligments to be 5% longer/shorter to be considered significantly loger or shorter, respectively. The total BLASTX alignments were 44,970 for
A.thaliana and 32,887 for barley:
diamond_strategy matched same_hit same_length longer shorter
bur-0.diamond.tsv 0.935 0.895 0.921 0.031 0.047
bur-0.diamond.sensitive.tsv 0.973 0.902 0.919 0.037 0.044
bur-0.diamond.more.tsv 0.973 0.902 0.918 0.037 0.045
SBCC073_fLF.diamond.tsv 0.889 0.807 0.852 0.071 0.077
SBCC073_fLF.diamond.sensitive.tsv 0.960 0.831 0.856 0.076 0.067
SBCC073_fLF.diamond.more.tsv 0.961 0.831 0.856 0.076 0.067
All in all, our tests suggest a reduction in computing time of 2 orders of magnitude with a 4% loss of sensitivity and alignments to protein sequences of the same length in most cases.
See you,
Bruno