in a recent post I discussed the performance of DIAMOND when aligning nucleotide sequences against a peptide database. However, as this tool can also do protein similarity searches, I wanted to test that too. Here I summarize those results.
$ makeblastdb -in uniprot_sprot.fasta -dbtype prot $ diamond makedb --in uniprot_sprot.fasta --db uniprot_sprot.fasta
As earlier, we used v0.8.25, obtained from https://github.com/bbuchfink/diamond.
The actual sequence similarity searches were performed as follows:
$ diamond blastp -p 30 -d swissprot -q bur-0.faa -o bur-0.diamond.tsv \ --max-target-seqs 1000 --evalue 0.00001 \ --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend \ sstart send evalue bitscore qcovhsp sseq $ diamond blastp -p 30 -d swissprot -q bur-0.faa -o bur-0.diamond.sensitive.tsv \ --sensitive ... $ diamond blastp -p 30 -d swissprot -q bur-0.faa -o bur-0.diamond.more.tsv \ --more-sensitive ... $ time ncbi-blast-2.2.30+/bin/blastp -num_threads 30 -db swissprot \ -query bur-0.faa -out bur-0.blastp.tsv -max_target_seqs 1000 -evalue 0.00001 -outfmt '6 std qcovhsp sseq'
# calling _split_blast.pl as used in get_homologues
$ get_homologues-est/_split_blast.pl 30 250 ncbi-blast-2.2.30+/bin/blastp \ -db swissprot -query bur-0.faa -out bur-0.split.blastp.tsv \ -max_target_seqs 1000 -evalue 0.00001 -outfmt \'6 std qcovhsp sseq\'Those commands were then run replacing bur-0.faa with Mtuberculosis_H37Rv.faa and using 8 CPU cores instead of 30. The obtained alignments were compared with a script (available upon request). Wall-clock time and matched hits are computed in relation to BLASTP (control). Alignment comparisons were performed with all BLASTP hits (red) and the subset of hits with identity >= 50% (blue).
The take home message is that DIAMOND "more sensitive" is 20x to 100x faster than BLASTP in these tests, with roughly 15% less sensitive overall, which is reduced to 5-9% when more remote homologues matter.
search_strategy time(s) matched =length longer shorter matched =length longer shorter control (NCBI BLASTP) 16440 [ all hits ] [ IDENTITY:50% ] bur-0.diamond 55 0.402 0.804 0.109 0.088 0.848 0.916 0.063 0.021 bur-0.diamond.sensitive 192 0.774 0.794 0.116 0.090 0.906 0.916 0.062 0.022 bur-0.diamond.more 263 0.832 0.797 0.115 0.088 0.910 0.916 0.062 0.022 bur-0.split.blastp 6915 1.000 control (NCBI BLASTP) 2657 [ all hits ] [ IDENTITY:50% ] Mtub.diamond 34 0.483 0.853 0.075 0.072 0.930 0.932 0.051 0.017 Mtub.diamond.sensitive 124 0.832 0.823 0.091 0.087 0.957 0.931 0.052 0.017 Mtub.diamond.more 141 0.853 0.821 0.092 0.086 0.958 0.930 0.052 0.017 Mtub.split.blastp 2334 1.000
Bruno
PS I post another test done by David A Wilkinson, from Massey University, New Zealand:
"I ran get homologues in conjunction with blastp, diamond in "standard mode" and diamond in "more-precise mode" for a small dataset.The dataset includes 20 genomes from the genus Leptospira - importantly, they are not all from the same "species", and for some pairwise comparisons will have relatively distant genomes and gene identities...
I used the ORTHOMCL algorithm in get-homologues for clustering. My numbers are in total agreement with published literature.Here are the comparative results:
#clusters
|
#core
|
#softcore
|
#shell
|
#cloud
|
|
BLASTP
|
10398
|
1523
|
2197
|
2269
|
5932
|
DIAMOND_STANDARD
|
10829
|
1421
|
2146
|
2328
|
6355
|
DIAMOND_MORE_PRECISE
|
10410
|
1535
|
2215
|
2269
|
5926
|
%clusters
|
%core
|
%softcore
|
%shell
|
%cloud
|
|
BLASTP
|
100%
|
100%
|
100%
|
100%
|
100%
|
DIAMOND_STANDARD
|
104%
|
93%
|
98%
|
103%
|
107%
|
DIAMOND_MORE_PRECISE
|
100%
|
101%
|
101%
|
100%
|
100%
|
"