#!/perl/bioinfo: diamond

19 de diciembre de 2016

DIAMOND as alternative to BLASTP

Hi,
in a recent post I discussed the performance of DIAMOND when aligning nucleotide sequences against a peptide database. However, as this tool can also do protein similarity searches, I wanted to test that too. Here I summarize those results.

We used all protein sequences encoded in ecotype Bur-0 of Arabidopsis thaliana (n=53,027) and the proteome of Mycobacterium tuberculosis strain H37Rv (n=3,906). We scanned these against Swissprot, downloaded from ftp.uniprot.org and formatted as follows:

$ makeblastdb -in uniprot_sprot.fasta -dbtype prot                   
$ diamond makedb --in uniprot_sprot.fasta --db uniprot_sprot.fasta

As earlier, we used v0.8.25, obtained from https://github.com/bbuchfink/diamond.

The actual sequence similarity searches were performed as follows:

$ diamond blastp -p 30 -d swissprot -q bur-0.faa -o bur-0.diamond.tsv \
  --max-target-seqs 1000 --evalue 0.00001 \
  --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend \
  sstart send evalue bitscore qcovhsp sseq 

$ diamond blastp -p 30 -d swissprot -q bur-0.faa -o bur-0.diamond.sensitive.tsv \
  --sensitive ...

$ diamond blastp -p 30 -d swissprot -q bur-0.faa -o bur-0.diamond.more.tsv \
  --more-sensitive ...

$ time ncbi-blast-2.2.30+/bin/blastp -num_threads 30 -db swissprot \
  -query bur-0.faa -out bur-0.blastp.tsv 
  -max_target_seqs 1000 -evalue 0.00001 -outfmt '6 std qcovhsp sseq'

# calling _split_blast.pl as used in get_homologues

$ get_homologues-est/_split_blast.pl 30 250 ncbi-blast-2.2.30+/bin/blastp \
  -db swissprot -query bur-0.faa -out bur-0.split.blastp.tsv \
  -max_target_seqs 1000 -evalue 0.00001 -outfmt \'6 std qcovhsp sseq\'

Those commands were then run replacing bur-0.faa with Mtuberculosis_H37Rv.faa and using 8 CPU cores instead of 30. The obtained alignments were compared with a script (available upon request). Wall-clock time and matched hits are computed in relation to BLASTP (control). Alignment comparisons were performed with all BLASTP hits (red) and the subset of hits with identity >= 50% (blue).

The take home message is that DIAMOND "more sensitive" is 20x to 100x faster than BLASTP in these tests, with roughly 15% less sensitive overall, which is reduced to 5-9% when more remote homologues matter.

     search_strategy    time(s) matched =length longer shorter matched =length longer shorter

control (NCBI BLASTP)     16440  [         all hits          ]  [       IDENTITY:50%        ]
bur-0.diamond                55   0.402   0.804  0.109   0.088   0.848   0.916  0.063   0.021
bur-0.diamond.sensitive     192   0.774   0.794  0.116   0.090   0.906   0.916  0.062   0.022
bur-0.diamond.more          263   0.832   0.797  0.115   0.088   0.910   0.916  0.062   0.022
bur-0.split.blastp         6915   1.000    
                                          
control (NCBI BLASTP)      2657  [         all hits          ]  [       IDENTITY:50%        ]
Mtub.diamond                 34   0.483   0.853  0.075   0.072   0.930   0.932  0.051   0.017
Mtub.diamond.sensitive      124   0.832   0.823  0.091   0.087   0.957   0.931  0.052   0.017
Mtub.diamond.more           141   0.853   0.821  0.092   0.086   0.958   0.930  0.052   0.017
Mtub.split.blastp          2334   1.000

Bruno

PS I post another test done by David A Wilkinson, from Massey University, New Zealand:

"I ran get homologues in conjunction with blastp, diamond in "standard mode" and diamond in "more-precise mode" for a small dataset.The dataset includes 20 genomes from the genus Leptospira - importantly, they are not all from the same "species", and for some pairwise comparisons will have relatively distant genomes and gene identities...

I used the ORTHOMCL algorithm in get-homologues for clustering. My numbers are in total agreement with published literature.Here are the comparative results:

	#clusters	#core	#softcore	#shell	#cloud
BLASTP	10398	1523	2197	2269	5932
DIAMOND_STANDARD	10829	1421	2146	2328	6355
DIAMOND_MORE_PRECISE	10410	1535	2215	2269	5926

	%clusters	%core	%softcore	%shell	%cloud
BLASTP	100%	100%	100%	100%	100%
DIAMOND_STANDARD	104%	93%	98%	103%	107%
DIAMOND_MORE_PRECISE	100%	101%	101%	100%	100%

7 de noviembre de 2016

DIAMOND as replacement of BLASTX

Hi,
about a year ago my colleague Javier Tamames told me about a piece of software called DIAMOND which could be thought of as a replacement of some BLAST tools, particularly BLASTP and BLASTX.

Author's benchmark of DIAMOND, taken from http://www.nature.com/nmeth/journal/v12/n1/full/nmeth.3176.html .

Recently we tested in-house the BLASTX alternative and I summarize here the results of finding protein-coding segments in transcripts from Arabidopsis thaliana (n=67,259) and barley (Hordeum vulgare, n=76,362) by comparing them to Swissprot, downloaded from ftp.uniprot.org.

You can get or build DIAMOND from https://github.com/bbuchfink/diamond.

First, we had to format Swissprot to support searches:

$ makeblastdb -in uniprot_sprot.fasta -dbtype prot                   # produces 3 files
$ diamond makedb --in uniprot_sprot.fasta --db uniprot_sprot.fasta   # produces 1 file

Then we could run the actual nucleotide-to-peptide sequence searches allocating 30 CPU cores. Note that files bur-0.fasta and SBCC073_fLF.fasta contain the A.thaliana and barley transcripts:

$ diamond blastx -p 30 -d swissprot -q bur-0.fasta -o bur-0.diamond.tsv \
  --max-target-seqs 1 --evalue 0.00001 \
  --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovhsp sseq 
#Total time = 36.2444s

$ diamond blastx -p 30 -d swissprot -q bur-0.fasta -o bur-0.diamond.sensitive.tsv \
  --sensitive ...

#Total time = 144.636s

$ diamond blastx -p 30 -d swissprot -q bur-0.fasta -o bur-0.diamond.more.tsv \
  --more-sensitive ...

#Total time = 194.35s

$time ncbi-blast-2.2.30+/bin/blastx -num_threads 30 -db ~/db/swissprot \

  -query bur-0.fasta -out bur-0.blastx.tsv \
  -max_target_seqs 1 -evalue 0.00001 -outfmt '6 std qcovhsp sseq' 
#real    351m30.313s
#user    8294m28.727s
#sys    16m26.575s

Similar searches were performed with barley sequences and the results then compared with help from a Perl script which required aligments to be 5% longer/shorter to be considered significantly loger or shorter, respectively. The total BLASTX alignments were 44,970 for A.thaliana and 32,887 for barley:

diamond_strategy                 matched same_hit  same_length longer  shorter
bur-0.diamond.tsv                  0.935    0.895        0.921  0.031    0.047
bur-0.diamond.sensitive.tsv        0.973    0.902        0.919  0.037    0.044
bur-0.diamond.more.tsv             0.973    0.902        0.918  0.037    0.045

SBCC073_fLF.diamond.tsv            0.889    0.807        0.852  0.071    0.077
SBCC073_fLF.diamond.sensitive.tsv  0.960    0.831        0.856  0.076    0.067
SBCC073_fLF.diamond.more.tsv       0.961    0.831        0.856  0.076    0.067

All in all, our tests suggest a reduction in computing time of 2 orders of magnitude with a 4% loss of sensitivity and alignments to protein sequences of the same length in most cases.

See you,
Bruno