#!/perl/bioinfo: ORF finding

7 de noviembre de 2016

DIAMOND as replacement of BLASTX

Hi,
about a year ago my colleague Javier Tamames told me about a piece of software called DIAMOND which could be thought of as a replacement of some BLAST tools, particularly BLASTP and BLASTX.

Author's benchmark of DIAMOND, taken from http://www.nature.com/nmeth/journal/v12/n1/full/nmeth.3176.html .

Recently we tested in-house the BLASTX alternative and I summarize here the results of finding protein-coding segments in transcripts from Arabidopsis thaliana (n=67,259) and barley (Hordeum vulgare, n=76,362) by comparing them to Swissprot, downloaded from ftp.uniprot.org.

You can get or build DIAMOND from https://github.com/bbuchfink/diamond.

First, we had to format Swissprot to support searches:

$ makeblastdb -in uniprot_sprot.fasta -dbtype prot                   # produces 3 files
$ diamond makedb --in uniprot_sprot.fasta --db uniprot_sprot.fasta   # produces 1 file

Then we could run the actual nucleotide-to-peptide sequence searches allocating 30 CPU cores. Note that files bur-0.fasta and SBCC073_fLF.fasta contain the A.thaliana and barley transcripts:

$ diamond blastx -p 30 -d swissprot -q bur-0.fasta -o bur-0.diamond.tsv \
  --max-target-seqs 1 --evalue 0.00001 \
  --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovhsp sseq 
#Total time = 36.2444s

$ diamond blastx -p 30 -d swissprot -q bur-0.fasta -o bur-0.diamond.sensitive.tsv \
  --sensitive ...

#Total time = 144.636s

$ diamond blastx -p 30 -d swissprot -q bur-0.fasta -o bur-0.diamond.more.tsv \
  --more-sensitive ...

#Total time = 194.35s

$time ncbi-blast-2.2.30+/bin/blastx -num_threads 30 -db ~/db/swissprot \

  -query bur-0.fasta -out bur-0.blastx.tsv \
  -max_target_seqs 1 -evalue 0.00001 -outfmt '6 std qcovhsp sseq' 
#real    351m30.313s
#user    8294m28.727s
#sys    16m26.575s

Similar searches were performed with barley sequences and the results then compared with help from a Perl script which required aligments to be 5% longer/shorter to be considered significantly loger or shorter, respectively. The total BLASTX alignments were 44,970 for A.thaliana and 32,887 for barley:

diamond_strategy                 matched same_hit  same_length longer  shorter
bur-0.diamond.tsv                  0.935    0.895        0.921  0.031    0.047
bur-0.diamond.sensitive.tsv        0.973    0.902        0.919  0.037    0.044
bur-0.diamond.more.tsv             0.973    0.902        0.918  0.037    0.045

SBCC073_fLF.diamond.tsv            0.889    0.807        0.852  0.071    0.077
SBCC073_fLF.diamond.sensitive.tsv  0.960    0.831        0.856  0.076    0.067
SBCC073_fLF.diamond.more.tsv       0.961    0.831        0.856  0.076    0.067

All in all, our tests suggest a reduction in computing time of 2 orders of magnitude with a 4% loss of sensitivity and alignments to protein sequences of the same length in most cases.

See you,
Bruno