#!/perl/bioinfo: HMMER

26 de septiembre de 2016

PHMMER como alternativa a BLASTP

Hola,
hoy quería hablar de herramientas de búsqueda de secuencias de proteína, uno de los caballos de batalla en nuestro trabajo. Es un tema ya tocado en este blog, por ejemplo cuando hablamos de deltablast y cs-blast, pero que sigue siendo de actualidad.

En esta ocasión ha sido un artículo de Saripella et al el que me lo ha vuelto a recordar, comparando algoritmos que usan perfiles de secuencia (CS-BLAST, HHSEARCH and PHMMER) con algoritmos convencionales (BLASTP, USEARCH, UBLAST and FASTA). Es uno de esos artículos aburridos pero necesarios, donde se compara por vía independiente la validez de los mejores algoritmos para esta tarea, y se guardan los tiempos de cálculo empleados por cada uno de ellos (usando un core de CPU), para anotar dominios conocidos de secuencias de SwissProt extraídos de 3 repositorios complementarios (Pfam, Superfamily y CATH):

Figura original de http://bioinformatics.oxfordjournals.org/content/32/17/2636.full.

En el artículo se calculan áreas bajo curvas ROC para caracterizar a los diferentes algoritmos. De todos ellos destacaría dos:

1) BLASTP por ser muy rápido y muy preciso, con áreas de 0.908, 0.857 y 0.878 para las 3 colecciones de dominios y el tiempo que se muestra en la gráfica para buscar 100 secuencias al azar.

2) PHMMER por ser marginalmente más lento que BLASTP pero con una ganancia en área de aproximadamente el 3% (0.922,0.903,0.903).

Además, PHMMER es muy fácil de usar. Lo descargas de http://eddylab.org/software/hmmer3/CURRENT/ y solamente necesitas un archivo FASTA con la(s) secuencia(s) problema y otro con un repositorio de secuencias como SwissProt:

$ hmmer-3.1b1-linux-intel-x86_64/src/phmmer problema.faa uniprot_sprot.fasta

Hasta pronto,
Bruno

8 de febrero de 2012

HMMER 3.0, HMMER 2.3.2 or PfamScan?

Last time that I annotated Pfam domains into footprintDB database I used the program hmmpfam from the HMMER 2.3.2 software package. But now, many things have changed, Pfam version has moved from 23 to 26, and the current HMM file can't be used directly with HMMER 2, it needs to be converted with a simple tool of HMMER 3 (hmmconvert -2 Pfam-A.hmm > Pfam_ls_26).

HMMER 3 is a nice software tool that is hundreds of times faster than its predecesor, it takes 20 minutes in my Quad-Core computer the same calculation that took like 2 hours in a 28 node cluster.

So I have decided to move to modern times, but cautiously, because last time I tried HMMER 3 I had not wanted results, so I have done my own benchmark that I'm going to explain...

First, I downloaded a test set of protein sequences from 3Dfootprint, as I work with DNA binding proteins, I downloaded all of them from this archive (currently 2007 FASTA sequences).

To calculate the Pfam domains I used the last version of HMMER 3.0 from http://hmmer.janelia.org/software, my old version of HMMER 2.3.2 (something similar can be found here: http://hmmer.janelia.org/software/archive) and pfam_scan.pl script used by Pfam team to create their database in the Sanger Institute. Also I downloaded the last version of Hidden Markov Models from Pfam version 26 and converted it to use with HMMER 3 (hmmpress Pfam-A.hmm) and with HMMER 2 (hmmconvert -2 Pfam-A.hmm > Pfam_ls_26).

Then I started the testing with the 3 programs, with and without using thresholds in the HMMER param options:
HMMER 2 with thresholds: hmmpfam --acc --cut_ga Pfam-A.hmm protein_sequence_complexes.faa > protein_sequence_complexes.hmmscan
HMMER 2 default: hmmpfam --acc Pfam-A.hmm protein_sequence_complexes.faa > protein_sequence_complexes.hmmscan
HMMER 3 with thresholds: hmmscan --acc --cpu 8 --notextw --cut_ga -o protein_sequence_complexes.hmmscan Pfam-A.hmm protein_sequence_complexes.faa
HMMER 3 default: hmmscan --acc --cpu 8 --notextw -o protein_sequence_complexes.hmmscan Pfam-A.hmm protein_sequence_complexes.faa
PfamScan (default thresholds): pfam_scan.pl -align -cpu 8 -hmm Pfam-A.hmm -fasta protein_sequence_complexes.faa -outfile protein_sequence_complexes.pfamscan

The final conclusion is that I'll use HMMER 3 with thresholds, it's because the calculation time is 200 times faster that HMMER 2 (Figure 1) and both retrive more or less the same number of domains for the main transcription factor families (Figure 2).

It's remarkable that HMMER 3 without thresholds is very much sensitive, detecting near the double number of domains than the rest of the techniques (Figure 1), but most of them are undesired domains that interfere with the identification of the important ones.

HMMER 2 results with and without thresholds are comparable (Figure 1), both of them detect most of the transcription factor domains (Figure 2), even a bit more than HMMER 3 with thresholds, without including spurious domains even without thresholds (Figure 1).

PfamScan detects less domains than the rest of the techniques (Figure 1), although it uses HMMER 3 internally, this is because it doesn't annotate overlapping domains, but also because it has very strict thresholds that in many cases fail to detect real transcription factor domains (Figure 2). We have noticed this problem in a particula recent study of the transcription factor YY1 (1UBD chain C), if we search in Pfam webserver its sequence (chain C) we obtain only 3 of the 4 real Zinc Finger domains, we must find the 4th Zinc Finger domain among the 'insignificant Pfam-A Matches'.

I hope these results help to decide to people like me dubbing among moving to HMMER 3, use PfamScan or continue using HMMER 2.

Figure 1. Statistics of several parameters with the 5 calculation methods.

Figure 2. Numer of retrieved domains for different transcription factor families with the different methods.