#!/perl/bioinfo: hmmpfam

8 de febrero de 2012

HMMER 3.0, HMMER 2.3.2 or PfamScan?

Last time that I annotated Pfam domains into footprintDB database I used the program hmmpfam from the HMMER 2.3.2 software package. But now, many things have changed, Pfam version has moved from 23 to 26, and the current HMM file can't be used directly with HMMER 2, it needs to be converted with a simple tool of HMMER 3 (hmmconvert -2 Pfam-A.hmm > Pfam_ls_26).

HMMER 3 is a nice software tool that is hundreds of times faster than its predecesor, it takes 20 minutes in my Quad-Core computer the same calculation that took like 2 hours in a 28 node cluster.

So I have decided to move to modern times, but cautiously, because last time I tried HMMER 3 I had not wanted results, so I have done my own benchmark that I'm going to explain...

First, I downloaded a test set of protein sequences from 3Dfootprint, as I work with DNA binding proteins, I downloaded all of them from this archive (currently 2007 FASTA sequences).

To calculate the Pfam domains I used the last version of HMMER 3.0 from http://hmmer.janelia.org/software, my old version of HMMER 2.3.2 (something similar can be found here: http://hmmer.janelia.org/software/archive) and pfam_scan.pl script used by Pfam team to create their database in the Sanger Institute. Also I downloaded the last version of Hidden Markov Models from Pfam version 26 and converted it to use with HMMER 3 (hmmpress Pfam-A.hmm) and with HMMER 2 (hmmconvert -2 Pfam-A.hmm > Pfam_ls_26).

Then I started the testing with the 3 programs, with and without using thresholds in the HMMER param options:
HMMER 2 with thresholds: hmmpfam --acc --cut_ga Pfam-A.hmm protein_sequence_complexes.faa > protein_sequence_complexes.hmmscan
HMMER 2 default: hmmpfam --acc Pfam-A.hmm protein_sequence_complexes.faa > protein_sequence_complexes.hmmscan
HMMER 3 with thresholds: hmmscan --acc --cpu 8 --notextw --cut_ga -o protein_sequence_complexes.hmmscan Pfam-A.hmm protein_sequence_complexes.faa
HMMER 3 default: hmmscan --acc --cpu 8 --notextw -o protein_sequence_complexes.hmmscan Pfam-A.hmm protein_sequence_complexes.faa
PfamScan (default thresholds): pfam_scan.pl -align -cpu 8 -hmm Pfam-A.hmm -fasta protein_sequence_complexes.faa -outfile protein_sequence_complexes.pfamscan

The final conclusion is that I'll use HMMER 3 with thresholds, it's because the calculation time is 200 times faster that HMMER 2 (Figure 1) and both retrive more or less the same number of domains for the main transcription factor families (Figure 2).

It's remarkable that HMMER 3 without thresholds is very much sensitive, detecting near the double number of domains than the rest of the techniques (Figure 1), but most of them are undesired domains that interfere with the identification of the important ones.

HMMER 2 results with and without thresholds are comparable (Figure 1), both of them detect most of the transcription factor domains (Figure 2), even a bit more than HMMER 3 with thresholds, without including spurious domains even without thresholds (Figure 1).

PfamScan detects less domains than the rest of the techniques (Figure 1), although it uses HMMER 3 internally, this is because it doesn't annotate overlapping domains, but also because it has very strict thresholds that in many cases fail to detect real transcription factor domains (Figure 2). We have noticed this problem in a particula recent study of the transcription factor YY1 (1UBD chain C), if we search in Pfam webserver its sequence (chain C) we obtain only 3 of the 4 real Zinc Finger domains, we must find the 4th Zinc Finger domain among the 'insignificant Pfam-A Matches'.

I hope these results help to decide to people like me dubbing among moving to HMMER 3, use PfamScan or continue using HMMER 2.

Figure 1. Statistics of several parameters with the 5 calculation methods.

Figure 2. Numer of retrieved domains for different transcription factor families with the different methods.