Mostrando entradas con la etiqueta proteins. Mostrar todas las entradas
Mostrando entradas con la etiqueta proteins. Mostrar todas las entradas

8 de febrero de 2012

HMMER 3.0, HMMER 2.3.2 or PfamScan?

Last time that I annotated Pfam domains into footprintDB database I used the program hmmpfam from the HMMER 2.3.2 software package. But now, many things have changed, Pfam version has moved from 23 to 26, and the current HMM file can't be used directly with HMMER 2, it needs to be converted with a simple tool of HMMER 3 (hmmconvert -2 Pfam-A.hmm > Pfam_ls_26).

HMMER 3 is a nice software tool that is hundreds of times faster than its predecesor, it takes 20 minutes in my Quad-Core computer the same calculation that took like 2 hours in a 28 node cluster.

So I have decided to move to modern times, but cautiously, because last time I tried HMMER 3 I had not wanted results, so I have done my own benchmark that I'm going to explain...

First, I downloaded a test set of protein sequences from 3Dfootprint, as I work with DNA binding proteins, I downloaded all of them from this archive (currently 2007 FASTA sequences).

To calculate the Pfam domains I used the last version of HMMER 3.0 from http://hmmer.janelia.org/software, my old version of HMMER 2.3.2 (something similar can be found here: http://hmmer.janelia.org/software/archive) and pfam_scan.pl script used by Pfam team to create their database in the Sanger Institute. Also I downloaded the last version of Hidden Markov Models from Pfam version 26 and converted it to use with HMMER 3 (hmmpress Pfam-A.hmm) and with HMMER 2 (hmmconvert -2 Pfam-A.hmm > Pfam_ls_26).

Then I started the testing with the 3 programs, with and without using thresholds in the HMMER param options:
HMMER 2 with thresholds: hmmpfam --acc --cut_ga Pfam-A.hmm protein_sequence_complexes.faa > protein_sequence_complexes.hmmscan
HMMER 2 default: hmmpfam --acc Pfam-A.hmm protein_sequence_complexes.faa > protein_sequence_complexes.hmmscan
HMMER 3 with thresholds:  hmmscan --acc --cpu 8 --notextw --cut_ga -o protein_sequence_complexes.hmmscan Pfam-A.hmm protein_sequence_complexes.faa
HMMER 3 default: hmmscan --acc --cpu 8 --notextw -o protein_sequence_complexes.hmmscan Pfam-A.hmm protein_sequence_complexes.faa
PfamScan (default thresholds): pfam_scan.pl -align -cpu 8  -hmm Pfam-A.hmm -fasta protein_sequence_complexes.faa -outfile protein_sequence_complexes.pfamscan

The final conclusion is that I'll use HMMER 3 with thresholds, it's because the calculation time is 200 times faster that HMMER 2 (Figure 1) and both retrive more or less the same number of domains for the main transcription factor families (Figure 2).

It's remarkable that HMMER 3 without thresholds is very much sensitive, detecting near the double number of domains than the rest of the techniques (Figure 1), but most of them are undesired domains that interfere with the identification of the important ones.

HMMER 2 results with and without thresholds are comparable (Figure 1), both of them detect most of the transcription factor domains (Figure 2), even a bit more than HMMER 3 with thresholds, without including spurious domains even without thresholds (Figure 1).

PfamScan detects less domains than the rest of the techniques (Figure 1), although it uses HMMER 3 internally, this is because it doesn't annotate overlapping domains, but also because it has very strict thresholds that in many cases fail to detect real transcription factor domains (Figure 2). We have noticed this problem in a particula recent study of the transcription factor YY1 (1UBD chain C), if we search in Pfam webserver its sequence (chain C) we obtain only 3 of the 4 real Zinc Finger domains, we must find the 4th Zinc Finger domain among the 'insignificant Pfam-A Matches'.

I hope these results help to decide to people like me dubbing among moving to HMMER 3, use PfamScan or continue using HMMER 2.

Figure 1. Statistics of several parameters with the 5 calculation methods.
Figure 2. Numer of retrieved domains for different transcription factor families with the different methods.

31 de enero de 2012

BIFI 2012 - V Congreso Internacional - Dianas proteicas: Descubrimiento de Compuestos Bioactivos

Del 1 al 4 de febrero de 2012 se celebrará en Zaragoza la 5º Congreso Internacional del Instituto de Biocomputación y Física de Sistemas Complejos (BIFI).

Este año el tema central será el Descubrimiento de Fármacos, cubriendo desde los pasos iniciales de investigación en laboratorio hasta los estudios preclínicos: nuevas dianas proteicas, validación de dianas, nuevas metodologías y herramientas de caracterización tanto estructural como funcional y cribado computacional de moléculas. La conferencia servirá de punto de encuentro de investigadores del campo del descubrimiento de fármacos, donde se discutirán los avances más recientes y retos futuros.

Nuestro laboratorio asistirá al evento con un seminario titulado "Protein-DNA interface prediction techniques: performance and potential in protein engineering" y un póster titulado: "In vivo DNA binding pattern of Rex-1 in mouse embryonic stem cells" realizado en colaboración con el Departamento de Veterinaria de la Universidad de Zaragoza.

English version:

The V International Conference of the Institute for Biocomputation and Physics of Complex Systems (BIFI) on February 1-4, 2012.

The meeting will be an international conference on Drug Discovery from a protein perspective, covering most of the initial steps in drug discovery and preclinical studies (new protein targets, protein target validation, new methodologies and tools for structural and functional characterization, experimental and computational high-throughput screening, etc.). We wish the conference to represent a venue for gathering active researchers on drug discovery, with strong roots in the scientific and academic communities to discuss recent developments and future challenges in the field.

Our laboratory will participate in the event with a talk titled "Protein-DNA interface prediction techniques: performance and potential in protein engineering" and a poster titled: "In vivo DNA binding pattern of Rex-1 in mouse embryonic stem cells" in collaboration with the Veterinary Department of the University of Zaragoza.

3 de febrero de 2011

footprintDB - online database of transcription factors and DNA binding motifs

I want to introduce you the new bioinformatic contribution of our lab to the science world: footprintDB (http://floresta.eead.csic.es/footprintdb/)

footprintDB is a database with 2905 unique DNA-binding proteins (mostly transcription factors, TFs) and 4001 DNA-binding motifs extracted from the literature and other repositories.

The binding interfaces of (most) proteins in the database are inferred from the collection of protein-DNA complexes described in 3D-footprint.

footprintDB predicts:
  1. Transcription factors which bind a specific DNA site or motif
  2. DNA motifs likely to recognised by a specific DNA-binding protein
 As summarized in the schema:


We encourage you to register in footprintDB and test it by yourself: http://floresta.eead.csic.es/footprintdb/index.php?user_register