#!/perl/bioinfo

23 de marzo de 2012

Correlación entre transcriptoma y proteoma

Hola,
una cuestión que ya dio qué hablar en los tiempos de los chips de RNA (microarrays) y que ha vuelto a resurgir recientemente es hasta qué punto podemos hacernos una idea de la actividad celular si sólo medimos la expresión de los genes en forma de mRNAs, cuando en realidad muchas de las funciones las desempeñan proteínas.
Por ejemplo, Tian et al.(2004) publicaron correlaciones del 40% en líneas celulares hematopoyéticas de mamíferos.

Sin embargo, ahora en muchos ámbitos los microarrays están siendo sustituidos por experimentos de RNAseq, y por tanto es pertinente volver a hacerse la pregunta, dado que se espera que esta tecnología sea superior. Dos artículos muy recientes nos ayudan precisamente a valorar esto.

El trabajo de Jiang et al.(2011), que propone el uso de controles para corregir los niveles de expresión medidos por RNAseq, incluye una figura que resume la distribución típica de valores de expresión génica que se obtienen por RNAseq:

Figura original de Jiang et al. (2011). En el A panel se muestra la equivalencia entre 'copias por célula' y 'fragmentos por kilobase mapeada' (FPKM). En el B se muestra el histograma de genes expresados, que se solapa a la izquierda (un hombro) con los valores de genes que se expresan menos de una copia por célula. Los datos se obtuvieron a partir de material extraído de una línea celular de Drosophila melanogaster.

El trabajo de Nagaraj et al. (2011), que se basa en material de la archiconocida línea celular humana HeLa, encuentra una distribución similar de mRNAs, lo que sugiere que debe ser una distribución estándar, pero además la correlaciona con valores de abundancia de péptidos medidos por espectrometría de masas de alta resolución. La figura siguiente resume un montón de datos experimentales:

Figura original de Nagaraj et al. (2011).

En el panel A efectivamente se reproduce el hombro observado en los datos de RNAseq de D.melanogaster y en el B se muestran los datos equivalentes de proteómica. El diagrama de Venn C es interesante, porque muestra que hay muchos mensajeros que no se observan como proteína, y algunas proteínas cuyo mensajero está ausente. El panel D muestra la distribución funcional de los genes expresados y, finalmente, el diagrama de dispersión E muestra la correlación observada entre ambos tipos de mediciones, con un coeficiente de correlación de Spearman de 0.6,
un saludo,
Bruno

Actualizaciones posteriores:
[1] En este trabajo se estudia cómo se heredan los niveles de expresión en mosca
[2] En este artículo se comenta que los genes ortólogos entre distintos organismos tienen mayores correlaciones en sus [proteína] que en [mRNA]

6 de marzo de 2012

ofertas de postdoc en filogenómica

Hola,
copio aquí dos ofertas de trabajo que me han llegado desde bioinformatibericos :

Postdoctoral Position in Evolutionary Genomics at IFCA (Universidad of Santander-CSIC)

A postdoctoral position is available immediately to work with Rafael Zardoya (Madrid), Julio Rozas (Barcelona), David Posada (Vigo), and Jesus Marco (Santander) on a common

project related with gastropod phylogenomics. The position has a term of 1 year and mobility among the referred labs.

Prerrequisites: A PhD in Biology, Chemistry, Computer Science or related fields, with proved skills in computational biology.

Priority will be given to candidates with previous experience in NGS data analysis (assembling, gene annotation, etc), and with expertise in bioinformatics programing languages (Perl, Python,

C, etc).

It will be desirable some experience with Phylogenetics and/or Comparative Genomics or evolutionary biology in general, and with Relational databases (MySQL).

If interested please contact Rafael Zardoya, Department of Biodiversity and Evolutionary Biology, Museo Nacional de Ciencias Naturales, rafaz@mncn.csic.es. Application review

will begin on March 5, 2012 and continue until the position is filled. To apply, please send the following:

1. A curriculum vitae

2. Names of 2 referees willing to provide a letter of recommendation upon request

3. A brief statement of research interests and goals.

Postdoc on chromatin biophysics (Structural Genomics Group, National Center for Genomic Analysis)

The successful candidate will team with other members of our group in our efforts towards elucidating the 3D structure of genomic domains and genomes. Recent publication from the group in this field include:

- Umbarger, M. A., Toro, E., Wright, M. A., Porreca, G. J., Baù, D., Hong, S.-H., Fero, M. J., et al. (2011). Molecular Cell, 44(2), 252–264

- Marti-Renom, M. A., & Mirny, L. A. (2011). PLoS Computational Biology, 7(7), e1002125

- Sanyal, A., Baù, D., Martí-Renom, M. A., & Dekker, J. (2011). Current Opinion in Cell Biology. doi:10.1016/j.ceb.2011.03.009

- Baù, D., Sanyal, A., Lajoie, B. R., Capriotti, E., Byron, M., Lawrence, J. B., Dekker, J., et al. (2011). Nature Structural & Molecular Biology, 18(1), 107–114.

I would appreciate if you could forward this announcement to whom may be interested,

Marc A. Marti-Renom, Group Leader

Parc Cientific de Barcelona - Torre I - Baldiri Reixac, 4 - 2a.p - 08028 - Barcelona

Tel +34 934 033 743 Fax +34 934 037 279

email mmarti@pcb.ub.cat web http://marciuslab.org & http://cnag.cat

28 de febrero de 2012

Leyendo el diagrama de Ramachandran

Hola,
la semana pasada les planteé a mis alumnos de la Licenciatura en Ciencias Genómicas el problema de asignar estructura secundaria a los residuos de una proteína, una vez medidos los ángulos dihedros del esqueleto peptídico psi y phi. Para esta tarea lo lógico es usar el diagrama de Ramachandran, como éste tomado de la wikipedia:

La solución naive consiste en delimitar las regiones del diagrama por medio de condiciones tales como 'si phi > X AND psi < Y' entonces un residuo está en estado Z. Sin embargo, para tener mucha precisión sería necesario tener muchas condiciones como ésta, dada las formas irregulares del diagrama.

Una solución interesante, propuesta por F.Peñaloza, consiste en realmente leer el mapa, como lo hacemos nosotros. Lo que él hizo fue obtener un mapa de Ramachadran cuadrado con unos cuantos colores conocidos,

Diamagra de Ramachandran RGB de F.Peñaloza.

Diagrama anterior con los colores RGB empleados

para luego convertirlo a texto ASCII (con algo como text-image). Con este mapa en memoria, es sencillo obtener la estructura secundaria para unas coordenadas phi,psi dadas.

Aprovechando el módulo GD de perl podemos leer la imagen directamente, como muestra el siguiente código:

 use strict;  
 use GD;  
   
 my $RAMAPLOT = 'ramachandran.png';  
 my $plot = GD::Image->newFromPng($RAMAPLOT);  
 my ($maxpsi,$maxphi) = checkPlot($plot);  
   
 my $inputPSI = 160;   
 my $inputPHI = -45;   
 printf("El estado de estructura secundaria para %s,%s es: %s\n",  
    $inputPSI,$inputPHI, getSimpleSS($maxpsi,$maxphi, $inputPSI,$inputPHI));  
   
 sub getSimpleSS  
 {  
   my ($maxpsi,$maxphi,$psi,$phi) = @_;  
   my %colorSS = (  
     '236,30,36','E',  
     '164,74,164','E',  
     '156,218,236','H',  
     '4,162,236','H',  
     #'180,230,28','L'   
     );  
   
   if($psi > 180 || $psi < -180 || $phi > 180 || $phi < -180)  
   {  
     die "# valid psi/phi are [-180,+180]\n";  
   }  
   $psi = (0.5 - (0.5 * $psi/180)) * $maxpsi;  
   $phi = (0.5 + (0.5 * $phi/180)) * $maxphi;  
     
   my @RGB = $plot->rgb( $plot->getPixel($phi,$psi) );  
   return $colorSS{"$RGB[0],$RGB[1],$RGB[2]"} || 'C';  
 }  
   
 sub checkPlot  
 {  
   my ($plot,$print_pixels) = @_;  
   my ($maxphi,$maxpsi) = $plot->getBounds();  
   if(!$print_pixels){ return ($maxphi,$maxpsi) }  
     
   foreach my $phi ( 0 .. $maxphi )   
   {  
     foreach my $psi ( 0 .. $maxpsi )   
    {  
       my @RGB = $plot->rgb( $plot->getPixel($phi,$psi));  
       print "$RGB[0] $RGB[1] $RGB[2]\n";  
    }  
   }  
   return ($maxphi,$maxpsi);  
 }

Un saludo, Bruno

8 de febrero de 2012

HMMER 3.0, HMMER 2.3.2 or PfamScan?

Last time that I annotated Pfam domains into footprintDB database I used the program hmmpfam from the HMMER 2.3.2 software package. But now, many things have changed, Pfam version has moved from 23 to 26, and the current HMM file can't be used directly with HMMER 2, it needs to be converted with a simple tool of HMMER 3 (hmmconvert -2 Pfam-A.hmm > Pfam_ls_26).

HMMER 3 is a nice software tool that is hundreds of times faster than its predecesor, it takes 20 minutes in my Quad-Core computer the same calculation that took like 2 hours in a 28 node cluster.

So I have decided to move to modern times, but cautiously, because last time I tried HMMER 3 I had not wanted results, so I have done my own benchmark that I'm going to explain...

First, I downloaded a test set of protein sequences from 3Dfootprint, as I work with DNA binding proteins, I downloaded all of them from this archive (currently 2007 FASTA sequences).

To calculate the Pfam domains I used the last version of HMMER 3.0 from http://hmmer.janelia.org/software, my old version of HMMER 2.3.2 (something similar can be found here: http://hmmer.janelia.org/software/archive) and pfam_scan.pl script used by Pfam team to create their database in the Sanger Institute. Also I downloaded the last version of Hidden Markov Models from Pfam version 26 and converted it to use with HMMER 3 (hmmpress Pfam-A.hmm) and with HMMER 2 (hmmconvert -2 Pfam-A.hmm > Pfam_ls_26).

Then I started the testing with the 3 programs, with and without using thresholds in the HMMER param options:
HMMER 2 with thresholds: hmmpfam --acc --cut_ga Pfam-A.hmm protein_sequence_complexes.faa > protein_sequence_complexes.hmmscan
HMMER 2 default: hmmpfam --acc Pfam-A.hmm protein_sequence_complexes.faa > protein_sequence_complexes.hmmscan
HMMER 3 with thresholds: hmmscan --acc --cpu 8 --notextw --cut_ga -o protein_sequence_complexes.hmmscan Pfam-A.hmm protein_sequence_complexes.faa
HMMER 3 default: hmmscan --acc --cpu 8 --notextw -o protein_sequence_complexes.hmmscan Pfam-A.hmm protein_sequence_complexes.faa
PfamScan (default thresholds): pfam_scan.pl -align -cpu 8 -hmm Pfam-A.hmm -fasta protein_sequence_complexes.faa -outfile protein_sequence_complexes.pfamscan

The final conclusion is that I'll use HMMER 3 with thresholds, it's because the calculation time is 200 times faster that HMMER 2 (Figure 1) and both retrive more or less the same number of domains for the main transcription factor families (Figure 2).

It's remarkable that HMMER 3 without thresholds is very much sensitive, detecting near the double number of domains than the rest of the techniques (Figure 1), but most of them are undesired domains that interfere with the identification of the important ones.

HMMER 2 results with and without thresholds are comparable (Figure 1), both of them detect most of the transcription factor domains (Figure 2), even a bit more than HMMER 3 with thresholds, without including spurious domains even without thresholds (Figure 1).

PfamScan detects less domains than the rest of the techniques (Figure 1), although it uses HMMER 3 internally, this is because it doesn't annotate overlapping domains, but also because it has very strict thresholds that in many cases fail to detect real transcription factor domains (Figure 2). We have noticed this problem in a particula recent study of the transcription factor YY1 (1UBD chain C), if we search in Pfam webserver its sequence (chain C) we obtain only 3 of the 4 real Zinc Finger domains, we must find the 4th Zinc Finger domain among the 'insignificant Pfam-A Matches'.

I hope these results help to decide to people like me dubbing among moving to HMMER 3, use PfamScan or continue using HMMER 2.

Figure 1. Statistics of several parameters with the 5 calculation methods.

Figure 2. Numer of retrieved domains for different transcription factor families with the different methods.

31 de enero de 2012

BIFI 2012 - V Congreso Internacional - Dianas proteicas: Descubrimiento de Compuestos Bioactivos

Del 1 al 4 de febrero de 2012 se celebrará en Zaragoza la 5º Congreso Internacional del Instituto de Biocomputación y Física de Sistemas Complejos (BIFI).

Este año el tema central será el Descubrimiento de Fármacos, cubriendo desde los pasos iniciales de investigación en laboratorio hasta los estudios preclínicos: nuevas dianas proteicas, validación de dianas, nuevas metodologías y herramientas de caracterización tanto estructural como funcional y cribado computacional de moléculas. La conferencia servirá de punto de encuentro de investigadores del campo del descubrimiento de fármacos, donde se discutirán los avances más recientes y retos futuros.

Nuestro laboratorio asistirá al evento con un seminario titulado "Protein-DNA interface prediction techniques: performance and potential in protein engineering" y un póster titulado: "In vivo DNA binding pattern of Rex-1 in mouse embryonic stem cells" realizado en colaboración con el Departamento de Veterinaria de la Universidad de Zaragoza.

English version:

The V International Conference of the Institute for Biocomputation and Physics of Complex Systems (BIFI) on February 1-4, 2012.

The meeting will be an international conference on Drug Discovery from a protein perspective, covering most of the initial steps in drug discovery and preclinical studies (new protein targets, protein target validation, new methodologies and tools for structural and functional characterization, experimental and computational high-throughput screening, etc.). We wish the conference to represent a venue for gathering active researchers on drug discovery, with strong roots in the scientific and academic communities to discuss recent developments and future challenges in the field.

Our laboratory will participate in the event with a talk titled "Protein-DNA interface prediction techniques: performance and potential in protein engineering" and a poster titled: "In vivo DNA binding pattern of Rex-1 in mouse embryonic stem cells" in collaboration with the Veterinary Department of the University of Zaragoza.