#!/perl/bioinfo

13 de septiembre de 2010

Jornadas Bioinformáticas JBI 2010 (X Edición), nuestro laboratorio estará allí...

Las Jornadas Bioinformáticas son la cita anual obligada para los bioinformáticos españoles. Este año se celebrará su décima edición del 27 al 29 de Octubre en Torremolinos (Málaga). La organización de las mismas corre a cargo de la Universidad de Málaga, el Instituto Nacional de Bioinformática y la Red Portuguesa de Bioinformática. Este año el tema central es "La bioinformática aplicada a la medicina personalizada", sobre el cual se discutirá la integración de los campos de la biología, medicina e informática para el desarrollo de terapias más específicas y efectivas. Sin embargo, éste no será el único tema a tratar, también se compartirán resultados y experiencias en otros campos:
- Análisis de datos en técnicas de alto rendimiento como la secuenciación de nueva generación.
- Bioinformática estructural
- Algoritmos de biología computacional y técnicas de computación de alto rendimiento
- Análisis de secuencias, filogenética y evolución
- Bases de datos, herramientas y tecnologías de biología computacional
- Bioinformática en transcriptómica y proteómica
- Biología sintética y de sistemas

IN ENGLISH:

The Xth Spanish Symposium on Bioinformatics (JBI2010) will take place in October 27-29, 2010 in Torremolinos-Málaga, Spain. Co-organised by the National Institute of Bioinformatics-Spain and the Portuguese Bioinformatics Network and hosted by the University of Malaga (Spain).

This year, the reference topic is “Bioinformatics for personalized medicine” for which the conference will provide the opportunity to discuss the state of the art for the integration of the fields of biology, medicine and informatics. We invite you to submit your work and share your experiences in the following topics of interest including, but not limited to:
- Analysis of high throughput data (NGS)
- Structural Bioinformatics
- Algorithms for computational biology and HPC
- Sequence analysis, phylogenetics and evolution
- Databases, Tools and technologies for computational biology
- Bioinformatics in Transcriptomics and Proteomics
- System and Synthetic Biology

Nuestras aportaciones

Nuestro laboratorio va a participar en las Jornadas Bioinformáticas con tres contribuciones que presentaré a continuación:

3D-footprint: a database for the structural analysis of protein–DNA complexes (paper)
The relation between amino-acid substitutions in the interface of transcription factors and their recognized DNA motifs
101DNA: a set of tools for Protein-DNA interface analysis

3D-footprint: a database for the structural analysis of protein–DNA complexes
3D-footprint is a living database, updated and curated on a weekly basis, which provides estimates of binding specificity for all protein–DNA complexes available at the Protein Data Bank. The web interface allows the user to: (i) browse DNA-binding proteins by keyword; (ii) find proteins that recognize a similar DNA motif and (iii) BLAST similar DNA-binding proteins, highlighting interface residues in the resulting alignments. Each complex in the database is dissected to draw interface graphs and footprint logos, and two complementary algorithms are employed to characterize binding specificity. Moreover, oligonucleotide sequences extracted from literature abstracts are reported in order to show the range of variant sites bound by each protein and other related proteins. Benchmark experiments, including comparisons with expert-curated databases RegulonDB and TRANSFAC, support the quality of structure-based estimates of specificity. The relevant content of the database is available for download as flat files and it is also possible to use the 3D-footprint pipeline to analyze protein coordinates input by the user. 3D-footprint is available at http://floresta.eead.csic.es/3dfootprint with demo buttons and a comprehensive tutorial that illustrates the main uses of this resource.

The relation between amino-acid substitutions in the interface of transcription factors and their recognized DNA motifs

Transcription Factors (TFs) play a key role in gene regulation by binding to DNA target sequences. While there is a vast literature describing computational methods to define patterns and match DNA regulatory motifs within genomic sequences, the prediction of DNA binding motifs (DBMs) that might be recognized by a particular TF is a relatively unexplored field. Numerous DNA-binding proteins are annotated as TFs in databases; however, for many of these orphan TFs the corresponding DBMs remain uncharacterized. Standard annotation practice transfer DBMs of well known TFs to those orphan protein sequences which can be confidently aligned to them, usually by means of local alignment tools such as BLAST, but these predictions are known to be error-prone. With the aim of improving these predictions, we test whether the knowledge of protein-DNA interface architectures and existing TF-DNA binding experimental data can be used to generate family-wise interface substitution matrices (ISUMs). An experiment with 85 Drosophila melanogaster homeobox proteins demonstrate that ISUMs: i) capture information about the correlation between the substitution of a TF interface residue and the conservation of the DBM; ii) are valuable to evaluate TFs alignments and iii) are better classifiers than generic amino-acid substitution matrices and that BLAST E-value when deciding whether two aligned homeobox proteins bind to the same DNA motif.

101DNA: a set of tools for Protein-DNA interface analysis

Analysis of protein-DNA interfaces has shown a great structural dependency. Despite the observation that related proteins tend to use the same pattern of amino acid and base contacting positions, no simple recognition code has been found. While protein contacts with the sugar-phosphate backbone of DNA provide stability and yield very little specificity information, contacts between amino acid side-chains and DNA bases (direct readout) apparently define specificity, in addition to some constrains defined by DNA sequence-dependent features, namely indirect readout.
Recent approaches have proposed bipartite graphs as an structural way of analysing interfaces from a protein-DNA-centric viewpoint. With this perspective in mind, we have developed a set of tools for the dissection and comparison of protein-DNA interfaces. Taking a protein-DNA complex file in PDB format as input, the software generates a 2D matrix that represents a bipartite graph of residue contacts obtained after applying a simple distance threshold that captures all non-covalent interactions. The generated 2D matrices allow a fast and simple visual inspection of the interface and have been successfully produced for the current non-redundant set of protein-DNA complexes in the 3D-footprint database.
As a second utility to compare 2 interfaces, the 101DNA software includes an aligment tool where a dynamic programming matrix is created with the Local Affine Gap algorithm and traced back as a finite state automata. The scores between pairs of interface amino acid residues are calculated as a function of the observed contacts with DNA nitrogen bases. This tool produces local interface alignments which are independent of the underlying protein sequence, but that faithfully represent the binding architecture. Preliminary tests show that these local alignments successfully identify binding interfaces that share striking similarity despite belonging to different protein superfamilies, and these observations support this graph-theory approach.

10 de septiembre de 2010

Cómo saber si un valor es numérico mediante una simple expresión regular

Hoy no estoy muy inspirado para escribir en el blog, así que he repasado mis librerías de subrutinas de perl para ver si encontraba algo curioso para publicar y me he encontrado con esto...
¿Quién no ha tenido nunca el problema de comprobar si una expresión es numérica o no?
Perl no posee una función que lo haga automáticamente (por lo menos que yo conozca), sin embargo con una simple línea de código podemos salir de dudas:

 # Return TRUE if an expression is numerical  
 sub is_numeric {  
      my ($exp) = @_;  
      if ($exp =~ /^(-?[\d.\-]*e[\d.\-\+]+|-?[\d.\-]\^[\d.\-]+|-?[\d.\-]+)$/){  
      # Corregida siguiendo las indicaciones de Joaquín Ferrero:
      if ($exp =~ /^-?(?:[\d.-]+*e[\d.+-]+|\d[\d.-]*\^[\d.-]+|[\d.-]+)$/){
           return 1;  
      } else {  
           return 0;  
      }  
 }

1 de septiembre de 2010

Adaptando scripts antiguos para Blast+

Hola,
los que uséis programas de la familia BLAST habitualmente habréis notado que las últimas versiones instalables, que se pueden descargar por FTP, son de la rama nueva de desarrollo Blast+, que tiene algunas nuevas funcionalidades muy interesantes, pero que cambia los nombres de los ejecutables y la forma de pasar argumentos que sabíamos usar.

Sin embargo, los desarrolladores del NCBI ya habían previsto esta transición y acompañan a los nuevos binarios un script Perl, que se llama legacy_blast.pl, que puede ayudarnos a reconvertir código que invoca versiones antiguas de BLAST. Con este programa podemos por ejemplo traducir este comando de la versión 2.2.18, que tiene opciones de filtrado un tanto especiales:

$ blastall -F 'm S' -z 10000 -p blastp -i problema.faa -d bases_datos/seqs.faa

a esta llamada al binario de la versión 2.2.24+:

$ blastp -db bases_datos/seqs.faa -query problema.faa -dbsize 10000 -seg yes -soft_masking true

Por cierto, en mis pruebas veo que el nuevo binario puede aprovechar una base de secuencias formateada con el formatdb antiguo, que ahora se llama makeblastdb.

Un saludo,
Bruno

24 de agosto de 2010

Código en C dentro de un programa Perl

Hola, siguiendo el hilo de los comentarios a la entrada anterior de Álvaro, hoy pongo un ejemplo de como incluir una o más subrutinas escritas en C dentro de un texto en Perl, recurriendo al módulo Inline:

 #!/usr/bin/perl -w 
 # programa 7 
 use Inline C; 
 use strict; 
 my $BASE = 0.987654321; 
 my $EXP = 1000; 
 print "# $BASE^$EXP = \n"; 
 print "# ".potencia_perl($BASE,$EXP)." (perl)\n"; 
 print "# ".potencia_C($BASE,$EXP)." (C)\n"; 
 sub potencia_perl 
 { 
    my ($base,$exp) = @_; 
    my $result = $base; 
    for(my $p=1; $p<$exp; $p++) { $result *= $base; } 
    return $result; 
 } 
 __END__  
 ### alternativa en C 
 __C__ 
 #include <stdio.h> 
 float potencia_C( double base, double exp )  
 { 
    double result = base; 
    int p; 
    for(p=1; p<exp; p++) 
    {  
       result *= base; 
    }    
    return result; 
 }

La salida obtenida pone en evidencia que diferentes lenguajes suelen tener diferentes precisiones a la hora de representar números y operar con ellos:

$ perl prog7.pl 
# 0.987654321^1000 = 
#  4.02687472213071e-06 (perl)
#  4.02687464884366e-06 (C)

19 de agosto de 2010

Generar todas las posibles combinaciones posibles de n nucleótidos o aminoácidos

Erróneamente decimos que queremos generar todas las posibles 'combinaciones' de n letras, nucleótidos o aminoácidos cuando el término correcto sería 'variaciones con repetición' (Permutations-Variations-Combinations). Por ejemplo, las 16 posibles variaciones con repetición de 2 nucleótidos serían: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT. El número de variaciones con repetición de n nucleótidos es 4^n y de n aminoácidos 20^n.

El siguiente código, es una adaptación de un código en PHP para generar todas las variaciones con repetición de n elementos, en nuestro caso, 2 aminoácidos y 4 nucleótidos.

 #!/usr/bin/perl -w  
 # Generate all the variations with repetition of n letters  
 my @aminoacids = ('A','R','N','D','C','Q','E','G','H','L','I','K','M','F','P','S','T','W','Y','V');  
 my @nucleotides = ('A','C','G','T');  
 # Generate all the variations of 2 aminoacids  
 my $aas = variations(\@aminoacids,2);  
 # Generate all the variations of 4 nucleotides  
 my $nts = variations(\@nucleotides,4);  
 # Print the results  
 print "\nAll the variations with repetition of 2 aminoacids: ";  
 foreach my $vari (@{$aas}){  
      foreach my $elem (@{$vari}){  
           print "$elem";  
      }  
      print " ";  
 }  
 print "\n";  
 print "\nAll the variations with repetition of 4 nucleotides: ";  
 foreach my $vari (@{$nts}){  
      foreach my $elem (@{$vari}){  
           print "$elem";  
      }  
      print " ";  
 }  
 print "\n\n";  
 sub variations {  
      my ($letters,$num) = @_;  
      my $last = [map $letters->[0] , 0 .. $num-1];  
      my $result;  
      while (join('',@{$last}) ne $letters->[$#{$letters}] x $num) {  
           push(@{$result},[@{$last}]);  
           $last = char_add($letters,$last,$num-1);  
           print '';  
      }  
      push(@{$result}, $last);  
      return $result;  
 }  
 sub char_add{  
      my ($digits,$string,$char) = @_;  
      if ($string->[$char] ne $digits->[$#{$digits}]){  
           my ($match) = grep { $digits->[$_] eq $string->[$char]} 0 .. $#{$digits};  
           $string->[$char] = $digits->[$match+1];  
           return $string;  
      } else {  
           $string = changeall($string,$digits->[0],$char);  
           return char_add($digits,$string,$char-1);  
      }  
 }  
 sub changeall {  
      my ($string,$char,$start,$end) = @_;  
      if (!defined($start)){$start=0;}  
      if (!defined($end)){$end=0;}  
      if ($end == 0) {$end = $#{$string};}  
      for(my $i=$start; $i<=$end; $i++){  
           $string->[$i] = $char;  
      }  
      return $string;  
 }

Para terminar una breve lección de estadística, fuente: Aulafacil.com

a) Combinaciones:
Determina el número de subgrupos de 1, 2, 3, etc. elementos que se pueden formar con los "n" elementos de una nuestra. Cada subgrupo se diferencia del resto en los elementos que lo componen, sin que influya el orden.

Por ejemplo, calcular las posibles combinaciones de 2 elementos que se pueden formar con los números 1, 2 y 3.
Se pueden establecer 3 parejas diferentes: (1,2), (1,3) y (2,3). En el cálculo de combinaciones las parejas (1,2) y (2,1) se consideran idénticas, por lo que sólo se cuentan una vez.

b) Variaciones:
Calcula el número de subgrupos de 1, 2, 3, etc.elementos que se pueden establecer con los "n" elementos de una muestra. Cada subgrupo se diferencia del resto en los elementos que lo componen o en el orden de dichos elementos (es lo que le diferencia de las combinaciones).

Por ejemplo, calcular las posibles variaciones de 2 elementos que se pueden establecer con los número 1, 2 y 3.
Ahora tendríamos 6 posibles parejas: (1,2), (1,3), (2,1), (2,3), (3,1) y (3,3). En este caso los subgrupos (1,2) y (2,1) se consideran distintos.

c) Permutaciones:
Cálcula las posibles agrupaciones que se pueden establecer con todos los elementos de un grupo, por lo tanto, lo que diferencia a cada subgrupo del resto es el orden de los elementos.

Por ejemplo, calcular las posibles formas en que se pueden ordenar los número 1, 2 y 3.
Hay 6 posibles agrupaciones: (1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2) y (3, 2, 1)