#!/perl/bioinfo

21 de diciembre de 2012

Bioinformatic Merry Christmas!!!

My boss is very nervous when I prepare this blog entry at the end of December, and this 'crisis' year I wanted to prepare the best bioinformatic inspirated Christmas card ever..

If you are not as geek as us, here is written in flat words... MERRY CHRISTMAS AND HAPPY NEW YEAR 2013!!!

If you want to replicate the Christmas greeting, go to this link, or to the 'retrieve' section in UniProt and introduce the following protein identifiers: P0A183, O02437, Q40309, Q6L5Z2, E0ZPQ6.

30 de noviembre de 2012

comparando secuencias de igual longitud

Hola,
hoy quería compartir recetas en Perl para la comparación eficiente de secuencias de igual longitud. Las dos funciones que vamos a ver devuelven como resultado listas con las posiciones (contando desde 0) en las que difieren dos secuencias.

Para qué sirve esto? En general, para calcular distancias de edición (edit distances); en bioinformática es una manera, por ejemplo, de comparar oligos de igual longitud o secuencias extraídas de un alineamiento múltilple, incluyendo gaps. El siguiente código fuente incluye una subrutina en Perl (con el operador binario ^ y la función pos) y otra en C embebido para hacer la misma operación por dos caminos:

 use strict;  
 use warnings;  
 my $seq1 = 'ACTGGA';  
 my $seq2 = 'AGTG-A';  
   
 my @Pdiffs = find_diffs($seq1,$seq2);  
 my @Cdiffs = find_diffs_C($seq1,$seq2);  
   
 printf("# version Perl\ndiferencias = %d\n%s\n",scalar(@Pdiffs),join(',',@Pdiffs));  
 printf("# version C\ndiferencias = %d\n%s\n",scalar(@Cdiffs),join(',',@Cdiffs));  
   
 sub find_diffs  
 {  
    my ($seqA,$seqB) = @_;  
    my $XOR = $seqA ^ $seqB; 
    my @diff;   
    while($XOR =~ /([^\0])/g)  
    {   
       push(@diff,pos($XOR)-1);  
    }  
    return @diff;  
 }  
   
 use Inline C => <<'END_C';  
 void find_diffs_C(char* x, char* y)   
 {                        
   int i;   
    Inline_Stack_Vars;                               
   Inline_Stack_Reset;                                                                
   for(i=0; x[i] && y[i]; ++i) {                          
    if(x[i] != y[i]) {                              
     Inline_Stack_Push(sv_2mortal(newSViv(i)));                 
    }                                       
   }   
    Inline_Stack_Done;                                                                  
 }     
 END_C

Podéis ver en stackoverflow otras versiones y su comparación en base a su tiempo de cálculo,
un saludo,
Bruno

21 de noviembre de 2012

Comprimiendo BLAST

Hola,
ya hemos hablado antes en este blog de las crecientes aplicaciones de los algoritmos de compresión en la bioinformática. Hoy precisamente quería enlazar a dos artículos recientes que los explotan para acelerar algo que a priori parece imposible, la herramienta más universal de la biología computacional, BLAST.

En un paper en Nature Biotech, Loh, Maym y Berger nos presentan el prototipo Compression-accelerated BLAST (fuente C++ aquí), que en pruebas empíricas acelera varios órdenes de magnitud las búsquedas de BLAST simplemente haciendo las operaciones en un espacio comprimido. Como ventaja añadida se generan archivos de resultados de tamaños significativamente menores. Obviamente pagamos una pequeña pérdida de sensibilidad, pero puede ser perfectamente asumible para tareas de mapeo de lecturas (reads) de secuenciación de última generación:


Original de http://www.nature.com/nbt/journal/v30/n7/full/nbt.2241.html.

En otro paper recientemente publicado en Bioinformatics, Koskinen y Holm, aplican el marco de los vectores de sufijos (ya discutidos aquí) a la búsqueda de proteínas homólogas con una identidad superior al 50%, acelerando de nuevo varias órdenes de magnitud por encima de BLAST.

Familias de algoritmos para la búsqueda inexacta de secuencias, según Koskinen y Hol (http://bioinformatics.oxfordjournals.org/content/28/18/i438.full).

El programa que implementa estas ideas se llama SANS, escrito en FORTRAN 90, está disponible en este enlace,
un saludo,
Bruno

31 de octubre de 2012

Tutorial de modelado comparativo

Hola,
en colaboración con mis colegas Daniela Medeot, Romina Rivero y Edgardo Jofré de la Universidad Nacional de Río Cuarto (Argentina) hemos organizado recientemente un curso de posgrado titulado "Herramientas de bioinformática aplicadas al análisis de ácidos nucleicos y proteínas", al que asistieron cerca de 20 alumnos.

Además de hacer una introducción práctica al uso de Perl en diferentes tareas cotidianas de la bioinformática, mi participación se centró en un taller práctico para el que preparé el tutorial "Modelado comparativo de proteínas" que espero pueda servir de ayuda a personas interesadas en aprender a modelar proteínas. El material puede utilzarse desde el navegador web o descargarse como un documento PDF,
un saludo, Bruno

18 de septiembre de 2012

TFcompare - a tool for structural alignment of DNA binding protein complexes

I want to introduce you the new bioinformatic contribution of our lab to the science world: TFcompare (http://floresta.eead.csic.es/tfcompare/)

TFcompare is a tool for structural alignment of DNA motifs and protein domains from DNA binding protein complexes in Protein Data Bank.

The TFcompare algorithm calculates structural alignments between three dimensional structures of two DNA-protein complexes. The most interesting feature of TFcompare when compared with other methods is that it extracts individual protein domains and their recognized DNA sequences, aligning them separately and returning not only the structure superposition but the DNA sequence superposition too. In this way we can compare single domain affinity for different DNA sequences in DNA-protein complexes, especially transcription factors and their recognized cis elements.

The working schema of TFcompare is the following:

TFcompare takes as input two PDB identifiers. Structures from PDB are retrieved automatically and Pfam domains contacting DNA are calculated and trimmed from the original structure. Then all the domains from the first structure are aligned to all the domains from the second in several steps:

The program MAMMOTH performs the structural alignment.
The produced transformation matrices are applied to the coordinates of the DNA binding sites in order to derive the equivalent cis element superpositions.
Root-mean-squared deviations of superposed coordinates are calculated with beta-carbon atoms (proteins) and with N9 (purines) and N1 (pyrimidines) atoms (DNA).
Structural alignments are scored in terms of i) the number of identical superposed nucleotides (DNA Score 1-0) and ii) the sum of N9 and N1 atom pairs within 3.5 Å (DNA Score).

We can take as example the alignment of 1D5Y and 1BL0 structures, both are bacterial proteins with Helix-turn-helix (HTH) protein domains binding DNA.

We obtain the following results:

Results are ordered by structural similarity (RMSD), from both protein domain and DNA. In green colour are showed the alignment of similar structures (protein RMSD <=5.0 Å and DNA RMSD <= 3.5 Å) and in red colour dissimilar ones. 1D5Y contains two protein chains with HTH domains contacting DNA (trimmed domain structures are 1d5y_A1 and 1d5y_C1). 1BL0 have two HTH domains in its unique chain A (1bl0_A1 and 1bl0_A2). Structural alignment results show how 1bl0_A1 superposes very well with 1d5y_A1 and 1d5y_C1 (green colour). When DNA sequences recognized by these domains are aligned, they show a DNA motif conserved with three common nucleotides ‘CAC’. However, 1bl0_A2 superpositions (red colour) are not as good as previous ones, and DNA motif ‘CAC’ is not preserved when checking the resulting DNA alignment.

Each row contains an alignment of a pair of DNA binding domains, showing a picture of their structures before and after superposition. DNA alignment is also shown.

PDB files with aligned structures can be downloaded by left-clicking on the domain names and DNA sequences. Opening them with PDB viewer software (Pymol for ex.) is possible to visualize the resulting superposition after structural alignment.

Results column headers and their meaning:
Pair: Pair number
Domain_Query: PDB name, chain and domain number of the Query
Domain_Sbjct: PDB name, chain and domain number of the Sbjct
DNA_Query: DNA site recognized by the Query domain
DNA_Sbjct: DNA site recognized by the Query domain
Similar: 1 if both protein domains and DNA sites are below RMSD thresholds, 5.0 A and 3.5 A
respectively
DNA_Alignment: DNA sites structurally aligned
DNA_Aligned: Number of aligned nucleotides
DNA_Score_1-0: Number of identical nucleotides
DNA_Score: Structural alignment score
DNA_RMSD: RMSD of the structurally aligned DNA sites
PROT_RMSD: RMSD of the structurally aligned protein domains
3D_Alignment: 3D Visualization of aligned structures