#!/perl/bioinfo

20 de abril de 2011

curso APIs de Ensembl

Hola,
en línea con la anterior entrada del blog os paso este anuncio de Miguel Pérez-Enciso (miguel.perez@uab.es ) :

Del 4 al 6 de octubre organizo el siguiente curso en el CRAG, más detalles + adelante. Lo darán dos personas del EBI. Se requiere conocimientos sólidos de perl. Máximo 20 personas.

"Ensembl uses MySQL relational databases to store its information. A comprehensive set of Application Programme Interfaces (APIs) serve as a middle-layer between underlying database schemes and more specific application programmes. The APIs aim to encapsulate the database layout by providing efficient high-level access to data tables and isolate applications from data layout changes.

This 2-day workshop is aimed at developers interested in exploring Ensembl beyond the website. Participants will be expected to have experience in writing Perl programs. A background in object oriented programming techniques and familiarity with databases (MySQL) would be an advantage.

The workshop covers the following four databases and APIs. For each of them the database schema and the API design as well as its most important objects and their methods will be presented. This will be followed bypractical sessions in which the participants can put the learned into practice by writing their own Perl scripts.

**Ensembl Core databases and API

The set of species-specific Ensembl Core databases stores genome sequences and most of the annotation information. This includes the gene, transcript and protein models annotated by the Ensembl automated genome analysis and annotation pipeline. Ensembl Core databases also store assembly information, cDNA and protein alignments, external references, markers and repeat regions data sets.

**Ensembl Variation databases and API

The large amount of genetic variation information is organised in a set of species-specific Ensembl Variation databases. This includes basic variation information (alleles, flanking sequences, genotypes), the functional consequence of a variation and more complex data such as LD values from genotype data and tagged SNPs in a population.

**Ensembl Compara database and API

The Ensembl Compara multi-species database stores the results of genome-wide species comparisons re-calculated for each release. The comparative genomics set includes pairwise whole genome alignments and synteny regions. The comparative proteomics data set contains orthologue predictions and protein family clusters. "

13 de abril de 2011

Bajar de ensembl todas las secuencias de proteína de los ortólogos de un gen (o de varios)

Dada una secuencia de proteína, puede interesarnos bajar de una sola vez todas las secuencias de proteínas ortólogas conocidas. Si aceptamos el criterio de ensembl para determinar qué es y qué no es ortólogo (una buena opción en una decisión a veces complicada), las siguientes instrucciones pueden servir de guía para esta tarea.

i) pasos previos

Tendréis que tener instalado el ensemble-api (ver instrucciones de instalación) y el módulo DBD::Mysql. En Mac OS X he conseguido instalar este módulo usando macports. En 5 líneas, sería:

~% sudo port selfupdate

~% sudo port install mysql5-server

~% sudo port install p5-dbd-mysql

~% sudo port install p5-dbi

~% sudo port -f activate p5-dbi

ii) bajar los ortólogos

En un fichero de texto, ponéis una lista con los identificadores de genes para los que queréis hacer la búsqueda.

(por ejemplo)

~% cat IDs.txt

ENSMUSG00000031628

ENSMUSG00000027997

ENSG00000003400

Ejecutáis get_orthologues.pl dándole este fichero como $ARGV[0]

~% perl get_orthologues.pl IDs.txt

Si veis algo así en pantalla, todo OK.

Obtaining 1-to-1 and 1-to-many orthologues for ENSMUSG00000031628

main IDs for each orthologue written to ENSMUSG00000031628.mainIDs.txt

all IDs for each orthologue written to ENSMUSG00000031628.allIDs.txt

sequeences for the main IDs dumped to ENSMUSG00000031628.ort.fa

Obtaining 1-to-1 and 1-to-many orthologues for ENSMUSG00000027997

main IDs for each orthologue written to ENSMUSG00000027997.mainIDs.txt

all IDs for each orthologue written to ENSMUSG00000027997.allIDs.txt

sequeences for the main IDs dumped to ENSMUSG00000027997.ort.fa

Obtaining 1-to-1 and 1-to-many orthologues for ENSG00000003400

main IDs for each orthologue written to ENSG00000003400.mainIDs.txt

all IDs for each orthologue written to ENSG00000003400.allIDs.txt

sequeences for the main IDs dumped to ENSG00000003400.ort.fa

(en mi ordenador, para estos 3 IDs tarda unos 5 minutos)

El resultado, para cada identificador ENSG.., son como véis 3 ficheros:

ENSG*.mainIDs.txt --> listado con los protein-IDs del transcript principal de cada ortólogo

ENSG*.allIDs.txt --> listado con los protein-IDs de todas las secuencias que ensembl guarda para cada ortólogo

ENSG*.ort.fa --> secuencias de proteina (formato fasta) correspondientes al transcript principal de cada ortólogo

Aclaración por si alguien se ha liado con esto de transcript_principal / todas_las_secuencias. Como sabéis, para cada gen (y para cada uno de sus ortólogos) hay descritos varios transcripts normalmente. ¿Cuál utiliza ensembl como transcript principal? El más largo. La secuencia de proteína de ese se graba en OUT3, su ID en OUT1 y las IDs de todos en OUT2.

El código de get_orthologues.pl es el siguiente:

#!/usr/bin/perl

# Script to retrieve protein sequences for 1-to-1(and 1-to-many)-orthologues for a given gene.
# (adaptado del que me pasó Bert Overduin de helpdesk@enseml.org)

# Definicion de librerias y modulos

use strict;
use warnings;

use lib '/Users/jramon/Soft/ensembl_API/ensembl/modules/';
use lib '/Users/jramon/Soft/ensembl_API/ensembl-compara/modules/';
use lib '/Users/jramon/Soft/ensembl_API/ensembl-variation/modules/';
use lib '/Users/jramon/Soft/ensembl_API/ensembl-functgenomics/modules/';
use lib '/Users/jramon/Soft/BioPerl-1.6.1/';
use Bio::EnsEMBL::Registry;

my $reg = "Bio::EnsEMBL::Registry";
$reg->load_registry_from_url('mysql://anonymous@ensembldb.ensembl.org');

my $member_adaptor = $reg->get_adaptor("Multi", "compara", "Member");
my $homology_adaptor = $reg->get_adaptor("Multi", "compara", "Homology");


# Leer identificadores de genes y buscar ortologos para cada uno
# guardando IDs en OUT1,2 y secuencias en OUT3

open (INPUT_IDs,"$ARGV[0]");
while () {
my $gene_stable_id = $_;
chop $gene_stable_id;

my $member = $member_adaptor->fetch_by_source_stable_id("ENSEMBLGENE", $gene_stable_id);

open (OUT1,">$gene_stable_id.mainIDs.txt");
open (OUT2,">$gene_stable_id.allIDs.txt");
open (OUT3,">$gene_stable_id.ort.fa");
print "--\nObtaining 1-to-1 and 1-to-many orthologues for $gene_stable_id\n";
print "main IDs for each orthologue written to $gene_stable_id.mainIDs.txt\n";
print "all IDs for each orthologue written to $gene_stable_id.allIDs.txt\n";
print "sequences for the main IDs dumped to $gene_stable_id.ort.fa\n";

my $all_homologies = $homology_adaptor->fetch_all_by_Member($member);

foreach my $homology (@$all_homologies) {
if($homology->description eq (('ortholog_one2one')||('ortholog_one2many'))){
my $members = $homology->get_all_Members();
foreach my $member (@$members) {
 if($member->stable_id ne $gene_stable_id){
  my $canonical_peptide = $member->get_canonical_peptide_Member;
  print OUT1
   $canonical_peptide-> stable_id, "\n";

  my $gene_adaptor = $reg->get_adaptor($member->genome_db->name, "core", "Gene");
  my $gene = $gene_adaptor->fetch_by_stable_id($member->stable_id);
  my @transcripts = @{$gene->get_all_Transcripts};
  foreach my $transcript(@transcripts){
   if($transcript->translation){
     print OUT2 $transcript->translation->stable_id, "\n";
   }
  }

  print OUT3
   ">", $canonical_peptide->stable_id, "\n",
   $canonical_peptide->sequence, "\n";
  }
}
}
}
close OUT1;
close OUT2;
close OUT3;
}
close INPUT_IDs;

De todas formas, para el volumen de datos que ha de bajar, se me hace un poco lento el código. Se agradecen mejoras en este sentido.

5 de abril de 2011

Extraer de un texto líneas con el contenido deseado

A continuación os dejo un sencillo código para extraer las líneas deseadas de un texto dándole como parámetros el texto y un array con las palabras o patrones que deben estar presentes en las líneas a extraer.

Modificaciones de este código nos pueden ser muy útiles en diversos problemas bioinformáticos, cuando tenemos archivos con muchas líneas de datos y sólo queremos utilizar o visualizar unas pocas.

En próximas entradas utilizaremos este código para extraer por ejemplo los átomos deseados de un fichero de coordenadas atómicas en formato PDB.

 my $source_text = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.\n  
 Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.\n  
 Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.\n  
 Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.";  
   
 my @patterns;  
 push(@patterns, 'adipisicing');  
 push(@patterns, '/dolor/');  
   
 my $extracted_text = join('',extract_lines_from_text($source_text,\@patterns));  
   
 print "$extracted_text\n";  
   
 # Extract lines from text with the desired patterns  
 sub extract_lines_from_text {  
   
      my ($text, $patterns) = @_;  
   
      my @data;  
      my @lines = split("\n",$text);  
   
      foreach my $line (@lines){  
           foreach my $pattern (@{$patterns}){  
                if ($pattern =~ /^\/(.+)\/$/){  
                     if ($line =~ /$1/){  
                          push(@data,$line);  
                          last;  
                     }  
                } else {  
                     if ($line =~ /\Q$pattern\E/){  
                          push(@data,$line);  
                          last;  
                     }  
                }  
           }  
      }  
   
      return @data;  
   
 }

31 de marzo de 2011

Curso 'Next Generation Sequence Analysis: Practice and Departure to New Frontiers'

Hola,
en el 11 GABI Status Meeting, que tuvo lugar recientemente en Alemania,
nos anunciaron el siguiente taller de verano, que tendrá lugar en el mes de Agosto:

'Next Generation Sequence Analysis: Practice and Departure to New Frontiers'

organized by: R. Fries (Animal Breeding, TUM), T. Meitinger (Human Genetics, TUM), K.F.X. Mayer (MIPS, Helmholtz Zentrum München), T. Strom (Human Genetics, Helmholtz Zentrum München)

-> OPEN FOR APPLICATION! (see link below)

Dates: 08.08. – 15.08.2011

Location: Herrsching am Ammersee, Germany

Description:
The Synbreed Summer School provides an introduction to next generation sequence analysis for PhD students and postdoctoral researchers in animal and plant breeding. The course consists of a practical part with hands-on exercises guided by experienced Synbreed scientists and guest lecturers covering the frontiers of next gen sequencing in animal and plant improvement.

Program Overview:

Practice the pipeline from next gen sequence reads to annotated variants (alignment, visualization, variant calling, imputing and annotation)
Visiting the sequencing facility of the Helmholtz Zentrum Munich
Guest lectures about advances in sequencing technology, genome assembly, genomic selection, next gen population genomics, efficient computation and agricultural genome projects
Download preliminary programme

Target Group and Requirements:

The course is directed towards PhD students and postdoctoral researchers in animal and plant breeding
Knowledge of Linux and a scripting language (e.g. Phyton)
Dual-core processor equipped laptop with a recent Linux distribution (e.g. Ubuntu 10.10) and at least 150 GB free storage. Linux setup assistance will be provided during the introductory session.

Course language: English

Costs:
Course fee 500 Euro
Participant in single bedroom* 826 Euro
Participant in double bedroom* 756 Euro
(*includes: full accommodation and meals)

Services at Course Location: please visit http://www.hdbl-herrsching.de.

-> For your application click here!

Application deadline: 01.06.2011

Number of participants: 20

Contact:
Project Coordination Synbreed
Natalie Ohl / Wolf-Christian Saul
Chair of Plant Breeding
Center of Life and Food Sciences Weihenstephan
Technische Universität München
Emil-Ramann-Str. 4
D-85354 Freising

Ph.: +49 (0)8161/71-5226
Fax: +49 (0)8161/71-4511
Email: synbreed[at]wzw.tum.de

21 de marzo de 2011

Vectores de sufijos para buscar patrones (suffix arrays)

Buenas,
hace poco tuve la suerte de cursar un taller práctico sobre problemas en el análisis, mapeo y ensamblado de secuencias genómicas, impartido por Paolo Ribeca (autor del software GEM). A parte de volver un poco abrumado por las dificultades que encierran los datos producidos por estas tecnologías de secuenciación de gran capacidad, sobre todo para bichos que no tienen todavía un genoma de referencia de buena calidad, he aprendido qué estructuras de datos soportan el manejo y análisis de todas estas secuencias. Además de tablas hash y grafos de De Bruijn, y la magia de Burrows-Wheeler, que se merecerá su propia entrada, hemos aprendido las ventajas de usar vectores de sufijos (suffix arrays) para el problema de buscar patrones o subcadenas exactas en textos muy largos, como son las secuencias genómicas. Pero, qué es un vector de sufijos? Trataré de explicarlo con el siguiente código en Perl:

 # calcula un vector de sufijos de manera que las diferentes subcadenas   
 # de una secuencia queden preordenadas lexicograficamente  
 sub make_suffix_array  
 {  
    my ($seq,$verbose) = @_;  
    my @suffix = (0 .. length($seq)-1); # en base 0  
   
    # ordena los length($seq) posibles sufijos lexicogr´aficamente 
    @suffix = sort {substr($seq,$a) cmp substr($seq,$b)} (@suffix);  
      
    if($verbose)  
    {  
       print "# suffix array for $seq :\n";  
       foreach my $suf (@suffix)  
       {   
          printf("%3d %s\n",$suf,substr($seq,$suf));   
       }  
       print "\n";  
    }  
    return @suffix;  
 }

Si llamamos a la subrutina

my @suffix = make_suffix_array('TTTTAGATCGATCGACTAGACTACGACTCGA',1);

obtendremos:

30 A
22 ACGACTCGA
19 ACTACGACTCGA
14 ACTAGACTACGACTCGA
25 ACTCGA
17 AGACTACGACTCGA
4  AGATCGATCGACTAGACTACGACTCGA
10 ATCGACTAGACTACGACTCGA
6  ATCGATCGACTAGACTACGACTCGA
28 CGA
12 CGACTAGACTACGACTCGA
23 CGACTCGA
8  CGATCGACTAGACTACGACTCGA
20 CTACGACTCGA
15 CTAGACTACGACTCGA
26 CTCGA
29 GA
18 GACTACGACTCGA
13 GACTAGACTACGACTCGA
24 GACTCGA
9  GATCGACTAGACTACGACTCGA
5  GATCGATCGACTAGACTACGACTCGA
21 TACGACTCGA
16 TAGACTACGACTCGA
3  TAGATCGATCGACTAGACTACGACTCGA
27 TCGA
11 TCGACTAGACTACGACTCGA
7  TCGATCGACTAGACTACGACTCGA
2  TTAGATCGATCGACTAGACTACGACTCGA
1  TTTAGATCGATCGACTAGACTACGACTCGA
0  TTTTAGATCGATCGACTAGACTACGACTCGA

que es una lista de los 31 posibles sufijos de la cadena original, en orden lexicográfico. De hecho, si te fijas en el código, el vector realmente contiene sólo las posiciones (en base 0) de los sufijos ordenados, no su secuencia. Obviamente la construcción de este vector es costosa al necesitar de una ordenación (en Perl es por defecto un mergesort con un caso peor O(NlogN) ), pero luego permite consultas O(logN), es decir, mucho más rápidas que simplemente recorrer la secuencia de principio a fin, al usar implícitamente un árbol binario.

Puedes probarlo mediante el siguiente ejemplo:

 match_pattern('TAG','TTTTAGATCGATCGACTAGACTACGACTCGA');

 # aprovecha el orden del vector de sufijos para buscar cualquier subcadena   
 # o patro´n (pattern) por medio de un árbol binario   
 sub match_pattern   
 {  
    my ($pattern,$seq) = @_;  
    print "# looking for pattern $pattern in sequence $seq (base 0)\n";  
    my @suffix = make_suffix_array($seq,1);  
    my $low = 0;  
    my $high = $#suffix;  
    my $patl = length($pattern);  
    my ($med,$submed);  
    while ($low <= $high)   
    {  
       my $med = int (($low+$high)/2); # punto medio de la búsqueda        
       # comparacion lexicográfica en punto medio, mira 'perldoc perlop'  
       my $comp = $pattern cmp substr($seq,$suffix[$med],length($pattern));  
       if($comp < 0){ $high = $med-1 }  # retrocedemos en @suffix   
       elsif($comp > 0){ $low = $med+1 } # avanzamos  
       else   
       {  
          my $submed = $med - 1; # sufijo inmediatamente anterior al punto medio  
          while($submed > 0 && $pattern eq substr($seq,$suffix[$submed],$patl))  
          {   
             print "# match at position $suffix[$submed]\n";  
             $submed--;   
          }  
          while ($med < $#suffix-1 && $pattern eq substr ($seq,$suffix[$med],$patl))  
          {  
             print "# match at position $suffix[$med]\n";  
             $med++;   
          }  
          last;  
       }  
    }  
 }

Hasta otra,
Bruno