21 de febrero de 2014

Naïve Bayes con pseudoconteos (Laplace)


Hola,
tras leer recientemente algunos capítulos del libro Doing Bayesian Data Analysis, que trae todo el código escrito en R, se me ocurrió explorar cómo construir en Perl un clasificador bayesiano ingenuo (Naïve Bayes), del que recuerdo haber leído su aplicación hace tiempo en un trabajo de Beer & Tavazoie para correlacionar secuencias reguladoras en cis y patrones de expresión de genes en levaduras.

Figura recortada de http://www.sciencedirect.com/science/article/pii/S0092867404003046



Enseguida me crucé con un módulo de CPAN (Algorithm::NaiveBayes) adecuado para entrenar un clasificador en basea  observaciones codificadas en forma de vectores de atributos y la clase a la que pertenecen:

observación = (atributo1, atributo2, atributo3 : clase)



Sin embargo,  mirando el código fuente no parece incluir la corrección de Laplace para frecuencias nulas. En esencia, esta corrección consiste en sumar a los datos empíricos, las observaciones, datos ficticios para segurarse que no sobreestimamos el modelo y para permitir clasificar muestras  con atributos no vistos en el entrenamiento, sin alejarse de lo que indican los datos reales. Si todo esto te suena a chino te redirijo a este vídeo y a este blog. Esto se ha hecho muchas veces en Bioinformática, por ejemplo para construir matrices a partir de secuencias de DNA, como en el algoritmo CONSENSUS de Hertz y Stormo.

Sigo buscando y me encuentro una versión muy compacta de clasificador Bayesiano basada en el módulo Moose, que podéis ver en http://asciirain.com/wordpress/2012/12/03/naive-bayes-in-42-lines . Tomo por tanto este código como punto de partida y lo modifico para permitir el uso de pseudoconteos de Laplace. Primero defino la clase myNaiveBayes:

 package myNaiveBayes;   
    
 # modification of http://asciirain.com/wordpress/2012/12/03/naive-bayes-in-42-lines  
 # with Laplace smoothing (K param)  
 # allows missing features, which should be encoded as 'ND'  
   
 use Moose; # turns on strict and warnings  
   
 has class_counts => (is => 'ro', isa => 'HashRef[Int]', default => sub {{}});  
 has class_feature_counts => (is => 'ro', isa => 'ArrayRef[HashRef[HashRef[Num]]]', default => sub {[]});  
 has feature_counts => (is => 'ro', isa => 'ArrayRef[HashRef[Num]]', default => sub {[]});  
 has total_observations => (is => 'rw', isa => 'Num', default => 0);  
 has K => (is => 'ro', isa => 'Num', writer => '_set_K', default => 0);  
   
 sub insert   
 {  
    # insert observation, a vector of 'features', with last element being 'class'  
    # example: ('chrome','yahoo','us','good') , where class = 'good'  
   
   my( $self, @data ) = @_;  
   my $class = pop( @data );  
   $self->class_counts->{$class}++;  
   $self->total_observations( $self->total_observations + 1 );  
       
   for( my $i = 0; $i < @data; $i++ )   
     {  
       next if($data[$i] eq 'ND');  
     $self->feature_counts->[$i]->{$data[$i]}++;  
     $self->class_feature_counts->[$i]->{$class}->{$data[$i]}++;  
   }  
     
     return $self;  
 }  
   
 sub classify   
 {  
    # takes a feature vector (a new observation) of unknown class and returns a hash reference with   
    # probabilities of being associated to all previously seen classes  
   
    my( $self, @data ) = @_;  
    my ($i,$class,%probabilities,$feature_count,$class_feature_count);  
    my ($feature_probability,$conditional_probability,$class_count,$class_probability );  
       
    printf("# classify: training data = %d , K = %d \n",$self->total_observations,$self->K);  
       
    for $class ( keys %{ $self->class_counts } ){  
       $class_count = $self->class_counts->{$class};   
       $class_probability = $class_count / $self->total_observations;   
          
       ($feature_probability, $conditional_probability) = (1.0,1.0);  
       for($i = 0; $i < @data; $i++){  
          $feature_count = $self->feature_counts->[$i]->{$data[$i]} + $self->K;  
          $class_feature_count = $self->class_feature_counts->[$i]->{$class}->{$data[$i]} + $self->K;  
         
          # if $self->K==0 (no pseudocounts) zero counts are omitted  
          next unless($feature_count && $class_feature_count);   
               
          $feature_probability *= $feature_count /   
             ($self->total_observations + (keys(%{$self->feature_counts->[$i]}) * $self->K));  
               
          $conditional_probability *= $class_feature_count /   
             ($class_count + (keys(%{$self->class_feature_counts->[$i]->{$class}}) * $self->K));   
       }  
       #                     p(class) * p(features|class)   
       # p(class|features) = ----------------------------   
       #                     p(features)  
       $probabilities{$class} = ($class_probability * $conditional_probability) / $feature_probability;  
    }  
    return \%probabilities;  
 }  
   
 no Moose;  
   
 __PACKAGE__->meta->make_immutable;  
   
 1; 

Y ahora lo probamos con los mismos ejemplos del código en que me basé, creando un programa de nombre nbayes.pl:
 #!/usr/bin/env perl  
   
 use myNaiveBayes;  
   
 use constant PSEUDOCOUNTS => 1; # Laplace smoothing  
   
 my $K = $ARGV[0] || PSEUDOCOUNTS;  
   
 my $nb = myNaiveBayes->new( K => $K);  
   
 $nb->insert('chrome' ,'yahoo'  ,'us', 'good');  
 $nb->insert('chrome' ,'slashdot','us', 'bad');  
 $nb->insert('chrome' ,'slashdot','uk', 'good');  
 $nb->insert('explorer','google' ,'us', 'good');  
 $nb->insert('explorer','slashdot','ca', 'good');  
 $nb->insert('firefox' ,'google' ,'ca', 'bad');  
 $nb->insert('firefox' ,'yahoo'  ,'uk', 'good');  
 $nb->insert('firefox' ,'slashdot','us', 'good');  
 $nb->insert('firefox' ,'slashdot','us', 'bad');  
 $nb->insert('firefox' ,'slashdot','uk', 'bad');  
 $nb->insert('opera'  ,'slashdot','ca', 'good');  
 $nb->insert('opera'  ,'yahoo'  ,'uk', 'bad');  
 $nb->insert('opera'  ,'yahoo'  ,'uk', 'bad');  
   
 my $ref_classes = $nb->classify('opera','slashdot', 'uk');  
 foreach my $class (sort { $ref_classes->{$a} <=> $ref_classes->{$b} } keys(%$ref_classes))  
 {  
   printf("%-20s : %5.5f\n", $class, $ref_classes->{$class} );  
 }  

Si lo ejecuto en el terminal obtengo:

$ perl nbayes.pl  

# classify: training data = 13 , K = 1
good                 : 0.33287
bad                  : 0.68883



Hasta luego,
Bruno


14 de febrero de 2014

trabajo en PDB Europa

The Protein Data Bank in Europe (PDBe) is seeking to recruit a 
structural analyst/programmer to join the team at the European 
Bioinformatics Institute located on the Wellcome Trust Genome Campus 
near Cambridge in the UK.

For further information and to apply, please see the full EMBL-EBI job 
listing at http://bit.ly/1jvNjIO
To discuss this post informally, please contact PDBe's Dr Sameer 
Velankar (sameer@ebi.ac.uk).

About the position:

Established in 1994, CATH and SCOP are the world?s most comprehensive 
resources classifying protein-domain structures into evolutionary 
superfamilies. They are currently being combined in a collaborative 
project - Genome3D - that aims to provide predicted 3D structures for 
sequences assigned to SCOP and CATH superfamilies. Combining SCOP and 
CATH based predictions allows us to identify more accurately regions 
that agree between the two methods. We aim to provide these 
Genome3D-predicted structures via the PDBe resource. To improve the 
assessment of the reliability of the predictions it is necessary to 
develop a mapping between SCOP and CATH and to remove any conflicts.

We are now seeking to recruit a structural analyst/programmer to assist 
with this task. The structural analyst/programmer will also build an 
automatic pipeline to generate putative domain assignments (from CATH) 
for new PDB structures, prior to classification in SCOP or CATH. The 
structural analyst/programmer will be based at PDBe, and will be jointly 
supervised by Prof. Christine Orengo (UCL, London) and Dr Alexey Muzin 
(MRC-LMB, Cambridge), together with Prof. Gerard Kleywegt and Dr Sameer 
Velankar at PDBe.

About PDBe:

The Protein Data Bank in Europe (PDBe; pdbe.org) is part of the 
Worldwide Protein Data Bank organisation (wwPDB; wwpdb.org), which 
maintains the global archive of 3D structural data on biomacromolecules. 
The PDBe team also maintains a number of databases that support 
deposition and advanced search services for structural biologists and 
the wider scientific community. The team consists of an international 
and inter-disciplinary mix of professionals (scientists and IT specialists).

Best regards,
Gary.

-- Gary Battle Protein Data Bank in Europe (PDBe) EMBL-EBI Wellcome Trust Genome Campus Hinxton, Cambridge CB10 1SD http://www.facebook.com/proteindatabank http://twitter.com/PDBeurope

11 de febrero de 2014

Jugando a leer y extraer secuencias de un fichero FASTQ comprimido con GZIP (.fq.gz)

Un fichero con extensión '.fq.gz' es un fichero en formato FASTQ comprimido en formato GZIP. Los ficheros FASTQ con datos de los nuevos métodos de secuenciación (NGS) suelen ocupar decenas de Gigabytes de espacio de disco (una cantidad indecente para la mayoría de los mortales) y comprimidos con GZIP se reducen.

Para empezar intentaremos saber cuanto espacio ocupa realmente un archivo 'fq.gz' comprimido, para ello usaremos el mismo comando 'gzip':

> gzip --list reads.fq.gz
         compressed        uncompressed  ratio uncompressed_name
        18827926034          1431825024 -1215.0% reads.fq


Parece que GZIP en vez de comprimir expande, pero no es verdad, simplemente que la opción '--list' de 'gzip' no funciona correctamente para archivos mayores de 2GB. Así que hay que recurrir a un método más lento y esperar unos minutos:


> zcat reads.fq.gz | wc --bytes
61561367168


Si queremos echar un vistazo al contenido del archivo podemos usar el comando 'less' o 'zless':

> less reads.fq.gz
> zless reads.fq.gz

Y para saber el número de secuencias que contiene simplemente hay que contar el número de líneas y dividir por 4 (le costará unos minutos):

> zcat reads.fq.gz | echo $((`wc -l`/4))
256678360

Podemos buscar una determinada secuencia en el archivo y contar cuántas veces aparece, por ej. ATGATGATG:

> zcat reads.fq.gz | awk '/ATGATGATG/ {nlines = nlines + 1} END {print nlines}'
398065

A veces nos interesará tomar un trozo del fichero para hacer pruebas, por ejemplo las primeras 1000 secuencias (o 4000 líneas):
> zcat reads.fq.gz | head -4000 > test_reads.fq
> zcat reads.fq.gz | head -4000 | gzip > test_reads.fq.gz

O extraer un rango de líneas (1000001-1000004):

> zcat reads.fq.gz | sed -n '1000001,1000004p;1000005q' > lines.fq 

También nos puede interesar dividir el fichero en varios más pequeños de por ejemplo 1 miĺlón de secuencias (o 4 millones de líneas):

> zcat reads.fq.gz | split -d -l 4000000 - reads.split
> gzip reads.split*
> rename 's/\.split(\d+)/.$1.fq/' reads.split*

Y posteriormente reunificarlos:

> zcat reads.*.fq.gz | gzip > reads.fq.gz


Finalmente os recomiendo leer un post similar en inglés:
http://darrenjw.wordpress.com/2010/11/28/introduction-to-the-processing-of-short-read-next-generation-sequencing-data/

24 de enero de 2014

breeding informatics job Dupont-Pioneer (Seville, Spain)


Buenas, 
os paso una oferta que me ha llegado por si a alguien le interesa, el contacto es  Eva Guzmán (eva dot guzman at pioneer dot com), hasta luego,
Bruno


Senior Research Associate position

Description

Provide computer and software support for researchers in crop product development programs world-wide. The senior research associate will be an integral part of the crop development team located in, and supported by, the researchers in Seville, Spain. The successful candidate will provide support for IM tools and processes and may interact with project leaders, SCRUM team(s) and members of the non-corn hybrid crop breeding staff to assist in the development and implementation of software solutions to meet research and production needs.

Provide training and client support for software applications. The successful candidate will troubleshoot and resolve technical problems primarily in support of sunflower and canola breeding in European countries. The successful applicant will work directly with breeders to become familiarized with research activities and processes relevant to the crop and region they support. They will interact directly with the app support representative in Johnston to implement software solutions, train clients worldwide on the use of software, and offer timely solutions to client IM needs. She/he will prepare and maintain training materials and provide training for new staff. 

Ensure that production bugs are routed through the recommended channels for prompt resolution and that clients’ needs are addressed in a timely and efficient manner. The successful candidate will work directly with clients and the development team(s) to promptly address urgent production issues. She/he may contribute to the establishment of a knowledge base to assist other client support representatives in addressing IM support needs, log production bugs, document bug fixes and ensure they are communicated to relevant RIM Support and Development teams.

Under the direction of the CPD RIM Support Manager, serve as the primary support contact on key initiatives and software development projects directly related to the crops and regions he/she supports. Responsibilities will include working with research clients to identify and document IM needs, working with software development staff to provide IM solutions when necessary, validating and testing newly developed software, and developing end-user documentation. Adapt and test compatibility of internal systems with locally available hardware.

Qualification

• B.S. degree in Computer Science, Bioinformatics, Information Systems, Biology, Agronomy, Molecular Genetics, Plant Breeding or related field with 2 – 4 years of industry related experience or M.Sc. degree in Computer Science, Agronomy or related field with 0 – 2 years of industry related experience or B.S.
o Science competencies;
 Experience in lab/field/greenhouse plant breeding research. Knowledge of experimental design and/or statistics. Understanding of molecular genetics and molecular breeding applications as it pertains to cultivar development.

o IM competencies;
 Excellent database, web, and/or PC skills, both functional and technical skills preferred. Experience supporting or strong familiarity with Windows Operating System. Experience with database and/or network support. Familiarity with personal computer software, client-server technology, and internet technology. Familiarity with software application to plant breeding preferred.  Demonstrated understanding of relational databases and ability to extract, summarize, and report large amounts of data.  Strong desire to contribute to the discovery and implementation of IM solutions in support of breeding activities.

o General competencies;
 Excellent written and oral communication skills and demonstrated ability to communicate complex information in a clear and concise manner.  Experience coordinating technical training and support programs to clients. Experience with customer-service preferred. Demonstrated ability to effectively manage priorities in a high-demand, time-sensitive work environment.  Demonstrated ability to work cooperatively in a diverse team environment.  Strong interpersonal and communication skills.

17 de enero de 2014

Congreso BIFI2014 y taller internacional de bioinformática

Hola,

la próxima semana el laboratorio estará en dos eventos simultáneos:

http://bifi.es/events/bifi2014
http://congresos.nnb.unam.mx/TIB2014


1) los Talleres Internacionales de Bioinformática TIB2014, donde Carlos Pérez Cantalapiedra participa como profesor en el "Taller 2: Bioinformática para análisis de datos de secuenciación masiva (NGS)". Este evento tiene lugar en el campus UNAM en Cuernavaca, Morelos, México. Aprovecho para señalar que, entre otras muchas herramientas y utilidades para el manejo de datos NGS, Carlos empleará dos programas desarrollados en el laboratorio, uno escrito en C++ y otro en Perl:
  •  split_pairs: efficient kseq-based program to sort and find paired reads within FASTQ/FASTA files, with the ability to edit headers with the power of Perl-style regular expressions
  •  split_blast: Perl script to take advantage of multi-core CPUs for doing BLAST searches that fit in RAM, del que ya hablamos en este blog

2) la IV International Conference BIFI 2014, en Zaragoza, España, donde el laboratorio presenta varios pósters y una charla sobre trabajo reciente.

Nos vemos allí, un saludo,
Bruno