#!/perl/bioinfo

15 de diciembre de 2016

Chuleta de Python 3 para principiantes

Hoy quiero compartir con vosotros una chuleta de Python 3 que he creado para mis estudiantes, he intentado recopilar en dos caras de DIN-A4 los tipos de datos, operadores, métodos, funciones y otros contenidos útiles que uno necesita tener a mano cuando comienza a programar en Python 3 (incluso meses más tarde).

Puedes descargarte la chuleta en formato PDF de mi blog en inglés o usarla online:

7 de noviembre de 2016

DIAMOND as replacement of BLASTX

Hi,
about a year ago my colleague Javier Tamames told me about a piece of software called DIAMOND which could be thought of as a replacement of some BLAST tools, particularly BLASTP and BLASTX.

Author's benchmark of DIAMOND, taken from http://www.nature.com/nmeth/journal/v12/n1/full/nmeth.3176.html .

Recently we tested in-house the BLASTX alternative and I summarize here the results of finding protein-coding segments in transcripts from Arabidopsis thaliana (n=67,259) and barley (Hordeum vulgare, n=76,362) by comparing them to Swissprot, downloaded from ftp.uniprot.org.

You can get or build DIAMOND from https://github.com/bbuchfink/diamond.

First, we had to format Swissprot to support searches:

$ makeblastdb -in uniprot_sprot.fasta -dbtype prot                   # produces 3 files
$ diamond makedb --in uniprot_sprot.fasta --db uniprot_sprot.fasta   # produces 1 file

Then we could run the actual nucleotide-to-peptide sequence searches allocating 30 CPU cores. Note that files bur-0.fasta and SBCC073_fLF.fasta contain the A.thaliana and barley transcripts:

$ diamond blastx -p 30 -d swissprot -q bur-0.fasta -o bur-0.diamond.tsv \
  --max-target-seqs 1 --evalue 0.00001 \
  --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovhsp sseq 
#Total time = 36.2444s

$ diamond blastx -p 30 -d swissprot -q bur-0.fasta -o bur-0.diamond.sensitive.tsv \
  --sensitive ...

#Total time = 144.636s

$ diamond blastx -p 30 -d swissprot -q bur-0.fasta -o bur-0.diamond.more.tsv \
  --more-sensitive ...

#Total time = 194.35s

$time ncbi-blast-2.2.30+/bin/blastx -num_threads 30 -db ~/db/swissprot \

  -query bur-0.fasta -out bur-0.blastx.tsv \
  -max_target_seqs 1 -evalue 0.00001 -outfmt '6 std qcovhsp sseq' 
#real    351m30.313s
#user    8294m28.727s
#sys    16m26.575s

Similar searches were performed with barley sequences and the results then compared with help from a Perl script which required aligments to be 5% longer/shorter to be considered significantly loger or shorter, respectively. The total BLASTX alignments were 44,970 for A.thaliana and 32,887 for barley:

diamond_strategy                 matched same_hit  same_length longer  shorter
bur-0.diamond.tsv                  0.935    0.895        0.921  0.031    0.047
bur-0.diamond.sensitive.tsv        0.973    0.902        0.919  0.037    0.044
bur-0.diamond.more.tsv             0.973    0.902        0.918  0.037    0.045

SBCC073_fLF.diamond.tsv            0.889    0.807        0.852  0.071    0.077
SBCC073_fLF.diamond.sensitive.tsv  0.960    0.831        0.856  0.076    0.067
SBCC073_fLF.diamond.more.tsv       0.961    0.831        0.856  0.076    0.067

All in all, our tests suggest a reduction in computing time of 2 orders of magnitude with a 4% loss of sensitivity and alignments to protein sequences of the same length in most cases.

See you,
Bruno

31 de octubre de 2016

Correlaciones y regresiones en R

Hola,
al analizar datos es frecuente que recurrir al cálculo de correlaciones, que permiten establecer si dos variables covarían, y de regresiones, que permiten modelar de qué manera conociendo una variable podemos estimar la otra. De hecho ya habíamos tocado este tema en este blog.

Hoy quisiera compartir el estupendo material sobre estas materias que ha preparado nuestro colega Pablo Vinuesa, que podéis explorar en línea y con ayuda de Rstudio, por ejemplo, para probar todo el código de los ejemplos:

correlación: [http://www.ccg.unam.mx/~vinuesa/R4biosciences/docs/Tema8_correlacion_presentacionR.html#/]

regresión: [http://www.ccg.unam.mx/~vinuesa/R4biosciences/docs/Tema9_regresion_presentacionR.html#/]

Un saludo,
Bruno

5 de octubre de 2016

extraer TSV de todas las hojas de un libro excel

Hola,
cuando intercambiamos datos en al laboratorio a menudo usamos libros Excel, en formato XLSX, con varias hojas. Ahora bien, si luego queremos convertirlos a ficheros con valores separados por tabuladores (TSV) o comas (CSV), MS Excel sólo te permite hacerlo de hoja en hoja, lo cual es un poco latoso para libros gordos. Aquí os pongo un script en Perl para hacer esta tarea desde el terminal, que requiere instalar el módulo Spreadsheet::ParseXLSX, por ejemplo con:

$ sudo cpan -i Spreadsheet::ParseXLSX

El código es:


#!/usr/bin/perl -w
# get separate tab-separated (TSV) files from sheets in Excel .xlsx file 
use strict;
use Spreadsheet::ParseXLSX;

die "# usage: $0 <infile.xlsx> \n" if(!$ARGV[0]);

my $parser = Spreadsheet::ParseXLSX->new();
my $book = $parser->parse($ARGV[0]);
foreach my $sheet (@{$book->{Worksheet}})
{
  open(TSV,'>',"$sheet->{'Name'}.tsv") || 
    die "# cannot create $sheet->{'Name'}.tsv: $!\n";

  foreach my $row ($sheet->{'MinRow'} .. $sheet->{'MaxRow'})
  {
    foreach my $col ($sheet->{'MinCol'} ..  $sheet->{'MaxCol'})
    {
      print TSV "$sheet->{'Cells'}->[$row]->[$col]->{'Val'}\t";
    } print TSV "\n";
  }

  close(TSV);
}

Si guarmos el código en un fichero de nombre xlsx2multitab.pl, podemos invocarlo de la siguiente manera:

$ perl xlsx2multitab.pl libro.xlsx

En el directorio actual se guardará un fichero .tsv por cada hoja del libro,
hasta luego,
Bruno

29 de septiembre de 2016

Seminario de introducción a la metagenómica

Os quiero anunciar que el 19 de Octubre impartiré el workshop titulado "Introduction to Bioinformatics applied to Metagenomics and Community Ecology" como parte de la conferencia Community Ecology for the 21st Century (Évora, Portugal).

Si estáis interesados, podéis contactar a los organizadores de la conferencia en el siguiente enlace, todavía hay plazas disponibles en el workshop.

Durante el curso presentaré la nueva herramienta AmpliTAXO para el análisis sencillo y online de datos de NGS de RNA ribosomal y otros marcadores.

El curso consistirá en 2 partes, la primera teórica donde se expondrán los retos de la metagenómica, las posibilidades de las nuevas técnicas de secuenciación y el funcionamiento de las herramientas de análisis más habituales (UPARSE, QIIME, MOTHUR). La segunda parte será práctica y consistirá en el análisis de datos metagenómicos reales obtenidos por NGS.

Podéis encontrar más información en inglés en mi nuevo blog y próximamamente en la página de la conferencia (pendiente de actualizar).

Os dejo un pequeño adelanto de los contenidos...