#!/perl/bioinfo

25 de septiembre de 2018

SequenceServer: nice local blast

Hi,
today I wanted to let you know about a tool that we discovered recently that has been very useful for us. Its name is SequenceServer (http://www.sequenceserver.com). It is simply a wrapper to let your collaborators run NCBI BLAST searches on your local sequence databases.
All you need is a copy of the NCBI folder with some BLAST+ release and a Linux distribution. Here we had ncbi-blast-2.7.1+ already in place but had to install the application, which is a ruby application, as recommended by the authors:

$ sudo gem install sequenceserver

Due to dependencies during installation I could not manage to install it in Centos5, but instead it was easy in CentOS release 7.5 (sudo yum install ruby-devel). Once this is done, and the appropriate port is open in the host, all that remains is to let the application know where the sequence databases are. You can do that with these commands:

$ sequenceserver -d /path/to/dbs  # add new databases

$ sequenceserver -l               # list installed dbs

$ sequenceserver &                # launch web application

Now you are ready to go. Your users need to type the URL:port of your host in their browser and they can now run their searches. This is the way it looks in our server:

Cheers, Bruno

PS I will be moving to the EMBL-EBI so there might be a break in this blog, but please keep in touch

10 de agosto de 2018

minimap2 vs BLASTN

Hola,
recientemente Heng Li publicó un trabajo (https://doi.org/10.1093/bioinformatics/bty191) describiendo un nuevo alineador genérico de nucleótidos que se llama minimap2, que podéis descargar en https://github.com/lh3/minimap2.

Figura tomada de https://doi.org/10.1093/bioinformatics/bty191

En el artículo se compara minimap2 en diferentes escenarios contra otros softwares alternativos, incluyendo su antecesor BWA mem y se destaca su velocidad y su versatilidad, ya que es capaz de alinear lecturas cortas, secuencias largas e incluso también puede alinear saltando intrones.

Yo lo que he hecho ha sido una prueba rápida para compararlo con BLASTN en el escenario habitual de GET_HOMOLOGUES-EST, donde se comparan por ejemplo todos los genes de una planta (Brachypodium distachyon) contra todos los genes de otra especie cercana (Oryza sativa). Esto es lo que he hecho:

# how many sequences
$ grep -c "^>" *fna
Bdistachyon.fna:36647
Osativa.fna:42189

# index and BLASTN search
$ ncbi-blast-2.6.0+/bin/makeblastdb -in Osativa.fna -dbtype nucl
$ ncbi-blast-2.6.0+/bin/blastn -query Bdistachyon.fna -db Osativa.fna \
   -out Bdistachyon.Osativa.blastn.tsv -dbsize 100000000 -evalue 1e-5 -outfmt 6
real	0m40.937s
user	0m40.280s
sys	0m0.636s

# index [assuming up 80% sequence identity] and minimap2 search
$ minimap2/minimap2 -x asm20 -d Oryza.mmi Osativa.fna
$ time minimap2/minimap2 Oryza.mmi Bdistachyon.fna > Bdistachyon.Osativa.minimap.paf
real	0m2.084s
user	0m3.360s
sys	0m0.300s

Ahora echemos un ojo a los alineamientos resultantes. Selecciono un par de secuencias, primero de BLASTN:

BdiBd21-3.2G0760100.1   LOC_Os01g70090.1        87.839  847     95      5       31      876     37      876     0.0     987
BdiBd21-3.2G0521100.1   LOC_Os01g37510.1        85.652  683     92      3       91      773     103     779     0.0     713

y ahora de minimap2, en formato PAF:

BdiBd21-3.2G0760100.1	876	155	776	+	LOC_Os01g70090.1	876	161	776	181	621	60	tp:A:P	cm:i:16	s
1:i:179	s2:i:0	dv:f:0.0980
BdiBd21-3.2G0521100.1	777	110	653	+	LOC_Os01g37510.1	783	122	659	87	543	60	tp:A:P	cm:i:10	s
1:i:85	s2:i:0	dv:f:0.1196

Al maneo para estos dos ejemplos podemos observar que:
i) el mejor hit de BLASTN y minimap coinciden
ii) los alineamiento de BLASTN son más largos

Hasta pronto, buenas vacaciones,
Bruno

23 de julio de 2018

conjunto diferencia entre listas con Perl

Hola de nuevo,
sirva esta entrada para compartir una receta eficiente para calcular el conjunto diferencia entre dos listas o arrays en lenguaje Perl5.

Conjunto diferencia, tomado de https://es.wikipedia.org/wiki/Diferencia_de_conjuntos.

Para ello podemos definar la siguiente subrutina, tomada del módulo Array::Utils:

my @a = 0..10000; 
my @b = 5000..10000;

array_minus(@a, @b);

sub array_minus(\@\@) {
    my %e = map{ $_ => undef } @{$_[1]};
    return grep( ! exists( $e{$_} ), @{$_[0]} ); 
}

Podéis ver otras alternativas en reddit, incluyendo soluciones en Perl6 y python,
un saludo,
Bruno

estructuras del PDB parecidas en secuencia (REST)

Hola,
hace unos días un usuario preguntaba en la lista de usuarios del Protein Data Bank (PDB) cómo usar la interfaz de servicios REST, documentada en https://www.rcsb.org/pdb/software/rest.do, para hacer búsquedas BLAST.

Mientras otro usuario compartía un script escrito en Python3, llamado sequenceSimilarity.py, que requiere mmtf-pyspark para hacer consultas PSI-BLAST en tiempo real contra el PDB, a mi me llamó la atención la simplicidad del servicio sequenceCluster, que para cualquier cadena de una estructura depositada en el PDB permite obtener el clúster de secuencias con un porcentaje de identidad controlado por el usuario:

https://www.rcsb.org/pdb/rest/sequenceCluster?cluster=70&structureId=9ant.A

Esto devuelve una lista de cadenas de estructuras PDB en formato XML que podemos procesar por ejemplo como se explica en

https://developer.atlassian.com/server/fisheye-crucible/writing-a-rest-client-in-perl

Hasta luego, buen verano,
Bruno

14 de junio de 2018

updated footprintDB motifs

Hi,
I have just updated the format of footprintDB motifs used in RSAT (www.rsat.eu), which can de downloaded from http://floresta.eead.csic.es/footprintdb/download. This was done to ensure compatibility with compare-matrices in RSAT. Now, motifs look this:

AC  AY750993/VRN1/EEADannot
XX
ID  AY750993:VRN1:EEADannot
XX
NA  AY750993
XX
DE  AY750993
XX
OS  Hordeum vulgare cv Strider
XX
BF  10127;
XX
P0      A      C      G      T
01     19     72      7      2      C
02     10     71      7     12      C
03     57     17     14     12      a
04     56      6     32      6      r
05     94      2      4      0      A
06     84      1      3     12      A
07     79      7      7      7      A
08     57      4      9     30      w
09     24      0     76      0      G
10     15      0     85      0      G
XX
LN  http://floresta.eead.csic.es/footprintdb/index.php?motif=AY750993:VRN1:EEADannot
XX
RN  [1];
RL  Deng W, Casao Mc, Wang P, Sato K, Hayes PM, Jean Finnegan E, Trvaskis B (2015) Direct links between the vernalization response and other key 
traits of cereal crops. Nat Comm 6:5882
RN  [2];
RX  PUBMED: 24234003
RL  Sebastian, A., Contreras-Moreira, B. footprintDB: a database of transcription factors with annotated cis elements and binding interfaces. Bio
informatics 30, 258-65 (2014).
XX

Note that the ID field was added and that a URL pointing to the relevant database entry identified with it is attached. The accession code replaces ':' with '/' to ensure that downstream RSAT analyses can be carried out just fine,
cheers,
Bruno