25 de octubre de 2018

Plant Genomes in a Changing Environment (I)

Hi,  
the first meeting on "Plant Genomes in a Changing Environment" kicked off today at the Wellcome Genome Campus in Hinxton, UK. It is exciting to be here and find out this is probaby the first ever  plant genome meeting in an otherwise world-famous genomics venue.

 
I will post here my notes on the talks I attend to.


Caroline Dean, John Innes Centre, UK
She presents the different flowering habits of Arabidopsis thaliana accessions (rapid cycling, winter facultative & obligate winter-annual) and takes us to the current knowledge of the quantitative nature of winter recording in the FLC locus, a MADS repressor of flowering which is the target of a polycomb-mediated epigenetic switch. In addition, she summarizes the mutually exclusive non-coding FLC transcripts found to be cold induced, such as COOLAIR (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4234544, https://www.nature.com/articles/ncomms13031). After flowering, the epigenome state of FLC is restored by a demethylase. COOLAIR is actually a Brassicaceae-conserved secondary structure RNA molecule substantially affected with a single SNP affecting splicing. She says that this ncRNA folds and stays in place, blocking physical access to that locus. She adds this mechanism is conserved in humans and Brassicaceae, and would expect the same in monocots.
By the way, COOLAIR non-coding transcripts seem to be annotated in Ensembl Plants: https://plants.ensembl.org/Arabidopsis_thaliana/Gene/Summary?g=AT5G10140;r=5:3173382-3179448;t=AT5G10140.2;db=core

FLC locus accumulates H3K27me3 histones with exposure to cold, setting up a bistable state of inducing/repressing chromatine modifications. This balance spreads across tissues and cell populations, including the root tip. This memory is sustained by the own chromatin in cis (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4450441).
She then presents the RY cis elements in intron 1 of the FLC locus which is repressed by VAL1 (https://www.ncbi.nlm.nih.gov/pubmed/27471304) to trigger polycomb nucleation (http://floresta.eead.csic.es/footprintdb/index.php?tf=ea4a1835a3360403cd07b75528829572).
When they looked at 80 world-wide populations they found distinct FLC haplotypes, which compared to each other in a common background explain a linear vernalization requirement.
She claims that in A. thaliana vernal days are actually afternoons with temperatures < 15 °C (https://www.nature.com/articles/s41467-018-03065-7).  


Doreen Ware, USDA and Cold Spring Harbor, USA
She talks about a maize pangenome browser currently under development. She explains that growers require a platform that would allow easy knowledge transfer from some plants to others, so that it can be used in breeding. She talks about CNV genes with agronomical impact, such as transporters providing Al tolerance (http://www.pnas.org/content/110/13/5241). She shows GRAMENE neighborhood conservation display modes based on Ensembl Compara data: 




Then she describes their current efforts PacBio-assembling 26 maize NAM parents, with SMRTlink assembly performed in the cloud (DNAnexus) and sped up 360x. The resulting assemblies are robust, with N50 > 34Mb.
She terminates with a quick overview of transcriptome profiling for heterosis-inspired work, with the aim of phasing isoforms, which is important for reconstructing heterozygous loci (https://www.nature.com/articles/ncomms11708).

Eric Schranz, Wageningen University, The Netherlands
Talks about conservation and divergence in relative gene order of plant and animal genomes using network-based synteny analysis. He explains genome territories and why gene context matters with multiple examples of Hox genes and body layout plans. He claims that we have a genomic hairball problem when looking at synteny, and that networks with edges~synteny can simplify the problem, allowing PAV and homeologues to be integrated easily (https://www.sciencedirect.com/science/article/pii/S1369526616302230).
He also explains phylogenetic profiling and how they used to find MADS box genes which are syntenic in all angiosperms but not in particular groups such as crucifers or monocots (http://www.plantcell.org/content/early/2017/06/05/tpc.17.00312).
He also explains that they´re doing a mammal vs plant synteny analysis. Overall, mammal genomes are syntenic, while plant genomes are not. This work is under review at PNAS. They do find family specific conserved syntenic blocks and a few, photosynthesis & clock-related, angiosperm-conserved genes.

John Vogel, University of California, Berkeley, USA
John talks about the pan-genome of Brachypodium distachyon and its implications for polyploid genome evolution. He describes the main findings of the Gordon et al paper (https://www.nature.com/articles/s41467-017-02292-8). He mentions that there is currently no way of displaying the pangenome efficiently in phytozome, and he looks forward to the new developments of Gramene.
He then introduces B. stacei and the resulting B. hybridum. He shows the high synteny between B. hybridum subgenomes and the diploid parental species, as well as the SNP-based tree suggesting at least two hybridization events. Then he shows k-mer plots suggesting that D-citotype B. hybridum (older) lines contain unique k-mer composition.
He then moves to the analysis of foundation effects in the hybrids, but shows that the hybridum + parental pangenome is not significantly different to the individual parental pangenomes. Finally, he shows dNdS plots to show that both subgenomes are still under selection.
M Morgante comments that this data is probably not compatible with a epigenetic shock post-hybridization.

Jae Young Choi, New York University, USA
Jae could not attendand was replaced by an unnamed researcher from the group. She starts by introducing that besides transposable elements (https://www.ncbi.nlm.nih.gov/pubmed/25917896), tandem repeats are important drivers and markers for plant diversity. The talk is actually about natural variation in telomere repeats, which essentially are a major plant satellite, and their correlation with flowering time. They work with 100-mers of Oryza species, which include telomeres. In fact they see that O. sativa indica has significantly larger telomeres than ssp. japonica, and that correlates negatively with days to flowering.

Gabriele Magris, University of Udine, Italy
Gabriele gave a very nice and comprehensive talk on the characterisation of the pan-genome of Vitis vinifera using NGS with a special focus on collinear genes that have gained or lost a neighbor transposable element (TE) affecting their expression. My battery died and unfortunately, I could not take proper notes. However, I recall that he show nice results on the methylation state of the regions where TE insert and the preference of TE families for particular genomic territories, such as LINE elements for introns for instance. I asked him about how to efficiently annotate TEs in genomes and he referred me to the work of Wicker (https://www.nature.com/articles/nrg2165-c2).

 

19 de octubre de 2018

"Modern Statistics for Modern Biology" (libro)

Hola,
esta es mi primera entrada escrita desde el EMBL-EBI y en ella solamente quiero compartir un libro de libre acceso que se llama Modern Statistics for Modern Biology, escrito por Susan Holmes y Wolfgang Huber, que se puede visitar en https://www.huber.embl.de/msmb


Tiene una prosa sencilla y describe aproximaciones para enfrentarse a los problemas reales de la biología en general, incluyendo los que de manera habitual describimos en este blog. Además de explicar los fundamentos, el texto tiene muchos ejemplos y soluciones completas en lenguaje R. De hecho se puede descargar en http://web.stanford.edu/class/bios221/book/Rfiles el código fuente de todos los capítulos.

Un saludo,
Bruno

25 de septiembre de 2018

SequenceServer: nice local blast

Hi,
today I wanted to let you know about a tool that we discovered recently that has been very useful for us. Its name is SequenceServer (http://www.sequenceserver.com). It is simply a wrapper to let your collaborators run NCBI BLAST searches on your local sequence databases.
All you need is a copy of the NCBI folder with some BLAST+ release and a Linux distribution. Here we had ncbi-blast-2.7.1+ already in place but had to install the application, which is a ruby application, as recommended by the authors:

$ sudo gem install sequenceserver

Due to dependencies during installation I could not manage to install it in Centos5, but instead it was easy in CentOS release 7.5 (sudo yum install ruby-devel). Once this is done, and the appropriate port is open in the host, all that remains is to let the application know where the sequence databases are. You can do that with these commands:

$ sequenceserver -d /path/to/dbs  # add new databases
$ sequenceserver -l               # list installed dbs
$ sequenceserver &                # launch web application 
 

Now you are ready to go. Your users need to type the URL:port of your host in their browser and they can now run their searches. This is the way it looks in our server:


Cheers, Bruno

PS I will be moving to the EMBL-EBI so there might be a break in this blog, but please keep in touch

10 de agosto de 2018

minimap2 vs BLASTN

Hola,
recientemente Heng Li publicó un trabajo (https://doi.org/10.1093/bioinformatics/bty191) describiendo un nuevo alineador genérico de nucleótidos que se llama minimap2, que podéis descargar en https://github.com/lh3/minimap2.

Figura tomada de https://doi.org/10.1093/bioinformatics/bty191

En el artículo se compara minimap2 en diferentes escenarios contra otros softwares alternativos, incluyendo su antecesor BWA mem y se destaca su velocidad y su versatilidad, ya que es capaz de alinear lecturas cortas, secuencias largas e incluso también puede alinear saltando intrones.

Yo lo que he hecho ha sido una prueba rápida para compararlo con BLASTN en el escenario habitual de GET_HOMOLOGUES-EST, donde se comparan por ejemplo todos los genes de una planta (Brachypodium distachyon) contra todos los genes de otra especie cercana (Oryza sativa). Esto es lo que he hecho:

# how many sequences
$ grep -c "^>" *fna
Bdistachyon.fna:36647
Osativa.fna:42189

# index and BLASTN search
$ ncbi-blast-2.6.0+/bin/makeblastdb -in Osativa.fna -dbtype nucl
$ ncbi-blast-2.6.0+/bin/blastn -query Bdistachyon.fna -db Osativa.fna \
   -out Bdistachyon.Osativa.blastn.tsv -dbsize 100000000 -evalue 1e-5 -outfmt 6
real	0m40.937s
user	0m40.280s
sys	0m0.636s

# index [assuming up 80% sequence identity] and minimap2 search
$ minimap2/minimap2 -x asm20 -d Oryza.mmi Osativa.fna
$ time minimap2/minimap2 Oryza.mmi Bdistachyon.fna > Bdistachyon.Osativa.minimap.paf
real	0m2.084s
user	0m3.360s
sys	0m0.300s

Ahora echemos un ojo a los alineamientos resultantes. Selecciono un par de secuencias, primero de BLASTN:

BdiBd21-3.2G0760100.1   LOC_Os01g70090.1        87.839  847     95      5       31      876     37      876     0.0     987
BdiBd21-3.2G0521100.1   LOC_Os01g37510.1        85.652  683     92      3       91      773     103     779     0.0     713

y ahora de minimap2, en formato PAF:

BdiBd21-3.2G0760100.1	876	155	776	+	LOC_Os01g70090.1	876	161	776	181	621	60	tp:A:P	cm:i:16	s
1:i:179	s2:i:0	dv:f:0.0980
BdiBd21-3.2G0521100.1	777	110	653	+	LOC_Os01g37510.1	783	122	659	87	543	60	tp:A:P	cm:i:10	s
1:i:85	s2:i:0	dv:f:0.1196

Al maneo para estos dos ejemplos podemos observar que:
i) el mejor hit de BLASTN y minimap coinciden
ii) los alineamiento de BLASTN son más largos


Hasta pronto, buenas vacaciones,
Bruno

23 de julio de 2018

conjunto diferencia entre listas con Perl

Hola de nuevo,
sirva esta entrada para compartir una receta eficiente para calcular el conjunto diferencia entre dos listas o arrays en lenguaje Perl5.
Conjunto diferencia, tomado de https://es.wikipedia.org/wiki/Diferencia_de_conjuntos.
Para ello podemos definar la siguiente subrutina, tomada del módulo Array::Utils:

my @a = 0..10000; 
my @b = 5000..10000; 
 
array_minus(@a, @b);

sub array_minus(\@\@) {
    my %e = map{ $_ => undef } @{$_[1]};
    return grep( ! exists( $e{$_} ), @{$_[0]} ); 
}

Podéis ver otras alternativas en reddit, incluyendo soluciones en Perl6 y python,
un saludo,
Bruno