#!/perl/bioinfo

19 de marzo de 2018

notes on EUCARPIA Cereals meeting 2018 (I)

Monday, 19th March 2018 (program at https://symposium.inra.fr/eucarpia-cereal2018)

Intro: Gilles Charmet (INRA-UCA), remembers Patrick Schweizer

Intro Eucarpia: Andreas Borner (IPK), EUCARPIA Cereals section conference

Raphäel Dumoin (Bayer Crop Science) Wheat Innovation Strategy at Bayer

There is a need to breed wheat for both high and low productivity areas around the world. In each area, there is a gap between current productivity and potential yield. They expect that the wheat seed market will be soon as large as corn’s [due to correlation between acreage and seed value for corn, soybean, cotton, canola]. BCS now has breeding stations in North America, EU and Australia and they are developing pure lines and hybrids, as well as looking for yield improving traits. The elements that explain higher yield in wheat are yield stability and abioitic stress tolerance, while maintaining quality. They use heterotic pools for breeding hybrids. They work with both spring and winter wheats and take 7yr to develop a new variety with marker-assisted breeding. They also work with targeted genome optimization/Cas9 edition, which can be done in 1-2yr but faces regulatory hurdles in EU. They actively engage in collaborations with public R&D organizations and private companies around the world.

Andreas Graner (IPK) Ex situ germoplasm collections

There is a increased demand of crops and a need for sustainability (Steffen Science 2015). The quest for innovation in plant breeding needs the interface between genomics, metabolomics and phenomics. The breeding methodology is now genomic selection, that increases explained variability by adding minor QTLs. Both doubled haploids and transformation/Cas9 are key enabling technologies. He emphasizes the importance of surveying and exploiting the available genetic resources. He mentions that currently the German federal ex-situ genebank contains 27K wheat and 23K Hordeum accessions, with seed multiplication done on average every 20-30 yr (https://www.nature.com/articles/srep05231). These experiments have allowed estimating heritabilities of 0.89-0.95 and are now the ground for GWAS analyses with very large populations after careful curation of data. At IPK they are taking advantage of a large phenomics facility put together recently to quantitatively characterize traits such as lipid content at large scale (see paper 2014 on Avena lipids). They have sequenced with GBS 23K barleys, observing that genetic diversity mimics geographic origin. He mentions data management FAIR principles and APIs. They have recently released the BRIDGE barley IPK DB (https://t.co/fLPAkkF7nY). He argues that the Nagoya Protocol on Access and Benefit Sharing (https://www.cbd.int/abs) is against Open Access, as it will restrict, for instance, dissemination of phenotypic data from collections.

Davide Guerra (CREA, Italy)

Presents the WHEALBI collection with 512 barley accessions from 73 countries, including both cultivars and landraces. These were exon-captured and sequenced to yield 403 validates sampled with 64M called variants, which they used to allocate barleys to 6 geography-based subpopulations. A series of common garden experiments were carried out in several latitudes and irrigations regimes. He shows preliminary results on multi-environment GWAS experiments and discusses a few confirmed candidate genes they have found, including VRNH1, PpdH1 or HvCEN. He then goes into some depth to show his results on Copy Number Variation (CNV) at the CBF locus, the frost tolerance experiments carried out to characterize the alleles discovered and the PCR experiments ahead to survey that particular genomic locus.

Ernesto Igartua (EEAD-CSIC, Spain)

Presents the Spanish Barley Core Collection (SBCC, http://www.eead.csic.es/barley/index.php) and explains that Spanish landraces comprise actually 4 subpopulations. These SBCC barleys have been used in the CLIMBAR FACCEJGI project to analyze their association to agro-climatic variables. He presents first the genetic differentiation of the 4 subpopulations (XtX, diversity). Then a table is shown with linkage disequilibrium. First, it is found that cold tolerance and water balance are the main variables explaining the genetic diversity. Second, GWAS experiments with both Bayenv2 and LFMM confirm the CBF locus (+ control) and unveil a candidate amino-oxydase associated to cold/heat responses.

Marco Maccaferri (U Bologna, Italy); Luigi CATTIVELLI (CREA, Italy)

Genome assembly of durum wheat cv Svevo (http://www.tasaco.com/Seed.aspx?cesit=44) and then a tetraploid diversity panel of 1.9K lines. Estimates average LD < 0.2 with dist(SNPs) between 400Kb and 1.9Mb depending on the population.

Luigi talks more about the genome project (https://www.interomics.eu/durum-wheat-genome), assembled with NRGene software. 90% of the genome in 2K scaffolds. 95% scaffolds are mapped and anchored. The same protocol was used by other team to sequence wild emmer cv Zavitan, parent of wheat tetraploids, which was already sequenced (http://science.sciencemag.org/content/357/6346/93) and suggests that there is a lot of CNV, concentrated at the end of chromosome arms. In addition, they found 600 loss-of-function genes in durum compared to Zavitan, due to gained stop codons or frame shifts due to indels%3 > 0. These must have occurred in less than 10K yr.

Helmy M YOUSSEF (IPK, Germany)

Talks about natural diversity of inflorescence in Hordeum vulgare, reporting results published in https://www.nature.com/articles/ng.3717. He explains what two, six-rowed barleys are and describes labile and intermedium spikes as well. They discover and describe gene Vrs2, which affect spike architecture.

Constance LAVERGNE (U Nottingham, UK)

Talks about introducing/introgressing of Aegilops sharonensis cytoplasm into common wheat and production of addition/translocation lines which are often male-sterile. She shows seed pictures of different generations, as well as GISH preparations of introgressed and translocation lines.

Scott Allen JACKSON (U Georgia, USA)

Talks about legume genomes (10 references available currently). While annual soybeans are Chinese, there are a few perennials in Australia. Phaseolus is more ancestral and is used to root trees. Breeding is just a series of bottlenecks, and domestication is likely the most important one. However, improvement requires genetic variation. Discusses that reference genomes, while allowing many types of diversity studies, have limitations, as they are just genomic snapshots. He argues that pan-genomes are better tools and he shows the wild Glycine pan-genome, reported at https://www.nature.com/articles/nbt.2979. He mentions that having it allowed to test for genes under selection in G. max, and they found just under seven hundred.

He then talks about transposable elements (TEs) and their role in genome evolution as sources of novel diversity. TEs live for about 2Myr in a typical plant (half-life). There are no subgenomes dominance effect in soybean, and there is large PAV. He talks also about DNA methylation (CG, CHG, CHH, 3 different plant methylases) and how it changes TEs (he cites https://www.nature.com/articles/nrg.2016.139). He says methylation is the preferred mechanism to silence inserted TEs in plant genomes, and how differentially methylated regions (DMRs) in a pan-genome occur, usually because TE move. Most DMRs are inherited stably and behave like SNPs. He also cites a recent paper showing that post-duplication methylation diminishes are evolutionary time passes (https://onlinelibrary.wiley.com/doi/abs/10.1111/pce.13127). Non-syntenic genes tend to be C-methylated. His last statement is that a third of pan-genome genes are in low recombinogenic regions, including TE non-colinear genes.

Caroline JUERY (INRA GDEC, France)

Explains histone marks of euchromatin and heterochromatin and then explains she wants to check whether the wheat epigenome is partitioned according to H4K27me3, H3K36me3, H4K9ac, H3K4me3 marks (or lack of) ascertained by ChIP-seq. She concludes there are clearly epigenetic territories and then looks to triads of homeologous genes to measure the effect of epigenome marks (upstream, ATG, stop, downstream, as in figure 3 of http://www.plantcell.org/content/21/4/1053) on gene expression, not protein expression yet.

Cécile MONAT (IPK, Germany)

She starts by defining the basics of pan-genomes and presents the http://www.10wheatgenomes.com project, which is starting to produce reference-quality assemblies of 10 wheat cultivars combining NRGene assemblies, linked 10x reads (https://community.10xgenomics.com/t5/10x-Blog/A-basic-introduction-to-linked-reads/ba-p/95), POPSEQ and Hi-C data. Cécile has a preprint describing the pan-genome of two African rice species at https://www.biorxiv.org/content/early/2018/01/09/245431.

Maria BUERSTMAYR (BOKU, Austria)

Talks about high-resolution mapping of the pericentromeric region on wheat chromosome arm 5AS harboring the Fusarium head blight resistance QTL Qfhs.ifa-5A. Used gamma-radiation to promote double-breaks in DNA and overcome recombination limitations in the centromere, even with large populations, by building a radiation hybrid map with markers in cR units.

Romain DE OLIVEIRA (INRA GDEC, France)

He defines CNV and then Presence Absence Variation (PAV). He explains his reference-mapping pipeline to identify TE-element-related CNV in wheat. He shows that wheat accessions can be clustered in terms of PAV of TEs. At least 15% of genes are PAV variable among accessions.

9 de marzo de 2018

growth of protein-DNA complexes in the Protein Data Bank

Hi,
while checking the update logs of our good old 3D-footprint, a database of DNA-binding protein structures updated weekly from the Protein Data Bank, I found a folder with logs starting Februrary, 2009. The plot below shows how the number of non-redundant complexes, filtered in terms of protein sequence identity, has doubled in just a decade:

The nr95 bundle can be downloaded in PDB format at
http://maya.ccg.unam.mx/tfmodeller/get_library.cgi

Other related files are available at:
http://floresta.eead.csic.es/3dfootprint/download.html

cheers,
Bruno

1 de marzo de 2018

sustituyendo el operador smartmatch en Perl5

Hola,
tras el anuncio reciente de que la versión 5.28 de Perl5 eliminaría el operador smartmatch ~~ (ver aquí) me he encontrado un programa viejito dónde se usaba, a pesar de que ha sido experimental desde hace mucho tiempo. Con ayuda de

$ perldoc perlop

cuelgo aquí un ejemplo de cómo sustituir este operador por código estándar:

use strict;
use warnings;

my @array = qw( JASPAR footprintDB UNIPROBE );
my %hash  = ( JASPAR => 1, footprintDB => 2, UNIPROBE => 3 );

my $element = 'footprintDB';

# array context
if ($element ~~ @array){
  print "\@array contains element '$element' (smartmatch)\n";
}

if (grep { $element eq $_ } @array){
  print "\@array contains element '$element' (core Perl5)\n";
}

# hash context
if(/$element/ ~~ %hash){
  print "\%hash contains a key matching regex /$element/ (smartmatch)\n";
}

if(grep { /$element/ } keys(%hash)){
  print "\%hash contains a key matching regex /$element/ (core Perl5)\n";
}

Un saludo,
Bruno

8 de febrero de 2018

Modelling transcription factor complexes in the terminal

Hi,
I just updated our good old server TFmodeller, available at http://www.ccg.unam.mx/tfmodeller,
so that it uses the current collection of 95% non-redundant protein-DNA complexes extracted from the Protein Data Bank. As of Feb 7, 2018, there are 977 such complexes, which can be downloaded.
In addition, I just wrote a Perl client so that predictions can be ordered from the terminal via a SOAP interface, producing XML output which should be easy to parse. The PDB format coordinates of the resulting model are marked-up with tags. The input is a peptide FASTA file. This is the code:

#!/usr/bin/perl -w
use strict;
use SOAP::Lite;

my $URL = 'http://maya.ccg.unam.mx:8080/axis';
my $WSDL = "$URL/TFmodellerService.jws?WSDL";

my $infile = $ARGV[0] || die "# usage: $0 \n";
my ($inFASTA,$result);
open(FASTA,'<',$infile) ||die "#cannot read $infile\n";
$/ = undef;
$inFASTA = ; # slurp
close(FASTA);

my $soap = SOAP::Lite->uri($URL)
                     ->proxy($URL, timeout => 300 )
                     ->service($WSDL);

eval { $result = $soap->TFmodeller($inFASTA) };
if($@){ die $@ }
else{ print $result }

The original Java client can still be found here. Note that the output includes a sequence alignment of query and template with residues contacting DNA nitrogen bases highlighted:

HEADER model 1zrf_A 203 DNACOMPLEX resol=2.10 21 8e-46
REMARK query    MILLLSKKNAEERLAAFIYNLSRRFAQRGFSPREFRLTMTRGDIGNYLGLTVETISRLLG
REMARK template KVGNLAFLDVTGRIAQTLLNLAKQ-PDAMTHPDGMQIKITRQEIGQIVGCSRETVGRILK
REMARK contacts ........................ ................*........***...*...

Bruno

6 de febrero de 2018

Bioinformática Estructural 2018 en la LCG-UNAM

Hola,
desde hoy martes 6 hasta el viernes 9 de febrero pasaremos las mañanas en la Licenciatura de Ciencias Genómicas de la UNAM aprendiendo a modelar las secuencias de ADN y de proteínas como moléculas que se pliegan y cumplen su función en 3D. Para ello usaremos algoritmos y software descritos en este material, actualizado en enero de 2018:

http://eead-csic-compbio.github.io/bioinformatica_estructural

Composición de dominios de Cas9 en complejo con crRNAs, tomada de https://www.ncbi.nlm.nih.gov/pubmed/29035385.

Se puede también descargar en PDF,
hasta luego,
Bruno