2 de enero de 2019

BLAST+ actualizado a versión 2.8.1

Hola, espero que estéis bien.
En esta primera entrada del año solamente quería señalar que BLAST+ fue actualizado a la versión 2.8.1+ hace un par de semanas a causa de un error encontrado al usar la opción -max_target_seqs, tal como se publicó en https://doi.org/10.1093/bioinformatics/bty833 y se discutió en https://www.biostars.org/p/340129 .

En respuesta a este error, tres autores del NCBI (Madden, Busby y Ye) escribieron una carta donde explican que el error detectado tiene menor impacto del esperado porque afecta a alineamientos con un número "muy elevado" de indels. Sin embargo, sí reconocen que el uso del parámetro -max_target_seqs con valores M pequeños puede causar confusión porque secuencias con igual puntuación se seleccionarían en base a su posición en el fichero FASTA de partida. Para abordar esto la versión actualizada avisa al usuario cuando use M < 5.

La explicación detallada de los autores de BLAST y los cambios introducidos en la versión actual se explican en https://www.ncbi.nlm.nih.gov/books/NBK131777 y https://doi.org/10.1093/bioinformatics/bty1026 .

Un saludo,
Bruno

17 de diciembre de 2018

no sabemos plegar proteínas (CASP13)

Hola,
en la última entrada de este año, escrita desde Hinxton, UK, me gustaría hablar de CASP13, la edición más reciente del experimento colectivo de predicción a ciegas de estructuras de proteínas (que ya habíamos mencionado aquí).

Entre que esta ocasión ha habido un salto de capacidad predictiva y que el aprendizaje automático está de actualidad, este año CASP ha salido en todas partes: en Science, en The Guardian y hasta en El País.

Yo me centraré aquí en las opiniones de expertos participantes de CASP. Pero antes, para que sepáis de qué hablo, podéis ver los resultados oficiales en predictioncenter.org/casp13

Empezaré por esta figura de Torsten Schwede, que muestra el salto de calidad de las mejores predicciones a lo largo de la historia de CASP. El ajuste entre un modelo y su estructura experimental se calcula con la función GDT_TS:

Fuente: https://www.sib.swiss/about-sib/news/10307-deep-learning-a-leap-forward-for-protein-structure-prediction

Otra visión de los mismos resultados nos la da Mohammed AlQuraishi, mostrando la separación entre los mejores grupos/predictores en ediciones de CASP:


Fuente: https://moalquraishi.wordpress.com/2018/12/09/alphafold-casp13-what-just-happened/
En ambos casos podemos ver una tendencia ascendente que habrá que ver si se mantiene en el tiempo o, si en cambio, se debe a que las secuencias problema de CASP13 eran más fáciles que otras veces.

Qué ha pasado en los últimos años? Seguramente la suma de muchas cosas. Por ejemplo, la llegada del equipo DeepMind en esta edad de oro del aprendizaje automático. Es curioso, porque las redes neuronales se han estado aplicando en CASP desde los años noventa para la predicción de estructura secundaria; sin embargo, desde 2011 sabemos que para muchas familias de proteínas tenemos tantas secuencias diferentes que podemos predecir los contactos que se dan entre las partes plegadas de la proteína.

Fuente: https://doi.org/10.1371/journal.pone.0028766  

Por tanto, no sabemos cómo se pliegan las proteínas todavía, pero algunos grupos de investigación han sabido explotar la información evolutiva implícita en alineamientos múltiples de proteínas para saber qué tipo de plegamiento adoptan finalmente. Muchos de esos grupos comparten su código fuente (por ejemplo http://evfold.org/evfold-web/evfold.do), a ver si lo hace DeepMind pronto,

hasta el año que viene!

Bruno

26 de octubre de 2018

Plant Genomes in a Changing Environment (III)

Hi, this is my account of the first few talks from the last day of the meeting.


Claudia Köhler, Swedish University of Agricultural Sciences, Sweden
She talks about imprinted genes which are flanked by transposable elements (TE) in Arabidopsis thaliana. They find that RNApolIV mutants suppress triploid seed abortions. RNApolIV is know to be involved in RNA-guided methylation. They found that RNApolIV is behind the biogenesis of easiRNAs from TEs, and that correlates with decreased CHH methylation in the endosperm of triploid seeds (https://www.ncbi.nlm.nih.gov/pubmed/29335544). So they propose that pollen-derived easiRNAs are functional after fertilization and have a transgenerational role in assessing gamete compatibility, similar to animal piRNAs. The relevance of the results is that these mechanisms allow rapid evolution of hybridization barriers and ultimately speciation.

Isabel Bäurle, University of Potsdam, Germany
She talks about how Arabidopsis thaliana plants remember past stress events, particular heat, which is one of the most fluctuating stress sources in nature. She describes Heat Shock Factor 2 (HSFA2) and how it associates transiently to genes conferring heat memory. Target genes were observed to accumulate H3K4me3, making chromatin accessible for at least 5 days  (https://www.ncbi.nlm.nih.gov/pubmed/26657708, http://www.plantcell.org/content/early/2014/04/25/tpc.114.123851). Then she moves to describing BRU1/TSK/MGO3, which is orthologous to animal TSL, which has an epigenetic role during DNA replication and is also required for heat memory ensuring that chromatin marks are inherited during cell division (https://onlinelibrary.wiley.com/doi/abs/10.1111/pce.13365). Their long-term goal is to provide stress-memory to crops in the right moment so that yield is not too affected.

Manu Dubin, CNRS / Université de Lille, France
He explains he is back to academia from industry and that he is studying how both climate of origin and breeding efforts influence DNA methylation in barley (Hordeum vulgare) and how that is linked to adaptation, inspired in previous work on climate clines in A. thaliana. They used USDA barley core collection (inbred seeds from Mexico) with both landraces and cultivars from Europe and North America, but does not include any Iberian barleys nor North-African, which are known to contribute to the genetic diversity of the species (see for instance https://link.springer.com/article/10.1007/s11032-018-0816-z). They observe that winter barleys have slightly higher CG methylation than springs and show GWAS results on TE methylation. They find that for most TE families winter lines are more methylated than springs. He focus a little on BARE1 copia-like elements, associated to drought and ABA responses, with higher CNV equatorial/sorth term T fluctuating regions. He shows a negative correlation between BARE1 CNV and yield. He shows nice boxplot-like plots showing individual data. He is asked to what extent the reference genome (Morex) affects his conclusion. He is also asked whether the seed source would affect his results, and to what extent his yield measurements are affected by the fact that he is planting barleys from other regions in North Europe.

Sorry, I missed the talks by Martin Groth (Helmholtz Zentrum München, Germany), Nick Loman (U. Birmingham, UK) and Tetsuya Higashiyama (Nagoya University, Japan).

25 de octubre de 2018

Plant Genomes in a Changing Environment (II)

Now for the second day.



Etienne Bucher, INRA, France
I miss the beginning of the talk but still get the main message: you can control the efficiency of retrotransposon mobilization in plants by exposing plants to heat (stress) and drug-inhibiting RNA pol II, which has a key role on transposon defense (RNA-directed methylation). The key paper is https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1265-4. They are using controlled [drugs: α-amanitin and zebularine] to create new variants and to select them in the field with rice and soybean. He has set up a company called epibreed to carry out this kind of experiments, but he insisted the approach can be used for free for research purposes.

Holger Puchta, Karlsruhe Institute of Technology, Germany
He takes us to a nice overview of double-strand breaks in plant genomes, and then moves to CRISP-Cas9 systems, where they initially the got 15% (heritable mutation) efficiencies in Arabidopsis thaliana. And now, using S. aureus Cas9, they achieve 90% efficiencies. They have tried several approaches for in planta gene targeting (initial idea summarized in  http://www.pnas.org/content/early/2012/04/19/1202191109) and are improving their efficiency so that they can use it to routinely knock out genes in A. thaliana (http://www.pnas.org/content/113/26/7266.short). He discusses that by combining double-strand breaks it is possible to induce recombination in centromeric regions, where meiotic recombination is extremely unlikely. In A. thaliana, out of 200 ds-break you get about 10 cross-over events. He is funded by ERC.

Sophie Harrington, John Innes Centre, UK
She talks about TILLING to study wheat senescence. They do EMS TILLING populations and sequence captured exons. She shows a nice figure of Ensembl Plants where this kind of data is readily available for users. She then introduces NAC transcription factors and in particular the NAM factors related to senescence. They use tetraploid wheat to study NAM-A1, because it´s single-copy there. By phenotyping EMS populations they see a particular amino acid substitution induces a significant delay in senescence in the field in two environments. Using yeast two-hybrids (Y2H) they believe these mutations impair NAC dimerization. She mentions a paper describing the NAC family in wheat (https://www.ncbi.nlm.nih.gov/pubmed/28698232). They used chromosome sorting to isolate a chromosome harboring a region with a clear allele frequency shift linked to senescence, they are working on sequencing that region. She gets several questions regarding dominant mutants in wheat, and how the dominant nature relates to the number of copies of the mutated regions.

Youssef Belkhadir,GMI Vienna, Austria
He talks about the molecular logic and emergent properties in receptor-receptor interaction networks around plant signaling. There are 400 receptor kinares (RKs) in Arabidopsis thaliana. They have diverse extracellular domains (ECDs). He shows nice cartoons of large & short Leu-rich ECDs docked together with a ligand and triggering intracellular phosphorylation and presents their approach to high-throughput screen LRR domains, as published in https://www.nature.com/articles/nature25184. They did confirmation Y2H experiments and found and agreement of 57% for high-confidence short-to-long LRR interaction predictions. By using network dissection, including page rank, they find that sort LRR proteins are more frequently central nodes than long LRR proteins.
He also shows data from an A. thaliana diversity panel (about 600 lines) used for large-scale root phenotyping assays of plants treated with brassinosteroids. Subsequent GWAS analyses suggest several LRR genes to explain the differences observed.
He mentions that BAK1 receptor is 100% conserved at the amino acid level in over 1K A. thaliana lines. He mentions that absence genotypes of particular LRR genes were confirmed by PCR against the suspected genome. They didn´t do the actual annotation; instead this was done at the group of Magnus Nordborg.

Anne Osbourn, John Innes Centre, UK
She talks about antimicrobial compounds (such as avenacin) synthesized at the roots of Avena plants. The responsible pathway is actually composed of several neighbor genes which are all under concerted expression, with a root-specific promoter (http://www.pnas.org/content/111/23/8679). They have a contig of this 720Kb region of the genome and they believe this cluster is not conserved in Brachypodium nor in wheat.
She mentions that many metabolic gene clusters have been reported in both monocot and dicots, that no horizontal gene transfer from microbes has been demonstrated and that probably their genomic co-localization is linked to their regulation and epigenomics (https://www.ncbi.nlm.nih.gov/pubmed/26895889). They have developed transient expression systems to test these metabolic clusters, both natural and synthetic, in Nicotiana leaves and obtained in some cases gr-scale triterpenes productions (https://www.ncbi.nlm.nih.gov/pubmed/28687337).
She then describes the thalianol pathway in A. thaliana, which was the first operon-like they ever predicted, and other posterior examples, such as http://www.pnas.org/content/114/29/E6005. She also shows data of rhizosphere composition changes in mutants on these pathways. They have developed a tool for predicting metabolic clusters: http://plantismash.secondarymetabolites.org

Matteo Dell Acqua, Scuola Superiore Sant'Anna, Italy
He talks about the identification of candidate genes for maize leaf development using tools such as GWAS, eQTL and precision phenotyping. He emphasizes the need to integrate approaches due to the observation that most alleles have small effects, with only a few major effect genes whatever the complex trait under study. He shows correlations among gene expression values and leaf traits, as well as GWAS-derived SNPs associated to the same traits.
He also shows that for eQTLs, the majority of expression levels analyzed are associated to remote cis & trans locations (matrix of expressed gene position vs eQTL position, cis are in diagonal). They focus on cis SNPs  found for several traits, and find several genes encoding vacuole pumps. He mentions the challenge of pericentromeric regions that have high linkage disequilibrium, that produce artificial segments with consecutive eQTLs. They use also WGCNA and compute correlations between modules and phenotypes, finding that some have positive correlations while others are actually negative.
He concludes by summarizing that RNAseq data are very valuable to do eQTL analyses and to produce markers.

Ming-Jung Liu, Academia Sinica, Taiwan
She starts by saying that Academia Sinica is currently recruiting and moves to talk about regulatory divergence in wound-responsive gene expression between domesticated (lycopersicum) and wild (pennellii) Solanum species. She expends some time discussing the tradeoff between growth and wound stress tolerance in wild species. They identified putative cis regulatory elements enriched in clusters of genes related to wound responses, which correspond to G-box and W-box elements, and are enriched in upstream regions immediately before TSS positions. They then check whether these cis elements are conserved between both species and find that most are conserved but a good fraction are actually non conserved, unique to each species (http://www.plantcell.org/content/early/2018/05/09/tpc.18.00194).

Sally Aitken, University of British Columbia, Canada
She talks about climate adaptation in conifers, which are currently experiencing drought and massive death at British Columbia. She talks about the increasing frequency of extreme climate events, added to the warming trends. (Tree) seed and breeding zones based on local populations no longer match genotypes with climates. Mutation rates in trees are low per year but high per generation. They have estimated that climate is chainging at a speed of 70km/yr, while paleobiology evidence suggest trees have in history travelled at 0.1km/yr. She describes their AdapTree project which is designed to manage this issue in W Canada with assisted gene flow. They have not seen population variability in drougt/heat response, only in cold hardiness. As they don´t have access to good assemblies they used exome capture and SNP arrays to do Genome to Environment Association with bayenv2 and standard GWAS. She explains that the population structure of conifers actually correlates with climate gradients, so that by removing pop structure you actually miss potentially bona fide adaptation loci. So they decided to not remove pop structure and instead took only SNPs in excess of the background distribution of SNPs per gene (http://science.sciencemag.org/content/353/6306/1431). They found 47 candidate genes common to pine and spruce populations and later work was done to find correlating haplotypes, instead of individual SNPs, to be used as markers (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1545-7).

I missed “Reinforcing plant volutionary genomics using ancient DNA” by Hernan Burbano (MPI Tübingen, Germany) and “A major QTL for grain weight in wheat is associated with increased grain length and cell size” by Jemima Brinton (John Innes Centre, UK).

Esther van der Knaap, University of Georgia, USA
She talks about their work on the mechanisms underlying morphological diversity in tomato, which is largely explained by four gene families, including Ovate and the OFP family members. OFP have been shown to interact with TFs, to act as repressors and to affect cellular localization of other proteins. They observed that OFP20 interacts with a series of proteins in Y2H assays and further refined the list by doing Cas9 knockout mutants and found that the pear/round shape is related to patterns of cell division in the fruit. She mentions a collaboration with Toni Monforte (UPV, Spain) where they found another OFP family member responsible for melon fruit shape.

Benjamin Brachi, INRA, France
He talks about natural variation of leaf secondary metabolites, and the underlying genetics, in European white oaks (Quercus robur). They have a reference genome and a genetic map made from trees planted in 1999. They do mass spectrometry from leave extract, cluster the compounds/pseudomolecules observed and estimate their replicability and heritability. He then explains a study of 9 populations of Quercus petrae from around France, where they see that population provenance does explain a very small part of the metabolites analyzed, and a fraction of those actually have bimodal/binary/PAV patterns: they are either produced or not at all. I think he believes the latter have a genetic explanation, while the rest probably respond largely to the environment.
Andrew Gloss, University of Chicago, USA
Andy talks about plant genotype × herbivore genotype interactions using 288 ecotypes of Arabidopsis thaliana, with the goal of discovering the genetic architecture of resistance to herbivory. The chosen herbivor is a fly related to Drosophila. They measure leaf damage and perform multi-trait GWAS, classifying SNPs as common genetic SNPs and SNPs with effects that depend on the plant population studied. He then focus on gene PBSL, which underlies clinal variation in size from N to S Europe.

Sarah Schiessl Weidenweber, Justus Liebig University Giessen, Germany
She talks about miRNA signaling under drought stress in winter lines of alopolyploid Brassica napus. How does drought affect flowering? It delays flowering and reduces yield. Their hypothesis is that the flowering networks senses drought stress by means of RNAi. They put their plants in containers to get realistic soil drying compared to pots, sampled tissue and finally did WGCNA analyses first with RNAseq to define modules and then with small RNAs looking for those correlated with modules defined earlier. Now they are studying in PCR experiments the expression of the candidate smallRNAs and they have observed a high variation across genotypes.

Adrien Sicard, SLU Uppsala, Sweden
He talks about the convergent evolution of flower morphology after the transition to selfing in the genus Capsella. He introduces the selfing syndrome of repeated morpho evolution in plants, which tend to reduce petal size by reducing the number of petal cells, which they also see after Principal Component Analysis of transcriptomes of selfing and non-selfing species. They have a strong QTL for petal size in a population of two selfing species. When the candidate gene is mutated, probably in the promoter, they see pleiotropic effects.