16 de octubre de 2019

Plant Genomes in a Changing Environment 2019 (I)

Hi, these are my notes of the talks I attended of the first day of the Plant Genomes in a Changing Environment 2019 conference.

Jump to day 2 or day 3.


Unlocking the polyploidy potential of wheat through genomics (Cristobal Uauy, John Innes Centre, UK)
Phenotypes of agricultural importance are complex, with continuous gradients instead of discrete states, and differences being often difficult to tell from noise. This only gets worse with polyploids, where QTL effects are subtler than in diplids. He explains that Arabidopsis thaliana is as far from wheat as platypus to human. He then talks about the rich wheat genomic resources which are bringing people to work in this species perhaps for the first time. All these resources are documented at http://www.wheat-training.com

He then moves to describe a particular example of combining these tools, using  https://bioconductor.org/packages/release/bioc/html/GENIE3.html to predict target genes of wheat TFs using around 900 RNAseq experiments. They see no evidence of TFs preferring targets from same (A,B,D) subgenome, even when the D subgenome has eveolved independently for thousands of years with respect to A & B. He’s asked how difficult it is to map & genotype specific subgenomes, he says with Polymarker you can enrich on subgenome-specific regions. They use https://pachterlab.github.io/kallisto to assemble transcripts and have validated in chr-deletion lines that the transcript from those missing chromosomes are not expressed.

ENSEMBL plants – Visualizing the Wheat Genome in Ensembl Plants (Guy Naamati, EMBL-EBI, UK)
He starts with a quick tour of http://plants.ensembl.org/Triticum_aestivum. He then explains the gene trees for wheat produced with https://www.ensembl.org/info/genome/compara . Then he moves on to the wheat variant collections and the TILLING mutants and the KASP markers. Finally he also mentions our preparative work with a dozen wheat cultivar assemblies from http://www.10wheatgenomes.com . He concludes showing off the Ensembl Outreach team that do Ensembl training around the world. He is asked why gene models keep changing across releases and whether it is possible to know the most abundant isoform (canonical?) . He’s also asked how the 10+ varieties are going to be loaded. Another question is how low sequence identity orthologues are managed.

Expression atlas - Submission, archival and visualisation of plant sequencing data (Nancy George, EMBL-EBI, UK)
She guides us through a submission from start to end: i) annotate metadata with https://www.ebi.ac.uk/fg/annotare, ii) import expression data with https://www.ebi.ac.uk/arrayexpress and https://www.ebi.ac.uk/gxa (min 3 replicates, biological question, reference in Ensembl). These steps eventually result in baseline expression reports such as http://plants.ensembl.org/Triticum_aestivum/Gene/ExpressionAtlas?g=TraesCS3D02G273600;r=3D:379535906-379539827
She then moves on to say how the plant community are still not embracing the single-cell sequencing efforts due to technical challenges.

Benchmarking and development of an ensemble motif mapping approach to improve gene regulatory network inference (Marc Jones, VIB / Ghent University, Belgium)
He introduces TF binding motifs and how they can be used to scan genome sequences to predict genomic sites. They compared different motif aligners, including MOODS, cluster-buster, FIMO and matrix-scan. The observe that FIMO sites tend to match more often with those from other tools. They then compared site predictions to ChIPseq read depth, in order to compute precision and recall. FIMO comes best in terms of precision and worst on recall. When they look at the first 7000 sites, their 4 tested aligners behave similarly. Eventually they combined FIMO and cluster-buster, as they report many sites missed by the others. The full set of results is described at http://www.plantphysiol.org/content/181/2/412

No genome required: Finding genetic variants associated with plant phenotypes without complete genome information (Yoav Voichek, Max Planck Institute for Developmental Biology, Germany)
He talks about doing GWAS analysis with K-mer distributions instead of mapping to a reference genome. They start with a PAV table of 31-mers across genotypes. That table can be used to characterize a pan-genome after removing low depth kmers, as they did with 1000 A. thaliana genome sequence sets. From that they have developed a GWAS pipeline for k-mers which accounts for population structure. They assign genomic context to k-mers by i) mapping to ref genome, ii) LD and iii) assemblying reads containing the k-mers and then mapping. The code will be released soon in https://github.com/voichek/kmers-gwas

The 4th dimension of Gene Regulatory Networks: TIME (Gloria M Coruzzi, New York University, USA) 
She talks about the time dimension in regulatory networks with the diagram on the left  from https://europepmc.org/articles/PMC4558309. She proposes we should be handling TFs binding to DNA just like enzymes, with enzymatic kinetics. She tells 3 stories on A. thaliana.


The Just-in-TIME approach allowed to study genes expressed in response to N a as function of time with enriched cis elements and GO terms that you would have missed if analyzed in bulk https://www.pnas.org/content/115/25/6494.short. They apply ML to identify the TFs binding to those cis elements using time series gene expression and they validated the predictions with 7 TF perturbations, that affect 2K targets.

Hit-and-Run is another approach to study transient TF binding controlled by adding dexamethasone (developed by José Álvarez et al,  soon in Nat Comms).  She stresses that binding is a poor predictor of regulation, as most binding does not affect expression, and instead in many cases they can’t catch ChIPseq binding events that they know to happen. She also shows results of TFs binding to the 3’UTR. In order to catch those transient-binding TFs they used a new protocol called DamID. It turns out that most transient events are very early in the N response, while the stable binders tend to be late responding. She does not know whether transient sites are bound with less affinity, but she notes they do are enriched in neighbor sites from other TFs.

Finally, they performed network walking to connect transient TFs to their targets in A. thaliana, which they published at https://www.nature.com/articles/s41467-019-09522-1. It is called net walking because they walk from primary TFs, then to secondary regulated TFs and finally to indirect targets. They are now developing a method called OutPredict to introduce priors in their network inference.
  
Genetic and genomic studies of climate adaptation and genotype-by-environment interaction in switchgrass (Panicum virgatum, Tom Juenger, University of Texas at Austin, USA)
Talks about the evolutionary genetics of plant adaptation citing https://www.ncbi.nlm.nih.gov/pubmed/21550682 . His system is the C4, perennial, polyploid, wind-pollinated P. virgatum, related to http://plants.ensembl.org/Panicum_hallii_fil2.

They have resequenced 950 individuals 45x to map against a V5 PacBio assembly, yielding 46M SNPs. They belong to 4 populations. Their experiment sites span 24.3 degrees of latitude across 16 locations. They have published several articles, such as https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5100855 . They have been able to assign % of genetic variance to climate (such as mean temp of driest quarter) and geography and find SNPs associated to them. They conclude climate has been a stronger driver of adaptation than genetic isolation, and they observe widespread QTL x E interactions for local adaptation.

9 de septiembre de 2019

modelos de proteínas a partir de alineamientos múltiples

Hola,
desde hace unos meses he estado hablando aquí (1, 2, 3) de los nuevos métodos de predicción de estructura de proteínas basados en estimar distancias entre resíduos a partir de los alineamientos múltiples de sus secuencias (MSA). Hoy traigo aquí uno de esos métodos, que a diferencia de alphaFold, podéis probar en vuestro propio ordenador: DMPfold. Este algoritmo es producto del grupo de David T. Jones, bien conocido por herramientas muy populares como PSIPRED y usa la información evolutiva capturada en un MSA para calcular distancias entre C-betas, puentes de hidrógeno del esqueleto peptídico y ángulos diedros (leer aquí y aquí).


Diagrama de flujo de DMPfold, tomado de https://www.nature.com/articles/s41467-019-11994-0


La lista de dependencia es larga, como explican en su repositorio https://github.com/psipred/DMPfold, pero os permitirá modelar vuestras propias secuencias, incluso proteínas de membrana, y tener el control sobre el proceso,
hasta pronto,
Bruno

3 de septiembre de 2019

cómo hacer filogenias de miles de genomas

Hola,
la acumulación de genomas completos humanos, actualmente del orden de decenas de miles, plantea problemas a la hora de calcular filogenias con las estructuras de datos y los algoritmos tradicionales. Por esa razón hay grupos desarrollando nuevas estrategias que beneficiarán también a los que, como nosotros, trabajamos en plantas, cuando lleguemos a esos números.

Hoy comento muy brevemente dos métodos que acabo de ver publicados en Nature Genetics. El primero se llama tsinfer y usa un árbol comprimido para almacenar las variantes genómicas en mucho menos espacio que una matriz VCF:

Tamaño de las estructuras de datos probadas por los autores de tsinfer, tomado de https://www.nature.com/articles/s41588-019-0480-1.

El segundo método se llama relate y se basa en reconstruir los eventos de recombinación de cromosomas ancestrales que explican los haplotipos observados. Este método calcula longitudes de ramas:


Resumen del algoritmo relate, tomado de https://www.nature.com/articles/s41588-019-0484-x.

Un saludo,
Bruno

15 de agosto de 2019

progreso en la predicción estructural de proteínas

Hola,
hace unos meses contaba aquí el algoritmo AlphaFold para plegar proteínas por predicción de distancias entre residuos, que había escuchado de boca de uno de sus creadores. Hoy me he encontrado con la evaluación oficial de estructura terciaria del experimento CASP13, donde AlphaFold se destacó como mejor grupo predictor. La conclusión se resume en esta figura:

Fuente: https://onlinelibrary.wiley.com/doi/10.1002/prot.25787
Parece seguro decir que en CASP13 ha habido un salto en la calidad de las predicciones respecto a ediciones previas, a pesar de que la dificultad en esta edición es comparable a la anterior (Tabla 1 del artículo de Abriata et al de la figura). Los evaluadores achacanla mejoría precisamente a que más allá de predecir contactos, algunos de los mejores predictores, como A7D (AlphaFold), MULTICOM o RaptorX han empezado a predecir directamente distancias entre residuos, algo para la cual hacen falta alineamientos múltiples de secuencia muy profundos. Hasta luego,
Bruno


14 de agosto de 2019

modelado comparativo de proteínas multidominio

Hola,
en muchas ocasiones el modelado por homología o comparativo es  la única manera que tenemos trabajar con la estructura de una proteína que todavía no está en el Protein Data Bank. De hecho muchos artículos han sido publicados con figuras construidas sobre este tipo de modelos porque ayudan a comprender y poner en contexto tridimensional los resultados.

Interfaz entre dos monómeros modelada por homología, tomada de https://science.sciencemag.org/content/364/6445/1095.

Sin embargo, casi todas las herramientas que existen para modelar proteínas se han centrado históricamente en modelar dominios de proteína uno a uno, cuando la realidad es que muchas proteínas contienen varios dominios. Precisamente para modelar las conformaciones de este tipo de proteínas ha sido publicado recientemente https://zhanglab.ccmb.med.umich.edu/DEMO.

Diagrama de flujo de DEMO, tomado de https://www.pnas.org/content/116/32/15930.

Con la ayuda de DEMO podrás ensamblar dominios previamente modelados de dos en dos. El algoritmo consulta una colección no redundante de estructuras multidominio y optimiza las orientaciones entre dominios, además de que puede usar datos experimentales (cross-linking y crioEM) para guiar el proceso.

Un saludo, Bruno