17 de mayo de 2019

agrupando secuencias de proteínas con Linclust

Hola,
de vez en cuando tengo que revisar un viejo script para actualizar mi copia local del Protein Data Bank (PDB). El programa descarga solamente las estructuras que han cambiado mediante rsync y otros ficheros de un servidor FTP.
Sin embargo, las rutas a las respectivas carpetas van cambiando y yo tengo que actualizarlas. En concreto, hoy habían cambiado las listas de secuencias no redundantes, que ahora se pueden encontrar en ftp://resources.rcsb.org/sequence/clusters

Leyendo descubro que en el PDB ahora agrupan sus secuencias usando MMseq2 / Linclust, dos métodos relacionados que calculan de manera muy eficiente la similitud entre secuencias a partir de su composición de K-meros con un alfabeto reducido, temas de los que ya hemos hablado por ejemplo aquí y aquí. Me centraré en Linclust.

Algoritmo de clustering de coste lineal. Fuente: https://www.nature.com/articles/s41467-018-04964-5

Según el banco de pruebas publicado por sus autores, a diferencia de otras alternativas, el algoritmo Linclust tiene un coste lineal pero un comportamiento parecido, con pérdidas controlables de sensibilidad. Consta de varias fases:
  1. Transformación de las secuencias originales a una alfabeto reducido de 13 letras. Obtienen resultados óptimos haciendo las siguientes simplificaciones: (L, M), (I, V), (K, R), (E, Q), (A, S, T), (N, D), (F, Y)
  2. Generación de una tabla de K-meros con K entre 10 y 14. De cada secuencia solamente guardan 20 K-meros, elegidos por su frecuencia alta con una función hash.
  3. Búsqueda de secuencias con idénticos K-meros
  4. Pre-clustering en varios pasos, de más a menos eficientes: distancia de Hamming con alfabeto completo, alineamientos locales sin y con gaps.
  5. Clustering voraz con las secuencias ordenadas por longitud
En sus pruebas Linclust es mucho más escalable, al ser lineal, que alternativas como CD-HIT o UCLUST, y obtiene buenos resultados para cortes de identidad entre 90 y 50%. Esto es ideal para exploraciones de metagenomas por ejemplo,
hasta pronto,
Bruno




2 de mayo de 2019

#monogram2019 (2)

Below and Above Ground Processes
Silvio Salvi, U Bologna
He talks about root idiotypes in cereals. Root traits include biomass, lodging, soil penetration, architecture, ahairs and exudates, including mucilages. He shows root angles pictures of barley cultivars from WHEALBI and now want to confirm the phenotypes using a Morex TILLING collection http://www.distagenomics.unibo.it/TILLMore and mapping-by-sequencing approaches.
Vera Hecht, Julich
She talks about root traits in barley as a function of sowing density. She shows field experiments (5 replicates, Barke, Scarlett), that she published at https://www.frontiersin.org/articles/10.3389/fpls.2016.00944/full
She observed effects as a function of density as soon as 4 weeks after sowing date. That should be considered while planning lab experiments.
Tom Bennet, U Leeds
Works on Arabidopsis thaliana roots, finding that root system size at the end of winter  determines yield.

Phenotyping
Richard Whalley, Rothamsted
Invited talk on “Soil structure, root growth and yield in wheat”, they also take 1m cores as V Hecht. His message: we can’t predict root growth in the field.
Tony Pridmore, U Nottingham
Talks about UK and international plant phenotyping initiatives from the perspective of https://www.phenomuk.net. Phenotyping requires bringing together people with different backgrounds. It’s growing rapidly in the literature. He mentions EU program   https://emphasis.plant-phenotyping.eu , which has regular calls and has driven the development of https://www.miappe.org, and is about to be followed by https://eppn2020.plant-phenotyping.eu
Simon Orford, Germplasm Resource Unit, JIC
Simon presents the GRU (John Innes Centre Germplasm Resources Unit, they manage https://www.seedstor.ac.uk) and their role producing breeder tool kits for Designing Future Wheat (DFW, http://wisplandracepillar.jic.ac.uk/toolkit.htm). In DFW they have regular meetings with breeders from several companies and visit their plots in July, it looks like a very fruitful, long-term collaboration.
Aleksander Ligeza, NIAB
Talks about designing root architecture to improve Nitrogen Use Efficiency. He shows nice figures and plots showing that N content in the soil (measured in their rhizo-tubes) affects root architecture.


Abiotic and Biotic Stress
Gustavo Slafer, Centre for Research in Agrotechnology, Lleida, Spain
After an introduction on the how the literature describes plant physiology in controlled conditions, as opposed to field experiments, he talks about heat x Nitrogen penalties on yield. His team has measured that penalty interaction in several experiments with farmers with barley, maize and wheat. In all cases they observe that heat causes more damage as more N is added as fertilizer. This is with heat being only a few more degrees of max T for a few days.
Paul Nicholson, JIC
He talks about wheat diseases Fusarium head blight (FHB) and blast (appeared in 1985 in Brasil). The main problem with FHB is that accumulates toxins in the grain. He talks about alleles of semi-dwarf wheats which are linked to a glucosidase that, when silenced on a single subgenome, increases FHB resistance. More generally, it seems Fusarium takes advantage of several phytohormones and pathways which we can attenuate to increase resistance.
Blast does not affect wheat because it has Rwt3 and 4 in chr 1D (Aegilops tauschii). One of them is absent in the Brazilian wheat cv. Rwt4 is a full length NBS-ARC LRR.
Jonathan Cope, JHI
He talks about “Introgressing resilience and resource use efficiency traits from scots bere to elite barley lines (rebel)”. Bere are old Scandinavian 6row landraces tough to be brought by Danish invaders a few centuries ago into Scotland. They perform well in Mn poor soils and less affected that elite lines by diseases.
Samer Amer, U Reading
His research is about the “Genetic underpinnings of wheat responses to drought”. He shoes 18yr data with average rainfall in the UK over 1000l/m2. However, he insists that lack of rain in critical moments can have large undesired effects. He measured water availability in the field with tubes in the soil in “Kielder plots”.
Amma Simon, Rothamsted
Talks about her work on elucidating the mechanisms of aphid (R. padi) resistance in ancestral wheats, particularly in T. monococcum. Their field trials last year, which was very warm, were not successful as it seems drought culls the aphid populations.


Reproduction and Grain Development
Alison Bentley, NIAB
She talks in their work on wheat flowering time variation on the NIAB Elite MAGIC population, with 8 founders (https://www.niab.com/pages/id/402/NIAB_MAGIC_population_resources), which produced 643 lines. This is winter wheat (https://www.ncbi.nlm.nih.gov/pubmed/25237112). They have measured essentially days to particular GS developmental stages, and they validated the observations with two extra sites in Germany (long season) and Croatia (short). It is a heterogeneous dataset as it was scored in several years by several people. So they had to tailor their QTL detection methods, using both SNPs and Identity By Descent. Ppd1 is the largest effect observed, which is a positive control.
She also talsk about a project with French partners on predicting the ideal flowering idiotypes based on GS30/GS55 for N and S France. They are using genomic prediction for this using panels on over 400 cultivars from France. Out of 7369 MAGIC markers, they have short-listed a set of 19 markers (GIEC) for flowering time for the predictions. By using these 19 they achieve accuracies near 0.6 when predicting flowering time with cross-validation. Warning: populations with differing allele freqs require to account for this while predicting.
Stuart Desjardins, U Leicester
He talks about “Releasing natural variation in bread wheat by modulating meiotic crossovers” and the recombination-cold regions which take over 60% of the chromosomes in wheat and barley. They want to change and engineer that, so that cross-overs (CO) can occur elsewhere. He mentions that diploids cannot be compared to hexaploid wheat in this respect. CO are more frequent in telomers presumably because that’s where chromatids start pairing in meiosis.
Adam Gauley, JIC
Talks about the flowering pathways in wheat by comparing growing in the UK and in Spain, which shorter season. They are looking at the usual suspects FT and Ppd1.
Miaoyuan Hua, U Nottingham
He talks about “Gene Networks controlling anther tapetum development in barley during early anther development”. This involves a single-copy TF in barley which has several copies in other species, presumably due to having a few extra protein domains. They confirm this by rescuing mutants in Arabidopsis thaliana.
Jemima Brinton, JIC
She talks about “A major QTL for grain weight in wheat is associated with increased grain length and cell size”. They see growing increases of weight with knockouts as they affect more subgenomes. The increase is reduced without irrigation in dry season. This has been published at https://nph.onlinelibrary.wiley.com/doi/full/10.1111/nph.14624 and https://bmcplantbiol.biomedcentral.com/articles/10.1186/s12870-018-1241-5.

Genomics and Technologies for Crop Improvement
Wilma van Esse, Wageningen
She talks about transcription factor networks that control yield in barley. She mentions Vrs genes (http://www.plantphysiol.org/content/174/4/2397) and explains that 6row barleys have about 60 grains per spike, compared to 20 for 2rows. How does correlate with yield? To investigate that they look into the molecular and regulatory interactions between Vrs1 and Vrs5, is 5 regulating 1 by binding to its promoter? They have seen putative sites just by doing sequence analysis, and this happens in Arabidopsis thaliana (TB1, HOX1). Based on their Y2H data they hypothesize that Vrs1 forms heterodimers with other HOX TFs.
Gustavo asks how this would change in the field, where plants cannot tiller freely.


Laura Gardiner, IBM Research UK
She introduces us to “Using AI combined with genomics to improve crops: techniques and applications”. AI,Machine learning,Deep Learning. She walks through an example where they classify circadian genes in wheat by looking at their expression values in time series with MetaCycle (https://www.ncbi.nlm.nih.gov/pubmed/27378304). First they do circadian vs non-circadian, then the increase classes to morning, day, evening, night  and non-circadian. They do support proof-of-concept translatable projects to help UK industry, get in touch.
Riccardo Fusi, U Nottingham
Talks about “Deepgenes: functional characterization of genes enhancing soil exploration in maize”. He uses rice and brachy as model organisms.
Kumar Gaurav, JIC
He presents “A comprehensive and unbiased peek at the genetic diversity of wild wheat”, a quest for finding resistance genes which has been published at http://dx.doi.org/10.1038/s41587-018-0007-9. This is a reference-free GWAS analysis using presence/absence of k-mers as biallelic SNPs. They do take into account population structure. They are now upscaling their method with a bias towards resistance genes and mapping the k-mers to a reference genome. The bait sequences are available at https://github.com/steuernb/AgRenSeq
Michael Hammond-Kosack, Rothampsted
Talks about the wheat promoteome, as part of a DEFRA-funded networks (http://www.wgin.org.uk). They want to identify SNPs, indels and CREs linked to phenotypes among wheat cultivars. They use RNA-microarrays as they have stronger binding to DNA than DNA chips. They have 1.6kb promoter for 95 wheat cultivars and related species, from diploid to hexaploid. After mapping probes they reconstruct promoter haplotypes and find less than expected. After mapping they confidently discover indels (4-1kb) in some lines. They annotate also TF binding site looking for exact matches from a library.
Debbie Harding, BBSRC
She talks about BBSRC Strategy and Funding Opportunities.
Sophie Harrington, JIC
She delivers a talk as the winner of the student contest. She talks about the role of NAC TFs in wheat senescence, a process in which the leaves’ nutrients are consumed by the spikes to fill the grains. This work follows early findings by C Uauy (https://www.ncbi.nlm.nih.gov/pubmed/17124321). She has found a syntenic cluster on chr2, conserved in barley, with several dimeric NAC3 factors, some of which are highly expressed in the senescent tissues. Double Kronos EMS mutants of A & B subgenomes delay senescence. She uses https://bioconductor.org/packages/release/bioc/html/GENIE3.html to infer regulatory networks. Her preprints are https://www.biorxiv.org/content/10.1101/456749v1.abstract and https://www.biorxiv.org/content/10.1101/573881v1.abstract
Ricardo Ramírez-González, JIC
Summarizes his career from a stock analyst in México to wheat postdoctoral researcher at the JIC. He then explains how his http://wheat-expression.com site manages triad expression values (triangles) from the 3 subgenomes. He shows that most balanced homeologous genes are expressed in most tissues (stable), with tissue-specific genes being more unbalanced (dynamic).




Quality and Nutrition
Mark Loosely, JHI
Talks about “Improved malting quality in winter barley”. Starts by explaining the 3 steps of malting: degradation of cell walls, protein modification, activation of enzymes that convert starch into sugars. The carry out GWAS with both spring and winter barleys treated with fungicides. They find association peaks of alleles which have been already fixed in the elite barleys but also some with still low frequencies. They find 3 promising QTLs, with apparently gene clusters with similar functions, and later created a mapping population to fine-map one QTL in 3H. Finally, they are doing RNAseq of a few NIL recombinant lines to double-check the expression of genes in the QTL.
Katherine Steele, Bangor U
She talks about improving phosphorus efficiency in barley. Currently P is mined currently to produce fertilizers, but that supply might not last more than 100yr, and this is the P source for plants. P in the seed, mostly in the form of phytic acid (PA) in the aleurone layer, is the only source for germination, for the first 4 weeks. PA has high affinity for dietary minerals and thus it would be desirable to reduce its content in the grain without affecting germination and seedling vigour. Thus, they are looking at barley mutants with low tissue-specific PA content (lpa).
Anna Gordon, NIAB
She talks about ergot alkaloids, which contaminate grain and induce convulsions, hallucinations, vasoconstriction and sickness. These are produced by fungi and get into the grain during filling in barley, wheat and rye. There are EC limits for these compounds in cereals.


Rice and other Grasses
Adam Price. U Aberdeen
He talks about their long-term work on QTL analysis with rice (see https://www.sciencedirect.com/science/article/pii/S1360138506000793 or https://www.nature.com/articles/ncomms1467 for an example with GxE, with associations peaks appearing in different loci depending on the environments and populations). Recently they have been working with the Bengal and Assam aus panel (BAAP), which has been a donor for many relevant alleles to elite cultivars.
Luke Dunning, U Sheffield
Talks about lateral gene transfer (LGT) throughout the grass family, which he claims is recurrent, adding functional genes but might have a cost as well, due to TEs being acquired as well. He has published this recently at https://onlinelibrary.wiley.com/doi/full/10.1111/evo.13250 and also at  https://www.pnas.org/content/116/10/4416.abstract and starts with an example of a gene transferred with help from an aphid. He then walks us through to follow a few LGT genes in Alloteropsis among geographically close species. When they look at Au and Africa genomes altogether they see a clear accumulation of LGT gene over time, which is not syntenic (non-homologous recombination and transposition involved perhaps). Overall they have evidence of this happening in many but not all species and crops, such as barley. They report 5 in Brachypodium distachyon. In other species they have seen LGT fragments harboring up to 10 genes. He has not looked at codon usage biases.
Mario Caccamo, NIAB
Talks about an analysis of a large collection of Vietnamese native rice lines, which revealed novel genomics variants. This project also entailed creating the relevant bioinformatics pipelines and training local researchers. He explains with some detail their variant calling using FreeBayes and LD thining to obtain just under a million variants.
Mark Quinton-Tulloch, U Bangor
He talks about his genome-wide identification of resistance genes and their variation in indica rice. He uses KASP-driven genomic selection. He explains NBS-LRR genes and the TIR and CC families. He used https://github.com/steuernb/NLR-Parser to obtain motifs and call NBLRs. He acknowledges that you get both genes and pseudogenes.