30 de abril de 2019

#monogram2019


Hi, these are my notes on the Cereal Bioinformatics Session, plus the keynote by Keith Edwards, at Monogram 2019.

The rest of the notes are at https://bioinfoperl.blogspot.com/2019/05/monogram19-2.html?m=1


Cristobal Uauy, JIC  
Speaks about http://wheat-expression.com and explains the different references, from TGAC to RefSeq v1.0 (with 01 in the middle of gene names) and v1.1 (02 instead, as in TraesCS3D02G273600, used in http://plants.ensembl.org/Triticum_aestivum/Info/Index). He asks users to cite the papers not just the Web site. He mentions also the gene expression browser http://bar.utoronto.ca/efp_wheat/cgi-bin/efpWeb.cgi , http://www.polymarker.info to design polyploid-aware primers and the in silico wheat TILLING integrated in Ensembl Plants (http://www.wheat-tilling.com is legacy on previous gene models, but still useful in some cases). He wraps up by describing http://www.wheat-training.com , which links out to all resources and wheat populations as well.

Guy Gnaamati, EMBL-EBI
Describes the RefSeq v1.0 assembly with the v1.1 gene annotation in Ensembl Plants, the updated marker display (http://plants.ensembl.org/Triticum_aestivum/Variation/Explore?r=4A:714193214-714194214;v=BA00249348;vdb=variation;vf=194242) and their linked SIFT predictions. He summarizes the outcome of the ensembl4breeders event (see table in poster belowttps://twitter.com/ensemblgenomes/status/1098902364998782976https://twitter.com/ensemblgenomes/status/1098902364998782ttps://twitter.com/ensemblgenomes/status/1098902364998782976https://twitter.com/ensemblgenomes/status/1098902364998782976), and singles out pangenomes and the wheat test case as a prototype to develop that within Ensembl. He finishes advertising the upcoming Plant Genomes in a Changing Environment conference in October, 2019 (https://coursesandconferences.wellcomegenomecampus.org/our-events/plant-genomes-2019)




Leif Skot, IBERS
He talks about breeding targets in outperennial ryegrass (Lolium perenne ) and genomic predictions based on a 50yr running breeding experiment with linear biomass, yield gains with no signs of inbreeding depression yet. There seems to be a physically-anchored genome assembly under way (https://gtr.ukri.org/projects?ref=BB%2FG012342%2F1), but not ready yet; there are though synteny-based (https://www.ncbi.nlm.nih.gov/pubmed/26408275) and de novo (https://link.springer.com/chapter/10.1007/978-3-319-28932-8_19) assemblies.

Craig Simpson, JHI
They are using Salmon/kallisto to quantify barley transcriptomes, knowing that current barley gene models are still poor. Their aim is also to build a reference transcriptome (BarRTv1) with https://ccb.jhu.edu/software/stringtie guided by Morex assembly. They analyze 11 RNAseq datasets, with over 800 Illumina samples. They filter out low-expression transcripts (less than 0.3TPM) and use gmap to map back to Morex reference. They try to validate their expression values with RT-PCR and realize how difficult is to map multiple isoforms to a single PCR read. By correlating with RT-PCR results they defined optimal StringTie params: -c 2.5 - 50 –f 0, yielding over 60K genes and 177K transcripts, less than the original, which have been imported into a database by Linda Milne. They plan to do BarRTv2 with PacBio Iso-seq. He says there are many genes with a single dominant isoforms but also many others with 2-4 dominant isoforms, which could be nice to annotate in resources such as Ensembl. This data is still unpublished.

Kumar Gaurav, JIC
He talks about wild parents of wheat and their recent R-gene enrichment sequencing work to show they contain useful disease resistance genes. They belong to the Open Wild Wheat consortium, and have sequenced 260 Aegilops tauschii individuals with 10-30x cover (10Tb, available under Toronto agreement, seeds from JRU, JCI). They are performing diversity studies and mention wheat lineages 1 and 2.

Anthony Hall, JIC
He talks about a pan-genome of wheat elite cultivars as a way to gain access to hidden variability (SV, TE, promoters). This is the 10+ project, with NRGene RefSeq and W2RAP (https://github.com/bioinfologics/w2rap) assemblies. They know they are not covering all wheat variability out there. The assemblies are ready, they are now finishing the annotations with both de novo and validated gene models using a pan-transcriptome. A BLAST server is already available at https://webblast.ipk-gatersleben.de/wheat_ten_genomes .

Micha Bayer, JHI
He talks about the barley variome sampled from exome capture of 823 barley genotypes, covering mainly SNP and small indels. He discusses the depth vs breadth dilemma when managing diversity in germplasm. Their cultivars come from WHEALBI, EXCAP, B1K Israel, WBDC and other projects. Less than 5% of their final variant come from exons, with most coming from introns and UTRs. 96% are off-target variants with low read depth, with sufficient calling quality. Population level analyses distinguish wild and cultivated barleys, with low recombination around centromeres. They use SnpEff and are looking at fixed loss-of-function alleles in domesticated barleys. He mentions 20-30% of reads do not map the reference with max 4% mismatches.

After his talk, there’s a discussion on how to name genes in the context of pan-genomes. Cristobal says the role in Ensembl will be critical in this context.
Sebastian Raubach, JHI
Talks about Germinate v3 https://ics.hutton.ac.uk/get-germinate, a one-stop database schema for plant genetic resources, with powerful visualizations. It supports BrAPI, Multi-Crop Passport Descriptors (MCPD) and Dublin Core Metadata Initiative (DCMI). It comprises 3 modules: Scan (bar codes), Data Import and Germinate. It is used by 100+ groups working on different crops around the world, including wheat and maize at CIMMYT. Data can be exported to Helium, Flapjack, R, Excel, BraPI, google maps. It supports custom, restricted data access.

Keywan Hassani-Pak, Rohampstead
He talks about KnetMiner3.0 (http://knetminer.rothamsted.ac.uk) and does a quick 5-minute off-line demo. He shows case the evidence view, which is an enrichment analysis, and the keyword search to get more specific search results.
He then makes a DFW progress report on behalf of Rob Davey (EI), including https://grassroots.tools, which is about making data publicly available, and http://cyverseuk.org

Paul Wilkinson, U Bristol
He talks about http://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/indexNEW.php, built with Perl and PHP on top a database. He focuses specifically on the most recently added features, including a QTL database made in collaboration with the JIC and EI (which links out to Ensembl Plants), online dendrograms (http://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/35K_dendrogram.php) and an introgression plotter. The latter will become available soon and allows visualizing genomic regions introgressed in crosses. It produces nice circular plots and heatmaps.

Mario Caccamo, NIAB
He starts by talking about http://wheatis.org, which is part of the 2011 launched wheat initiative. There are 5 nodes across EU and US, including Ensembl Plants and Gramene. He then moves to the recent work regarding a group of experts around wheat gene nomenclature with the Wheat Gene Catalogue https://shigen.nig.ac.jp/wheat/komugi/genes/symbolClassList.jsp
Roughly, 10-15% of the loci in the catalogue correspond to current gene models, not always on a 1-to-1 relation.

Kim Hammond-Kosack, Rothampsted
She talks about PHI-base (http://www.phi-base.org) on pathogens-host interactions. Hosts are plants half of the time, not only cereals. Main use is to lookup of mutant-phenotypes relationships. They use a scale of 9 phenotypes, including negative results. They have a tool (PHI-canto) to allow users to annotate their own results with controlled vocabularies. It complies with FAIR Data principles.

Keith Edwards, U Bristol (plenary talk)
He talks about the genomic challenges in wheat and how we are discovering the actual diversity of wheats thanks to the marker technologies. This is in contrast to what was thought earlier, that they lacked variability. Today the can scan 98K KASP markers in 1 day and we now that this species, despite being only 10K yr old and having gone through 1-2 hybridizations, has a massive diversity. This is probably due to hybrid swarms, populations of hybrids that interbreed and backcross with their parents (diploid & tetraploids). He shows two examples of extensive introgressions in chromosomes of elite cultivar Cadenza and two ancient wheats:  Watkins 199 and Chaff 1790 from Rothampsted. He concludes that variation was already there 10K yr ago, is not new, and that there is forced gene flow between wheat and its parents and close species, mostly the tetraploids. These introgressed regions do not usually recombine, as they are too divergent (over 0.5%), and impose a LEGO-like genome, with recombination restricted to certain windows. 

Source: http://www.earlham.ac.uk/articles/earlham-institute-lego-sequencer

16 de abril de 2019

mapeo de coordenadas entre ensamblajes

Hola,
una tarea con la que tropiezas a menudo cuando trabajas en genómica es la de traducir unas coordenadas de una versión del genoma a las coordenadas equivalentes de otra versión.

Ejemplo de mapeo entre dos versiones o ensamblajes del mismo genoma, tomado de http://ensemblgenomes.org/info/data/assembly_mapping.

En la figura se resume el problema. Es importante darse cuenta de que puede haber elementos en la versión n que estén ausentes en la n+1, o viceversa. Con esa salvedad, podemos traducir coordenadas en Ensembl de manera sencilla usando la interfaz REST: https://rest.ensembl.org/documentation/info/assembly_map

Por ejemplo, para convertir el fragmento 1-10 del cromosoma 1 de cebada de la versión de 2012 (ASM32608v1) a la de 2017 (IBSC_v2), podríamos hacer:

$ wget -q --header='Content-type:application/json' \
  'https://rest.ensembl.org/map/hordeum_vulgare/ASM32608v1/1:1..10/IBSC_v2?'  -O -

Obtenemos los resultados en formato JSON:

{"mappings":[{"original":{"end":10,"seq_region_name":"1","assembly":"ASM32608v1","start":1,
"strand":1,"coord_system":"chromosome"},"mapped":{"start":268488,"assembly":"IBSC_v2",
"seq_region_name":"chr1H","end":268497,"coord_system":"chromosome","strand":1}}]}

Puedes averiguar los nombres de las diferentes versiones soportadas en la herramienta Tools->AssemblyConverter (por ejemplo en humano, plantas),

hasta pronto,
Bruno



22 de marzo de 2019

oneliner en utf8

Hola,
hoy solamente quiero compartir un oneliner, o perlito como los llama mi colega Pablo Vinuesa, que imprime un fichero con codificación UTF8, no ASCII, tal como la requiere pandoc al compilar la lista de bibliografía BIBTEX un documento markdown:

$ perl -lne 'BEGIN{binmode(STDOUT, ":utf8")} next if(/^%/); s/\\htmladdnormallink\{\S+?\}\{(\S+)?\}/$1/; print' bib_myarticles.bib > bib_myarticles_md.bib

Hasta luego,
Bruno

5 de marzo de 2019

Unión de varios ficheros ordenados (sort merge)

Hola,
en entradas pasadas, como ésta de 2010, ya hemos hablado de cómo ordenar resultados tabulares de BLAST con GNU sort. Entre tanto sort ha ido evolucionando y las versiones actuales, por ejemplo la que viene instalada por ejemplo en Ubuntu 18, ahora permite distribuir el trabajo en varias hebras:

--parallel=N    change the number of sorts run concurrently to N

Sin embargo, cuando queremos ordenar de manera global tablas numéricas distribuidas en varios ficheros internamente ordenados seguiremos usando una opción que ya teníamos en 2010:

-m, --merge     merge already sorted files; do not sort

A pesar de ahorrarse ordenar los ficheros internamente, cuando el número de ficheros es grande esta tarea se vuelve muy lenta. En el ejemplo ordenamos 100 ficheros separados por tabuladores (TSV) por las columnas numéricas 1 y 11:

time sort -S500M -s -k1g -k11g -m ficheros*.sorted > all.sorted

real 20m15.793s
user 19m46.045s
sys 0m18.724s

La clave para mejorar es eliminar el uso de ficheros temporales, leyendo de los 100 ficheros a la vez:

time sort --batch-size=100 -S500M -s -k1g -k11g -m ficheros*.sorted > all.sorted

real 13m49.672s
user 13m29.370s
sys 0m9.972s
 
De esta manera cuesta aproximadamente un tercio menos, y como se ve en el último ejemplo no sirve de nada aumentar la memoria del proceso de 500MB a 5 GB. De la misma manera, en mis pruebas no merece la pena aumentar el números de hebras concurrentes:

time sort --batch-size=100 S5G -s -k1g -k11g -m ficheros*.sorted > all.sorted

real 14m9.339s
user 13m49.315s
sys 0m10.705s

Hasta luego,
Bruno

26 de febrero de 2019

StructMAn: impacto funcional de mutaciones no sinónimas en base a la estructura 3D

Hola,
acabo de escuchar a Olga Kalininia en el Sanger Institute hablar sobre cómo analizar el impacto potencial de mutaciones no sinónimas en proteínas usando
https://structman.mpi-inf.mpg.de

Fuente: https://academic.oup.com/nar/article/44/W1/W463/2499349
Otro artículo interesante es https://www.nature.com/articles/oncsis201779

Es un "predictor sencillo", palabras textuales, que clasifica cada posición en al secuencia como sitio de interacción molecular (con otras proteínas, ligandos o ADN) o como sitio core (en contraposición a sitio en la superficie, según su área expuesta al solvente). Para ello mapea la secuencia sobre estructuras del PDB o sobre todos los modelos por homología posibles con identidad de secuencia >= 35% y luego  calcula la ΔΔ G de la mutación con foldX (del orden de segundos por mutación). Finalmente, por medio de un predictor de tipo bosque aleatorio (random forest) combina atributos de estructura y secuencia para predecir si hay un impacto funcional o no.

Entrenaron sus predictores con datos de ClinVar (fundamentalmente relacionados con cáncer), las proteínas humanas en UniProt y obtienen precisiones del orden del 80%. Es interesante que uno de los atributos que correlaciona negativamente con el impacto funcional es el desorden del residuo.
Cuando le pregunto sobre esto me dice que están mirando actualmente mutantes que afectan al splicing y están observando que suelen estar en regiones desordenadas,
hasta pronto,
Bruno