15 de enero de 2023

Notes on Plant and Animal Genomes conference #PAG30 (I)

Over the next few days I will be sharing my notes on some of the PAG talks I attend.

Jump to: Sun Mon Tue Wed

PAG 30 - January 13-18, 2023 - Plant & Animal Genome Conference

Saturday 14012023

This was my first day, you can tell how jet lagged I was as I had to pass on the last session.

Mohamed Zouine, Institut Nationale Polytechinque de Toulouse: when reporting on the quality of a genome assembly it might be useful to use the Assembly Standards of the Earth BioGenome Project, which can be found at https://www.earthbiogenome.org/assembly-standards 

Adam William Schoen, University of Maryland: in his group they use MutMap (https://pubmed.ncbi.nlm.nih.gov/22267009/) to find candidate genes for mutant phenotypes. That requires sequencing a single sample of pooled mutant DNA and computing a SNP index, candidate regions have index values close to 1. For instance, the mutant Tin3 has one tiller only in Tmonoc but longer spikes and the candidate gene has 1 SNP in the first position of intron 1, which is then retained.

Matthias Heuberger, University of Zurich: they have a list of genes that seems functional and are expressed in Triticum monoccccum centromeres.

Maria Alejandra Alvarez, University of California, Davis & HHM: she presents data on ELF3 mutants to show that photoperiodic response in wheat does not depend on Ppd1: https://www.biorxiv.org/content/10.1101/2022.10.11.511815v1

Martin Mascher, IPK: Genus-wide pan-genomics. Hordeum bulbosum is ~Myr of evolution away from H. vulgare and can be crossed. It is perennial ourcrossing, the closest wild relative. They are using Hi-C data to assembly separate haplotypes. They typically use Hi-C to check the quality of assemblies. They are now trying pollen sequencing to separate parental genomes of a heterozygous individual. There are larger non-recombining pericentromeric regions in bulbosum than in vulgare. They used a modified, yet unpublished TRITEX protocol which has been used for other species as well. Worth saying these results are actually from a PhD student that could not attend the meeting due to his visa taking a long time, unbelievable this could happen. [HiFi sequencing of barley about 10K per genotype, plus 5K for Hi-C].

Einar Baldvin Haraldsson, Heinrich Heine Universität Düsseldorf. He’s comparing annual barley to perennial Hordeum erectifolium, original from Argentina, which rolls its leaves to keep them erect under drought. IsoSeq with 22 different samples/tissues -> 38K loci, 27K protein coding. Collaborates with MMascher on Pan-Hordeum project (24 species). According to his crosses, perenniality seems to be a dominant trait over annuality.

Eduard Akhunov, Kansas State Univ. He talks about multiplex CRIPR-editing strategies to engineer gene regulation in wheat. Precision is key for multigene families to target the desired copy. He is currently working on deleting regions within promoters of Q gene (an AP TF) and also on silencing clusters of gliadins. He mentions that they needed to resequence the target promoter in the target cultivar, as the CS reference was to divergent to guide their CRISPR constructs. They do both multiplex amplicon and WGS to validate their editing experiments, they have seen to off-targets yet. For the gliadins they used the long-read assembly of cv. Fielder to design gRNA. They confirmed deletions by WGS and also scored protein quality of the edited wheats, with negligible impact of bread making. He says that crosses and field trials are the only way to evaluate the impact of edits in the long term.

William Marande, INRAE – CNRGV: presents the 4Gbp (2Gbp haploid) Vanilla planifolia genome project. Vanilla tissues undergo strict partial endoreplication, which means that C content of cells varies, mixing 2C nuclei with fractions of cells with higher (4,8,16) C contents. So they had to optimize and choose the best starting tissue, which turns out to be auxiliary buds, rich in 2C. When they do K-mer analysis they find an unusual pattern with 3 peaks, as a results of non- and endoreplicated cell populations [as opposed to 2 peaks expected for a heterozygous diploid]. See https://pubmed.ncbi.nlm.nih.gov/35617961 and https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5546068.

Sandra Smit, Bioinformatics Group, Wageningen University and Research. Talks about pangenomes (lossless compressed, integrated, interrogable, scalable, a moving target), which she models with the software PanTools, which is now in version 4.1 and more scalable (hundreds of genomes). It can take for instance all Solanaceae. The latest paper is from 2022 can be found at  https://academic.oup.com/bioinformatics/article/38/18/4403/6647839. PanTools is now ploidy-aware and can work at the sequence level as well. They have produced a nice pangenome browser called PanVA: https://www.techrxiv.org/articles/preprint/PanVA_Variant_Analysis_within_Pangenomes/21572433

Wanfang Fu, Department of Plant and Environmental Sciences, Clemson University. She teaches us about extra-chr circular DNAs in plant genomes (eccDNA). They have been proposed to transmit glyphosate resistance in Amaranthus and behave like independent replicons: https://pubmed.ncbi.nlm.nih.gov/36103503, https://academic.oup.com/plcell/article/32/7/2132/6115633. In her work they use the CIDER-Seq protocol to identify them (https://www.nature.com/articles/s41596-020-0301-0), and found a variable number of eccDNAs (in the thousands) in different blackgrass (Alopecurus myosuroides, a weed of wheat) accessions.

Justin N Vaughn, USDA-ARS/University of Georgia, explains how to low-genotype complex loci with variation/pangenome graphs. Mentions phantom SNPs. They are using https://github.com/USDA-ARS-GBRU/PanPipes to build low-cost graphs with Skim-seq (https://www.nature.com/articles/s41598-022-19858-2). Questions whether we need cyclic graphs to capture recombination event? These ideas were applied to melon fungal resistance in https://pubmed.ncbi.nlm.nih.gov/36550124

Isabelle M. Henry, UC Davis. Clonally propagated species are challenging for genomics, such as spearmint (Menta spicata, allotetraploid). She shows nice plots showing coverage vs chr position to highlight regions were there are several haplotypes (higher cover).

 

12 de enero de 2023

Hito 200K del Protein Data Bank

Hola,

a punto de empezar el congreso PAG30, empezamos el año con una buena noticia: 

undefined 

Pego aquí la reseña oficial del Protein Data Bank: 

Date: Wed, 11 Jan 2023 09:55:41 -0500
Subject: pdb-l: PDB Reaches a New Milestone: 200,000+ Entries

With this week's update, the PDB archive contains a record 200,069 
entries. The archive passed 150,000 structures in 2019 and 
100,000 structures in 2014. 

Established in 1971, this central, public archive has reached this 
critical milestone thanks to the efforts of structural biologists 
throughout the world who contribute their experimentally-determined 
protein and nucleic acid structure data.

wwPDB data centers support online access to three-dimensional structures 
of biological macromolecules that help researchers understand many 
facets of biomedicine, agriculture, and ecology, from protein synthesis 
to health and disease to biological energy. Many milestones have been 
reached since the archive released the 100,000th structure in 2014. PDB 
data have been seminal in understanding SARS-CoV-2, and provided the 
foundation for the development of AI/ML techniques for predicting 
protein structure. The 50th anniversary of the PDB was celebrated 
throughout 2021 <https://www.wwpdb.org/pdb50>.

Today, the archive is quite large, containing more than 3,000,000 files 
related to these PDB entries that require more than 1086 Gbytes of 
storage. PDB structures contain more than 1.8 billion non-hydrogen atoms.


Function follows form
In the 1950s, scientists had their first direct look at the structures 
of proteins and DNA at the atomic level. Determination of these early 
three-dimensional structures by X-ray crystallography ushered in a new 
era in biology-one driven by the intimate link between form and 
biological function. As the value of archiving and sharing these data 
were quickly recognized by the scientific community, the Protein Data 
Bank (PDB) was established as the first open access digital resource in 
all of biology by an international collaboration in 1971 with data 
centers located in the US and the UK.

Among the first structures deposited in the PDB were those of myoglobin 
and hemoglobin, two oxygen-binding molecules whose structures were 
elucidated by Chemistry Nobel Laureates John Kendrew and Max Perutz. 
With this week's regular update, the PDB welcomes 266 new structures 
into the archive. These structures join others vital to drug discovery, 
bioinformatics and education.

The PDB is growing rapidly, increasing in size ~13% since 2011. In 2022, 
an average of 275 new structures were released to the scientific 
community each week. The resource is accessed hundreds of millions of 
times annually by researchers, students, and educators intent on 
exploring how different proteins are related to one another, to clarify 
fundamental biological mechanisms and discover new medicines.

Twenty Years of Collaboration
Since its inception, the PDB has been a community-driven enterprise, 
evolving into a mission critical international resource for biological 
research. The wwPDB partnership was established in July 2003 with PDBe, 
PDBj, and RCSB PDB. Today, the collaboration includes partners BMRB 
(joined in 2006) and EMDB (2021).

The wwPDB ensures that these valuable PDB data are securely stored, 
expertly managed, and made freely available for the benefit of 
scientists and educators around the globe. wwPDB data centers work 
closely with community experts to define deposition and annotation 
policies, resolve data representation issues, and implement community 
validation standards. In addition, the wwPDB works to raise the profile 
of structural biology with increasingly broad audiences.

Each structure submitted to the archive is carefully curated by wwPDB 
staff before release. New depositions are checked and enhanced with 
value-added annotations and linked with other important biological data 
to ensure that PDB structures are discoverable and interpretable by 
users with a wide range of backgrounds and interests.

wwPDB eagerly awaits the next 100,000 structures and the invaluable 
knowledge these new data will bring.

Hasta pronto,

Bruno

23 de diciembre de 2022

RSAT::Plants updated (Dec2022)

Hi, 

if you use the Plants server of the Regulatory Sequence Analysis Tools (RSAT), you might want to know that it has just been updated. Here's a short summary of the changes:

  • The updated URL is https://rsat.eead.csic.es/plants
  •  It now supports HTTPS connections powered by certbot
  •  It now uses the source code at https://github.com/rsa-tools/rsat-code (I have updated some documentation along the way)
  •  Nine new species have been imported from Ensembl Plants: Lolium perenne, Brassica juncea, Echinochloa crusgalli, Digitaria exilis, Vigna unguiculata, Brassica rapa ro18, Corylus avellana, Ficus carica, Lactuca sativa
  •  One species renamed: Physcomitrium patens
  •  Three updated with a new assembly: Vitis vinifera, Triticum urartum, sunflower
  • This leaves the total number of supported assemblies in 100; you can see their stats at https://rsat.eead.csic.es/plants/data/stats
  • Most species now correspond to release 55 of Ensembl Plants, but note that the sequence data is unchanged in many cases. This means that, for instance, that Hordeum_vulgare.MorexV3_pseudomolecules_assembly.52 becomes        Hordeum_vulgare.MorexV3_pseudomolecules_assembly.55, but the sequence is exactly the same.         

  


Have a nice break,

Bruno


21 de diciembre de 2022

Mapeando sobre un pangenoma de arroz con minigraph

Hola,

hace unos meses contruí un pangenoma a partir de 15 genomas de arroz obtenidos de Ensembl Plants. Para ello he probado el software minigraph, descrito en https://doi.org/10.1186/s13059-020-02168-z , que es una de las herramientas disponibles para construir un grafo genómico, en este caso por medio de inserciones y deleciones sobre el genoma de referencia (naranja en la figura).

 

Figura. Mapeo de lecturas sobre un grafo pangenómico, figura tomada de https://doi.org/10.1186/s13059-020-02168-z

Hoy quería resumir aquí cómo se hace por si le ayuda a alguien.

El primer paso es construir el grafo a partir de varios genomas individuales, de arroz en este caso. Para ello deberás partir de ficheros FASTA donde cada cromosoma tenga un nombre único. Eso se puede lograr por ejemplo agregando al nombre de cromosoma original el identificador o accession de cada genoma:

# 1) prepare genome FASTA files, making sure chr names are unique
mkdir fasta
while read core; do 
	echo $core; 
	perl -lne 'BEGIN{ if($ARGV[0] =~ /sativa_([^_]+)/){ $acc=$1 }} if(/^>(\S+)/){ print ">$1_$acc" } else {print}' ${acc}.fna > ${acc}.uniq.fna 
done < ../liftover/list_cores.txt

# build the graph
bsub -M 40G -n 10 -cwd soft/minigraph/minigraph -xggs -t 10 oryza_sativa_core_48_101_7.fna fasta/*.fna -o oryza_sativa.gfa
Este proceso genera la siguiente salida:
[M::main::0.702*0.84] loaded the graph from "oryza_sativa_core_48_101_7.fna"
[M::mg_index::9.913*1.50] indexed the graph
[M::mg_opt_update::10.576*1.47] occ_weight=20, occ_max1=178; 95 percentile: 2
[M::ggen_map::11.491*1.42] loaded file "fasta/oryza_sativa_Azucena_core_48_101_1.fna"
[M::ggen_map::168.948*6.05] mapped 37 sequence(s) to the graph
[M::mg_ggsimple::170.379*6.01] inserted 15028 events, including 39 inversions
[M::mg_index::180.913*5.75] indexed the graph
...
[M::main] Real time: 13152.427 sec; CPU: 82007.142 sec; Peak RSS: 47.687 GB
Ahora podemos probar a mapear secuencias de cDNA sobre el grafo:
# read
https://twitter.com/zhigui_bao/status/1417028758725222400
https://github.com/lh3/minigraph/issues/37

# Note in rice -N 0 /-N 100 made no difference!
soft/minigraph/minigraph -t 4 -j 0.02 oryza_sativa.gfa -N 100 \
	oryza_nivara.cdna.fna | sort -k1,1 -k10,10nr > Onivara.cdna.graph.sort.gaf
soft/minigraph/minigraph -t 4 -j 0.02 oryza_sativa.gfa -N 100 \
	oryza_sativa.cdna.fna | sort -k1,1 -k10,10nr > Osativa.cdna.graph.sort.gaf
soft/minigraph/minigraph -t 4 -j 0.02 oryza_sativa.gfa -N 100 \
	oryza_indica.cdna.fna | sort -k1,1 -k10,10nr > Oindica.cdna.graph.sort.gaf
Hasta pronto, Bruno

21 de noviembre de 2022

A la memoria de Javier Abadía

Hola,

empezamos la semana tristes porque el viernes nos dejó de repente Javier Abadía, un colega querido de la Estación Experimental de Aula Dei (EEAD-CSIC). Javier fue un estupendo colega, gestor, maestro de una larga lista de investigadores y profesionales en diferentes rincones del mundo, y ejemplo para muchos de los que le conocimos en la EEAD. Podéis ver su trayectoria en Google Scholar por ejemplo.

Con las emociones de estos días me vienen a la memoria muchas situaciones que viví con él, pero me gustaría destacar aquí solamente un par, por las que le estoy más agradecido, por si sirven de inspiración a otros colegas.

Desde mi llegada a la EEAD a finales de 2007, como investigador novato del programa ARAID, tuve la oportunidad de conversar con Javier sobre las múltiples posibilidades que se abren si se combinan las aproximaciones metabolómicas y proteómicas, en las que su grupo es experto, con la genómica y la bioinformática. Estas discusiones informales nos llevaron a que él me invitara a colaborar en varios trabajos que cristalizaron en varios artículos a lo largo de los años. Además del placer de la interacción con su grupo, esos trabajos me permitieron conocer mejor lo que se hacía en otros departamentos y enriquecieron mi CV, que en esta profesión lo es todo. No tengo ninguna duda de que Javier sabía lo que hacía. Es una buena lección para los que nos dedicamos a esto, la de ayudar y dar un empujón a los colegas más noveles al principio de sus carreras.

Cuando ya llevaba más de una década en la EEAD tuve la oportunidad de irme tres años a trabajar al Instituto Europeo de Bioinformática (EMBL-EBI). Ese tipo de movimientos no son tan habituales en la academia española porque son complejos. Javier fue uno de los compañeros que más me apoyó y me aconsejó en cómo dar los pasos para que no me perjudicara de cara a mi posible regreso al CSIC. De hecho, cuando aprobé la oposición y obtuve el permiso para reincorporarme en 2021, el propio Javier me escribió de puño y letra:

Congratulations, ha sido como subir al Everest..

J

Seguro que muchos de los compañeros que se presentan estos días a las oposiciones a científico titular suscriben estas palabras.  

Con esta referencia montañera termino, porque Javier fue además un guía del Pirineo para nosotros y nos regalaba fotos como la de abajo en sus felicitaciones de navidad, hasta siempre.

PD: obituario de la EEAD-CSIC