16 de enero de 2023

Notes on Plant and Animal Genomes conference #PAG30 (II)

Sunday 15012023

Mario Stanke, Institute for Mathematics and Computer Science, University of Greifswald. He reviews existing tools to identify coding and non-coding genes based on dn/dS and sequence composition. These help discriminate non-coding. Instead, the recent ClaMSA is a differentiable TensorFlow model which can be trained on any objective criteria, as opposed to PhyloCSF (likelihood) or codeml (omega). It was a student assignment project. Code and documentation can be found at https://github.com/Gaius-Augustus/clamsa. In their benchmark with vertebrates and fly exon codon alignments it makes less errors that PhyloCSF and codeml. It can be used to scan genomic regions to discover protein-coding frames.

Mihaela Pertea, Department of Biomedical Engineering Johns Hopkins University. She talks about recent work that evolves StringTie (https://ccb.jhu.edu/software/stringtie) and integrates both short (rarely span more than 1 exon) and long (high error rate) RNAseqs to assemble transcripts. Long-read on their own can create very complex splice graphs which impaired StringTie v1. She mentions the work of Lima et al 2020 to justify why simply correcting errors in long reads is not a good idea, as many isoforms are lost (https://academic.oup.com/bib/article/21/4/1164/5512144). StringTie2 can successfully use both types of reads and can handle noisy long reads. In her tests hybrid data produces much better transcripts that either individual inputs and does not need correcting long reads (not worth it).

Tomáš Brůna, DOE Joint Genome Institute. Presents GeneMark-ETP, a software for protein-coding gene annotation. He shows benchmarks with C. elegans, A. thaliana and D. melanogaster and then with more complex genomes. The latest version performs better than BRAKER particularly in genomes with heterogeneous GC regions, such as mouse. Takes 1-3 days to run.

Lars Gabriel, Institute for Mathematics and Computer Science, University of Greifswald. Presents BRAKER3 (https://github.com/Gaius-Augustus/BRAKER) for the annotation of eukaryotic genomes from short RNAeq reads and protein sequences. This latest version uses HISTA2, StringRie, GeneMarkETP and AUGUST among other tools. The only plant tested is A .thaliana. They also us TSEBRA to select plausible isoforms. In they benchmarks they report results at the exon, gene and transcript level. It takes almost 2x the time of BRAKER2 to run. They haven’t used with long reads yet. Realted note: in questions somebody says that their IsoSeq libraries contain a lot on retained introns.

I missed Roderic Guigo's talk, but there's this tweet: https://twitter.com/Campbell_JD_PhD/status/1614682387648221184

Stephane Rombauts, VIB-UGent Center for Plant Systems Biology. Talks about their pangenome implementation for genomes of the same species or genus. Reviews gene-centric vs sequence-based graphs vs object-based sequence feature pangenomes. They go for the last option, resembling Sandra Smit’s approach but using vaticle (https://vaticle.com) instead of Neo4j as DB engine.

Robert J. Henry, Queensland Alliance for Agriculture and Food Innovation, University of Queensland. Talks about their current program on sequencing wild relatives of crops such as macadamias, bananas, pidgeon pea, coffee, mango, etc, and cereals (wild rice, sorghum genus,). These are mostly untapped species that could be eventually domesticated if needed. He mentions that rice domestication involved only two mothers (plastids) but happened many times. In questions he says that takes only 1yr to domesticate these plants. Main problem is seed shattering. Many of the cereal species have larger grains than currently cultivated plants.

Ilga Mercedes Porth, Laval University. She talks about poplar improvement in Canada. They have a panel of 1K genotyped individuals. 2/3 variants hava MAF<0.05. They have found 8K gained stop codons in 6K genes. Heterosis is used in breeding as well (to accelerate growth for instance), suspected to be related to structural variants. They have looked at local adaptation using SNP, SVs  and more recently ploidy. It seems that in polyploids certain transposon families are over-expressed compared to diploids (https://nph.onlinelibrary.wiley.com/doi/full/10.1002/ppp3.10297). As happens in other species, triploids seem to accumulate in areas prone to drought. They have also found that certain multigene families are partially responsible for adaptation (https://onlinelibrary.wiley.com/doi/abs/10.1111/mec.14836).

Min Zhang, University of California Irvine. She talks about network reconstruction out of cis-eQTLs. She explains this has been done in the literature by building linear models, but these can be unfeasible at the genome scale for the large number of parameters required. Too tired to follow the algebra, she shows examples on simulated and yeast data. 

Nils Stein, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK). Talks about accessing the secondary genepool of barley, which includes Hordeum bulbosum, which is no host for most barley pathogens. Complements a talk by Martin Mascher yesterday. Shows previous results of exome capture of introgression lines, where regions that accumulate polymorphic read mappings delineate introgressed Hbulb segments, usually telomeric. Now they are using PacBio HiFi reads to resolve introgressions by assembling phased haplotypes and building pangenome graphs of particular loci, such as a locus for virus resistance.

 

15 de enero de 2023

Notes on Plant and Animal Genomes conference #PAG30 (I)

Over the next few days I will be sharing my notes on some of the PAG talks I attend.

Jump to: Sun Mon Tue Wed

PAG 30 - January 13-18, 2023 - Plant & Animal Genome Conference

Saturday 14012023

This was my first day, you can tell how jet lagged I was as I had to pass on the last session.

Mohamed Zouine, Institut Nationale Polytechinque de Toulouse: when reporting on the quality of a genome assembly it might be useful to use the Assembly Standards of the Earth BioGenome Project, which can be found at https://www.earthbiogenome.org/assembly-standards 

Adam William Schoen, University of Maryland: in his group they use MutMap (https://pubmed.ncbi.nlm.nih.gov/22267009/) to find candidate genes for mutant phenotypes. That requires sequencing a single sample of pooled mutant DNA and computing a SNP index, candidate regions have index values close to 1. For instance, the mutant Tin3 has one tiller only in Tmonoc but longer spikes and the candidate gene has 1 SNP in the first position of intron 1, which is then retained.

Matthias Heuberger, University of Zurich: they have a list of genes that seems functional and are expressed in Triticum monoccccum centromeres.

Maria Alejandra Alvarez, University of California, Davis & HHM: she presents data on ELF3 mutants to show that photoperiodic response in wheat does not depend on Ppd1: https://www.biorxiv.org/content/10.1101/2022.10.11.511815v1

Martin Mascher, IPK: Genus-wide pan-genomics. Hordeum bulbosum is ~Myr of evolution away from H. vulgare and can be crossed. It is perennial ourcrossing, the closest wild relative. They are using Hi-C data to assembly separate haplotypes. They typically use Hi-C to check the quality of assemblies. They are now trying pollen sequencing to separate parental genomes of a heterozygous individual. There are larger non-recombining pericentromeric regions in bulbosum than in vulgare. They used a modified, yet unpublished TRITEX protocol which has been used for other species as well. Worth saying these results are actually from a PhD student that could not attend the meeting due to his visa taking a long time, unbelievable this could happen. [HiFi sequencing of barley about 10K per genotype, plus 5K for Hi-C].

Einar Baldvin Haraldsson, Heinrich Heine Universität Düsseldorf. He’s comparing annual barley to perennial Hordeum erectifolium, original from Argentina, which rolls its leaves to keep them erect under drought. IsoSeq with 22 different samples/tissues -> 38K loci, 27K protein coding. Collaborates with MMascher on Pan-Hordeum project (24 species). According to his crosses, perenniality seems to be a dominant trait over annuality.

Eduard Akhunov, Kansas State Univ. He talks about multiplex CRIPR-editing strategies to engineer gene regulation in wheat. Precision is key for multigene families to target the desired copy. He is currently working on deleting regions within promoters of Q gene (an AP TF) and also on silencing clusters of gliadins. He mentions that they needed to resequence the target promoter in the target cultivar, as the CS reference was to divergent to guide their CRISPR constructs. They do both multiplex amplicon and WGS to validate their editing experiments, they have seen to off-targets yet. For the gliadins they used the long-read assembly of cv. Fielder to design gRNA. They confirmed deletions by WGS and also scored protein quality of the edited wheats, with negligible impact of bread making. He says that crosses and field trials are the only way to evaluate the impact of edits in the long term.

William Marande, INRAE – CNRGV: presents the 4Gbp (2Gbp haploid) Vanilla planifolia genome project. Vanilla tissues undergo strict partial endoreplication, which means that C content of cells varies, mixing 2C nuclei with fractions of cells with higher (4,8,16) C contents. So they had to optimize and choose the best starting tissue, which turns out to be auxiliary buds, rich in 2C. When they do K-mer analysis they find an unusual pattern with 3 peaks, as a results of non- and endoreplicated cell populations [as opposed to 2 peaks expected for a heterozygous diploid]. See https://pubmed.ncbi.nlm.nih.gov/35617961 and https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5546068.

Sandra Smit, Bioinformatics Group, Wageningen University and Research. Talks about pangenomes (lossless compressed, integrated, interrogable, scalable, a moving target), which she models with the software PanTools, which is now in version 4.1 and more scalable (hundreds of genomes). It can take for instance all Solanaceae. The latest paper is from 2022 can be found at  https://academic.oup.com/bioinformatics/article/38/18/4403/6647839. PanTools is now ploidy-aware and can work at the sequence level as well. They have produced a nice pangenome browser called PanVA: https://www.techrxiv.org/articles/preprint/PanVA_Variant_Analysis_within_Pangenomes/21572433

Wanfang Fu, Department of Plant and Environmental Sciences, Clemson University. She teaches us about extra-chr circular DNAs in plant genomes (eccDNA). They have been proposed to transmit glyphosate resistance in Amaranthus and behave like independent replicons: https://pubmed.ncbi.nlm.nih.gov/36103503, https://academic.oup.com/plcell/article/32/7/2132/6115633. In her work they use the CIDER-Seq protocol to identify them (https://www.nature.com/articles/s41596-020-0301-0), and found a variable number of eccDNAs (in the thousands) in different blackgrass (Alopecurus myosuroides, a weed of wheat) accessions.

Justin N Vaughn, USDA-ARS/University of Georgia, explains how to low-genotype complex loci with variation/pangenome graphs. Mentions phantom SNPs. They are using https://github.com/USDA-ARS-GBRU/PanPipes to build low-cost graphs with Skim-seq (https://www.nature.com/articles/s41598-022-19858-2). Questions whether we need cyclic graphs to capture recombination event? These ideas were applied to melon fungal resistance in https://pubmed.ncbi.nlm.nih.gov/36550124

Isabelle M. Henry, UC Davis. Clonally propagated species are challenging for genomics, such as spearmint (Menta spicata, allotetraploid). She shows nice plots showing coverage vs chr position to highlight regions were there are several haplotypes (higher cover).