Sunday 15012023
Mario Stanke, Institute for Mathematics and Computer Science, University of Greifswald. He reviews existing tools to identify coding and non-coding genes based on dn/dS and sequence composition. These help discriminate non-coding. Instead, the recent ClaMSA is a differentiable TensorFlow model which can be trained on any objective criteria, as opposed to PhyloCSF (likelihood) or codeml (omega). It was a student assignment project. Code and documentation can be found at https://github.com/Gaius-Augustus/clamsa. In their benchmark with vertebrates and fly exon codon alignments it makes less errors that PhyloCSF and codeml. It can be used to scan genomic regions to discover protein-coding frames.
Mihaela Pertea, Department of Biomedical Engineering Johns Hopkins University. She talks about recent work that evolves StringTie (https://ccb.jhu.edu/software/stringtie) and integrates both short (rarely span more than 1 exon) and long (high error rate) RNAseqs to assemble transcripts. Long-read on their own can create very complex splice graphs which impaired StringTie v1. She mentions the work of Lima et al 2020 to justify why simply correcting errors in long reads is not a good idea, as many isoforms are lost (https://academic.oup.com/bib/article/21/4/1164/5512144). StringTie2 can successfully use both types of reads and can handle noisy long reads. In her tests hybrid data produces much better transcripts that either individual inputs and does not need correcting long reads (not worth it).
Tomáš Brůna, DOE Joint Genome Institute. Presents GeneMark-ETP, a software for protein-coding gene annotation. He shows benchmarks with C. elegans, A. thaliana and D. melanogaster and then with more complex genomes. The latest version performs better than BRAKER particularly in genomes with heterogeneous GC regions, such as mouse. Takes 1-3 days to run.
Lars Gabriel, Institute for Mathematics and Computer Science, University of Greifswald. Presents BRAKER3 (https://github.com/Gaius-Augustus/BRAKER) for the annotation of eukaryotic genomes from short RNAeq reads and protein sequences. This latest version uses HISTA2, StringRie, GeneMarkETP and AUGUST among other tools. The only plant tested is A .thaliana. They also us TSEBRA to select plausible isoforms. In they benchmarks they report results at the exon, gene and transcript level. It takes almost 2x the time of BRAKER2 to run. They haven’t used with long reads yet. Realted note: in questions somebody says that their IsoSeq libraries contain a lot on retained introns.
I missed Roderic Guigo's talk, but there's this tweet: https://twitter.com/Campbell_JD_PhD/status/1614682387648221184
Stephane Rombauts, VIB-UGent Center for Plant Systems Biology. Talks about their pangenome implementation for genomes of the same species or genus. Reviews gene-centric vs sequence-based graphs vs object-based sequence feature pangenomes. They go for the last option, resembling Sandra Smit’s approach but using vaticle (https://vaticle.com) instead of Neo4j as DB engine.
Robert J. Henry, Queensland Alliance for Agriculture and Food Innovation, University of Queensland. Talks about their current program on sequencing wild relatives of crops such as macadamias, bananas, pidgeon pea, coffee, mango, etc, and cereals (wild rice, sorghum genus,). These are mostly untapped species that could be eventually domesticated if needed. He mentions that rice domestication involved only two mothers (plastids) but happened many times. In questions he says that takes only 1yr to domesticate these plants. Main problem is seed shattering. Many of the cereal species have larger grains than currently cultivated plants.
Ilga Mercedes Porth, Laval University. She talks about poplar improvement in Canada. They have a panel of 1K genotyped individuals. 2/3 variants hava MAF<0.05. They have found 8K gained stop codons in 6K genes. Heterosis is used in breeding as well (to accelerate growth for instance), suspected to be related to structural variants. They have looked at local adaptation using SNP, SVs and more recently ploidy. It seems that in polyploids certain transposon families are over-expressed compared to diploids (https://nph.onlinelibrary.wiley.com/doi/full/10.1002/ppp3.10297). As happens in other species, triploids seem to accumulate in areas prone to drought. They have also found that certain multigene families are partially responsible for adaptation (https://onlinelibrary.wiley.com/doi/abs/10.1111/mec.14836).
Min Zhang, University of California Irvine. She talks about network reconstruction out of cis-eQTLs. She explains this has been done in the literature by building linear models, but these can be unfeasible at the genome scale for the large number of parameters required. Too tired to follow the algebra, she shows examples on simulated and yeast data.
Nils Stein, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK). Talks about accessing the secondary genepool of barley, which includes Hordeum bulbosum, which is no host for most barley pathogens. Complements a talk by Martin Mascher yesterday. Shows previous results of exome capture of introgression lines, where regions that accumulate polymorphic read mappings delineate introgressed Hbulb segments, usually telomeric. Now they are using PacBio HiFi reads to resolve introgressions by assembling phased haplotypes and building pangenome graphs of particular loci, such as a locus for virus resistance.