La Sociedad Española de Bioinformática y Biología Computacional (SEBiBC) celebró estos días su primer congreso bianual en Valencia y fue todo un éxito. Durante 3 días 350 estudiantes y profesionales de la biología computacional discutimos resultados recientes y fuimos iluminados por las charlas magistrales de Christine Orengo, Luz García Alonso, Doreen Ware (pionera de genómica de plantas y Gramene) y Jaime Huerta Cepas. Abajo comparto las notas de algunas de las actividades del congreso.
https://x.com/SEBiBC_es/status/1847254469093867696 |
María Barranco trabaja con modelos de DL para extraer información sobre propiedades de compuestos químicos representados como SMILES como por ejemplo afinidad de union. Usan arquitectura SMILE-to-BERT y prueban dos tokenizadores distintos. Código y jupyter en https://github.com/m-baralt/smile-to-bert
Diego Herráez cuenta cómo su aproximación experimental basada en el mapeo termodinámico y la la microscopía confocal permite el estudio de células individuales en tejidos vivos, midiendo fluctuaciones (tracking- o flow-based). Figuras muy buenas, explicaciones muy superficiales por no entrar en detalles matemáticos de cómo se hace dinámica celular en 2D (no llegan todavía a 3D, es una limitación). Muestra vídeos a escala de µm de células y sus membranas. Aplican análisis de redes para modelar cómo se transmite la información en un tejido. Han estudiado el efecto de la aneuplodía en los tejidos y experimentos de migración de células.
Alejandro Orozco Valero, PhD student from UGranada, uses timeseries lib Catch22 (https://github.com/DynamicsAndNeuralSystems/catch22) to extract features from electrode-based data from two databases. Then they use DL to infer imbalances in brain activity. He does discuss limitations and future work.
Marta Camarena talks about cancer vaccines and work recently published at https://www.science.org/doi/10.1126/sciadv.adn3628. They want to design vaccines targeting cancer-specific antigens encoded by non-canonical ORFs, mostly < 100aa, in immuneprivileged tossues (testis). Her work combines transcriptmics and immunopeptidomics. She is collaborating with clinicians at UNavarra to confirm those that elicit T cell responses.
Christine Orengo "AlphaFold predicted structures expand our understanding of functional divergence in protein families".
Definen dominios sobre estructuras atómicas como consensos entre 3 métodos:
Chainsaw para cortar dominios, incluso discontinuos, sobre matrices de contactos: https://academic.oup.com/bioinformatics/article/40/5/btae296/7667299
Merizo: https://www.nature.com/articles/s41467-023-43934-4
UniDoc: https://academic.oup.com/bioinformatics/article/39/2/btad070/7025502
The encyclopedia of domains (https://zenodo.org/records/10848710), 25% are discontinuous, after filtering out poor AF models (<10%).
Muestra resultados de PCA de CATHe (https://pubmed.ncbi.nlm.nih.gov/36648327) donde comprueba que sus embeddings/inmersiones se parecen mucho a los HMMs de familias de CATH. En base a eso están usando https://huggingface.co/Rostlab/prot_bert para sustituir a HHalign, Prost-T5 en concreto.
Aureliano Bombarely from IBMCP starts by explaining plant features, biomass and diversity (300k species). 80% food comes from 17 families, 28k species are recorded as medicinal. Currently 2225 at NCBI Genome. Annotation tags genomic elements. Results: BRAKER > Helixer > Maker seems to incorporate better transcript (kmer) diversity and improves also over StringTie (ie BUSCO), although produces a lot of FPs (TE-related mostly). They use OrthoVenn to compare alternative annotations. DL-based Helixer is good anyway for its speed, linear with respect to assembly size, performing worse on species not used in the model.
Rubén Cañas from Global Omnium explains his work on water QC using macro-invertebrates as bioindicators, as they are responsive to env changes. Explains the Iberian BioMonitoring Water party Index and explains the scaling up methodological limitations. As a solution they try metagenomic identification of communities. They use 2 undisclosed gene markers, combined with morpho-taxonomic identification (MI), with GM2 capturing the most diversity (shows alpha and beta diversity results). They work for family level identification. Pearson corr of metagenomic and MI is currently 0.28, but they are working on routine standardization to improve results. Discusses that NCBI Taxonomy was used for being more complete, despite other specialized resources being better for particular taxa on interests.
Michael Tress summarizes results from recently published work on novel ORFs https://academic.oup.com/nar/article/52/14/8112/7702505 . 28 out of 32 proteomics-supported examples sit in 5'UTR, and sometimes overlap with the canonical ORF. 64% human novel ORFs are not conserved beyond monkeys. Most new start codons are non canonical, most common is CTG. These ORFs are shifted towards higher %GC and are more expressed in cancer. In some cases they complete protein domains in AF models. He concludes by discussing that in most cases they add disordered regions and seem to be biological noise, except the one case that disrupts signal peptides and ships proteins to different compartments.
Arnau Montagud, currently at I2SysBio - CSIC, presents work from a postdoc at BSC. He presents their modelling software. As a disclaimer he takes some time to remind us that all models are wrong and biased, including sophisticated digital twins (DT). Despite these limitations, in healthcare multi-scale models are increasingly used to carry out animal-free pre-clinical tests. He has been using Boolean models over a decade now (MaBoss), for instance to simulate signalling pathways. Finally also mentions PhysiBoss and some current results that are being experimentally tested. He responds that quantum computing might be useful in the future.
Taller Bioinformática 2030: Innovación y Desafíos Profesionales
Diana de la Iglesia: Fujitsu tiene unidad de bioinfo, proyectos de 3-4 años que se mantienen en el tiempo, más orientados al cliente, mejores salarios
Laureano Carpio – Protoqsar advierte del suflé de la IA con la analogía de QSAR en química, recuerda el doctorado industrial, espera que con el tiempo y la mejora de las herramientas haya personas trabajando con menos cualificaciones
Sheila Zúñiga – INCLIVA habla de su experiencia en bioinfo clínica; en la SEBiBC están trabajando con el Ministerio para dar visibilidad a los profesionales bioinfo
Pedro Carmona – Centro de Investigaciones Genómicas e Oncológicas (GENYO) habla de la transformación del campo en 15 años hacia una cc de datos, actualmente tienen problemas de espacio para bioinformáticos. Cada vez más los bioinfos deberán liderar en los proyectos.
Falta la figura de técnico SO en Bioinformática para trabajar sin phd, ya ocurre en la empresa. Donde cae bioinfo en ANECA: no hay comisión específica -> comisión 0 multidisciplinar
Taller Inteligencia Artificial Generativa: El bueno, el feo y el malo
Material original: https://drive.google.com/drive/folders/1LcRG9Pi9696njoEOwpDMhJxgfkqTVJR2
Recursos: https://www.nomic.ai/gpt4all , https://asciinema.org
Ana Hernández presents her work on the Orth Group Delineation algorithm, able to detect duplication events. She did 3 benchmarks: i) Hox genes (ANTP, PRD,TALE) from 28 species, ii) quest for orthologues data and iii) using EGGNOG6 data. EGGNOG starts by doing all-vs-all comparison of sequences, which is expensive.
Alex Ascensión presents a way to detect bacterial reads, after filtering human ones, combining kraken2, krakenUniq, centrifuge and kaiku, more details at https://www.biorxiv.org/content/10.1101/2024.04.23.590754v1
Paula Ruiz presents a nextflow, dockerized pipeline for the uniform annotation of FASTA files of Mycobacterium genome assemblies with https://github.com/oschwengers/bakta and miniprot, which are then fixed using the ref annotation as guide. These improve pangenome downstream analysis.
Miguel Fernández starts by presenting ANI and its limitations, which prompted development of BACTAX-ID, code not available yet (https://github.com/irycisBioinfo/BacTaxID), which uses MASH distance and a fixed set of cutoffs to group genomes.
Taller BCBHub: Strategies for International Funding of Computational Biology and Bioinformatics
First page says it all, even 1st sentence
Repeat main idea
European leadership
Gender balance
CZF biomed, no databases, impact, essential tools, CV weights less than tool, not for incipient tools, with community
Jaime Huerta Cepas "Functional, evolutionary and ecological signif of unknown genes in the global microbiome" summarizes ten fantastic years of work, with special focus to https://www.nature.com/articles/s41586-023-06955-z, which helped alumnus Álvaro Rodríguez del Río win the Oswaldo Trelles award for the best PhD thesis in 2024. We are proud of him and Carlos Cantalapiedra, also mentioned in the keynote.
Nuestro poster: https://digital.csic.es/handle/10261/368616
Algunos recursos que vi en pósters
Caastools para encontrar mutaciones missense en (árbol, MSA de CDS) asociadas con caracteres, ejemplo en el póster con primates: https://github.com/linudz/caastools
Protocolo para identificación y anotación de lncRNAs en plantas: https://github.com/ncRNA-lab/Cucurbit_lncRNAs_landscape
BUGSI, like BUSCO but including only housekeeping genes with only one isoform. Should be useful to estimate quality of (human?) transcriptomes. Not sure whether this would be available for plants.
Dinámica molecular para predecir mutaciones missenese que afecten al plegamiento: https://github.com/elhectro2/reMoDA
Hierarchical deep learning for predicting GO annotations: https://academic.oup.com/bioinformatics/article/38/19/4488/6656346