11 de junio de 2024

AllHands 2024 en Uppsala (I)

Esta semana he acudido a la reunión anual AllHands de ELIXIR, la ESFRI europea para datos de ciencias de la vida que hace poco cumplió 10 años. Se suele realizar cada año en un nodo diferente, este año en Suecia, en la ciudad de Uppsala. 

Aunque llevo ya un tiempo participando en actividades de ELIXIR, y desde hace uno ya como parte del nodo https://inb-elixir.es, es mi primera vez y me ha venido bien para aprender su vocabulario propio y ver un poco cómo funciona. Aquí pondré mis notas.

https://www.proteinatlas.org is a core resource, from tissues to cell lines and single-cell (nTPM for expression). 0% human prots are housekeeping, 15% tissue specific. Menciona https://olink.com para detectar 5k prots en muestras humanas. Case-control not a good approach to find markers, you need to consider wide disease panel.

Vocabulario:

    Commisioned services (CoS) are funded from EXILIR budget.
    Communities are funded for 2 years (capital C).
    Focus Groups?
    Platforms are operational infrastructures, not computing Resources
    Services ~500 are provided by nodes

 RDM = Research Data Manager

Taller "Demystifying ELIXIR: Everything you ever wanted to know and more" incluye preguntas y respuestas resueltas en grupos, por ejemplo: ELIXIR name was proposed by Janet Thornton, does not mean anything

Taller "Training MiniSymposium All Hands 2024" https://github.com/elixir-europe-training . Shortly a learning Path in place for all communities. Fair training handbook. Course design: considerations for trainers https://f1000research.com/documents/9-1377.

Taller "Insights into ELIXIR's Biodiversity and Plant Science Collaborations: Fostering Cross-Disciplinary Dialogue". Cyril presenta las ideas de la comunidad de plantas. Robert Waterhouse (CH) habla sobre la Comunidad de biodiversidad, donde también está T Gabaldón; menciona https://rdmkit.elixir-europe.org/plant_sciences; le preguntan sobre digital twins para modelo de nichos ecológicos. S Beier habla sobre nuevo plan de trabajo de nuevo itinerario (hasta 2028) de Plant Sc Community, que continúa https://f1000research.com/documents/10-145 . Menciona https://framework.frictionlessdata.io/docs/guides/validating-data.html para validar ficheros de datos tabulares. Phenotypic data does not fit well EBI, instead Zenodo, GnpIS or https://recherche.data.gouv.fr/fr using MIAPPE as a common language. Yvan Le Bras habla de la iniciativa fr biodiversidad PMDB. Cita https://onlinelibrary.wiley.com/doi/10.1002/ece3.9961 . gbif + metashark. Galaxy, conda, biocontainers. K Gruden holobionte, similar initiative at EPSO, not effective communication with ELIXIR. Nfdi4Biodiversity en .de en 10años. PHENET Daniel Wibberg. EU coordinado desde inrae montpellier. https://www.phenet.eu/en/about-phenet/phenet_partners


Auditorio principal de universidad de Uppsala

 


13 de mayo de 2024

Sequence logo out of protein-DNA complex made with AlphaFold3

Hi, last week the AlphaFold3 paper came out in the journal Nature among concerns about the code and data not being available, some by sources as close as referee #3 of the manuscript (read more here and here).

I couldn't find the time to test it with some plant transcription factors (TF) we have worked with over the years. Then I found this thread and could wait no more.

So I took barley VRN1, which according to our own footprintDB database binds to the consensus CCarAAAwGG, as determined by ChIPseq by Deng et al:

https://footprintdb.eead.csic.es/weblogos/54ee1632ee6b68508cdc131f69093cea.png
Sequence logo for VRN1

As this is a SRF-type/MADS box TF, known to bind as dimers (see for instance complex 1hbx_AB), I logged in at the Alphafold3 beta server at https://golgi.sandbox.google.com/about and pasted the following input. Note that I padded the DNA duplex with ACGT at both sides and replaced degenerate positions with adenines (having several targets on the same DNA duplex might confuse AF3, see here):

Once the job finished I downloaded and uncompressed the 2.9Mb results file and computed logos and DNA motifs for all 5 produced models using the Docker container at https://hub.docker.com/r/eeadcsiccompbio/dnaprot:

# convert CIF to PDB files
conda install -c conda-forge gemmi
gemmi convert fold_2024_05_10_16_53_model_0.cif vrn1_0.pdb
...
gemmi convert fold_2024_05_10_16_53_model_4.cif vrn1_4.pdb

# actually get DNA motifs and sequence logos
docker run --rm -v "$PWD:$PWD" -w "$PWD" -u $UID:$GROUPS \
    eeadcsiccompbio/dnaprot dnaprot.pl -P ./VRN1_0 -i vrn1_0.pdb
...
docker run --rm -v "$PWD:$PWD" -w "$PWD" -u $UID:$GROUPS \
    eeadcsiccompbio/dnaprot dnaprot.pl -P ./VRN1_4 -i vrn1_4.pdb

 

I got the 5 sequence DNA motifs, one for each AF3 model:

# IC=8.461 IC/col=0.769
A |   0   0   0  24  57  96  57  24   0   0  53
C |   1  96  96  24  12   0  12  24   0   0  12
G |   6   0   0  24  12   0  12  24  96  96  12
T |  89   0   0  24  15   0  15  24   0   0  19

...

# IC=5.996 IC/col=0.545
A |   4   0   4  24  24  24  84  24   0  16  48
C |   4  96  92  24  24  24   4  24   4  20  16
G |   0   0   0  24  24  24   4  24  92  56  12
T |  88   0   0  24  24  24   4  24   0   4  20

 

Which correspond to the following sequence logos:


In summary, it seems that the AF3 models fed with a cognate DNA sequence can be used to produce reasonable DNA motifs (compared to the consensus CCarAAAwGG). However, note that the nucleotides immediately before and after the consensus seem to be also moderately conserved in the motifs, and in this case they come from the padding sequences. While this indicates that padding can affect the obtained logos, the logo published by Deng et al did include one of those bases.


PS See also this thread for membrane proteins: https://twitter.com/jankosinski/status/1789062090205921768