#!/perl/bioinfo: protocols

30 de septiembre de 2024

protocolo para modelar parejas de proteínas con AlphaFold

Hace un año Homma, Huang y van der Hoorn publicaron en Nature Comms sus experimentos modelando complejos de proteínas híbridos planta:patógeno con AlphaFold-Multimer (AFM). En concreto, encontraron una manera de encontrar SSPs, proteínas pequeñas secretadas por microorganismos patógenos de plantas que se unen de manera específica a proteínas de la planta diana. En total, su cribado con AFM consideró las combinaciones de 1879 SSPs de bacterias y hongos patógenos del tomate y 6 proteasas endógenas que participan en la defensa frente a la infección:

Modelos de parejas de proteínas planta:patógeno modelados con AFM que superan el umbral 0.75, tomada de https://doi.org/10.1038/s41467-023-41721-9.

De 376 complejos proteína:proteína prometedores, elegidos por sus puntuaciones ipTM+pTM, se centraron en 15 complejos donde SSPs sin anotar bloqueaban los sitios activos de quitinasas y proteasas de tomate. De esos, encontraron confirmación experimental para 4.

Dado el interés que despertaron estos resultados, los mismos autores han publicado ahora un protocolo (https://doi.org/10.1111/tpj.16969) para hacer este tipo de predicciones usando ColabFold en la Web y localmente (leer más en blog).

El protocolo tiene los siguientes pasos:

Start with ColabFold online
Use a computing cluster for screens
Small sequences model faster
Curate the input sequences
Remove irrelevant domains
Include positive controls
Include negative controls
Recycle multiple sequence alignments (MSAs)
Control data storage
Separate CPU from GPU-intense steps
Try to get MSA >100
Evaluate the predicted scores
Beware of typical AFM errors
Beware of false negatives
Beware of false positives
Explore hits manually
Categorise hits in classes

Que se resumen en el siguiente diagrama de flujo:

Details are in the caption following the image

Hasta pronto,

Bruno

17 de enero de 2022

Get data from Ensembl Plants with scripts

Hi, I hope you are all well and healthy amidst the current COVID outbreak.

As we are back to business I would like to talk today about a recipe book to extract data from the genomic database Ensembl Plants from your own scripts. While Ensembl is primarily an interactive browser, some users might want to extract data from different species in one go, or perhaps some specific annotations for a complete chromosome. These recipes would allow to do just that, and make good use of the variety of entry points available at Ensembl: API (A), Biomart (B), FTP (F), SQL (S), REST (R), and Ensembl VEP (V).

An Open Acess chapter which covers this topic has just been published at https://link.springer.com/protocol/10.1007%2F978-1-0716-2067-0_2

The current recipes include:

## A1) Load the Registry object with details of genomes available
## A2) Check which analyses are available for a species
## A3) Get soft masked sequences from Arabidopsis thaliana
## A4) Get BED file with repeats in chr4
## A5) Find the DEAR3 gene
## A6) Get the transcript used in Compara analyses
## A7) Find all orthologues of a gene
## A8) Get markers mapped on chr1D of bread wheat
## A9) Find all syntelogues among rices
## A10) Print all translations for otherfeatures genes

## B1) Check plant marts and select dataset
## B2) Check available filters and attributes
## B3) Download GO terms associated to genes
## B4) Get Pfam domains annotated in genes
## B5) Get SNP consequences from a selected variation source

## C1) Find RNA-seq CRAM files for a genome assembly

## F1) Download peptide sequences in FASTA format
## F2) Download CDS nucleotide sequences in FASTA format
## F3) Download transcripts (cDNA) in FASTA format
## F4) Download soft-masked genomic sequences
## F5) Upstream/downstream sequences
## F6) Get mappings to UniProt proteins
## F7) Get indexed, bgzipped VCF file with variants mapped
## F8) Get precomputed VEP cache files
## F9) Download all homologies in a single TSV file, several GBs
## F10) Download UniProt report of Ensembl Plants, 
## F11) Retrieve list of new species in current release
## F12) Get current plant species tree (cladogram)

## S1) Check currently supported Ensembl Genomes (EG) core schemas,
## S2) Count protein-coding genes of a particular species
## S3) Get stable_ids of transcripts used in Compara analyses 
## S4) Get variants significantly associated to phenotypes
## S5) Get Triticum aestivum homeologous genes across A,B & D subgenomes
## S6) Count the number of whole-genome alignments of all genomes 
## S7) Extract all the mutations and consequences for a selected wheat line
## S8) Get FASTA of repeated sequences from selected species
## S9) Get GFF of repeated sequences from selected species

## R1) Create a HTTP client and a helper functions 
## R2) Get metadata for all plant species 
## R3) Find features overlapping genomic region
## R4) Fetch phenotypes overlapping genomic region
## R5) Find homologues of selected gene
## R6) Get annotation of orthologous genes/proteins
## R7) Fetch variant consequences for multiple variant ids
## R8) Check consequences of SNP within CDS sequence
## R9) Retrieve variation sources of a species
## R10) Get soft-masked upstream sequence of gene in otherfeatures track
## R11) Get all species under a given taxonomy clade
## R12) transfer coordinates across genome alignments between species

## V1) Download, install and update VEP
## V2) Unpack downloaded cache file & check SIFT support 
## V3) Predict effect of variants 
## V4) Predict effect of variants for species not in Ensembl

The recipes are written in different scripting languages (Python, R, Perl, Bash) and can be cloned from https://github.com/Ensembl/plant-scripts . Moreover, you can fork the repo and suggest new recipes with a pull request.

Have a great week,

Bruno