#!/perl/bioinfo: annotation

24 de agosto de 2021

gene IDs in RSAT::Plants

Plant genomes in plants.rsat.eu are imported from different sources, such as Ensembl Plants, the NCBI or JGI Phytozome. You can check the actual source of your genome of interest by browsing the left menu, finding 'Genomes and genomes' and clicking on the supported organisms table. As each database is different, the format of gene IDs across genomes might vary. Sometimes a genome might have several annotations as well, with different gene names. So it's important to know which one to use and which is available at RSAT. Here you'll learn two ways to work out the correct gene IDs for your genome.

1. Sequence tools -> retrieve sequence

On the left menu, find 'Sequence tools' and then click on retrieve sequence. On 'Mandatory inputs' type/select the appropriate genome and click on 'all genes of this organism'. See the figure for an example:

You can then click on 'Run analysis' and if you select 'display' you'll get FASTA output, where you can see the gene IDs in the header.

2. Data -> Linux terminal

On the left menu, find 'Help & Contact' and then click on data. Find the 'genomes/' folder, then your genome and therein the 'genome/' folder. There should be a file named 'gene.tab'. You should copy the URL to that file and then in the terminal call wget or curl:

dataurl=http://rsat.eead.csic.es/plants/data/

genefile=$dataurl/genomes/Cannabis_sativa.cs10.GCF_900626175.2.NCBI/genome/gene.tab

wget -O - -o /dev/null $genefile | cut -f 1 |grep -v "^;" | head

curl -s $genefile | cut -f 1 |grep -v "^;" | head

You should obtain a one-column file with the actual gene IDs supported for that genome, which you can copy and paste in retrieve sequences directly.

Hope this helps,

Bruno