26 de febrero de 2024

Cómo modelar proteínas con colabfold en tu GPU local

Hola,

hoy explicaré cómo he configurado ColabFold para ejecutarlo en hardware local, en concreto en una máquina con Ubuntu 20.04 que tiene una CPU Xeon CascadeLake Silver 4210R y una tarjeta gráfica NVIDIA RTX 3090. Puedes leer más sobre AlphaFold y ColabFold aquí o en este vídeo.

1) Necesité actualizar cuda, en concreto con la versión 11.8, algo que hice como se explica aquí:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub
sudo add-apt-repository "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
sudo apt update
sudo apt install cuda-toolkit-11-8

2) Tras reinicar, actualicé la variable de ambiente $PATH añadiendo estas líneas a mi fichero .bashrc:

export PATH=/usr/local/cuda/bin:$PATH

export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

3) Seguí las instrucciones para Linux en https://github.com/YoshitakaMo/localcolabfold?tab=readme-ov-file#for-linux . En mi caso tardó unos pocos minutos y sumó 15G al disco duro. 

4) Probé que todo funciona con un fichero FASTA qee contiene varias secuencias, guardando los resultados en la carpeta multi/ :

colabfold_batch test.multi.faa multi/


Ahora resumo los resultados que obtuve:

  • Por defecto colabfold_batch se conecta a https://api.colabfold.com para hacer búsquedas de secuencias similares y construir alineamientos múltiples (MSA) en un formato similar a FASTA que se llama a3m. Por tanto esa parte del trabajo no se hace localmente y tendrás que usarla con medida. Si quieres saber qué versión de las bases de datos de secuencias de ColabFold estás usando puedes consultar https://github.com/sokrypton/ColabFold/wiki/MSA-Server-Database-History
  • Las primeras secuencias que usé para construir modelos en formato PDB tenían entre 114 y 162 resíduos y tardaban un par de minutos, pego aquí el log: 
  • 2024-02-26 13:05:56,639 Running colabfold 1.5.5 (d36504fad856a0e1df511c5b0434957707030319)
    2024-02-26 13:05:56,862 Running on GPU
    2024-02-26 13:05:57,354 Found 5 citations for tools or databases
    2024-02-26 13:05:57,355 Query 1/29: test1 (length 114)
    2024-02-26 13:05:58,348 Sleeping for 6s. Reason: PENDING
    2024-02-26 13:06:05,308 Sleeping for 10s. Reason: RUNNING
    2024-02-26 13:06:30,822 Padding length to 124
    2024-02-26 13:06:58,791 alphafold2_ptm_model_1_seed_000 recycle=0 pLDDT=67.9 pTM=0.31
    2024-02-26 13:07:00,321 alphafold2_ptm_model_1_seed_000 recycle=1 pLDDT=68.8 pTM=0.329 tol=9.09
    2024-02-26 13:07:01,845 alphafold2_ptm_model_1_seed_000 recycle=2 pLDDT=69.7 pTM=0.358 tol=2.28
    2024-02-26 13:07:03,373 alphafold2_ptm_model_1_seed_000 recycle=3 pLDDT=69.8 pTM=0.367 tol=3.04
    2024-02-26 13:07:03,374 alphafold2_ptm_model_1_seed_000 took 32.6s (3 recycles)
    2024-02-26 13:07:04,871 alphafold2_ptm_model_2_seed_000 recycle=0 pLDDT=71.2 pTM=0.308
    2024-02-26 13:07:06,323 alphafold2_ptm_model_2_seed_000 recycle=1 pLDDT=71.6 pTM=0.346 tol=2.14
    2024-02-26 13:07:07,848 alphafold2_ptm_model_2_seed_000 recycle=2 pLDDT=71.7 pTM=0.358 tol=2.38
    2024-02-26 13:07:09,345 alphafold2_ptm_model_2_seed_000 recycle=3 pLDDT=71.8 pTM=0.365 tol=1.31
    2024-02-26 13:07:09,346 alphafold2_ptm_model_2_seed_000 took 5.9s (3 recycles)
    2024-02-26 13:07:10,984 alphafold2_ptm_model_3_seed_000 recycle=0 pLDDT=68.1 pTM=0.298
    2024-02-26 13:07:12,529 alphafold2_ptm_model_3_seed_000 recycle=1 pLDDT=68.6 pTM=0.34 tol=4.11
    2024-02-26 13:07:13,992 alphafold2_ptm_model_3_seed_000 recycle=2 pLDDT=69.2 pTM=0.36 tol=2.49
    2024-02-26 13:07:15,484 alphafold2_ptm_model_3_seed_000 recycle=3 pLDDT=68.8 pTM=0.367 tol=1.67
    2024-02-26 13:07:15,485 alphafold2_ptm_model_3_seed_000 took 6.1s (3 recycles)
    2024-02-26 13:07:16,987 alphafold2_ptm_model_4_seed_000 recycle=0 pLDDT=66.1 pTM=0.289
    2024-02-26 13:07:18,435 alphafold2_ptm_model_4_seed_000 recycle=1 pLDDT=66.8 pTM=0.283 tol=5.61
    2024-02-26 13:07:19,933 alphafold2_ptm_model_4_seed_000 recycle=2 pLDDT=67.7 pTM=0.298 tol=1.03
    2024-02-26 13:07:21,444 alphafold2_ptm_model_4_seed_000 recycle=3 pLDDT=67.9 pTM=0.318 tol=2.04
    2024-02-26 13:07:21,445 alphafold2_ptm_model_4_seed_000 took 5.9s (3 recycles)
    2024-02-26 13:07:22,931 alphafold2_ptm_model_5_seed_000 recycle=0 pLDDT=66.8 pTM=0.322
    2024-02-26 13:07:24,403 alphafold2_ptm_model_5_seed_000 recycle=1 pLDDT=68.2 pTM=0.345 tol=9.46
    2024-02-26 13:07:25,860 alphafold2_ptm_model_5_seed_000 recycle=2 pLDDT=68.8 pTM=0.354 tol=2.3
    2024-02-26 13:07:27,342 alphafold2_ptm_model_5_seed_000 recycle=3 pLDDT=69.4 pTM=0.358 tol=1.58
    2024-02-26 13:07:27,342 alphafold2_ptm_model_5_seed_000 took 5.9s (3 recycles)
    2024-02-26 13:07:27,369 reranking models by 'plddt' metric
    2024-02-26 13:07:27,369 rank_001_alphafold2_ptm_model_2_seed_000 pLDDT=71.8 pTM=0.365
    2024-02-26 13:07:27,369 rank_002_alphafold2_ptm_model_1_seed_000 pLDDT=69.8 pTM=0.367
    2024-02-26 13:07:27,370 rank_003_alphafold2_ptm_model_5_seed_000 pLDDT=69.4 pTM=0.358
    2024-02-26 13:07:27,370 rank_004_alphafold2_ptm_model_3_seed_000 pLDDT=68.8 pTM=0.367
    2024-02-26 13:07:27,370 rank_005_alphafold2_ptm_model_4_seed_000 pLDDT=67.9 pTM=0.318
    2024-02-26 13:07:28,679 Query 2/29: test2 (length 120)
    2024-02-26 13:07:29,695 Sleeping for 9s. Reason: PENDING
    2024-02-26 13:07:39,667 Sleeping for 9s. Reason: PENDING
    2024-02-26 13:07:49,628 Sleeping for 6s. Reason: PENDING
    2024-02-26 13:07:56,610 Sleeping for 6s. Reason: PENDING
    2024-02-26 13:08:03,608 Sleeping for 5s. Reason: PENDING
    2024-02-26 13:08:09,564 Sleeping for 6s. Reason: PENDING
    2024-02-26 13:08:16,534 Sleeping for 7s. Reason: PENDING
    2024-02-26 13:08:24,518 Sleeping for 5s. Reason: PENDING
    2024-02-26 13:08:30,471 Sleeping for 7s. Reason: PENDING
    2024-02-26 13:08:38,498 Sleeping for 5s. Reason: PENDING
    2024-02-26 13:08:44,459 Sleeping for 6s. Reason: PENDING
    2024-02-26 13:08:51,412 Sleeping for 9s. Reason: PENDING
    2024-02-26 13:09:01,412 Sleeping for 9s. Reason: PENDING
    2024-02-26 13:09:11,370 Sleeping for 8s. Reason: PENDING
    2024-02-26 13:09:20,337 Sleeping for 8s. Reason: PENDING
    2024-02-26 13:09:29,316 Sleeping for 6s. Reason: RUNNING
    2024-02-26 13:09:39,703 Padding length to 124
    2024-02-26 13:09:41,194 alphafold2_ptm_model_1_seed_000 recycle=0 pLDDT=73.9 pTM=0.55
    2024-02-26 13:09:42,664 alphafold2_ptm_model_1_seed_000 recycle=1 pLDDT=73.8 pTM=0.549 tol=3.08
    2024-02-26 13:09:44,110 alphafold2_ptm_model_1_seed_000 recycle=2 pLDDT=73.6 pTM=0.549 tol=1.59
    2024-02-26 13:09:45,593 alphafold2_ptm_model_1_seed_000 recycle=3 pLDDT=74.4 pTM=0.555 tol=1.67
    2024-02-26 13:09:45,593 alphafold2_ptm_model_1_seed_000 took 5.9s (3 recycles)
    2024-02-26 13:09:47,073 alphafold2_ptm_model_2_seed_000 recycle=0 pLDDT=76.7 pTM=0.565
    2024-02-26 13:09:48,523 alphafold2_ptm_model_2_seed_000 recycle=1 pLDDT=77.1 pTM=0.57 tol=0.571
    2024-02-26 13:09:49,977 alphafold2_ptm_model_2_seed_000 recycle=2 pLDDT=76.7 pTM=0.569 tol=0.958
    2024-02-26 13:09:51,421 alphafold2_ptm_model_2_seed_000 recycle=3 pLDDT=76.9 pTM=0.572 tol=0.881
    2024-02-26 13:09:51,421 alphafold2_ptm_model_2_seed_000 took 5.8s (3 recycles)
    2024-02-26 13:09:52,877 alphafold2_ptm_model_3_seed_000 recycle=0 pLDDT=75.6 pTM=0.542
    2024-02-26 13:09:54,315 alphafold2_ptm_model_3_seed_000 recycle=1 pLDDT=75.9 pTM=0.548 tol=1.52
    2024-02-26 13:09:55,763 alphafold2_ptm_model_3_seed_000 recycle=2 pLDDT=75.9 pTM=0.552 tol=1.69
    2024-02-26 13:09:57,218 alphafold2_ptm_model_3_seed_000 recycle=3 pLDDT=75.8 pTM=0.555 tol=0.883
    2024-02-26 13:09:57,219 alphafold2_ptm_model_3_seed_000 took 5.8s (3 recycles)
    2024-02-26 13:09:58,705 alphafold2_ptm_model_4_seed_000 recycle=0 pLDDT=73.9 pTM=0.56
    2024-02-26 13:10:00,177 alphafold2_ptm_model_4_seed_000 recycle=1 pLDDT=75.1 pTM=0.57 tol=2.2
    2024-02-26 13:10:01,620 alphafold2_ptm_model_4_seed_000 recycle=2 pLDDT=75.4 pTM=0.571 tol=1.78
    2024-02-26 13:10:03,076 alphafold2_ptm_model_4_seed_000 recycle=3 pLDDT=75.7 pTM=0.575 tol=2.04
    2024-02-26 13:10:03,077 alphafold2_ptm_model_4_seed_000 took 5.8s (3 recycles)
    2024-02-26 13:10:04,572 alphafold2_ptm_model_5_seed_000 recycle=0 pLDDT=75.2 pTM=0.573
    2024-02-26 13:10:06,026 alphafold2_ptm_model_5_seed_000 recycle=1 pLDDT=76.2 pTM=0.585 tol=2.12
    2024-02-26 13:10:07,498 alphafold2_ptm_model_5_seed_000 recycle=2 pLDDT=76.2 pTM=0.587 tol=1.44
    2024-02-26 13:10:08,958 alphafold2_ptm_model_5_seed_000 recycle=3 pLDDT=76.6 pTM=0.589 tol=1.21
    2024-02-26 13:10:08,959 alphafold2_ptm_model_5_seed_000 took 5.9s (3 recycles)
    2024-02-26 13:10:08,986 reranking models by 'plddt' metric
    2024-02-26 13:10:08,987 rank_001_alphafold2_ptm_model_2_seed_000 pLDDT=76.9 pTM=0.572
    2024-02-26 13:10:08,987 rank_002_alphafold2_ptm_model_5_seed_000 pLDDT=76.6 pTM=0.589
    2024-02-26 13:10:08,987 rank_003_alphafold2_ptm_model_3_seed_000 pLDDT=75.8 pTM=0.555
    2024-02-26 13:10:08,987 rank_004_alphafold2_ptm_model_4_seed_000 pLDDT=75.7 pTM=0.575
    2024-02-26 13:10:08,987 rank_005_alphafold2_ptm_model_1_seed_000 pLDDT=74.4 pTM=0.555
    2024-02-26 13:10:10,274 Query 3/29: test3 (length 162)
    2024-02-26 13:10:11,241 Sleeping for 8s. Reason: PENDING
    2024-02-26 13:10:20,230 Sleeping for 10s. Reason: PENDING
    2024-02-26 13:10:31,195 Sleeping for 5s. Reason: RUNNING
    2024-02-26 13:10:37,194 Sleeping for 6s. Reason: RUNNING
    2024-02-26 13:10:44,153 Sleeping for 9s. Reason: RUNNING
    2024-02-26 13:10:54,142 Sleeping for 10s. Reason: RUNNING
    2024-02-26 13:11:05,109 Sleeping for 8s. Reason: RUNNING
    2024-02-26 13:11:14,082 Sleeping for 6s. Reason: RUNNING
    2024-02-26 13:11:21,030 Sleeping for 8s. Reason: RUNNING
    2024-02-26 13:11:30,005 Sleeping for 9s. Reason: RUNNING
    2024-02-26 13:11:39,984 Sleeping for 7s. Reason: RUNNING
    2024-02-26 13:11:47,941 Sleeping for 10s. Reason: RUNNING
    2024-02-26 13:11:58,903 Sleeping for 9s. Reason: RUNNING
    2024-02-26 13:12:08,881 Sleeping for 5s. Reason: RUNNING
    2024-02-26 13:12:14,891 Sleeping for 9s. Reason: RUNNING
    2024-02-26 13:12:32,470 Padding length to 172
    2024-02-26 13:13:00,100 alphafold2_ptm_model_1_seed_000 recycle=0 pLDDT=62.9 pTM=0.433
    2024-02-26 13:13:02,186 alphafold2_ptm_model_1_seed_000 recycle=1 pLDDT=63.4 pTM=0.433 tol=8.27
    2024-02-26 13:13:04,282 alphafold2_ptm_model_1_seed_000 recycle=2 pLDDT=64.1 pTM=0.431 tol=8.02
    2024-02-26 13:13:06,403 alphafold2_ptm_model_1_seed_000 recycle=3 pLDDT=63.8 pTM=0.427 tol=8.51
    2024-02-26 13:13:06,404 alphafold2_ptm_model_1_seed_000 took 33.9s (3 recycles)
    2024-02-26 13:13:08,535 alphafold2_ptm_model_2_seed_000 recycle=0 pLDDT=60.2 pTM=0.417
    2024-02-26 13:13:10,637 alphafold2_ptm_model_2_seed_000 recycle=1 pLDDT=61 pTM=0.423 tol=6.09
    2024-02-26 13:13:12,742 alphafold2_ptm_model_2_seed_000 recycle=2 pLDDT=61.4 pTM=0.428 tol=3.33
    2024-02-26 13:13:14,846 alphafold2_ptm_model_2_seed_000 recycle=3 pLDDT=61.2 pTM=0.425 tol=1.8
    2024-02-26 13:13:14,846 alphafold2_ptm_model_2_seed_000 took 8.4s (3 recycles)
    2024-02-26 13:13:16,979 alphafold2_ptm_model_3_seed_000 recycle=0 pLDDT=62 pTM=0.425
    2024-02-26 13:13:19,099 alphafold2_ptm_model_3_seed_000 recycle=1 pLDDT=62.3 pTM=0.43 tol=7.21
    2024-02-26 13:13:21,197 alphafold2_ptm_model_3_seed_000 recycle=2 pLDDT=61.9 pTM=0.426 tol=4.32
    2024-02-26 13:13:23,303 alphafold2_ptm_model_3_seed_000 recycle=3 pLDDT=62.1 pTM=0.427 tol=5.17
    2024-02-26 13:13:23,304 alphafold2_ptm_model_3_seed_000 took 8.4s (3 recycles)
    2024-02-26 13:13:25,461 alphafold2_ptm_model_4_seed_000 recycle=0 pLDDT=60.5 pTM=0.418
    2024-02-26 13:13:27,552 alphafold2_ptm_model_4_seed_000 recycle=1 pLDDT=60.8 pTM=0.417 tol=9.52
    2024-02-26 13:13:29,658 alphafold2_ptm_model_4_seed_000 recycle=2 pLDDT=60.3 pTM=0.41 tol=9.23
    2024-02-26 13:13:31,749 alphafold2_ptm_model_4_seed_000 recycle=3 pLDDT=60.5 pTM=0.411 tol=6.08
    2024-02-26 13:13:31,750 alphafold2_ptm_model_4_seed_000 took 8.4s (3 recycles)
    2024-02-26 13:13:33,905 alphafold2_ptm_model_5_seed_000 recycle=0 pLDDT=59.9 pTM=0.416
    2024-02-26 13:13:36,038 alphafold2_ptm_model_5_seed_000 recycle=1 pLDDT=60.1 pTM=0.415 tol=9.96
    2024-02-26 13:13:38,154 alphafold2_ptm_model_5_seed_000 recycle=2 pLDDT=59.7 pTM=0.409 tol=3.89
    2024-02-26 13:13:40,252 alphafold2_ptm_model_5_seed_000 recycle=3 pLDDT=59.4 pTM=0.415 tol=11.4
    2024-02-26 13:13:40,253 alphafold2_ptm_model_5_seed_000 took 8.5s (3 recycles)
    2024-02-26 13:13:40,294 reranking models by 'plddt' metric
    2024-02-26 13:13:40,294 rank_001_alphafold2_ptm_model_1_seed_000 pLDDT=63.8 pTM=0.427
    2024-02-26 13:13:40,294 rank_002_alphafold2_ptm_model_3_seed_000 pLDDT=62.1 pTM=0.427
    2024-02-26 13:13:40,294 rank_003_alphafold2_ptm_model_2_seed_000 pLDDT=61.2 pTM=0.425
    2024-02-26 13:13:40,295 rank_004_alphafold2_ptm_model_4_seed_000 pLDDT=60.5 pTM=0.411
    2024-02-26 13:13:40,295 rank_005_alphafold2_ptm_model_5_seed_000 pLDDT=59.4 pTM=0.415
  • Como ves el propio script espera cuando el servidor remoto está ocupado.
  • Para cada secuencia problema obtienes figuras como éstas:




Hasta pronto,

Bruno

PD Cuando acabes de instalar deberías tener algo similar en tu fichero $HOME/.bashrc:

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export PATH="$HOME/colabfold/colabfold-conda/bin:$PATH"

Cuando no vayas a usar colabfold comenta estas líneas para usar perl y python del sistema

PD2 Me comentan colegas de ULiverpool que haciendo 800-900 MSAs al día en https://api.colabfold.com no han tenido problemas

23 de febrero de 2024

lectura eficiente de ficheros comprimidos en el terminal

Al analizar datos genómicos a menudo tenemos que manejar ficheros de texto enormes, del orden de decenas de GB, que suelen estar comprimidos para no llenar el disco.

Imagen de https://the-chestnut.com

Por ejemplo:

-rw-r--r--  1 contrera contrera 2.2G Feb 23 12:43  mergedpairs.tsv

comprimido con gzip ocupa casi la décima parte:

-rw-r--r--  1 contrera contrera 255M Feb 23 12:43  mergedpairs.tsv.gz

A veces es necesario operar sobre este tipo de ficheros desde scripts o el terminal sin descomprimirlos en el disco. Por ejemplo, podemos contar el número de líneas con zcat:

time zcat mergedpairs.tsv.gz | wc -l
10205212

real    0m10.793s
user    0m10.689s
sys    0m1.726s

Otra opción un poco más rápida, qué descubrí en dos foros (1,2), es unpigz, que hay que instalar primero en Ubuntu:

sudo apt install pigz

time unpigz -dc mergedpairs.tsv.gz | wc -l
10205212

real    0m6.503s
user    0m10.742s
sys    0m3.391s
 

Otra opción que he encontrado es https://github.com/mxmlnkn/rapidgzip, que además tiene una interfaz python que permite indexar ficheros .gz, pero se puede usar en la línea de comando también:

python3 -m pip install rapidgzip

time rapidgzip -d -c -P 0 mergedpairs.tsv.gz | wc -l
10205212

real    0m2.833s
user    0m10.895s
sys    0m5.020s


Finalmente, he probado con https://github.com/facebook/zstd, que usa un formato de compresión menos estándar (.zst):

sudo apt install zstd
pzstd mergedpairs.tsv

time pzstd -d -c mergedpairs.tsv.zst | wc -l
mergedpairs.tsv.zst : 2273983394 bytes                                         
10205212

real    0m1.600s
user    0m5.973s
sys    0m3.348s

 

 

Hasta pronto,

Bruno

7 de febrero de 2024

Browsing barley pangenes

Hi,

late last year we published a paper describing GET_PANGENES, a protocol to all pangenes, which are clusters of gene models/alleles found in genomic assemblies in a similar location. You can read all about it at https://doi.org/10.1186/s13059-023-03071-z . Using this approach you can produce figures like this, where you can see the pangene of interest in green:

Genomic context of barly pangene cluster HORVU.MOREX.r3.3HG0311160 (green arrows), which corresponds to barley locus HvOS2. Figure from https://doi.org/10.1186/s13059-023-03071-z
 

As we do research on barley breeding and adaptation, we thought it would be useful for us and other out there to have way of inspecting barley pangenes, for instance to check whether a gene of interest is conserved or polymorphic across the barleys sampled by in the pangenome (n=20) put together by Jayakodi et al.

This exactly what you can do, at the protein sequence level, at https://eead-csic-compbio.github.io/barley_pangenes , where you can scroll pangenes along chromosomes, with MorexV3 positions; genes not found in MorexV3 lack a position therefore and are shown with a hash (#):


You will notice that pangenes with occupancy > 1, ie containing gene models found it at least two barleys, can be clicked to display a multiple protein alignment with help from the NCBI msaviewer:


There you can easily zoom in to regions of interest and print or export the alignment in FASTA, PDF or SVG format (high quality).

Hope this can be useful to the barley genomics community,

Bruno

 

30 de enero de 2024

HOWTO run primers4clades with Docker container

One of the curses of developing bioinformatics software is the long-term sustainability of the code. Inevitably some tools reach dead ends and are abandoned. This can happen for several reason, notably lack of funding and hardware failures. 

One such example is our primers4clades, which used to live at http://floresta.eead.csic.es/primers4clades and http://maya.ccg.unam.mx/primers4clades, but is not available as a Web server anymore. However, Pablo Vinuesa and me thought it would be great to still provide it as a legacy tool and eventually packaged it as a Docker image, which is freely available at https://hub.docker.com/r/csicunam/primers4clades . This is easy to run, as explained below, and avoids installing the pipeline locally, which is challenging due to a number of dependencies which are no longer easy to find (read more at https://github.com/eead-csic-compbio/primers4clades).

1) Docker installation

You can find instructions for installing the Docker (Server) engine on Linux at https://docs.docker.com/engine/install/#server .

On Windows, our recommended procedure is to i) install the Windows Subsystem for Linux (WSL) and then ii) the Docker engine:

Please let us know if you manage to run the examples below with Docker Desktop for MacOS.


2) Running primers4clades on Docker

The following command-line examples show how to run primers4clades using Docker with an input FASTA file named 'bla1.fna' containing nucleotide CDS sequences. You can check the format required here:

# 2024-01-20. Running primers4clades using a Docker container

#>>> STEP1) produce clusters and display NJ-tree
$ docker run --rm -v "$PWD:$PWD" -w "$PWD" -u $UID:$GROUPS -it csicunam/primers4clades Fasta2clusters.pl -i bla1.fna 


# /primers4clades/CVS_code/marfil/Fasta2clusters.pl -i bla1.fna -c universal -d 2.21 -f 0 -m 0 -s 0 -M  -b  -e  -D 0 -S 0

# read_FASTA_sequence : skipped sequence identical to 013:
[Stenotrophomonas_maltophilia]_TEXM3A_NACIP1_4


# read_FASTA_sequence : skipped sequence identical to 013:
[Stenotrophomonas_maltophilia]_TEXM3D_NACIP1_4

# number of sequences read = 14
# number of recognised taxa = 9
# aligning translated sequences...
# computing distance matrix...
Using SPRNG -- Scalable Parallel Random Number Generator
# run_PUZZLE_DIST alpha = 0.92
# alignment stats: length = 321 %gaps = 9.81 %constant = 26.5
# distance matrix = bla1_aln.phy.dist

# mean distance = 0.34813
# max distance = 1.91051 ( 001 <=> 006 )
# min distance = 0.00000 ( 003 <=> 005 )

# NJ tree with cluster labels :


    +-004__0 [Stenotrophomonas_maltophilia_D457]_D457_...
  +-3
  ! +--007__0 [Stenotrophomonas_maltophilia_JV3]_JV3_...
  !
  !   +-009__0 [Stenotrophomonas_maltophilia_R551_3]_R551_3_...
  ! +-4
  ! ! ! +-011__0 [Stenotrophomonas_maltophilia]_TEXM2S_MKIMI4_21_...
  ! ! +-2
  ! !   +-015__0 [Stenotrophomonas_maltophilia]_TEXM3S_MKIMI4_5_...
  5-6
  ! !   +006__0 [Stenotrophomonas_maltophilia]_ESTM1A_MKCAZ1_5_...
  ! ! +-1
  ! ! ! +010__0 [Stenotrophomonas_maltophilia]_TEXM2D_MKIMI4_3_...
  ! +-7
  !   ! +013__0 [Stenotrophomonas_maltophilia]_TEXM3D_MKCAZ8_7_...
  !   ! !
  !   +-8  +003__0 [Stenotrophomonas_maltophilia_Ab55555]_Ab55555_...
  !     !  !
  !     +-10  +005__0 [Stenotrophomonas_maltophilia_EPM1]_EPM1_...
  !        !  !
  !        +-11    +002__0 [Stenotrophomonas_maltophilia]_13637_...
  !           !  +-9
  !           +-12 +008__0 [Stenotrophomonas_maltophilia_K279a]_K279a_...
  !              !
  !              +016__0 [Stenotrophomonas_sp_]_YAU14D1_LEIMI4_7_...
  !
  +-----------------------------------------------------001__0 [Stenotrophomonas_acidaminiphila]_TEXM2D_MKIMI4_3_...

# labelled tree : bla1_aln_nj.ph
# fully labelled tree : bla1_aln_nj_labelled.ph
# text formatted tree : bla1_aln_nj_txt.graph
# fully labelled text formatted tree : bla1_aln_nj_txt_labelled.graph
# CLUSTER 0 members = 14 : .//bla1_cluster_0.fna .//bla1_cluster_0.faa




#>>> STEP2). Select a specific cluster, e.g. 003__0,016__0

$ docker run --rm -v "$PWD:$PWD" -w "$PWD" -u $UID:$GROUPS -it csicunam/primers4clades Fasta2clusters.pl -i bla1.fna -c 11 -M bla1_aln.phy.dist -b 003__0,016__0 


# /primers4clades/CVS_code/marfil/Fasta2clusters.pl -i bla1.fna -c 11 -d 2.21 -f 0 -m 0 -s 0 -M bla1_aln.phy.dist -b 003__0,016__0 -e  -D 0 -S 0

# read_FASTA_sequence : skipped sequence identical to 014:
[Stenotrophomonas_maltophilia]_TEXM3A_NACIP1_4

# read_FASTA_sequence : skipped sequence identical to 014:[Stenotrophomonas_maltophilia]_TEXM3D_MKCAZ8_7

# number of sequences read = 14
# number of recognised taxa = 9
# number of valid cluster boundaries = 2
# size of user-selected cluster = 5
# distance matrix = bla1_aln.phy.dist


# max distance = 1.91051 ( 001 <=> 006 )
# min distance = 0.00000 ( 003 <=> 005 )

# NJ tree with cluster labels :


    +-004
  +-3
  ! +--007
  !
  !   +-009
  ! +-4
  ! ! ! +-011
  ! ! +-2
  ! !   +-015
  5-6
  ! !   +006
  ! ! +-1
  ! ! ! +010
  ! +-7
  !   ! +013
  !   ! !
  !   +-8  +003__0 [Stenotrophomonas_maltophilia_Ab55555]_Ab55555_...
  !     !  !
  !     +-10  +005__0 [Stenotrophomonas_maltophilia_EPM1]_EPM1_...
  !        !  !
  !        +-11    +002__0 [Stenotrophomonas_maltophilia]_13637_...
  !           !  +-9
  !           +-12 +008__0 [Stenotrophomonas_maltophilia_K279a]_K279a_...
  !              !
  !              +016__0 [Stenotrophomonas_sp_]_YAU14D1_LEIMI4_7_...
  !
  +-----------------------------------------------------001

# labelled tree : ./bla1_aln_nj.ph
# fully labelled tree : ./bla1_aln_nj_labelled.ph
# text formatted tree : ./bla1_aln_nj_txt.graph
# fully labelled text formatted tree : ./bla1_aln_nj_txt_labelled.graph
# CLUSTER 0 members = 5 : .//bla1_cluster_0.fna .//bla1_cluster_0.faa

#>>> STEP3) Search for primers on selected cluster bla1_cluster_0 and evaluate  amplicon with nucleotide substitution models, min quality 50, compute codon frequencies from input (-C)
$ docker run --rm -v "$PWD:$PWD" -w "$PWD" -u $UID:$GROUPS -it csicunam/primers4clades Fasta2primers.pl -n bla1_cluster_0.fna -p bla1_cluster_0.faa -C -P 50


# /primers4clades/CVS_code/marfil/Fasta2primers.pl -n bla1_cluster_0.fna -p bla1_cluster_0.faa -e  -P 50 -M 0 -k 0 -r  -S 0 -u 0 -T 55 -o c -s 0 -B 0 -L 0 -c  -C 1 -f 0 -m  -l 0 -R 0 -D

# /primers4clades/CVS_code/marfil/Fasta2primers.pl : cwd = /home/vinuesa/tmp/test_P4C_docker

# cutting amplicons...
## table redundancy vs amplicons stats:
# Stenotrophomonas_sp_TA57 redundancy = 0.78 amplicons = 1
# input_derived_005a0a18 redundancy = 0.86 amplicons = 1
# Stenotrophomonas_maltophilia redundancy = 0.87 amplicons = 0

# evaluating primers...
# primer file.tab = bla1_cluster_0.fna_primers.list.oligo_eval.tab
# ranking pairs of primers...

# mapfile : bla1_cluster_0.fna_primers.list.oligo_eval.tab.png


## Amplicon 1 codon_usage_table = Stenotrophomonas_sp_TA57 :
CCGGCCATCTTCTTGATaayatgaagyt 5'->3' N 100 294 (aligned residues)
ccggtcacctgctggacaacatgaagct >001 002 [Stenotrophomonas_maltophilia]_13637
ccggtcacctgctggacaacatgaagct >002 003 [Stenotrophomonas_maltophilia_Ab55555]_Ab55555
ccggtcacctgctggacaacatgaagct >003 005 [Stenotrophomonas_maltophilia_EPM1]_EPM1
ccggtcacctgctggacaacatgaagct >004 008 [Stenotrophomonas_maltophilia_K279a]_K279a
ccggtcacctgctggacaacatgaaact >005 016 [Stenotrophomonas_sp.]_YAU14D1_LEIMI4_7
....!..!..!..!..!..!.....?!.
CCGGTCACCTGCTGGACaacatgaarct codeh_corr bla1_cluster_0_amp1_N100
ccggtcacctgctggacaacatgaarct relax_corr bla1_cluster_0_amp1_N100
ccggtcacctgctggacaacatgaarct degen_corr bla1_cluster_0_amp1_N100

GGGCAAGTTGGGCATcraacttcttyt 5'->3' C 100 294 (aligned residues)
tggccaactgcgcgtcgaacttcttct >001 002 [Stenotrophomonas_maltophilia]_13637
tggccaactgcgcgtcgaacttcttct >002 003 [Stenotrophomonas_maltophilia_Ab55555]_Ab55555
tggccaactgcgcgtcgaacttcttct >003 005 [Stenotrophomonas_maltophilia_EPM1]_EPM1
tggccaactgcgcgtcgaacttcttct >004 008 [Stenotrophomonas_maltophilia_K279a]_K279a
tggccaactgcgcgtcgaacttcttct >005 016 [Stenotrophomonas_sp.]_YAU14D1_LEIMI4_7
!...!.!!..!..!..!........!.
TGGCCAACTGCGCGTcgaacttcttct codeh_corr bla1_cluster_0_amp1_C294
tggccaactgcgcgtcgaacttcttct relax_corr bla1_cluster_0_amp1_C294
tggccaactgcgcgtcgaacttcttct degen_corr bla1_cluster_0_amp1_C294

# primer pair quality = 100%
# expected PCR product length (nt) = 585
# fwd: minTm = 65.6 maxTm = 67.3
# rev: minTm = 66.7 maxTm = 66.7
#--
# phylogenetic amplicon evaluation:
# subs.model = TrNI
# n_of_seqs = 5 n_alignments_sites = 613
# n_of_seqs_with_composition_bias = 0
# %fully resolved quartets = 60 %partially resolved quartets = 0 %non resolved quartets = 40.0
# alpha = 0
# mean aLRT = 0.46 median aLRT = 0.46
# compressed outfile = my_bla1_cluster_0_aln_1__Stenotrophomonas_sp_TA57.tgz
# end_of_amplicon

#>>> STEP 3.2) As in STEP 3, get primers on selected cluster bla1_cluster_0, but evaluate amplicon with amino acid replacement models
$ docker run --rm -v "$PWD:$PWD" -w "$PWD" -u $UID:$GROUPS -it csicunam/primers4clades Fasta2primers.pl \
  -n bla1_cluster_0.fna -p bla1_cluster_0.faa -C -P 50 -M 


# /primers4clades/CVS_code/marfil/Fasta2primers.pl -n bla1_cluster_0.fna -p bla1_cluster_0.faa -e  -P 50 -M 1 -k 0 -r  -S 0 -u 0 -T 55 -o c -s 0 -B 0 -L 0 -c  -C 1 -f 0 -m  -l 0 -R 0 -D

# /primers4clades/CVS_code/marfil/Fasta2primers.pl : cwd = /home/vinuesa/tmp/test_P4C_docker

# cutting amplicons...
## table redundancy vs amplicons stats:
# Stenotrophomonas_sp_TA57 redundancy = 0.78 amplicons = 1
# input_derived_182c455f redundancy = 0.86 amplicons = 1
# Stenotrophomonas_maltophilia redundancy = 0.87 amplicons = 0

# evaluating primers...
# primer file.tab = bla1_cluster_0.fna_primers.list.oligo_eval.tab
# ranking pairs of primers...

# mapfile : bla1_cluster_0.fna_primers.list.oligo_eval.tab.png


## Amplicon 1 codon_usage_table = Stenotrophomonas_sp_TA57 :
CCGGCCATCTTCTTGATaayatgaagyt 5'->3' N 100 294 (aligned residues)
ccggtcacctgctggacaacatgaagct >001 002 [Stenotrophomonas_maltophilia]_13637
ccggtcacctgctggacaacatgaagct >002 003 [Stenotrophomonas_maltophilia_Ab55555]_Ab55555
ccggtcacctgctggacaacatgaagct >003 005 [Stenotrophomonas_maltophilia_EPM1]_EPM1
ccggtcacctgctggacaacatgaagct >004 008 [Stenotrophomonas_maltophilia_K279a]_K279a
ccggtcacctgctggacaacatgaaact >005 016 [Stenotrophomonas_sp.]_YAU14D1_LEIMI4_7
....!..!..!..!..!..!.....?!.
CCGGTCACCTGCTGGACaacatgaarct codeh_corr bla1_cluster_0_amp1_N100
ccggtcacctgctggacaacatgaarct relax_corr bla1_cluster_0_amp1_N100
ccggtcacctgctggacaacatgaarct degen_corr bla1_cluster_0_amp1_N100

GGGCAAGTTGGGCATcraacttcttyt 5'->3' C 100 294 (aligned residues)
tggccaactgcgcgtcgaacttcttct >001 002 [Stenotrophomonas_maltophilia]_13637
tggccaactgcgcgtcgaacttcttct >002 003 [Stenotrophomonas_maltophilia_Ab55555]_Ab55555
tggccaactgcgcgtcgaacttcttct >003 005 [Stenotrophomonas_maltophilia_EPM1]_EPM1
tggccaactgcgcgtcgaacttcttct >004 008 [Stenotrophomonas_maltophilia_K279a]_K279a
tggccaactgcgcgtcgaacttcttct >005 016 [Stenotrophomonas_sp.]_YAU14D1_LEIMI4_7
!...!.!!..!..!..!........!.
TGGCCAACTGCGCGTcgaacttcttct codeh_corr bla1_cluster_0_amp1_C294
tggccaactgcgcgtcgaacttcttct relax_corr bla1_cluster_0_amp1_C294
tggccaactgcgcgtcgaacttcttct degen_corr bla1_cluster_0_amp1_C294

# primer pair quality = 100%
# expected PCR product length (nt) = 585
# fwd: minTm = 65.6 maxTm = 67.3
# rev: minTm = 66.7 maxTm = 66.7
#--

ProtTest launched!

Please, watch out the content of file "my_bla1_cluster_0_aln_1__Stenotrophomonas_sp_TA57_aln.phy.prottest" for information on the
progress of the analysis and for possible error messages
Using SPRNG -- Scalable Parallel Random Number Generator
# phylogenetic amplicon evaluation:
# subs.model = WAG
# n_of_seqs = 5 n_alignments_sites = 205
# n_of_seqs_with_composition_bias = 0
# %fully resolved quartets = 60 %partially resolved quartets = 0 %non resolved quartets = 40.0
# alpha = 0
# mean aLRT = 0.46 median aLRT = 0.46
# compressed outfile = my_bla1_cluster_0_aln_1__Stenotrophomonas_sp_TA57.tgz
# end_of_amplicon



## Amplicon 2 codon_usage_table = input_derived_182c455f :
GCCAGCTGGCTGcarccvatggc 5'->3' N 55 243 (aligned residues)
gcgtcctggctgcagccgatggc >001 002 [Stenotrophomonas_maltophilia]_13637
gcgtcctggctgcagccgatggc >002 003 [Stenotrophomonas_maltophilia_Ab55555]_Ab55555
gcgtcctggctgcagccgatggc >003 005 [Stenotrophomonas_maltophilia_EPM1]_EPM1
gcgtcctggctgcagccgatggc >004 008 [Stenotrophomonas_maltophilia_K279a]_K279a
gcgtcctggctgcagccgatggc >005 016 [Stenotrophomonas_sp.]_YAU14D1_LEIMI4_7
..!!!.........!..!.....
GCGTCCTGGCTGcagccgatggc codeh_corr bla1_cluster_0_amp1_N55
gcgtcctggctgcagccgatggc relax_corr bla1_cluster_0_amp1_N55
gcgtcctggctgcagccgatggc degen_corr bla1_cluster_0_amp1_N55

GCGAAGCTGCGCTtrtartcytcga 5'->3' C 55 243 (aligned residues)
gcgaagctgcgcttgtagtcctcga >001 002 [Stenotrophomonas_maltophilia]_13637
gcgaagctgcgcttgtagtcctcga >002 003 [Stenotrophomonas_maltophilia_Ab55555]_Ab55555
gcaaagctgcgcttgtagtcctcga >003 005 [Stenotrophomonas_maltophilia_EPM1]_EPM1
gcgaagctgcgcttgtagtcctcga >004 008 [Stenotrophomonas_maltophilia_K279a]_K279a
gcgaagctgcgcttgtagtcctcga >005 016 [Stenotrophomonas_sp.]_YAU14D1_LEIMI4_7
..?...........!..!..!....
GCGAAGCTGCGCTtgtagtcctcga codeh_corr bla1_cluster_0_amp1_C243
gcraagctgcgcttgtagtcctcga relax_corr bla1_cluster_0_amp1_C243
gcraagctgcgcttgtagtcctcga degen_corr bla1_cluster_0_amp1_C243

# primer pair quality = 100%
# expected PCR product length (nt) = 567
# fwd: minTm = 74.4 maxTm = 74.4
# rev: minTm = 67.5 maxTm = 67.5
#--

ProtTest launched!

Please, watch out the content of file "my_bla1_cluster_0_aln_1__input_derived_182c455f_aln.phy.prottest" for information on the
progress of the analysis and for possible error messages
Using SPRNG -- Scalable Parallel Random Number Generator
# phylogenetic amplicon evaluation:
# subs.model = WAG
# n_of_seqs = 5 n_alignments_sites = 198
# n_of_seqs_with_composition_bias = 0
# %fully resolved quartets = 0 %partially resolved quartets = 0 %non resolved quartets = 100.0
# alpha = 0
# mean aLRT = 0.00 median aLRT = 0.00
# compressed outfile = my_bla1_cluster_0_aln_1__input_derived_182c455f.tgz
# end_of_amplicon




#################################################################
## overall quality stats:
# good/bad pairs ratio = 2/2
# individual quality checks stats:


## abbreviations:
# quality     = quality estimation [best = 100, worst = 0] %
# crosspot    = potential of cross-hybridization [0-1]
# relaxdeg    = relaxed (3' extended segment) degeneracy
# fulldeg     = full primer degeneracy
# minTm       = minimum Tm for the pool of relaxed primers
# maxTm       = maximum Tm for the pool of relaxed primers
# hpinpot     = potential of primer hairpin [0-1]
# selfpot     = potential of primer self-priming [0-1]
# aLRT        = approximate likelihood-ratio test [0-1],
#               overall confidence on trees built using this amplicon

#>>> STEP4) Analyze output files, including TAR.GZ bundles and PNG figures
$ ls

bla1_aln.faa my_bla1_cluster_0_aln_1__input_derived_005a0a18.tgz
bla1_aln_nj_labelled.ph my_bla1_cluster_0_aln_1__input_derived_182c455f_aln.aa_amp
bla1_aln_nj.ph my_bla1_cluster_0_aln_1__input_derived_182c455f.tgz
bla1_aln_nj_txt.graph my_bla1_cluster_0_aln_1__Stenotrophomonas_sp_TA57_aln.aa_amp
bla1_aln_nj_txt_labelled.graph my_bla1_cluster_0_aln_1__Stenotrophomonas_sp_TA57.tgz
bla1_aln.phy my_bla1_cluster_0_aln.faa
bla1_aln.phy.dist my_bla1_cluster_0_aln.fna
bla1_cluster_0.faa my_bla1_cluster_0_aln__input_derived_005a0a18.codon.use__codehops.out
bla1_cluster_0.fna my_bla1_cluster_0_aln__input_derived_182c455f.codon.use__codehops.out
bla1_cluster_0.fna_primers.list.oligo_eval.tab my_bla1_cluster_0_aln__Stenotrophomonas_maltophilia.codon.use__codehops.out
bla1_cluster_0.fna_primers.list.oligo_eval.tab.png my_bla1_cluster_0_aln__Stenotrophomonas_sp_TA57.codon.use__codehops.out
bla1.faa                                            my_bla1_cluster_0.faa
bla1.fna                                            my_bla1_cluster_0.f