2 de mayo de 2019

#monogram2019 (2)

Below and Above Ground Processes
Silvio Salvi, U Bologna
He talks about root idiotypes in cereals. Root traits include biomass, lodging, soil penetration, architecture, ahairs and exudates, including mucilages. He shows root angles pictures of barley cultivars from WHEALBI and now want to confirm the phenotypes using a Morex TILLING collection http://www.distagenomics.unibo.it/TILLMore and mapping-by-sequencing approaches.
Vera Hecht, Julich
She talks about root traits in barley as a function of sowing density. She shows field experiments (5 replicates, Barke, Scarlett), that she published at https://www.frontiersin.org/articles/10.3389/fpls.2016.00944/full
She observed effects as a function of density as soon as 4 weeks after sowing date. That should be considered while planning lab experiments.
Tom Bennet, U Leeds
Works on Arabidopsis thaliana roots, finding that root system size at the end of winter  determines yield.

Phenotyping
Richard Whalley, Rothamsted
Invited talk on “Soil structure, root growth and yield in wheat”, they also take 1m cores as V Hecht. His message: we can’t predict root growth in the field.
Tony Pridmore, U Nottingham
Talks about UK and international plant phenotyping initiatives from the perspective of https://www.phenomuk.net. Phenotyping requires bringing together people with different backgrounds. It’s growing rapidly in the literature. He mentions EU program   https://emphasis.plant-phenotyping.eu , which has regular calls and has driven the development of https://www.miappe.org, and is about to be followed by https://eppn2020.plant-phenotyping.eu
Simon Orford, Germplasm Resource Unit, JIC
Simon presents the GRU (John Innes Centre Germplasm Resources Unit, they manage https://www.seedstor.ac.uk) and their role producing breeder tool kits for Designing Future Wheat (DFW, http://wisplandracepillar.jic.ac.uk/toolkit.htm). In DFW they have regular meetings with breeders from several companies and visit their plots in July, it looks like a very fruitful, long-term collaboration.
Aleksander Ligeza, NIAB
Talks about designing root architecture to improve Nitrogen Use Efficiency. He shows nice figures and plots showing that N content in the soil (measured in their rhizo-tubes) affects root architecture.


Abiotic and Biotic Stress
Gustavo Slafer, Centre for Research in Agrotechnology, Lleida, Spain
After an introduction on the how the literature describes plant physiology in controlled conditions, as opposed to field experiments, he talks about heat x Nitrogen penalties on yield. His team has measured that penalty interaction in several experiments with farmers with barley, maize and wheat. In all cases they observe that heat causes more damage as more N is added as fertilizer. This is with heat being only a few more degrees of max T for a few days.
Paul Nicholson, JIC
He talks about wheat diseases Fusarium head blight (FHB) and blast (appeared in 1985 in Brasil). The main problem with FHB is that accumulates toxins in the grain. He talks about alleles of semi-dwarf wheats which are linked to a glucosidase that, when silenced on a single subgenome, increases FHB resistance. More generally, it seems Fusarium takes advantage of several phytohormones and pathways which we can attenuate to increase resistance.
Blast does not affect wheat because it has Rwt3 and 4 in chr 1D (Aegilops tauschii). One of them is absent in the Brazilian wheat cv. Rwt4 is a full length NBS-ARC LRR.
Jonathan Cope, JHI
He talks about “Introgressing resilience and resource use efficiency traits from scots bere to elite barley lines (rebel)”. Bere are old Scandinavian 6row landraces tough to be brought by Danish invaders a few centuries ago into Scotland. They perform well in Mn poor soils and less affected that elite lines by diseases.
Samer Amer, U Reading
His research is about the “Genetic underpinnings of wheat responses to drought”. He shoes 18yr data with average rainfall in the UK over 1000l/m2. However, he insists that lack of rain in critical moments can have large undesired effects. He measured water availability in the field with tubes in the soil in “Kielder plots”.
Amma Simon, Rothamsted
Talks about her work on elucidating the mechanisms of aphid (R. padi) resistance in ancestral wheats, particularly in T. monococcum. Their field trials last year, which was very warm, were not successful as it seems drought culls the aphid populations.


Reproduction and Grain Development
Alison Bentley, NIAB
She talks in their work on wheat flowering time variation on the NIAB Elite MAGIC population, with 8 founders (https://www.niab.com/pages/id/402/NIAB_MAGIC_population_resources), which produced 643 lines. This is winter wheat (https://www.ncbi.nlm.nih.gov/pubmed/25237112). They have measured essentially days to particular GS developmental stages, and they validated the observations with two extra sites in Germany (long season) and Croatia (short). It is a heterogeneous dataset as it was scored in several years by several people. So they had to tailor their QTL detection methods, using both SNPs and Identity By Descent. Ppd1 is the largest effect observed, which is a positive control.
She also talsk about a project with French partners on predicting the ideal flowering idiotypes based on GS30/GS55 for N and S France. They are using genomic prediction for this using panels on over 400 cultivars from France. Out of 7369 MAGIC markers, they have short-listed a set of 19 markers (GIEC) for flowering time for the predictions. By using these 19 they achieve accuracies near 0.6 when predicting flowering time with cross-validation. Warning: populations with differing allele freqs require to account for this while predicting.
Stuart Desjardins, U Leicester
He talks about “Releasing natural variation in bread wheat by modulating meiotic crossovers” and the recombination-cold regions which take over 60% of the chromosomes in wheat and barley. They want to change and engineer that, so that cross-overs (CO) can occur elsewhere. He mentions that diploids cannot be compared to hexaploid wheat in this respect. CO are more frequent in telomers presumably because that’s where chromatids start pairing in meiosis.
Adam Gauley, JIC
Talks about the flowering pathways in wheat by comparing growing in the UK and in Spain, which shorter season. They are looking at the usual suspects FT and Ppd1.
Miaoyuan Hua, U Nottingham
He talks about “Gene Networks controlling anther tapetum development in barley during early anther development”. This involves a single-copy TF in barley which has several copies in other species, presumably due to having a few extra protein domains. They confirm this by rescuing mutants in Arabidopsis thaliana.
Jemima Brinton, JIC
She talks about “A major QTL for grain weight in wheat is associated with increased grain length and cell size”. They see growing increases of weight with knockouts as they affect more subgenomes. The increase is reduced without irrigation in dry season. This has been published at https://nph.onlinelibrary.wiley.com/doi/full/10.1111/nph.14624 and https://bmcplantbiol.biomedcentral.com/articles/10.1186/s12870-018-1241-5.

Genomics and Technologies for Crop Improvement
Wilma van Esse, Wageningen
She talks about transcription factor networks that control yield in barley. She mentions Vrs genes (http://www.plantphysiol.org/content/174/4/2397) and explains that 6row barleys have about 60 grains per spike, compared to 20 for 2rows. How does correlate with yield? To investigate that they look into the molecular and regulatory interactions between Vrs1 and Vrs5, is 5 regulating 1 by binding to its promoter? They have seen putative sites just by doing sequence analysis, and this happens in Arabidopsis thaliana (TB1, HOX1). Based on their Y2H data they hypothesize that Vrs1 forms heterodimers with other HOX TFs.
Gustavo asks how this would change in the field, where plants cannot tiller freely.


Laura Gardiner, IBM Research UK
She introduces us to “Using AI combined with genomics to improve crops: techniques and applications”. AI,Machine learning,Deep Learning. She walks through an example where they classify circadian genes in wheat by looking at their expression values in time series with MetaCycle (https://www.ncbi.nlm.nih.gov/pubmed/27378304). First they do circadian vs non-circadian, then the increase classes to morning, day, evening, night  and non-circadian. They do support proof-of-concept translatable projects to help UK industry, get in touch.
Riccardo Fusi, U Nottingham
Talks about “Deepgenes: functional characterization of genes enhancing soil exploration in maize”. He uses rice and brachy as model organisms.
Kumar Gaurav, JIC
He presents “A comprehensive and unbiased peek at the genetic diversity of wild wheat”, a quest for finding resistance genes which has been published at http://dx.doi.org/10.1038/s41587-018-0007-9. This is a reference-free GWAS analysis using presence/absence of k-mers as biallelic SNPs. They do take into account population structure. They are now upscaling their method with a bias towards resistance genes and mapping the k-mers to a reference genome. The bait sequences are available at https://github.com/steuernb/AgRenSeq
Michael Hammond-Kosack, Rothampsted
Talks about the wheat promoteome, as part of a DEFRA-funded networks (http://www.wgin.org.uk). They want to identify SNPs, indels and CREs linked to phenotypes among wheat cultivars. They use RNA-microarrays as they have stronger binding to DNA than DNA chips. They have 1.6kb promoter for 95 wheat cultivars and related species, from diploid to hexaploid. After mapping probes they reconstruct promoter haplotypes and find less than expected. After mapping they confidently discover indels (4-1kb) in some lines. They annotate also TF binding site looking for exact matches from a library.
Debbie Harding, BBSRC
She talks about BBSRC Strategy and Funding Opportunities.
Sophie Harrington, JIC
She delivers a talk as the winner of the student contest. She talks about the role of NAC TFs in wheat senescence, a process in which the leaves’ nutrients are consumed by the spikes to fill the grains. This work follows early findings by C Uauy (https://www.ncbi.nlm.nih.gov/pubmed/17124321). She has found a syntenic cluster on chr2, conserved in barley, with several dimeric NAC3 factors, some of which are highly expressed in the senescent tissues. Double Kronos EMS mutants of A & B subgenomes delay senescence. She uses https://bioconductor.org/packages/release/bioc/html/GENIE3.html to infer regulatory networks. Her preprints are https://www.biorxiv.org/content/10.1101/456749v1.abstract and https://www.biorxiv.org/content/10.1101/573881v1.abstract
Ricardo Ramírez-González, JIC
Summarizes his career from a stock analyst in México to wheat postdoctoral researcher at the JIC. He then explains how his http://wheat-expression.com site manages triad expression values (triangles) from the 3 subgenomes. He shows that most balanced homeologous genes are expressed in most tissues (stable), with tissue-specific genes being more unbalanced (dynamic).




Quality and Nutrition
Mark Loosely, JHI
Talks about “Improved malting quality in winter barley”. Starts by explaining the 3 steps of malting: degradation of cell walls, protein modification, activation of enzymes that convert starch into sugars. The carry out GWAS with both spring and winter barleys treated with fungicides. They find association peaks of alleles which have been already fixed in the elite barleys but also some with still low frequencies. They find 3 promising QTLs, with apparently gene clusters with similar functions, and later created a mapping population to fine-map one QTL in 3H. Finally, they are doing RNAseq of a few NIL recombinant lines to double-check the expression of genes in the QTL.
Katherine Steele, Bangor U
She talks about improving phosphorus efficiency in barley. Currently P is mined currently to produce fertilizers, but that supply might not last more than 100yr, and this is the P source for plants. P in the seed, mostly in the form of phytic acid (PA) in the aleurone layer, is the only source for germination, for the first 4 weeks. PA has high affinity for dietary minerals and thus it would be desirable to reduce its content in the grain without affecting germination and seedling vigour. Thus, they are looking at barley mutants with low tissue-specific PA content (lpa).
Anna Gordon, NIAB
She talks about ergot alkaloids, which contaminate grain and induce convulsions, hallucinations, vasoconstriction and sickness. These are produced by fungi and get into the grain during filling in barley, wheat and rye. There are EC limits for these compounds in cereals.


Rice and other Grasses
Adam Price. U Aberdeen
He talks about their long-term work on QTL analysis with rice (see https://www.sciencedirect.com/science/article/pii/S1360138506000793 or https://www.nature.com/articles/ncomms1467 for an example with GxE, with associations peaks appearing in different loci depending on the environments and populations). Recently they have been working with the Bengal and Assam aus panel (BAAP), which has been a donor for many relevant alleles to elite cultivars.
Luke Dunning, U Sheffield
Talks about lateral gene transfer (LGT) throughout the grass family, which he claims is recurrent, adding functional genes but might have a cost as well, due to TEs being acquired as well. He has published this recently at https://onlinelibrary.wiley.com/doi/full/10.1111/evo.13250 and also at  https://www.pnas.org/content/116/10/4416.abstract and starts with an example of a gene transferred with help from an aphid. He then walks us through to follow a few LGT genes in Alloteropsis among geographically close species. When they look at Au and Africa genomes altogether they see a clear accumulation of LGT gene over time, which is not syntenic (non-homologous recombination and transposition involved perhaps). Overall they have evidence of this happening in many but not all species and crops, such as barley. They report 5 in Brachypodium distachyon. In other species they have seen LGT fragments harboring up to 10 genes. He has not looked at codon usage biases.
Mario Caccamo, NIAB
Talks about an analysis of a large collection of Vietnamese native rice lines, which revealed novel genomics variants. This project also entailed creating the relevant bioinformatics pipelines and training local researchers. He explains with some detail their variant calling using FreeBayes and LD thining to obtain just under a million variants.
Mark Quinton-Tulloch, U Bangor
He talks about his genome-wide identification of resistance genes and their variation in indica rice. He uses KASP-driven genomic selection. He explains NBS-LRR genes and the TIR and CC families. He used https://github.com/steuernb/NLR-Parser to obtain motifs and call NBLRs. He acknowledges that you get both genes and pseudogenes.

30 de abril de 2019

#monogram2019


Hi, these are my notes on the Cereal Bioinformatics Session, plus the keynote by Keith Edwards, at Monogram 2019.

The rest of the notes are at https://bioinfoperl.blogspot.com/2019/05/monogram19-2.html?m=1


Cristobal Uauy, JIC  
Speaks about http://wheat-expression.com and explains the different references, from TGAC to RefSeq v1.0 (with 01 in the middle of gene names) and v1.1 (02 instead, as in TraesCS3D02G273600, used in http://plants.ensembl.org/Triticum_aestivum/Info/Index). He asks users to cite the papers not just the Web site. He mentions also the gene expression browser http://bar.utoronto.ca/efp_wheat/cgi-bin/efpWeb.cgi , http://www.polymarker.info to design polyploid-aware primers and the in silico wheat TILLING integrated in Ensembl Plants (http://www.wheat-tilling.com is legacy on previous gene models, but still useful in some cases). He wraps up by describing http://www.wheat-training.com , which links out to all resources and wheat populations as well.

Guy Gnaamati, EMBL-EBI
Describes the RefSeq v1.0 assembly with the v1.1 gene annotation in Ensembl Plants, the updated marker display (http://plants.ensembl.org/Triticum_aestivum/Variation/Explore?r=4A:714193214-714194214;v=BA00249348;vdb=variation;vf=194242) and their linked SIFT predictions. He summarizes the outcome of the ensembl4breeders event (see table in poster belowttps://twitter.com/ensemblgenomes/status/1098902364998782976https://twitter.com/ensemblgenomes/status/1098902364998782ttps://twitter.com/ensemblgenomes/status/1098902364998782976https://twitter.com/ensemblgenomes/status/1098902364998782976), and singles out pangenomes and the wheat test case as a prototype to develop that within Ensembl. He finishes advertising the upcoming Plant Genomes in a Changing Environment conference in October, 2019 (https://coursesandconferences.wellcomegenomecampus.org/our-events/plant-genomes-2019)




Leif Skot, IBERS
He talks about breeding targets in outperennial ryegrass (Lolium perenne ) and genomic predictions based on a 50yr running breeding experiment with linear biomass, yield gains with no signs of inbreeding depression yet. There seems to be a physically-anchored genome assembly under way (https://gtr.ukri.org/projects?ref=BB%2FG012342%2F1), but not ready yet; there are though synteny-based (https://www.ncbi.nlm.nih.gov/pubmed/26408275) and de novo (https://link.springer.com/chapter/10.1007/978-3-319-28932-8_19) assemblies.

Craig Simpson, JHI
They are using Salmon/kallisto to quantify barley transcriptomes, knowing that current barley gene models are still poor. Their aim is also to build a reference transcriptome (BarRTv1) with https://ccb.jhu.edu/software/stringtie guided by Morex assembly. They analyze 11 RNAseq datasets, with over 800 Illumina samples. They filter out low-expression transcripts (less than 0.3TPM) and use gmap to map back to Morex reference. They try to validate their expression values with RT-PCR and realize how difficult is to map multiple isoforms to a single PCR read. By correlating with RT-PCR results they defined optimal StringTie params: -c 2.5 - 50 –f 0, yielding over 60K genes and 177K transcripts, less than the original, which have been imported into a database by Linda Milne. They plan to do BarRTv2 with PacBio Iso-seq. He says there are many genes with a single dominant isoforms but also many others with 2-4 dominant isoforms, which could be nice to annotate in resources such as Ensembl. This data is still unpublished.

Kumar Gaurav, JIC
He talks about wild parents of wheat and their recent R-gene enrichment sequencing work to show they contain useful disease resistance genes. They belong to the Open Wild Wheat consortium, and have sequenced 260 Aegilops tauschii individuals with 10-30x cover (10Tb, available under Toronto agreement, seeds from JRU, JCI). They are performing diversity studies and mention wheat lineages 1 and 2.

Anthony Hall, JIC
He talks about a pan-genome of wheat elite cultivars as a way to gain access to hidden variability (SV, TE, promoters). This is the 10+ project, with NRGene RefSeq and W2RAP (https://github.com/bioinfologics/w2rap) assemblies. They know they are not covering all wheat variability out there. The assemblies are ready, they are now finishing the annotations with both de novo and validated gene models using a pan-transcriptome. A BLAST server is already available at https://webblast.ipk-gatersleben.de/wheat_ten_genomes .

Micha Bayer, JHI
He talks about the barley variome sampled from exome capture of 823 barley genotypes, covering mainly SNP and small indels. He discusses the depth vs breadth dilemma when managing diversity in germplasm. Their cultivars come from WHEALBI, EXCAP, B1K Israel, WBDC and other projects. Less than 5% of their final variant come from exons, with most coming from introns and UTRs. 96% are off-target variants with low read depth, with sufficient calling quality. Population level analyses distinguish wild and cultivated barleys, with low recombination around centromeres. They use SnpEff and are looking at fixed loss-of-function alleles in domesticated barleys. He mentions 20-30% of reads do not map the reference with max 4% mismatches.

After his talk, there’s a discussion on how to name genes in the context of pan-genomes. Cristobal says the role in Ensembl will be critical in this context.
Sebastian Raubach, JHI
Talks about Germinate v3 https://ics.hutton.ac.uk/get-germinate, a one-stop database schema for plant genetic resources, with powerful visualizations. It supports BrAPI, Multi-Crop Passport Descriptors (MCPD) and Dublin Core Metadata Initiative (DCMI). It comprises 3 modules: Scan (bar codes), Data Import and Germinate. It is used by 100+ groups working on different crops around the world, including wheat and maize at CIMMYT. Data can be exported to Helium, Flapjack, R, Excel, BraPI, google maps. It supports custom, restricted data access.

Keywan Hassani-Pak, Rohampstead
He talks about KnetMiner3.0 (http://knetminer.rothamsted.ac.uk) and does a quick 5-minute off-line demo. He shows case the evidence view, which is an enrichment analysis, and the keyword search to get more specific search results.
He then makes a DFW progress report on behalf of Rob Davey (EI), including https://grassroots.tools, which is about making data publicly available, and http://cyverseuk.org

Paul Wilkinson, U Bristol
He talks about http://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/indexNEW.php, built with Perl and PHP on top a database. He focuses specifically on the most recently added features, including a QTL database made in collaboration with the JIC and EI (which links out to Ensembl Plants), online dendrograms (http://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/35K_dendrogram.php) and an introgression plotter. The latter will become available soon and allows visualizing genomic regions introgressed in crosses. It produces nice circular plots and heatmaps.

Mario Caccamo, NIAB
He starts by talking about http://wheatis.org, which is part of the 2011 launched wheat initiative. There are 5 nodes across EU and US, including Ensembl Plants and Gramene. He then moves to the recent work regarding a group of experts around wheat gene nomenclature with the Wheat Gene Catalogue https://shigen.nig.ac.jp/wheat/komugi/genes/symbolClassList.jsp
Roughly, 10-15% of the loci in the catalogue correspond to current gene models, not always on a 1-to-1 relation.

Kim Hammond-Kosack, Rothampsted
She talks about PHI-base (http://www.phi-base.org) on pathogens-host interactions. Hosts are plants half of the time, not only cereals. Main use is to lookup of mutant-phenotypes relationships. They use a scale of 9 phenotypes, including negative results. They have a tool (PHI-canto) to allow users to annotate their own results with controlled vocabularies. It complies with FAIR Data principles.

Keith Edwards, U Bristol (plenary talk)
He talks about the genomic challenges in wheat and how we are discovering the actual diversity of wheats thanks to the marker technologies. This is in contrast to what was thought earlier, that they lacked variability. Today the can scan 98K KASP markers in 1 day and we now that this species, despite being only 10K yr old and having gone through 1-2 hybridizations, has a massive diversity. This is probably due to hybrid swarms, populations of hybrids that interbreed and backcross with their parents (diploid & tetraploids). He shows two examples of extensive introgressions in chromosomes of elite cultivar Cadenza and two ancient wheats:  Watkins 199 and Chaff 1790 from Rothampsted. He concludes that variation was already there 10K yr ago, is not new, and that there is forced gene flow between wheat and its parents and close species, mostly the tetraploids. These introgressed regions do not usually recombine, as they are too divergent (over 0.5%), and impose a LEGO-like genome, with recombination restricted to certain windows. 

Source: http://www.earlham.ac.uk/articles/earlham-institute-lego-sequencer

16 de abril de 2019

mapeo de coordenadas entre ensamblajes

Hola,
una tarea con la que tropiezas a menudo cuando trabajas en genómica es la de traducir unas coordenadas de una versión del genoma a las coordenadas equivalentes de otra versión.

Ejemplo de mapeo entre dos versiones o ensamblajes del mismo genoma, tomado de http://ensemblgenomes.org/info/data/assembly_mapping.

En la figura se resume el problema. Es importante darse cuenta de que puede haber elementos en la versión n que estén ausentes en la n+1, o viceversa. Con esa salvedad, podemos traducir coordenadas en Ensembl de manera sencilla usando la interfaz REST: https://rest.ensembl.org/documentation/info/assembly_map

Por ejemplo, para convertir el fragmento 1-10 del cromosoma 1 de cebada de la versión de 2012 (ASM32608v1) a la de 2017 (IBSC_v2), podríamos hacer:

$ wget -q --header='Content-type:application/json' \
  'https://rest.ensembl.org/map/hordeum_vulgare/ASM32608v1/1:1..10/IBSC_v2?'  -O -

Obtenemos los resultados en formato JSON:

{"mappings":[{"original":{"end":10,"seq_region_name":"1","assembly":"ASM32608v1","start":1,
"strand":1,"coord_system":"chromosome"},"mapped":{"start":268488,"assembly":"IBSC_v2",
"seq_region_name":"chr1H","end":268497,"coord_system":"chromosome","strand":1}}]}

Puedes averiguar los nombres de las diferentes versiones soportadas en la herramienta Tools->AssemblyConverter (por ejemplo en humano, plantas),

hasta pronto,
Bruno



22 de marzo de 2019

oneliner en utf8

Hola,
hoy solamente quiero compartir un oneliner, o perlito como los llama mi colega Pablo Vinuesa, que imprime un fichero con codificación UTF8, no ASCII, tal como la requiere pandoc al compilar la lista de bibliografía BIBTEX un documento markdown:

$ perl -lne 'BEGIN{binmode(STDOUT, ":utf8")} next if(/^%/); s/\\htmladdnormallink\{\S+?\}\{(\S+)?\}/$1/; print' bib_myarticles.bib > bib_myarticles_md.bib

Hasta luego,
Bruno

5 de marzo de 2019

Unión de varios ficheros ordenados (sort merge)

Hola,
en entradas pasadas, como ésta de 2010, ya hemos hablado de cómo ordenar resultados tabulares de BLAST con GNU sort. Entre tanto sort ha ido evolucionando y las versiones actuales, por ejemplo la que viene instalada por ejemplo en Ubuntu 18, ahora permite distribuir el trabajo en varias hebras:

--parallel=N    change the number of sorts run concurrently to N

Sin embargo, cuando queremos ordenar de manera global tablas numéricas distribuidas en varios ficheros internamente ordenados seguiremos usando una opción que ya teníamos en 2010:

-m, --merge     merge already sorted files; do not sort

A pesar de ahorrarse ordenar los ficheros internamente, cuando el número de ficheros es grande esta tarea se vuelve muy lenta. En el ejemplo ordenamos 100 ficheros separados por tabuladores (TSV) por las columnas numéricas 1 y 11:

time sort -S500M -s -k1g -k11g -m ficheros*.sorted > all.sorted

real 20m15.793s
user 19m46.045s
sys 0m18.724s

La clave para mejorar es eliminar el uso de ficheros temporales, leyendo de los 100 ficheros a la vez:

time sort --batch-size=100 -S500M -s -k1g -k11g -m ficheros*.sorted > all.sorted

real 13m49.672s
user 13m29.370s
sys 0m9.972s
 
De esta manera cuesta aproximadamente un tercio menos, y como se ve en el último ejemplo no sirve de nada aumentar la memoria del proceso de 500MB a 5 GB. De la misma manera, en mis pruebas no merece la pena aumentar el números de hebras concurrentes:

time sort --batch-size=100 S5G -s -k1g -k11g -m ficheros*.sorted > all.sorted

real 14m9.339s
user 13m49.315s
sys 0m10.705s

Hasta luego,
Bruno