The article was co-written with Pan Liu, postdoctoral researcher at UCLA and Fred Hutchinson Cancer Center. Pan is the first author of the mcRigor Nature Communications article.
Single-cell sequencing technologies have advanced rapidly in recent years, providing unprecedented opportunities to uncover cellular diversity, dynamic changes in cell states, and underlying gene regulatory mechanisms. In addition to the widely used single-cell RNA sequencing (scRNA-seq) 1,2, new modalities such as single-cell chromatin accessibility sequencing (scATAC-seq) 3,4 and joint profiling of transcriptome and chromatin accessibility (scMultiome) 5 have enabled the dissection of cellular heterogeneity at single-cell resolution across multiple omics layers. However, the data generated by these technologies are typically highly sparse, primarily due to limited sequencing depth per cell, as well as imperfect reverse transcription and nonlinear amplification, which cause highly expressed genes to dominate sequencing capacity and make lowly expressed genes difficult to detect 6.

To alleviate data sparsity and noise, researchers proposed the “metacell” concept, in which cells with similar expression profiles are aggregated into a single representative unit—a metacell—whose expression is defined by the mean expression of its constituent cells, thereby enhancing signal and reducing noise. Yet, existing metacell construction methods often yield substantially different metacell partitions and are highly sensitive to hyperparameter settings, particularly the average metacell size 7. Such lack of consistency makes it difficult for users to determine which metacell partition is more trustworthy and to what extent the resulting metacell profiles preserve true biological signals. Consequently, the robustness of downstream analyses is compromised, and the potential of metacells as a general data preprocessing framework across diverse tasks and omics modalities remains limited.
Our Nature Communications paper 8 provides a rigorous statistical definition of a metacell based on a two-layer model of single-cell sequencing data: the upper layer captures the biological variation in true expression, while the lower layer models the sequencing process that generates measured expression from the true expression. Building on this definition, we develop mcRigor, a statistical framework for detecting dubious metacells within a given partition and selecting the optimal metacell partitioning method and hyperparameter across candidate method-hyperparameter configurations.
mcRigor not only detects and removes dubious metacells (its extended version, mcRigor two-step, further disassembles dubious metacells into single cells and re-assembles them into smaller, more reliable ones), thereby improving the reliability of downstream analyses such as gene co-expression and enhancer–gene regulation, but also enables data-driven selection of the most suitable metacell partitioning strategy for each dataset. Owing to its flexible compatibility, mcRigor can be readily applied to single-cell transcriptomic, chromatin accessibility, and multi-omic data (Fig. 2). In addition, mcRigor provides a unified evaluation criterion for benchmarking different metacell construction methods, offering reliable guidance for researchers in method selection.
In the first part of our paper 8, we introduce mcRigor’s methodology for detecting dubious metacells. Specifically, mcRigor quantifies the internal heterogeneity of each metacell using a feature-correlation-based statistic, mcDiv, which measures the deviation of feature–feature correlations from independence. The rationale is that if all member cells share the same true expression levels and the observed variation among them arises purely from the measurement process, the features should be approximately independent. mcRigor then constructs a null distribution for mcDiv using a novel double permutation procedure and identifies metacells that significantly deviate from this null as dubious (Fig. 2a).
In both semi-simulated and real PBMC datasets, mcRigor accurately distinguishes trustworthy metacells from dubious ones (Fig. 2b–c). We further demonstrate mcRigor’s effectiveness in improving the reliability of multiple downstream analyses. In cell-line data analyses, removing dubious metacells markedly increases the signal-to-noise ratio of cell-cycle marker genes (Fig. 2d). In COVID-19 versus healthy control data analyses, mcRigor eliminates spurious gene correlations caused by dubious metacells and reveals stronger co-expression within adaptive immune response modules (Fig. 2e). In scMultiome data analyses, mcRigor enhances the detectability of enhancer–gene associations, filtering out weakly supported false positives while preserving signals consistent with those observed at the single-cell level (Fig. 2f).


In the second part of our paper 8, we present mcRigor’s methodology for evaluating metacell partitions and optimizing hyperparameters. By balancing metacell trustworthiness against data sparsity, mcRigor assigns an overall evaluation score to each candidate partition and automatically selects the optimal method–parameter configuration among all candidates, thereby transforming the empirical process of method and parameter tuning into data-driven automated decision-making (Fig. 3a).
We illustrate the utility of this optimization functionality across diverse downstream tasks. For instance, the zero proportion of mcRigor-optimized metacells closely matches the gold-standard zero proportion measured by smRNA-FISH, demonstrating its ability to distinguish technical zeros from biological zeros (Fig. 3b). In differential expression analysis, results based on mcRigor-optimized metacells align more closely with those obtained from bulk RNA-seq data, indicating improved reliability (Fig. 3c). In time-course data, mcRigor-optimized metacells enhance trajectory resolution and reveal clearer gene-expression dynamics consistent with experimental evidence (Fig. 3d).
The mcRigor R package and online tutorials are available at https://jsb-ucla.github.io/mcRigor/
Full paper available at https://www.nature.com/articles/s41467-025-63626-5
References:
1. Picelli, S. et al. Full-length RNA-seq from single cells using Smart-seq2. Nat. Protoc. 9, 171–181 (2014).
2. Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
3. Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015).
4. Cusanovich, D. A. et al. Multiplex single cell profiling of chromatin accessibility by combinatorial cellular indexing. Science 348, 910–914 (2015).
5. Cao, J. et al. Joint profiling of chromatin accessibility and gene expression in thousands of single cells. Science 361, 1380–1385 (2018).
6. Jiang, R., Sun, T., Song, D. & Li, J. J. Statistics or biology: the zero-inflation controversy about scRNA-seq data. Genome Biol. 23, 31 (2022).
7. Bilous, M., Hérault, L., Gabriel, A. A., Teleman, M. & Gfeller, D. Building and analyzing metacells in single-cell genomics data. Mol. Syst. Biol. 20, 744–766 (2024).
8. Liu, P. & Li, J. J. mcRigor: a statistical method to enhance the rigor of metacell partitioning in single-cell data analysis. bioRxiv (2024) doi:10.1101/2024.10.30.621093.
9. Kirschenbaum, D. et al. Time-resolved single-cell transcriptomics defines immune trajectories in glioblastoma. Cell 187, 149–165.e23 (2024).



