Skip to content

goodarzilab/pyPAGE

Repository files navigation

pyPAGE

PyPI version Python versions Tests License: MIT

pyPAGE is a Python implementation of the conditional-information PAGE framework for gene-set enrichment analysis.

It is designed to infer differential activity of pathways and regulons while accounting for annotation and membership biases using information-theoretic methods.

Approach

Bulk PAGE

Standard gene-set enrichment methods test whether pathway members are non-randomly distributed across a ranked gene list. pyPAGE frames this as an information-theoretic question: how much does knowing a gene's pathway membership tell you about its expression bin?

  1. Discretize continuous expression scores (e.g. log2 fold-change) into equal-frequency bins
  2. Compute mutual information (MI) between expression bins and pathway membership — or conditional MI (CMI), which conditions on how many pathways each gene belongs to, correcting for the bias that heavily-annotated genes drive spurious enrichment
  3. Permutation test to assess significance, with early stopping
  4. Redundancy filtering removes pathways whose signal is explained by an already-accepted pathway (via CMI between memberships)
  5. Hypergeometric enrichment per bin produces the iPAGE-style heatmap showing which expression bins drive each pathway's signal

Single-Cell PAGE

For single-cell data, the question becomes: are pathway scores spatially coherent across the cell manifold? A pathway whose activity varies smoothly across cell states (rather than randomly) is biologically meaningful.

  1. Per-cell scoring — for each cell, compute MI or CMI between gene expression bins and pathway membership across all genes. This produces an (n_cells x n_pathways) score matrix
  2. KNN graph — build a cell-cell k-nearest-neighbor graph from expression (or use a precomputed one from scanpy)
  3. Geary's C — measure spatial autocorrelation of each pathway's scores on the KNN graph. Report C' = 1 - C, where higher values mean the pathway varies coherently across the manifold rather than randomly
  4. Permutation test — generate size-matched random gene sets, compute their C', and derive empirical p-values with BH FDR correction

Installation

Install from PyPI:

pip install bio-pypage

Or install from source:

git clone https://github.com/goodarzilab/pyPAGE
cd pyPAGE
pip install -e .

Quick Start

import pandas as pd
from pypage import PAGE, ExpressionProfile, GeneSets

# 1) Load expression profile (gene, score)
expr = pd.read_csv(
    "example_data/AP2S1.tab.gz",
    sep="\t",
    header=None,
    names=["gene", "score"],
)
exp = ExpressionProfile(expr["gene"], expr["score"], is_bin=True)

# 2) Load annotation (gene, pathway)
ann = pd.read_csv(
    "example_data/GO_BP_2021_index.txt.gz",
    sep="\t",
    header=None,
    names=["gene", "pathway"],
)
gs = GeneSets(ann["gene"], ann["pathway"])

# 3) Run pyPAGE
p = PAGE(exp, gs, n_shuffle=100, k=7, filter_redundant=True)
results, heatmap = p.run()

print(results.head())
heatmap.show()

results contains:

  • pathway
  • CMI — conditional mutual information score
  • z-score — z-score of observed CMI vs. permutation null distribution
  • p-value — empirical p-value from permutation test
  • Regulation pattern (1 for up, -1 for down)

Examples

Use these canonical examples with the bundled example_data/ outputs.

pypage -e example_data/test_DESeq_logFC.txt \
    --gmt example_data/c2.all.v2026.1.Hs.symbols.gmt \
    --type continuous --n-bins 9 \
    --cols GENE,log2FoldChange \
    --seed 42 \
    --outdir example_data/test_DESeq_logFC_cont_PAGE
pypage -e example_data/test_DESeq_logFC.txt \
    --gmt example_data/c2.all.v2026.1.Hs.symbols.gmt \
    --type discrete \
    --cols GENE,log2FoldChange_bin9 \
    --seed 42 \
    --outdir example_data/test_DESeq_logFC_disc_PAGE
pypage-sc --adata example_data/CRC.h5ad \
    --gene-column gene \
    --gmt example_data/c2.all.v2026.1.Hs.symbols.gmt \
    --groupby PhenoGraph_clusters --n-jobs 0 --fast-mode

Expected Outputs (Demo Artifacts)

Bulk continuous (example_data/test_DESeq_logFC_cont_PAGE/):

Bulk discrete (example_data/test_DESeq_logFC_disc_PAGE/):

Single-cell (example_data/CRC_scPAGE/):

Preview Graphics (embedded)

Bulk continuous heatmap (PDF | HTML): Bulk continuous heatmap

Bulk discrete heatmap (PDF | HTML): Bulk discrete heatmap

Single-cell ranking (PDF | Interactive ranking | SC report): Single-cell consistency ranking

Single-cell UMAP pathway example (PDF): Single-cell UMAP pathway

Single-cell group-enrichment example (PDF | Stats TSV): Single-cell group enrichment

Documentation

The detailed user and API documentation now lives in MANUAL.md.

Updated notebooks:

Citation

Bakulin A, Teyssier NB, Kampmann M, Khoroshkin M, Goodarzi H (2024) pyPAGE: A framework for Addressing biases in gene-set enrichment analysis—A case study on Alzheimer's disease. PLoS Computational Biology 20(9): e1012346. https://doi.org/10.1371/journal.pcbi.1012346

License

MIT

About

pyPAGE was developed in the Goodarzi Lab at UCSF by Artemy Bakulin, Noam B. Teyssier, and Hani Goodarzi.

About

python implementation of the PAGE algorithm

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors