A pipeline for curating ChEMBL into a virtual screening dataset for deep-learning model training/validation.
We filter ChEMBL activity data, select artificial decoys with reasonable compound/target criteria. Our pipeline also provides a leakage-resistant train/test splitting by receptor similarity. Using PDBbind+BioLip as default training set, easily replacable with user-custom bulk fasta.
Using just our default setting, you can create the 'ChEMBL-LR' benchmark to test DL models trained on PDBbind and/or BioLip2 dataset wihtout having to worry about target leakage!
| Stage | Command | Output |
|---|---|---|
| 1. Compound filtering | curate |
{target}/actives.tsv, {target}/comps/smiles/*.smi |
| 2. Protein filtering | filter-proteins |
aligned/*.pdb, pocket_info.csv, sequences.fasta, best_structure.tsv |
| 3. Active clustering | cluster-actives |
{target}/actives_clustered.tsv |
| 4. Compound pool | build-pool |
compound_pool.pkl |
| 5. Receptor similarity | receptor-sim |
pairwise_seqid.tsv, pairwise_pocket_rmsd.tsv |
| 6. Decoy selection | select-decoys |
{target}/decoys.tsv |
| 7. Train/test split | split |
train.txt, test.txt |
All outputs go under a single data directory (e.g. curated_data_filtered/).
conda is the recommended approach - RDKit and nurikit are C++ extension packages that conda resolves cleanly.
conda create -n chemblq python=3.10
conda activate chemblq
# RDKit via conda-forge (simpler than pip for RDKit)
conda install -c conda-forge rdkit
# Remaining Python dependencies
pip install -e . # installs click, numpy, pandas, requests, tqdm, nurikit
nurivsnurikit: the Python import isimport nuribut the package name on PyPI isnurikit. Thepip install -e .above handles this via the dependency inpyproject.toml.
These are not pip/conda packages and must be installed separately:
| Tool | Required for | Install |
|---|---|---|
| MMseqs2 | Stages 5, 7 (sequence search/clustering) | github.com/soedinglab/MMseqs2 - must be in PATH |
wget |
Stage 2 (AlphaFold download) | usually pre-installed |
pdb_get |
Stage 2 (optional) | local PDB mirror; falls back to RCSB web download |
Note: Structure alignment (Stages 2, 5) uses
nurikit(Python TMAlign bindings), which is installed automatically viapip install -e .. No separate TMalign binary is needed.
# Verify MMseqs2 is in PATH
mmseqs --helpDATA=curated_data_filtered
# Stage 1: curate compounds from ChEMBL
chembl-curator curate --download --output $DATA
# Stage 2: validate protein structures and binding sites
chembl-curator filter-proteins --curated-dir $DATA --n-processes 8
# Stage 3: cluster actives per target (Butina, Tanimoto ≥ 0.7)
chembl-curator cluster-actives --data-dir $DATA --workers 8
# Stage 4: build global compound pool
chembl-curator build-pool --data-dir $DATA
# Stage 5: compute pairwise receptor similarity
chembl-curator receptor-sim --data-dir $DATA --mode both --workers 8
# Stage 6: select property-matched, receptor-aware decoys
chembl-curator select-decoys --data-dir $DATA --max-decoys 30
# Stage 7: train/test split by sequence identity clustering
chembl-curator split --data-dir $DATA --valid-frac 0.1Extracts and filters ligand-target pairs from ChEMBL.
chembl-curator curate --download --output curated_data_filtered
chembl-curator curate --database /path/to/chembl.db --config config.json --output curated_data_filtered
chembl-curator curate --create-config config.json # generate example configDefault filters:
- Activity types: Ki, Kd, IC50, EC50
- Relations:
=,<= - Units: nM, uM (≤ 10 µM)
- pChEMBL ≥ 5.0, confidence score ≥ 6
- Heavy atoms: 5-80, valid SMILES required
- Binding assays only (
B), single protein targets
Configuration (JSON):
{
"activity_thresholds": {"nM": 10000.0, "uM": 10.0},
"activity_types": ["Kd", "Ki", "IC50", "EC50"],
"relations": ["=", "<="],
"units": ["nM", "uM"],
"min_pchembl_value": 5.0,
"min_confidence_score": 6,
"assay_types": ["B"]
}Fetches PDB structures, downloads AlphaFold models, aligns structures, and keeps only targets with a single binding site.
chembl-curator filter-proteins --curated-dir curated_data_filtered --n-processes 8Steps: fetch UniProt PDB list -> download PDB/AlphaFold -> detect ligand-bound structures -> align to AlphaFold (TMalign) -> cluster pockets -> filter single-site targets.
Also writes sequences.fasta (canonical UniProt sequences) and best_structure.tsv (best-resolution ligand-bound structure per target) needed by later stages.
Butina clustering of actives per target. Picks the highest-pChEMBL representative per cluster.
chembl-curator cluster-actives --data-dir curated_data_filtered --dist-thresh 0.3 --workers 8| Option | Default | Description |
|---|---|---|
--dist-thresh |
0.3 | Tanimoto distance threshold (0.3 -> similarity ≥ 0.7) |
--workers |
1 | Parallel worker processes |
Builds a global compound pool from all clustered actives. Deduplicates by ChEMBL ID, computes molecular properties and Morgan fingerprints.
chembl-curator build-pool --data-dir curated_data_filteredPool contains: MW, cLogP, TPSA, HBD, HBA, aromatic rings, 2048-bit Morgan FP (radius 2), target membership set.
Computes pairwise receptor similarity via two independent methods.
# Both (recommended)
chembl-curator receptor-sim --data-dir curated_data_filtered --mode both --workers 8
# Sequence identity only (faster, no nuri required)
chembl-curator receptor-sim --data-dir curated_data_filtered --mode seqid| Option | Default | Description |
|---|---|---|
--mode |
both | seqid, pocket, or both |
--seqid-threads |
4 | MMseqs2 thread count |
--workers |
4 | Processes for pocket RMSD |
--pocket-radius |
10.0 | Pocket radius in Å |
Sequence identity: MMseqs2 all-vs-all -> pairwise_seqid.tsv (query, target, seqid 0-1)
Pocket RMSD: TM-align full chain -> filter to pocket residues within radius -> RMSD on ≥3 matched pairs -> pairwise_pocket_rmsd.tsv
Selects property-matched, chemically dissimilar decoys per active. Excludes compounds active against receptors similar to the query target.
chembl-curator select-decoys --data-dir curated_data_filtered --max-decoys 30| Option | Default | Description |
|---|---|---|
--max-decoys |
30 | Decoys per active |
--seqid-thresh |
0.6 | Seqid threshold for receptor exclusion |
--pocket-rmsd-thresh |
3.0 | Pocket RMSD threshold (Å) for receptor exclusion |
--exclusion-mode |
or | or = exclude if seqid OR pocket matches |
--tanimoto-thresh |
0.3 | Max Tanimoto between active and decoy |
Property matching windows: ±50 Da MW, ±2 cLogP, ±50 Ų TPSA, ±2 HBD, ±2 HBA, ±1 aromatic ring.
Train/test split by sequence-identity clustering (MMseqs2). Greedy assignment balances per-source ratios. Per-entry sampling weight = 1 / log2(cluster_size + 1).
A bundled PDBbind+BioLip FASTA (chembl_curator/assets/external_targets.fasta) is included by default.
# Default: includes bundled PDBbind+BioLip sequences
chembl-curator split --data-dir curated_data_filtered --valid-frac 0.1
# ChEMBL-only split (no external sequences)
chembl-curator split --data-dir curated_data_filtered --no-external
# Use your own external FASTA
chembl-curator split --data-dir curated_data_filtered \
--external-fasta /path/to/your.fastaExternal FASTA ID format: IDs must be dot-prefixed as >{source}.{entry_id}, where source is any label (e.g. pdbbind, biolip, myscreendb) and entry_id is any identifier without spaces. For example:
>pdbbind.1a4k
MPPYTVVY...
>biolip.10gs_VWW_A_1
PYTVVYFP...
>myscreendb.custom_entry_42
MKWVTFIS...
Output: train.txt / test.txt (tab-separated, with header):
source entry_id compound weight
chembl A0A0H2UPP7 CHEMBL405346 0.17
chembl A0A0H2UPP7 CHEMBL407216 0.17
biolip 10gs_VWW_A_1 - 0.19
pdbbind 1a4k - 0.26
Output: chembl_targets.tsv (tab-separated, with header):
uniprot split n_actives n_decoys
A0A0H2UPP7 train 2 60
Q9NR56 test 9 270
curated_data_filtered/
├── sequences.fasta # Canonical sequences, all passed targets
├── best_structure.tsv # uniprot -> best PDB chain + resolution
├── compound_pool.pkl # Global compound pool (pickle)
├── pairwise_seqid.tsv # All-vs-all sequence identity
├── pairwise_pocket_rmsd.tsv # All-vs-all pocket RMSD
├── passed_targets.txt # UniProt IDs that passed protein filtering
├── train.txt # Train split entries with weights
├── test.txt # Test split entries with weights
├── chembl_targets.tsv # Per-target summary (split, n_actives, n_decoys)
│
└── {UniProt}/ # Per-target directory
├── actives.tsv # chembl_id, pchembl, smiles
├── actives_clustered.tsv # + cluster_size column
├── decoys.tsv # active_chembl_id -> decoy_ids (;-sep)
├── comps/smiles/*.smi # Raw SMILES files from Stage 1
├── pdb/ # Downloaded PDB + AlphaFold structures
├── aligned/ # Structures aligned to AlphaFold model
├── pdbid.list # PDB metadata (method, resolution, chains)
├── pocket_info.csv # Ligand pocket coordinates
└── sequence.fasta # Per-target canonical sequence
ChEMBL-Q/
├── chembl_curator/
│ ├── __init__.py
│ ├── cli.py # CLI entry points
│ ├── config.py # CurationConfig
│ ├── curator.py # Stage 1
│ ├── protein_filter.py # Stage 2
│ ├── active_clusterer.py # Stage 3
│ ├── compound_pool.py # Stage 4
│ ├── receptor_similarity.py # Stage 5
│ ├── decoy_selector.py # Stage 6
│ ├── splitter.py # Stage 7
│ ├── downloader.py
│ ├── filters.py
│ └── assets/
│ ├── excluded_ligands.txt
│ └── external_targets.fasta
├── docs/
│ └── index.html # Interactive pipeline overview
├── pyproject.toml
└── README.md
- ChEMBL - bioactivity database
- AlphaFold DB - predicted protein structures
- RCSB PDB - experimental protein structures
- MMseqs2 - fast sequence search/clustering
- nurikit - Python TMAlign bindings (used for structure alignment)
This project is provided as-is for research purposes.