VHS: Verifier on Hidden States
VHS is a latent verifier framework for best-of-N text-to-image generation. It scores candidate images — or their intermediate latent representations — against a text prompt using a lightweight multimodal language model (Qwen2.5-0.5B + LLaVA), enabling efficient selection of the best generation without running a full evaluator on every sample.
This release provides the inference pipeline and two pre-trained verifiers for evaluating Sana-Sprint (1-step diffusion) on the GeneVal compositional benchmark.
The pipeline works as follows:
- Generate N=32 candidate images from a text prompt with Sana-Sprint.
- Score each candidate with a latent verifier — either at the hidden-layer level (fast) or using CLIP embeddings.
- Select the highest-scoring image.
- Evaluate the selected image on GeneVal (object detection + color + spatial relations via Mask2Former).
Two verifier variants are provided:
| Verifier | Vision input | HF checkpoint | Speed |
|---|---|---|---|
| Hidden-layer | Sana transformer block 7 activations | aimagelab/vhs-hidden-verifier |
Fast — evaluates before full decode |
| CLIP | CLIP ViT-L/14@336 embeddings | aimagelab/vhs-mllm-clip-verifier |
Slow — evaluates on completed images |
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txtNote:
requirements.txtpinstransformersanddiffusersto specific commits for reproducibility.
Latent verifier weights are downloaded automatically from the Hugging Face Hub on first use.
Download the Mask2Former detector and normalization statistics required for GeneVal evaluation:
# Mask2Former detector (required for GeneVal evaluation)
huggingface-cli download aimagelab/vhs-checkpoints \
mask2former_swin-s-p4-w7-224_8xb2-lsj-50e_coco_20220504_001756-c9d0c4f2.pth \
--repo-type dataset --local-dir ckpts/geneval-det/
# Hidden-layer activation normalization statistics (required for the hidden-layer verifier)
huggingface-cli download aimagelab/vhs-checkpoints \
block_7_mean_bf16.pt block_7_variance_bf16.pt \
--repo-type dataset --local-dir ckpts/normalization/All commands should be run from the vhs-release/ root.
Hidden-layer verifier (fastest)
Uses Sana hidden activations at transformer block 7. Generation is interrupted early, scored, and only the winner is decoded at full quality.
python inference_scripts/sana_sprint_best_of_n_comple.py \
--config configs/config_hidden-7_Qwen2.5-0.5B_mlp_train-val-best-CEfocal-loss_alpha-0.37_gamma-0.0_step_1_BO32_step_1_sana_sprint.yamlUses CLIP ViT-L/14@336 embeddings on completed images.
python inference_scripts/sana_sprint_best_of_n_comple.py \
--config configs/config_clip_orig_Qwen2.5-0.5B_mlp_train-val-best-CE-loss_step_1_BO32_sana_sprint.yamlConfigs are YAML files in configs/. Key parameters:
| Parameter | Description |
|---|---|
latent_verifier_name |
Which verifier to use (must match a key in inference_scripts/latent_verifier_dict.py) |
vision_tower |
Vision encoder: hidden_7 for the latent verifier, or a CLIP model ID |
num_inference_steps |
Diffusion steps (1 for Sana-Sprint) |
image_width / image_height |
Output resolution (default 1024×1024) |
generation_mode |
prod enables production optimizations (early stopping + winner-only decode for hidden mode) |
num_samples |
Number of candidates per prompt (N in best-of-N, default 32) |
To add a new verifier, register it in inference_scripts/latent_verifier_dict.py:
LATENT_VERIFIER = {
"my-verifier-name": {
"latent_verifier_path": "org/my-hf-model",
"vision_tower": "hidden_7" # or a CLIP model ID
},
}Results are written to outputs/<run_name>/gen_eval_<run_name>/<prompt_id>/:
<prompt_id>/
├── samples/
│ ├── img_00000.jpg … img_00031.jpg # all N candidates
│ ├── best_XXXXX.jpg # selected best image
│ └── XXXXX_metadata.json # per-sample scores and feedback
Aggregated result files per process:
all_annotations_proc_<verifier>_id_proc_<N>.json— scores for all candidatesbest_annotations_proc_<verifier>_id_proc_<N>.json— GeneVal metrics for best-selected images
vhs-release/
├── configs/ # YAML inference configs
├── data/
│ └── evaluation_metadata.json # GeneVal prompts (100+ structured text-image pairs)
├── inference_scripts/
│ ├── sana_sprint_best_of_n_comple.py # Main inference + evaluation script
│ ├── sana_activation_catcher.py # PyTorch hook for capturing hidden activations
│ ├── latent_verifier_dict.py # Registry of available verifier checkpoints
│ └── object_names.txt # COCO-80 class names for GeneVal
├── verifier_scripts/
│ ├── latent_verifier.py # LatentGemmaVerifier / LatentGemmaFeedback classes
│ └── geneval_utils.py # GeneVal evaluator (Mask2Former + CLIP)
├── vhs/ # Core VHS package
│ ├── model/
│ │ ├── llava_arch.py # LLaVA base architecture
│ │ ├── language_model/
│ │ │ └── llava_qwen.py # Qwen2/Qwen3 multimodal LLM
│ │ ├── multimodal_encoder/ # Vision towers (CLIP, VAE, hidden-layer)
│ │ └── multimodal_projector/ # Vision-to-LLM projection heads
│ ├── model_loader.py # Model instantiation utilities
│ ├── conversation.py # Conversation formatting helpers
│ └── mm_utils.py # Multimodal utilities
├── requirements.txt
└── README.md
This codebase builds on:
- LLaVA — multimodal language model architecture (Apache 2.0)
- Sana-Sprint — 1-step text-to-image diffusion model
- GeneVal — compositional text-to-image evaluation benchmark
- Mask2Former — instance segmentation for object detection
- OpenCLIP — CLIP vision encoders
- Reflect-DiT
The VHS package (vhs/) is released under the Apache License 2.0, consistent with its upstream dependencies (LLaVA, Transformers). See individual file headers for attribution.