TEL

Transformer Efficiency Lab

⚠️ Warning: UNDER CONSTRUCTION ⚠️

This is a project to systematically manage, use and compare the different subsystems that go into the modern transformer. Specifically, the experiments will be in 2 distinct phases:

Phase 0 : Ground Work, reading the literature, building the mind, muscle and intuition for fundamentally training LLMs. ✅
Phase 1 : Dense Decoder Lab, focus on pre-training and architecture
Phase 2 : Inference Efficiency Lab, KV Cache, Inference and serving.
Phase 3 : Sparse Pre training Lab, everything MoE’s

While there are several other aspects i wish to discover which are within the realm but currently just out of scope: Parallelism(DDP,TP,EP maybe with JAX or along the way as i train bigger models), State Space Models (Mamba) etc.

The first order of business is to establish a solid baseline for phase-1 , i.e: model_spec and eval metrics. Let’s call this Phase 0 Once this is fortified we can step into different ablation studies into Phase 1 for : Encoding, Attention, KV Cache etc. After we have experimented through Phase1 the best version of of the model will be used as baseline for Phase 2.

Some pointers to self:

Avoid one off notebooks and script’s, build up a run&eval pipeline and use that for experiments. Develop with config based in mind.
Have compulsory reporting “If it’s not logged, it didn’t happen!”
Define data constraints - token budget between phase1 and phase2.
Be hypothesis driven and document everything.

Some other similar projects out-there which actually just allow rapid experimenting :

https://github.com/pytorch/torchtitan
https://github.com/google-deepmind/simply
https://github.com/karpathy/autoresearch ( senpai dropped this yesterday! )

TEL

Transformer Efficiency Lab

Phase_0

index

index

index

Graph View