count_tokens

Count tokens in The Pile training data.

Usage

Output csv file format:

<token_index>,<quantity>
<token_index>,<quantity>

All zero-quantity tokens will be at the end.

On Linux, just execute run.sh script. It will do all the work automatically. ~60 GB of free space on disk required.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.clang-format		.clang-format
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
count_tokens.c		count_tokens.c
gen_vocab.c		gen_vocab.c
run.sh		run.sh
vocab.inl.h		vocab.inl.h
vocab.json		vocab.json