API Reference — tokenizers extension v0.1.0¶
For guides and tutorials, see the guides and Getting Started.
Table of Contents¶
- Constants
\Tokenizers\Bpe- Static factory methods
- Instance methods
\Tokenizers\WordPiece- Static factory method
- Instance methods
$optsreference\Tokenizers\Unigram- Static factory method
- Instance methods
$optsreference\Tokenizers\Encoding- Cache directory resolution
- Procedural functions
- Remote companion —
\Tokenizers\Remote TransportinterfaceAnthropicGemini- Environment variable reference
\Tokenizers\TokenCounter\Tokenizers\TokenizerException- See also
1. Constants¶
\Tokenizers\VERSION¶
The extension version string. Also available via the procedural helper tokenizers_version().
2. \Tokenizers\Bpe¶
Native C class. Implements byte-level BPE (byte-pair encoding), compatible with OpenAI's tiktoken library. Suitable for cl100k_base (GPT-4, o1, o3), o200k_base (GPT-4o multimodal, o1 mini/pro), and open-weight models loaded from a HuggingFace tokenizer.json (GPT-2, RoBERTa, Llama 3, Mistral, Qwen, DeepSeek).
Merge complexity is O(n log n) — heap-based, so adversarial inputs do not trigger quadratic blowup. The vocabulary is loaded once per worker process into a process-global cache (ZTS-safe via a TSRM mutex).
Static factory methods¶
Bpe::fromTiktokenFile()¶
public static function fromTiktokenFile(
string $path,
string $pattern,
array $specialTokens = []
): Bpe;
Loads a vocabulary from a tiktoken .tiktoken file on disk. $path is the absolute path to the file. $pattern is the splitting regex (e.g. the cl100k_base pattern). $specialTokens is a map of special-token string to integer id.
$bpe = \Tokenizers\Bpe::fromTiktokenFile(
'/path/to/cl100k_base.tiktoken',
'/(?i:\'s|\'t|\'re|\'ve|\'m|\'ll|\'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+/',
['<|endoftext|>' => 100257]
);
Bpe::fromVocab()¶
public static function fromVocab(
array $tokenBytesToId,
array $merges,
string $pattern,
array $specialTokens = []
): Bpe;
Constructs a Bpe instance directly from in-memory data. $tokenBytesToId maps base64-encoded byte sequences to integer ids. $merges is an ordered list of merge pairs. $pattern is the splitting regex. $specialTokens maps special-token strings to ids.
Instance methods¶
encode()¶
public function encode(
string $text,
array|string $allowedSpecial = [],
array|string $disallowedSpecial = "all"
): array; // int[]
Encodes $text into an array of integer token ids.
Special-token handling: By default $disallowedSpecial = "all" means every special token in the vocabulary is treated as plain text (i.e. it will not be encoded as a single special-token id). Pass "all" to $allowedSpecial to allow all special tokens to be encoded as their ids, or pass an explicit list such as ['<|endoftext|>'] to allow only those tokens.
| Parameter | Type | Default | Description |
|---|---|---|---|
$text |
string |
— | Input text to encode |
$allowedSpecial |
array\|string |
[] |
Tokens that may be encoded as special ids. Pass "all" or a list. |
$disallowedSpecial |
array\|string |
"all" |
Tokens that must NOT appear; causes an error if encountered. |
$ids = $bpe->encode('Hello world');
// [9906, 1917]
// Allow the end-of-text marker to pass through as a special id:
$ids = $bpe->encode('<|endoftext|>text', allowedSpecial: ['<|endoftext|>']);
// Allow every special token:
$ids = $bpe->encode($text, allowedSpecial: 'all');
countTokens()¶
Returns the number of tokens $text encodes to, without allocating the full id array. Faster than count($this->encode($text)) for large inputs.
decode()¶
Decodes an array of integer token ids back to a UTF-8 string.
decodeSingle()¶
Decodes a single token id to its byte representation. Useful for inspecting individual tokens.
vocabSize()¶
Returns the total vocabulary size (base tokens + special tokens). For cl100k_base this is 100277.
name()¶
Returns the encoding name, or null. In v0.1 this always returns null — name tracking is not yet implemented.
3. \Tokenizers\WordPiece¶
Native C class. Implements greedy longest-match WordPiece tokenization (BERT family). Uses a ## continuation prefix for subword pieces, falls back to [UNK] for out-of-vocabulary words.
Normalization scope (v0.1): Latin-1 and CJK-spacing only. Non-Latin scripts that require full Unicode NFD decomposition are out of scope for v0.1 and may produce different results from the reference.
Static factory method¶
WordPiece::fromVocab()¶
Builds a WordPiece tokenizer from a vocabulary map. $tokenToId maps token strings (e.g. "hello", "##ing") to their integer ids. See $opts reference below.
To load directly from a BERT-style vocab.txt file, use \Tokenizers\Encoding::wordPieceFromVocabFile().
Instance methods¶
encode()¶
Encodes $text and returns an array of integer token ids.
countTokens()¶
Returns the token count without materialising the full id array.
decode()¶
Reconstructs text from a list of token ids. Strips the ## continuation prefix when reassembling subword pieces.
vocabSize()¶
Returns the size of the vocabulary.
$opts reference¶
All keys are optional. Unspecified keys use the listed defaults.
| Key | Type | Default | Description |
|---|---|---|---|
unkToken |
string |
"[UNK]" |
Token used for out-of-vocabulary words |
continuingSubwordPrefix |
string |
"##" |
Prefix prepended to continuation subword pieces |
maxInputCharsPerWord |
int |
100 |
Words longer than this character limit are mapped to [UNK] |
lowercase |
bool |
true |
Lowercases input before tokenization |
stripAccents |
bool |
true |
Strips diacritic accents (Latin-1 scope) |
handleChineseChars |
bool |
true |
Pads CJK codepoints with spaces before tokenization |
4. \Tokenizers\Unigram¶
Native C class. Implements SentencePiece Unigram tokenization (T5, ALBERT). Uses a Metaspace (▁, U+2581) to encode leading spaces, and Viterbi best-path decoding over piece log-probability scores (f64).
Normalization scope (v0.1): Metaspace and identity-on-ASCII. Inputs requiring NFKC normalization, and some whitespace-edge cases (leading/trailing/multiple spaces), may differ from the reference tokenizer.
Static factory method¶
Unigram::fromVocab()¶
Constructs a Unigram tokenizer from a list of (piece, score) pairs. $pieces is an array of [string, float] entries as found in a SentencePiece tokenizer.json. See $opts reference below.
To load from a HuggingFace tokenizer.json automatically, use \Tokenizers\Encoding::fromHuggingFace().
Instance methods¶
encode()¶
Encodes $text and returns an array of integer token ids via Viterbi best-path search.
countTokens()¶
Returns the token count without materialising the full id array.
decode()¶
Reconstructs text from a list of token ids, replacing leading ▁ characters with spaces.
vocabSize()¶
Returns the size of the vocabulary.
$opts reference¶
All keys are optional. Unspecified keys use the listed defaults.
| Key | Type | Default | Description |
|---|---|---|---|
unkId |
int |
id of <unk> if present, else 0 |
Token id to emit for unknown pieces |
addPrefixSpace |
bool |
true |
Prepend a Metaspace (▁) to the input before encoding |
5. \Tokenizers\Encoding¶
Pure-PHP shim (php/Tokenizers/Encoding.php). Provides high-level loaders that download, checksum-verify, and cache vocabulary files, then return the appropriate native tokenizer instance. Also dispatches HuggingFace tokenizer JSON files to the correct C class.
Encoding::load()¶
Loads a named built-in encoding. Downloads and checksum-verifies the vocabulary on first use; subsequent calls return from the process-global cache. Currently known encodings: 'cl100k_base' and 'o200k_base'.
Throws \Tokenizers\TokenizerException("unknown encoding: <name>") for unrecognised names.
use Tokenizers\Encoding;
$enc = Encoding::load('cl100k_base');
echo $enc->countTokens('Hello world'); // 2
echo $enc->vocabSize(); // 100277
Encoding::fromHuggingFace()¶
Reads a HuggingFace tokenizer.json file and auto-dispatches by the model.type field:
model.type value |
Returned class |
|---|---|
"BPE" |
\Tokenizers\Bpe |
"WordPiece" |
\Tokenizers\WordPiece |
"Unigram" |
\Tokenizers\Unigram |
$bpe = Encoding::fromHuggingFace('path/to/llama3/tokenizer.json'); // Bpe
$wp = Encoding::fromHuggingFace('path/to/bert/tokenizer.json'); // WordPiece
$ug = Encoding::fromHuggingFace('path/to/t5/tokenizer.json'); // Unigram
Encoding::wordPieceFromVocabFile()¶
Loads a plain BERT-style vocab.txt file (one token per line, line number = id) into a WordPiece tokenizer. $opts accepts the same keys as WordPiece::fromVocab().
Encoding::cacheDir()¶
Returns the resolved path to the directory where downloaded vocabulary files are stored. Useful for debugging cache location.
Encoding::download()¶
Low-level utility. Downloads $url to $dest, optionally verifying the file against $sha256. Used internally by Encoding::load(); you generally do not need to call this directly.
Cache directory resolution¶
Encoding::load() stores downloaded vocabulary files in the first directory that resolves from the following ordered list:
$TOKENIZERS_CACHE_DIR/tokenizers$XDG_CACHE_HOME/tokenizers$HOME/.cache/tokenizerssys_get_temp_dir()/tokenizers
Built-in vocabulary files are downloaded from OpenAI's public CDN on first use, checksum-verified, and are never redistributed with the extension.
6. Procedural functions¶
Global-namespace procedural functions. All functions that operate on a tokenizer accept a \Tokenizers\Bpe instance only — WordPiece and Unigram are not accepted by these functions.
tokenizers_version()¶
Returns the extension version string "0.1.0". Equivalent to \Tokenizers\VERSION.
tokenizers_cache_count()¶
Returns the number of tokenizer models currently held in the process-global vocabulary cache. Useful for diagnostics.
tokenizers_encode()¶
function tokenizers_encode(
\Tokenizers\Bpe $t,
string $text,
array $allowedSpecial = [],
array|string $disallowedSpecial = "all"
): array; // int[]
Procedural wrapper for Bpe::encode(). See encode() for $allowedSpecial and $disallowedSpecial semantics.
tokenizers_decode()¶
Procedural wrapper for Bpe::decode().
tokenizers_count()¶
Procedural wrapper for Bpe::countTokens().
7. Remote companion — \Tokenizers\Remote¶
Pure-PHP classes that count tokens via the official provider APIs. They work without the C extension loaded (they bootstrap a TokenizerException polyfill via require_once of php/Tokenizers/TokenizerException.php). They require ext-curl and ext-json. They use raw curl — they do not depend on anthropic-ai/sdk, Guzzle, or any other HTTP library.
Note: Claude 3+ and Gemini models have no local tokenizer. Exact token counts for those models require a network call and a valid API key. Providers may change their tokenizer at any time.
Transport interface¶
interface Transport {
public function post(
string $url,
array $headers,
string $body,
int $timeout
): array; // ['status' => int, 'body' => string]
}
The HTTP abstraction used by Anthropic and Gemini. The default implementation is CurlTransport. Inject a custom implementation for offline testing.
class FakeTransport implements \Tokenizers\Remote\Transport {
public function post(string $url, array $headers, string $body, int $timeout): array {
return ['status' => 200, 'body' => '{"input_tokens":5}'];
}
}
Anthropic¶
final class Anthropic {
public function __construct(
?string $apiKey = null,
?Transport $transport = null,
string $version = '2023-06-01',
int $timeout = 30
);
public function countTokens(
string $model,
string|array $messages,
?string $system = null
): int;
}
Counts tokens for an Anthropic (Claude) model via the official API.
countTokens() details:
- Endpoint:
POST https://api.anthropic.com/v1/messages/count_tokens - Headers:
x-api-key: <key>,anthropic-version: 2023-06-01,content-type: application/json - Body:
{"model": <model>, "messages": [...], "system": <system?>} - A plain
stringfor$messagesbecomes a single{"role":"user","content": <text>}turn. - An
arrayis sent as-is (for multi-turn conversations). - Response field parsed:
input_tokens - API key resolution:
$apiKeyconstructor argument, otherwiseANTHROPIC_API_KEYenvironment variable. A missing key throwsTokenizerExceptionat call time. - Errors: Non-2xx HTTP status or a malformed/missing
input_tokensfield throwsTokenizerException.
use Tokenizers\Remote\Anthropic;
$anthropic = new Anthropic(); // reads ANTHROPIC_API_KEY from env
$n = $anthropic->countTokens('claude-opus-4-8', 'Hello, world!');
// Multi-turn:
$n = $anthropic->countTokens('claude-opus-4-8', [
['role' => 'user', 'content' => 'Hi'],
['role' => 'assistant', 'content' => 'Hello!'],
['role' => 'user', 'content' => 'How are you?'],
]);
// With a system prompt:
$n = $anthropic->countTokens('claude-opus-4-8', 'Hello', system: 'You are a helpful assistant.');
Gemini¶
final class Gemini {
public function __construct(
?string $apiKey = null,
?Transport $transport = null,
int $timeout = 30
);
public function countTokens(string $model, string $text): int;
}
Counts tokens for a Google Gemini model via the official API.
countTokens() details:
- Endpoint:
POST https://generativelanguage.googleapis.com/v1beta/models/{model}:countTokensThe{model}segment is normalised — both"gemini-1.5-flash"and"models/gemini-1.5-flash"are accepted; the leadingmodels/is added or preserved exactly once, never double-prefixed. - Header:
x-goog-api-key: <key> - Body:
{"contents":[{"parts":[{"text": <text>}]}]} - Response field parsed:
totalTokens - API key resolution:
$apiKeyconstructor argument, thenGEMINI_API_KEYenv var, thenGOOGLE_API_KEYenv var. A missing key throwsTokenizerExceptionat call time. - Errors: Non-2xx HTTP status or a malformed/missing
totalTokensfield throwsTokenizerException.
use Tokenizers\Remote\Gemini;
$gemini = new Gemini(); // reads GEMINI_API_KEY or GOOGLE_API_KEY from env
$n = $gemini->countTokens('gemini-1.5-flash', 'Hello, world!');
// Both forms are equivalent:
$n = $gemini->countTokens('models/gemini-1.5-flash', 'Hello, world!');
Environment variable reference¶
| Provider | Environment variable(s) | Constructor override |
|---|---|---|
| Anthropic | ANTHROPIC_API_KEY |
new Anthropic(apiKey: '...') |
| Gemini | GEMINI_API_KEY (checked first), then GOOGLE_API_KEY |
new Gemini(apiKey: '...') |
8. \Tokenizers\TokenCounter¶
final class TokenCounter {
public function __construct(
?Anthropic $anthropic = null,
?Gemini $gemini = null
);
public static function route(string $model): string; // 'anthropic' | 'gemini' | 'local'
public function count(
string $model,
string $text,
?string $provider = null
): int;
}
High-level facade that dispatches token counting to the right backend based on the model name. Defined in php/Tokenizers/TokenCounter.php.
route() — routing rules (no network call):
| Model prefix | Returns |
|---|---|
claude or anthropic |
'anthropic' |
gemini or models/gemini |
'gemini' |
| anything else | 'local' |
count() — dispatch logic:
'anthropic'→ callsAnthropic->countTokens($model, $text)'gemini'→ callsGemini->countTokens($model, $text)'local'→ callsEncoding::load($model)->countTokens($text)- Passing an explicit
$providerthat is not one of the three recognised values throwsTokenizerException("unknown provider '<p>' for model: <model>").
use Tokenizers\TokenCounter;
$tc = new TokenCounter();
// Local (no network, no key needed):
$n = $tc->count('cl100k_base', $text);
// Remote Anthropic (needs ANTHROPIC_API_KEY):
$n = $tc->count('claude-opus-4-8', $text);
// Remote Gemini (needs GEMINI_API_KEY or GOOGLE_API_KEY):
$n = $tc->count('gemini-1.5-flash', $text);
// Inject pre-configured clients:
$tc = new TokenCounter(
anthropic: new \Tokenizers\Remote\Anthropic(apiKey: 'ant-...'),
gemini: new \Tokenizers\Remote\Gemini(apiKey: 'AIza...')
);
// Force a specific provider:
$n = $tc->count('my-model', $text, provider: 'local');
// Inspect routing without counting:
echo TokenCounter::route('claude-sonnet-4'); // 'anthropic'
echo TokenCounter::route('gemini-pro'); // 'gemini'
echo TokenCounter::route('cl100k_base'); // 'local'
9. \Tokenizers\TokenizerException¶
Thrown by all classes in this extension for error conditions. Extends the standard \RuntimeException, so it can be caught as either \Tokenizers\TokenizerException or \RuntimeException.
Thrown in the following situations:
Encoding::load()— unknown encoding name:"unknown encoding: <name>"Anthropic::countTokens()/Gemini::countTokens()— missing API key (at call time)Anthropic::countTokens()/Gemini::countTokens()— non-2xx HTTP responseAnthropic::countTokens()/Gemini::countTokens()— malformed or missing response fieldTokenCounter::count()— unknown explicit$provider:"unknown provider '<p>' for model: <model>"
use Tokenizers\Encoding;
use Tokenizers\TokenizerException;
try {
$enc = Encoding::load('p50k_base'); // not bundled in v0.1
} catch (TokenizerException $e) {
echo $e->getMessage(); // "unknown encoding: p50k_base"
}
See also¶
- Getting Started — installation, enabling the extension, first tokenization
- Status & Limitations — conformance results, known limitations, roadmap
- Guide: Estimating Costs — budget LLM API costs before calling
- Guide: Loading Models — load OpenAI/HF BPE, WordPiece, Unigram, and the cache
- Guide: Remote Providers — Claude/Gemini API companion, key setup, honest boundaries