API Reference — `tokenizers` extension v0.1.0¶

For guides and tutorials, see the guides and Getting Started.

Table of Contents¶

Constants
\Tokenizers\Bpe
Static factory methods
Instance methods
\Tokenizers\WordPiece
Static factory method
Instance methods
$opts reference
\Tokenizers\Unigram
Static factory method
Instance methods
$opts reference
\Tokenizers\Encoding
Cache directory resolution
Procedural functions
Remote companion — \Tokenizers\Remote
Transport interface
Anthropic
Gemini
Environment variable reference
\Tokenizers\TokenCounter
\Tokenizers\TokenizerException
See also

1. Constants¶

`\Tokenizers\VERSION`¶

const \Tokenizers\VERSION = "0.1.0";

The extension version string. Also available via the procedural helper tokenizers_version().

echo \Tokenizers\VERSION; // "0.1.0"

2. `\Tokenizers\Bpe`¶

Native C class. Implements byte-level BPE (byte-pair encoding), compatible with OpenAI's tiktoken library. Suitable for cl100k_base (GPT-4, o1, o3), o200k_base (GPT-4o multimodal, o1 mini/pro), and open-weight models loaded from a HuggingFace tokenizer.json (GPT-2, RoBERTa, Llama 3, Mistral, Qwen, DeepSeek).

Merge complexity is O(n log n) — heap-based, so adversarial inputs do not trigger quadratic blowup. The vocabulary is loaded once per worker process into a process-global cache (ZTS-safe via a TSRM mutex).

final class Bpe { /* ... */ }

Static factory methods¶

`Bpe::fromTiktokenFile()`¶

public static function fromTiktokenFile(
    string $path,
    string $pattern,
    array  $specialTokens = []
): Bpe;

Loads a vocabulary from a tiktoken .tiktoken file on disk. $path is the absolute path to the file. $pattern is the splitting regex (e.g. the cl100k_base pattern). $specialTokens is a map of special-token string to integer id.

$bpe = \Tokenizers\Bpe::fromTiktokenFile(
    '/path/to/cl100k_base.tiktoken',
    '/(?i:\'s|\'t|\'re|\'ve|\'m|\'ll|\'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+/',
    ['<|endoftext|>' => 100257]
);

`Bpe::fromVocab()`¶

public static function fromVocab(
    array  $tokenBytesToId,
    array  $merges,
    string $pattern,
    array  $specialTokens = []
): Bpe;

Constructs a Bpe instance directly from in-memory data. $tokenBytesToId maps base64-encoded byte sequences to integer ids. $merges is an ordered list of merge pairs. $pattern is the splitting regex. $specialTokens maps special-token strings to ids.

$bpe = \Tokenizers\Bpe::fromVocab($tokenBytesToId, $merges, $pattern);

Instance methods¶

`encode()`¶

public function encode(
    string       $text,
    array|string $allowedSpecial    = [],
    array|string $disallowedSpecial = "all"
): array; // int[]

Encodes $text into an array of integer token ids.

Special-token handling: By default $disallowedSpecial = "all" means every special token in the vocabulary is treated as plain text (i.e. it will not be encoded as a single special-token id). Pass "all" to $allowedSpecial to allow all special tokens to be encoded as their ids, or pass an explicit list such as ['<|endoftext|>'] to allow only those tokens.

Parameter	Type	Default	Description
`$text`	`string`	—	Input text to encode
`$allowedSpecial`	`array\\|string`	`[]`	Tokens that may be encoded as special ids. Pass `"all"` or a list.
`$disallowedSpecial`	`array\\|string`	`"all"`	Tokens that must NOT appear; causes an error if encountered.

$ids = $bpe->encode('Hello world');
// [9906, 1917]

// Allow the end-of-text marker to pass through as a special id:
$ids = $bpe->encode('<|endoftext|>text', allowedSpecial: ['<|endoftext|>']);

// Allow every special token:
$ids = $bpe->encode($text, allowedSpecial: 'all');

`countTokens()`¶

public function countTokens(string $text): int;

Returns the number of tokens $text encodes to, without allocating the full id array. Faster than count($this->encode($text)) for large inputs.

$n = $bpe->countTokens('Hola mundo!'); // 4

`decode()`¶

public function decode(array $ids): string;

Decodes an array of integer token ids back to a UTF-8 string.

echo $bpe->decode([9906, 1917]); // "Hello world"

`decodeSingle()`¶

public function decodeSingle(int $id): string;

Decodes a single token id to its byte representation. Useful for inspecting individual tokens.

$bytes = $bpe->decodeSingle(9906); // "Hello"

`vocabSize()`¶

public function vocabSize(): int;

Returns the total vocabulary size (base tokens + special tokens). For cl100k_base this is 100277.

echo $bpe->vocabSize(); // 100277 for cl100k_base

`name()`¶

public function name(): ?string;

Returns the encoding name, or null. In v0.1 this always returns null — name tracking is not yet implemented.

var_dump($bpe->name()); // NULL

3. `\Tokenizers\WordPiece`¶

Native C class. Implements greedy longest-match WordPiece tokenization (BERT family). Uses a ## continuation prefix for subword pieces, falls back to [UNK] for out-of-vocabulary words.

Normalization scope (v0.1): Latin-1 and CJK-spacing only. Non-Latin scripts that require full Unicode NFD decomposition are out of scope for v0.1 and may produce different results from the reference.

final class WordPiece { /* ... */ }

Static factory method¶

`WordPiece::fromVocab()`¶

public static function fromVocab(
    array $tokenToId,
    array $opts = []
): WordPiece;

Builds a WordPiece tokenizer from a vocabulary map. $tokenToId maps token strings (e.g. "hello", "##ing") to their integer ids. See $opts reference below.

$wp = \Tokenizers\WordPiece::fromVocab($vocab, [
    'lowercase'  => true,
    'unkToken'   => '[UNK]',
]);

To load directly from a BERT-style vocab.txt file, use \Tokenizers\Encoding::wordPieceFromVocabFile().

Instance methods¶

`encode()`¶

public function encode(string $text): array; // int[]

Encodes $text and returns an array of integer token ids.

$ids = $wp->encode('unbelievable tokenization'); // [23653, 19204, 3989]

`countTokens()`¶

public function countTokens(string $text): int;

Returns the token count without materialising the full id array.

$n = $wp->countTokens('Hello world');

`decode()`¶

public function decode(array $ids): string;

Reconstructs text from a list of token ids. Strips the ## continuation prefix when reassembling subword pieces.

echo $wp->decode([23653, 19204, 3989]);

`vocabSize()`¶

public function vocabSize(): int;

Returns the size of the vocabulary.

echo $wp->vocabSize();

`$opts` reference¶

All keys are optional. Unspecified keys use the listed defaults.

Key	Type	Default	Description
`unkToken`	`string`	`"[UNK]"`	Token used for out-of-vocabulary words
`continuingSubwordPrefix`	`string`	`"##"`	Prefix prepended to continuation subword pieces
`maxInputCharsPerWord`	`int`	`100`	Words longer than this character limit are mapped to `[UNK]`
`lowercase`	`bool`	`true`	Lowercases input before tokenization
`stripAccents`	`bool`	`true`	Strips diacritic accents (Latin-1 scope)
`handleChineseChars`	`bool`	`true`	Pads CJK codepoints with spaces before tokenization

4. `\Tokenizers\Unigram`¶

Native C class. Implements SentencePiece Unigram tokenization (T5, ALBERT). Uses a Metaspace (▁, U+2581) to encode leading spaces, and Viterbi best-path decoding over piece log-probability scores (f64).

Normalization scope (v0.1): Metaspace and identity-on-ASCII. Inputs requiring NFKC normalization, and some whitespace-edge cases (leading/trailing/multiple spaces), may differ from the reference tokenizer.

final class Unigram { /* ... */ }

Static factory method¶

`Unigram::fromVocab()`¶

public static function fromVocab(
    array $pieces,
    array $opts = []
): Unigram;

Constructs a Unigram tokenizer from a list of (piece, score) pairs. $pieces is an array of [string, float] entries as found in a SentencePiece tokenizer.json. See $opts reference below.

$ug = \Tokenizers\Unigram::fromVocab($pieces, ['addPrefixSpace' => true]);

To load from a HuggingFace tokenizer.json automatically, use \Tokenizers\Encoding::fromHuggingFace().

Instance methods¶

`encode()`¶

public function encode(string $text): array; // int[]

Encodes $text and returns an array of integer token ids via Viterbi best-path search.

$ids = $ug->encode('Hello world'); // [8774, 296]

`countTokens()`¶

public function countTokens(string $text): int;

Returns the token count without materialising the full id array.

$n = $ug->countTokens('Hello world');

`decode()`¶

public function decode(array $ids): string;

Reconstructs text from a list of token ids, replacing leading ▁ characters with spaces.

echo $ug->decode([8774, 296]); // "Hello world"

`vocabSize()`¶

public function vocabSize(): int;

Returns the size of the vocabulary.

echo $ug->vocabSize();

`$opts` reference¶

All keys are optional. Unspecified keys use the listed defaults.

Key	Type	Default	Description
`unkId`	`int`	id of `<unk>` if present, else `0`	Token id to emit for unknown pieces
`addPrefixSpace`	`bool`	`true`	Prepend a Metaspace (`▁`) to the input before encoding

5. `\Tokenizers\Encoding`¶

Pure-PHP shim (php/Tokenizers/Encoding.php). Provides high-level loaders that download, checksum-verify, and cache vocabulary files, then return the appropriate native tokenizer instance. Also dispatches HuggingFace tokenizer JSON files to the correct C class.

final class Encoding { /* ... */ }

`Encoding::load()`¶

public static function load(string $name): Bpe;

Loads a named built-in encoding. Downloads and checksum-verifies the vocabulary on first use; subsequent calls return from the process-global cache. Currently known encodings: 'cl100k_base' and 'o200k_base'.

Throws \Tokenizers\TokenizerException("unknown encoding: <name>") for unrecognised names.

use Tokenizers\Encoding;

$enc = Encoding::load('cl100k_base');
echo $enc->countTokens('Hello world'); // 2
echo $enc->vocabSize();                // 100277

`Encoding::fromHuggingFace()`¶

public static function fromHuggingFace(string $jsonPath): Bpe|WordPiece|Unigram;

Reads a HuggingFace tokenizer.json file and auto-dispatches by the model.type field:

`model.type` value	Returned class
`"BPE"`	`\Tokenizers\Bpe`
`"WordPiece"`	`\Tokenizers\WordPiece`
`"Unigram"`	`\Tokenizers\Unigram`

$bpe  = Encoding::fromHuggingFace('path/to/llama3/tokenizer.json'); // Bpe
$wp   = Encoding::fromHuggingFace('path/to/bert/tokenizer.json');   // WordPiece
$ug   = Encoding::fromHuggingFace('path/to/t5/tokenizer.json');     // Unigram

`Encoding::wordPieceFromVocabFile()`¶

public static function wordPieceFromVocabFile(
    string $path,
    array  $opts = []
): WordPiece;

Loads a plain BERT-style vocab.txt file (one token per line, line number = id) into a WordPiece tokenizer. $opts accepts the same keys as WordPiece::fromVocab().

$wp = Encoding::wordPieceFromVocabFile('/path/to/vocab.txt', ['lowercase' => true]);

`Encoding::cacheDir()`¶

public static function cacheDir(): string;

Returns the resolved path to the directory where downloaded vocabulary files are stored. Useful for debugging cache location.

echo Encoding::cacheDir();
// e.g. /home/user/.cache/tokenizers

`Encoding::download()`¶

public static function download(string $url, ?string $sha256, string $dest): void;

Low-level utility. Downloads $url to $dest, optionally verifying the file against $sha256. Used internally by Encoding::load(); you generally do not need to call this directly.

Encoding::download(
    'https://example.com/vocab.tiktoken',
    'abc123...',
    '/tmp/vocab.tiktoken'
);

Cache directory resolution¶

Encoding::load() stores downloaded vocabulary files in the first directory that resolves from the following ordered list:

$TOKENIZERS_CACHE_DIR/tokenizers
$XDG_CACHE_HOME/tokenizers
$HOME/.cache/tokenizers
sys_get_temp_dir()/tokenizers

Built-in vocabulary files are downloaded from OpenAI's public CDN on first use, checksum-verified, and are never redistributed with the extension.

6. Procedural functions¶

Global-namespace procedural functions. All functions that operate on a tokenizer accept a \Tokenizers\Bpe instance only — WordPiece and Unigram are not accepted by these functions.

`tokenizers_version()`¶

function tokenizers_version(): string;

Returns the extension version string "0.1.0". Equivalent to \Tokenizers\VERSION.

echo tokenizers_version(); // "0.1.0"

`tokenizers_cache_count()`¶

function tokenizers_cache_count(): int;

Returns the number of tokenizer models currently held in the process-global vocabulary cache. Useful for diagnostics.

echo tokenizers_cache_count(); // e.g. 1 after loading cl100k_base

`tokenizers_encode()`¶

function tokenizers_encode(
    \Tokenizers\Bpe $t,
    string          $text,
    array           $allowedSpecial    = [],
    array|string    $disallowedSpecial = "all"
): array; // int[]

Procedural wrapper for Bpe::encode(). See encode() for $allowedSpecial and $disallowedSpecial semantics.

$ids = tokenizers_encode($bpe, 'Hello world'); // [9906, 1917]

`tokenizers_decode()`¶

function tokenizers_decode(\Tokenizers\Bpe $t, array $ids): string;

Procedural wrapper for Bpe::decode().

echo tokenizers_decode($bpe, [9906, 1917]); // "Hello world"

`tokenizers_count()`¶

function tokenizers_count(\Tokenizers\Bpe $t, string $text): int;

Procedural wrapper for Bpe::countTokens().

$n = tokenizers_count($bpe, 'Hello world'); // 2

7. Remote companion — `\Tokenizers\Remote`¶

Pure-PHP classes that count tokens via the official provider APIs. They work without the C extension loaded (they bootstrap a TokenizerException polyfill via require_once of php/Tokenizers/TokenizerException.php). They require ext-curl and ext-json. They use raw curl — they do not depend on anthropic-ai/sdk, Guzzle, or any other HTTP library.

Note: Claude 3+ and Gemini models have no local tokenizer. Exact token counts for those models require a network call and a valid API key. Providers may change their tokenizer at any time.

`Transport` interface¶

interface Transport {
    public function post(
        string $url,
        array  $headers,
        string $body,
        int    $timeout
    ): array; // ['status' => int, 'body' => string]
}

The HTTP abstraction used by Anthropic and Gemini. The default implementation is CurlTransport. Inject a custom implementation for offline testing.

class FakeTransport implements \Tokenizers\Remote\Transport {
    public function post(string $url, array $headers, string $body, int $timeout): array {
        return ['status' => 200, 'body' => '{"input_tokens":5}'];
    }
}

`Anthropic`¶

final class Anthropic {
    public function __construct(
        ?string    $apiKey    = null,
        ?Transport $transport = null,
        string     $version   = '2023-06-01',
        int        $timeout   = 30
    );

    public function countTokens(
        string       $model,
        string|array $messages,
        ?string      $system = null
    ): int;
}

Counts tokens for an Anthropic (Claude) model via the official API.

countTokens() details:

Endpoint: POST https://api.anthropic.com/v1/messages/count_tokens
Headers: x-api-key: <key>, anthropic-version: 2023-06-01, content-type: application/json
Body: {"model": <model>, "messages": [...], "system": <system?>}
A plain string for $messages becomes a single {"role":"user","content": <text>} turn.
An array is sent as-is (for multi-turn conversations).
Response field parsed: input_tokens
API key resolution: $apiKey constructor argument, otherwise ANTHROPIC_API_KEY environment variable. A missing key throws TokenizerException at call time.
Errors: Non-2xx HTTP status or a malformed/missing input_tokens field throws TokenizerException.

use Tokenizers\Remote\Anthropic;

$anthropic = new Anthropic(); // reads ANTHROPIC_API_KEY from env
$n = $anthropic->countTokens('claude-opus-4-8', 'Hello, world!');

// Multi-turn:
$n = $anthropic->countTokens('claude-opus-4-8', [
    ['role' => 'user',      'content' => 'Hi'],
    ['role' => 'assistant', 'content' => 'Hello!'],
    ['role' => 'user',      'content' => 'How are you?'],
]);

// With a system prompt:
$n = $anthropic->countTokens('claude-opus-4-8', 'Hello', system: 'You are a helpful assistant.');

`Gemini`¶

final class Gemini {
    public function __construct(
        ?string    $apiKey    = null,
        ?Transport $transport = null,
        int        $timeout   = 30
    );

    public function countTokens(string $model, string $text): int;
}

Counts tokens for a Google Gemini model via the official API.

countTokens() details:

Endpoint: POST https://generativelanguage.googleapis.com/v1beta/models/{model}:countTokens The {model} segment is normalised — both "gemini-1.5-flash" and "models/gemini-1.5-flash" are accepted; the leading models/ is added or preserved exactly once, never double-prefixed.
Header: x-goog-api-key: <key>
Body: {"contents":[{"parts":[{"text": <text>}]}]}
Response field parsed: totalTokens
API key resolution: $apiKey constructor argument, then GEMINI_API_KEY env var, then GOOGLE_API_KEY env var. A missing key throws TokenizerException at call time.
Errors: Non-2xx HTTP status or a malformed/missing totalTokens field throws TokenizerException.

use Tokenizers\Remote\Gemini;

$gemini = new Gemini(); // reads GEMINI_API_KEY or GOOGLE_API_KEY from env
$n = $gemini->countTokens('gemini-1.5-flash', 'Hello, world!');
// Both forms are equivalent:
$n = $gemini->countTokens('models/gemini-1.5-flash', 'Hello, world!');

Environment variable reference¶

Provider	Environment variable(s)	Constructor override
Anthropic	`ANTHROPIC_API_KEY`	`new Anthropic(apiKey: '...')`
Gemini	`GEMINI_API_KEY` (checked first), then `GOOGLE_API_KEY`	`new Gemini(apiKey: '...')`

8. `\Tokenizers\TokenCounter`¶

final class TokenCounter {
    public function __construct(
        ?Anthropic $anthropic = null,
        ?Gemini    $gemini    = null
    );

    public static function route(string $model): string; // 'anthropic' | 'gemini' | 'local'

    public function count(
        string  $model,
        string  $text,
        ?string $provider = null
    ): int;
}

High-level facade that dispatches token counting to the right backend based on the model name. Defined in php/Tokenizers/TokenCounter.php.

route() — routing rules (no network call):

Model prefix	Returns
`claude` or `anthropic`	`'anthropic'`
`gemini` or `models/gemini`	`'gemini'`
anything else	`'local'`

count() — dispatch logic:

'anthropic' → calls Anthropic->countTokens($model, $text)
'gemini' → calls Gemini->countTokens($model, $text)
'local' → calls Encoding::load($model)->countTokens($text)
Passing an explicit $provider that is not one of the three recognised values throws TokenizerException("unknown provider '<p>' for model: <model>").

use Tokenizers\TokenCounter;

$tc = new TokenCounter();

// Local (no network, no key needed):
$n = $tc->count('cl100k_base', $text);

// Remote Anthropic (needs ANTHROPIC_API_KEY):
$n = $tc->count('claude-opus-4-8', $text);

// Remote Gemini (needs GEMINI_API_KEY or GOOGLE_API_KEY):
$n = $tc->count('gemini-1.5-flash', $text);

// Inject pre-configured clients:
$tc = new TokenCounter(
    anthropic: new \Tokenizers\Remote\Anthropic(apiKey: 'ant-...'),
    gemini:    new \Tokenizers\Remote\Gemini(apiKey: 'AIza...')
);

// Force a specific provider:
$n = $tc->count('my-model', $text, provider: 'local');

// Inspect routing without counting:
echo TokenCounter::route('claude-sonnet-4'); // 'anthropic'
echo TokenCounter::route('gemini-pro');      // 'gemini'
echo TokenCounter::route('cl100k_base');     // 'local'

9. `\Tokenizers\TokenizerException`¶

class TokenizerException extends \RuntimeException {}

Thrown by all classes in this extension for error conditions. Extends the standard \RuntimeException, so it can be caught as either \Tokenizers\TokenizerException or \RuntimeException.

Thrown in the following situations:

Encoding::load() — unknown encoding name: "unknown encoding: <name>"
Anthropic::countTokens() / Gemini::countTokens() — missing API key (at call time)
Anthropic::countTokens() / Gemini::countTokens() — non-2xx HTTP response
Anthropic::countTokens() / Gemini::countTokens() — malformed or missing response field
TokenCounter::count() — unknown explicit $provider: "unknown provider '<p>' for model: <model>"

use Tokenizers\Encoding;
use Tokenizers\TokenizerException;

try {
    $enc = Encoding::load('p50k_base'); // not bundled in v0.1
} catch (TokenizerException $e) {
    echo $e->getMessage(); // "unknown encoding: p50k_base"
}

API Reference — tokenizers extension v0.1.0¶

Table of Contents¶

1. Constants¶

\Tokenizers\VERSION¶

2. \Tokenizers\Bpe¶

Static factory methods¶

Bpe::fromTiktokenFile()¶

Bpe::fromVocab()¶

Instance methods¶

encode()¶

countTokens()¶

decode()¶

decodeSingle()¶

vocabSize()¶

name()¶

3. \Tokenizers\WordPiece¶

Static factory method¶

WordPiece::fromVocab()¶

Instance methods¶

encode()¶

countTokens()¶

decode()¶

vocabSize()¶

$opts reference¶

4. \Tokenizers\Unigram¶

Static factory method¶

Unigram::fromVocab()¶

Instance methods¶

encode()¶

countTokens()¶

decode()¶

vocabSize()¶

$opts reference¶

5. \Tokenizers\Encoding¶

Encoding::load()¶

Encoding::fromHuggingFace()¶

Encoding::wordPieceFromVocabFile()¶

Encoding::cacheDir()¶

Encoding::download()¶

Cache directory resolution¶

6. Procedural functions¶

tokenizers_version()¶

tokenizers_cache_count()¶

tokenizers_encode()¶

tokenizers_decode()¶

tokenizers_count()¶

7. Remote companion — \Tokenizers\Remote¶

Transport interface¶

Anthropic¶

Gemini¶

Environment variable reference¶

8. \Tokenizers\TokenCounter¶

9. \Tokenizers\TokenizerException¶

See also¶

API Reference — `tokenizers` extension v0.1.0¶

`\Tokenizers\VERSION`¶

2. `\Tokenizers\Bpe`¶

`Bpe::fromTiktokenFile()`¶

`Bpe::fromVocab()`¶

`encode()`¶

`countTokens()`¶

`decode()`¶

`decodeSingle()`¶

`vocabSize()`¶

`name()`¶

3. `\Tokenizers\WordPiece`¶

`WordPiece::fromVocab()`¶

`encode()`¶

`countTokens()`¶

`decode()`¶

`vocabSize()`¶

`$opts` reference¶

4. `\Tokenizers\Unigram`¶

`Unigram::fromVocab()`¶

`encode()`¶

`countTokens()`¶

`decode()`¶

`vocabSize()`¶

`$opts` reference¶

5. `\Tokenizers\Encoding`¶

`Encoding::load()`¶

`Encoding::fromHuggingFace()`¶

`Encoding::wordPieceFromVocabFile()`¶

`Encoding::cacheDir()`¶

`Encoding::download()`¶

`tokenizers_version()`¶

`tokenizers_cache_count()`¶

`tokenizers_encode()`¶

`tokenizers_decode()`¶

`tokenizers_count()`¶

7. Remote companion — `\Tokenizers\Remote`¶

`Transport` interface¶

`Anthropic`¶

`Gemini`¶

8. `\Tokenizers\TokenCounter`¶

9. `\Tokenizers\TokenizerException`¶