Skip to content

API Reference — tokenizers extension v0.1.0

For guides and tutorials, see the guides and Getting Started.


Table of Contents

  1. Constants
  2. \Tokenizers\Bpe
  3. Static factory methods
  4. Instance methods
  5. \Tokenizers\WordPiece
  6. Static factory method
  7. Instance methods
  8. $opts reference
  9. \Tokenizers\Unigram
  10. Static factory method
  11. Instance methods
  12. $opts reference
  13. \Tokenizers\Encoding
  14. Cache directory resolution
  15. Procedural functions
  16. Remote companion — \Tokenizers\Remote
  17. Transport interface
  18. Anthropic
  19. Gemini
  20. Environment variable reference
  21. \Tokenizers\TokenCounter
  22. \Tokenizers\TokenizerException
  23. See also

1. Constants

\Tokenizers\VERSION

const \Tokenizers\VERSION = "0.1.0";

The extension version string. Also available via the procedural helper tokenizers_version().

echo \Tokenizers\VERSION; // "0.1.0"

2. \Tokenizers\Bpe

Native C class. Implements byte-level BPE (byte-pair encoding), compatible with OpenAI's tiktoken library. Suitable for cl100k_base (GPT-4, o1, o3), o200k_base (GPT-4o multimodal, o1 mini/pro), and open-weight models loaded from a HuggingFace tokenizer.json (GPT-2, RoBERTa, Llama 3, Mistral, Qwen, DeepSeek).

Merge complexity is O(n log n) — heap-based, so adversarial inputs do not trigger quadratic blowup. The vocabulary is loaded once per worker process into a process-global cache (ZTS-safe via a TSRM mutex).

final class Bpe { /* ... */ }

Static factory methods

Bpe::fromTiktokenFile()

public static function fromTiktokenFile(
    string $path,
    string $pattern,
    array  $specialTokens = []
): Bpe;

Loads a vocabulary from a tiktoken .tiktoken file on disk. $path is the absolute path to the file. $pattern is the splitting regex (e.g. the cl100k_base pattern). $specialTokens is a map of special-token string to integer id.

$bpe = \Tokenizers\Bpe::fromTiktokenFile(
    '/path/to/cl100k_base.tiktoken',
    '/(?i:\'s|\'t|\'re|\'ve|\'m|\'ll|\'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+/',
    ['<|endoftext|>' => 100257]
);

Bpe::fromVocab()

public static function fromVocab(
    array  $tokenBytesToId,
    array  $merges,
    string $pattern,
    array  $specialTokens = []
): Bpe;

Constructs a Bpe instance directly from in-memory data. $tokenBytesToId maps base64-encoded byte sequences to integer ids. $merges is an ordered list of merge pairs. $pattern is the splitting regex. $specialTokens maps special-token strings to ids.

$bpe = \Tokenizers\Bpe::fromVocab($tokenBytesToId, $merges, $pattern);

Instance methods

encode()

public function encode(
    string       $text,
    array|string $allowedSpecial    = [],
    array|string $disallowedSpecial = "all"
): array; // int[]

Encodes $text into an array of integer token ids.

Special-token handling: By default $disallowedSpecial = "all" means every special token in the vocabulary is treated as plain text (i.e. it will not be encoded as a single special-token id). Pass "all" to $allowedSpecial to allow all special tokens to be encoded as their ids, or pass an explicit list such as ['<|endoftext|>'] to allow only those tokens.

Parameter Type Default Description
$text string Input text to encode
$allowedSpecial array\|string [] Tokens that may be encoded as special ids. Pass "all" or a list.
$disallowedSpecial array\|string "all" Tokens that must NOT appear; causes an error if encountered.
$ids = $bpe->encode('Hello world');
// [9906, 1917]

// Allow the end-of-text marker to pass through as a special id:
$ids = $bpe->encode('<|endoftext|>text', allowedSpecial: ['<|endoftext|>']);

// Allow every special token:
$ids = $bpe->encode($text, allowedSpecial: 'all');

countTokens()

public function countTokens(string $text): int;

Returns the number of tokens $text encodes to, without allocating the full id array. Faster than count($this->encode($text)) for large inputs.

$n = $bpe->countTokens('Hola mundo!'); // 4

decode()

public function decode(array $ids): string;

Decodes an array of integer token ids back to a UTF-8 string.

echo $bpe->decode([9906, 1917]); // "Hello world"

decodeSingle()

public function decodeSingle(int $id): string;

Decodes a single token id to its byte representation. Useful for inspecting individual tokens.

$bytes = $bpe->decodeSingle(9906); // "Hello"

vocabSize()

public function vocabSize(): int;

Returns the total vocabulary size (base tokens + special tokens). For cl100k_base this is 100277.

echo $bpe->vocabSize(); // 100277 for cl100k_base

name()

public function name(): ?string;

Returns the encoding name, or null. In v0.1 this always returns null — name tracking is not yet implemented.

var_dump($bpe->name()); // NULL

3. \Tokenizers\WordPiece

Native C class. Implements greedy longest-match WordPiece tokenization (BERT family). Uses a ## continuation prefix for subword pieces, falls back to [UNK] for out-of-vocabulary words.

Normalization scope (v0.1): Latin-1 and CJK-spacing only. Non-Latin scripts that require full Unicode NFD decomposition are out of scope for v0.1 and may produce different results from the reference.

final class WordPiece { /* ... */ }

Static factory method

WordPiece::fromVocab()

public static function fromVocab(
    array $tokenToId,
    array $opts = []
): WordPiece;

Builds a WordPiece tokenizer from a vocabulary map. $tokenToId maps token strings (e.g. "hello", "##ing") to their integer ids. See $opts reference below.

$wp = \Tokenizers\WordPiece::fromVocab($vocab, [
    'lowercase'  => true,
    'unkToken'   => '[UNK]',
]);

To load directly from a BERT-style vocab.txt file, use \Tokenizers\Encoding::wordPieceFromVocabFile().

Instance methods

encode()

public function encode(string $text): array; // int[]

Encodes $text and returns an array of integer token ids.

$ids = $wp->encode('unbelievable tokenization'); // [23653, 19204, 3989]

countTokens()

public function countTokens(string $text): int;

Returns the token count without materialising the full id array.

$n = $wp->countTokens('Hello world');

decode()

public function decode(array $ids): string;

Reconstructs text from a list of token ids. Strips the ## continuation prefix when reassembling subword pieces.

echo $wp->decode([23653, 19204, 3989]);

vocabSize()

public function vocabSize(): int;

Returns the size of the vocabulary.

echo $wp->vocabSize();

$opts reference

All keys are optional. Unspecified keys use the listed defaults.

Key Type Default Description
unkToken string "[UNK]" Token used for out-of-vocabulary words
continuingSubwordPrefix string "##" Prefix prepended to continuation subword pieces
maxInputCharsPerWord int 100 Words longer than this character limit are mapped to [UNK]
lowercase bool true Lowercases input before tokenization
stripAccents bool true Strips diacritic accents (Latin-1 scope)
handleChineseChars bool true Pads CJK codepoints with spaces before tokenization

4. \Tokenizers\Unigram

Native C class. Implements SentencePiece Unigram tokenization (T5, ALBERT). Uses a Metaspace (, U+2581) to encode leading spaces, and Viterbi best-path decoding over piece log-probability scores (f64).

Normalization scope (v0.1): Metaspace and identity-on-ASCII. Inputs requiring NFKC normalization, and some whitespace-edge cases (leading/trailing/multiple spaces), may differ from the reference tokenizer.

final class Unigram { /* ... */ }

Static factory method

Unigram::fromVocab()

public static function fromVocab(
    array $pieces,
    array $opts = []
): Unigram;

Constructs a Unigram tokenizer from a list of (piece, score) pairs. $pieces is an array of [string, float] entries as found in a SentencePiece tokenizer.json. See $opts reference below.

$ug = \Tokenizers\Unigram::fromVocab($pieces, ['addPrefixSpace' => true]);

To load from a HuggingFace tokenizer.json automatically, use \Tokenizers\Encoding::fromHuggingFace().

Instance methods

encode()

public function encode(string $text): array; // int[]

Encodes $text and returns an array of integer token ids via Viterbi best-path search.

$ids = $ug->encode('Hello world'); // [8774, 296]

countTokens()

public function countTokens(string $text): int;

Returns the token count without materialising the full id array.

$n = $ug->countTokens('Hello world');

decode()

public function decode(array $ids): string;

Reconstructs text from a list of token ids, replacing leading characters with spaces.

echo $ug->decode([8774, 296]); // "Hello world"

vocabSize()

public function vocabSize(): int;

Returns the size of the vocabulary.

echo $ug->vocabSize();

$opts reference

All keys are optional. Unspecified keys use the listed defaults.

Key Type Default Description
unkId int id of <unk> if present, else 0 Token id to emit for unknown pieces
addPrefixSpace bool true Prepend a Metaspace () to the input before encoding

5. \Tokenizers\Encoding

Pure-PHP shim (php/Tokenizers/Encoding.php). Provides high-level loaders that download, checksum-verify, and cache vocabulary files, then return the appropriate native tokenizer instance. Also dispatches HuggingFace tokenizer JSON files to the correct C class.

final class Encoding { /* ... */ }

Encoding::load()

public static function load(string $name): Bpe;

Loads a named built-in encoding. Downloads and checksum-verifies the vocabulary on first use; subsequent calls return from the process-global cache. Currently known encodings: 'cl100k_base' and 'o200k_base'.

Throws \Tokenizers\TokenizerException("unknown encoding: <name>") for unrecognised names.

use Tokenizers\Encoding;

$enc = Encoding::load('cl100k_base');
echo $enc->countTokens('Hello world'); // 2
echo $enc->vocabSize();                // 100277

Encoding::fromHuggingFace()

public static function fromHuggingFace(string $jsonPath): Bpe|WordPiece|Unigram;

Reads a HuggingFace tokenizer.json file and auto-dispatches by the model.type field:

model.type value Returned class
"BPE" \Tokenizers\Bpe
"WordPiece" \Tokenizers\WordPiece
"Unigram" \Tokenizers\Unigram
$bpe  = Encoding::fromHuggingFace('path/to/llama3/tokenizer.json'); // Bpe
$wp   = Encoding::fromHuggingFace('path/to/bert/tokenizer.json');   // WordPiece
$ug   = Encoding::fromHuggingFace('path/to/t5/tokenizer.json');     // Unigram

Encoding::wordPieceFromVocabFile()

public static function wordPieceFromVocabFile(
    string $path,
    array  $opts = []
): WordPiece;

Loads a plain BERT-style vocab.txt file (one token per line, line number = id) into a WordPiece tokenizer. $opts accepts the same keys as WordPiece::fromVocab().

$wp = Encoding::wordPieceFromVocabFile('/path/to/vocab.txt', ['lowercase' => true]);

Encoding::cacheDir()

public static function cacheDir(): string;

Returns the resolved path to the directory where downloaded vocabulary files are stored. Useful for debugging cache location.

echo Encoding::cacheDir();
// e.g. /home/user/.cache/tokenizers

Encoding::download()

public static function download(string $url, ?string $sha256, string $dest): void;

Low-level utility. Downloads $url to $dest, optionally verifying the file against $sha256. Used internally by Encoding::load(); you generally do not need to call this directly.

Encoding::download(
    'https://example.com/vocab.tiktoken',
    'abc123...',
    '/tmp/vocab.tiktoken'
);

Cache directory resolution

Encoding::load() stores downloaded vocabulary files in the first directory that resolves from the following ordered list:

  1. $TOKENIZERS_CACHE_DIR/tokenizers
  2. $XDG_CACHE_HOME/tokenizers
  3. $HOME/.cache/tokenizers
  4. sys_get_temp_dir()/tokenizers

Built-in vocabulary files are downloaded from OpenAI's public CDN on first use, checksum-verified, and are never redistributed with the extension.


6. Procedural functions

Global-namespace procedural functions. All functions that operate on a tokenizer accept a \Tokenizers\Bpe instance only — WordPiece and Unigram are not accepted by these functions.

tokenizers_version()

function tokenizers_version(): string;

Returns the extension version string "0.1.0". Equivalent to \Tokenizers\VERSION.

echo tokenizers_version(); // "0.1.0"

tokenizers_cache_count()

function tokenizers_cache_count(): int;

Returns the number of tokenizer models currently held in the process-global vocabulary cache. Useful for diagnostics.

echo tokenizers_cache_count(); // e.g. 1 after loading cl100k_base

tokenizers_encode()

function tokenizers_encode(
    \Tokenizers\Bpe $t,
    string          $text,
    array           $allowedSpecial    = [],
    array|string    $disallowedSpecial = "all"
): array; // int[]

Procedural wrapper for Bpe::encode(). See encode() for $allowedSpecial and $disallowedSpecial semantics.

$ids = tokenizers_encode($bpe, 'Hello world'); // [9906, 1917]

tokenizers_decode()

function tokenizers_decode(\Tokenizers\Bpe $t, array $ids): string;

Procedural wrapper for Bpe::decode().

echo tokenizers_decode($bpe, [9906, 1917]); // "Hello world"

tokenizers_count()

function tokenizers_count(\Tokenizers\Bpe $t, string $text): int;

Procedural wrapper for Bpe::countTokens().

$n = tokenizers_count($bpe, 'Hello world'); // 2

7. Remote companion — \Tokenizers\Remote

Pure-PHP classes that count tokens via the official provider APIs. They work without the C extension loaded (they bootstrap a TokenizerException polyfill via require_once of php/Tokenizers/TokenizerException.php). They require ext-curl and ext-json. They use raw curl — they do not depend on anthropic-ai/sdk, Guzzle, or any other HTTP library.

Note: Claude 3+ and Gemini models have no local tokenizer. Exact token counts for those models require a network call and a valid API key. Providers may change their tokenizer at any time.

Transport interface

interface Transport {
    public function post(
        string $url,
        array  $headers,
        string $body,
        int    $timeout
    ): array; // ['status' => int, 'body' => string]
}

The HTTP abstraction used by Anthropic and Gemini. The default implementation is CurlTransport. Inject a custom implementation for offline testing.

class FakeTransport implements \Tokenizers\Remote\Transport {
    public function post(string $url, array $headers, string $body, int $timeout): array {
        return ['status' => 200, 'body' => '{"input_tokens":5}'];
    }
}

Anthropic

final class Anthropic {
    public function __construct(
        ?string    $apiKey    = null,
        ?Transport $transport = null,
        string     $version   = '2023-06-01',
        int        $timeout   = 30
    );

    public function countTokens(
        string       $model,
        string|array $messages,
        ?string      $system = null
    ): int;
}

Counts tokens for an Anthropic (Claude) model via the official API.

countTokens() details:

  • Endpoint: POST https://api.anthropic.com/v1/messages/count_tokens
  • Headers: x-api-key: <key>, anthropic-version: 2023-06-01, content-type: application/json
  • Body: {"model": <model>, "messages": [...], "system": <system?>}
  • A plain string for $messages becomes a single {"role":"user","content": <text>} turn.
  • An array is sent as-is (for multi-turn conversations).
  • Response field parsed: input_tokens
  • API key resolution: $apiKey constructor argument, otherwise ANTHROPIC_API_KEY environment variable. A missing key throws TokenizerException at call time.
  • Errors: Non-2xx HTTP status or a malformed/missing input_tokens field throws TokenizerException.
use Tokenizers\Remote\Anthropic;

$anthropic = new Anthropic(); // reads ANTHROPIC_API_KEY from env
$n = $anthropic->countTokens('claude-opus-4-8', 'Hello, world!');

// Multi-turn:
$n = $anthropic->countTokens('claude-opus-4-8', [
    ['role' => 'user',      'content' => 'Hi'],
    ['role' => 'assistant', 'content' => 'Hello!'],
    ['role' => 'user',      'content' => 'How are you?'],
]);

// With a system prompt:
$n = $anthropic->countTokens('claude-opus-4-8', 'Hello', system: 'You are a helpful assistant.');

Gemini

final class Gemini {
    public function __construct(
        ?string    $apiKey    = null,
        ?Transport $transport = null,
        int        $timeout   = 30
    );

    public function countTokens(string $model, string $text): int;
}

Counts tokens for a Google Gemini model via the official API.

countTokens() details:

  • Endpoint: POST https://generativelanguage.googleapis.com/v1beta/models/{model}:countTokens The {model} segment is normalised — both "gemini-1.5-flash" and "models/gemini-1.5-flash" are accepted; the leading models/ is added or preserved exactly once, never double-prefixed.
  • Header: x-goog-api-key: <key>
  • Body: {"contents":[{"parts":[{"text": <text>}]}]}
  • Response field parsed: totalTokens
  • API key resolution: $apiKey constructor argument, then GEMINI_API_KEY env var, then GOOGLE_API_KEY env var. A missing key throws TokenizerException at call time.
  • Errors: Non-2xx HTTP status or a malformed/missing totalTokens field throws TokenizerException.
use Tokenizers\Remote\Gemini;

$gemini = new Gemini(); // reads GEMINI_API_KEY or GOOGLE_API_KEY from env
$n = $gemini->countTokens('gemini-1.5-flash', 'Hello, world!');
// Both forms are equivalent:
$n = $gemini->countTokens('models/gemini-1.5-flash', 'Hello, world!');

Environment variable reference

Provider Environment variable(s) Constructor override
Anthropic ANTHROPIC_API_KEY new Anthropic(apiKey: '...')
Gemini GEMINI_API_KEY (checked first), then GOOGLE_API_KEY new Gemini(apiKey: '...')

8. \Tokenizers\TokenCounter

final class TokenCounter {
    public function __construct(
        ?Anthropic $anthropic = null,
        ?Gemini    $gemini    = null
    );

    public static function route(string $model): string; // 'anthropic' | 'gemini' | 'local'

    public function count(
        string  $model,
        string  $text,
        ?string $provider = null
    ): int;
}

High-level facade that dispatches token counting to the right backend based on the model name. Defined in php/Tokenizers/TokenCounter.php.

route() — routing rules (no network call):

Model prefix Returns
claude or anthropic 'anthropic'
gemini or models/gemini 'gemini'
anything else 'local'

count() — dispatch logic:

  • 'anthropic' → calls Anthropic->countTokens($model, $text)
  • 'gemini' → calls Gemini->countTokens($model, $text)
  • 'local' → calls Encoding::load($model)->countTokens($text)
  • Passing an explicit $provider that is not one of the three recognised values throws TokenizerException("unknown provider '<p>' for model: <model>").
use Tokenizers\TokenCounter;

$tc = new TokenCounter();

// Local (no network, no key needed):
$n = $tc->count('cl100k_base', $text);

// Remote Anthropic (needs ANTHROPIC_API_KEY):
$n = $tc->count('claude-opus-4-8', $text);

// Remote Gemini (needs GEMINI_API_KEY or GOOGLE_API_KEY):
$n = $tc->count('gemini-1.5-flash', $text);

// Inject pre-configured clients:
$tc = new TokenCounter(
    anthropic: new \Tokenizers\Remote\Anthropic(apiKey: 'ant-...'),
    gemini:    new \Tokenizers\Remote\Gemini(apiKey: 'AIza...')
);

// Force a specific provider:
$n = $tc->count('my-model', $text, provider: 'local');

// Inspect routing without counting:
echo TokenCounter::route('claude-sonnet-4'); // 'anthropic'
echo TokenCounter::route('gemini-pro');      // 'gemini'
echo TokenCounter::route('cl100k_base');     // 'local'

9. \Tokenizers\TokenizerException

class TokenizerException extends \RuntimeException {}

Thrown by all classes in this extension for error conditions. Extends the standard \RuntimeException, so it can be caught as either \Tokenizers\TokenizerException or \RuntimeException.

Thrown in the following situations:

  • Encoding::load() — unknown encoding name: "unknown encoding: <name>"
  • Anthropic::countTokens() / Gemini::countTokens() — missing API key (at call time)
  • Anthropic::countTokens() / Gemini::countTokens() — non-2xx HTTP response
  • Anthropic::countTokens() / Gemini::countTokens() — malformed or missing response field
  • TokenCounter::count() — unknown explicit $provider: "unknown provider '<p>' for model: <model>"
use Tokenizers\Encoding;
use Tokenizers\TokenizerException;

try {
    $enc = Encoding::load('p50k_base'); // not bundled in v0.1
} catch (TokenizerException $e) {
    echo $e->getMessage(); // "unknown encoding: p50k_base"
}

See also