Language models cost much more in some languages than others
It is strange why the median token length of the Armenian language is so large. The language belongs to the Indo-European family and has much in common with the modern languages of this family, except for the alphabet.
Great project. Would love to see a linguists take on it
These tokenizers are simple compression algorithms. Instead of, say, consuming pairs of characters as base-65536 directly, instead do a base-(65536-256) dictionary encoding over bytes, add the 256 raw bytes as an escape hatch, and use that.
The purpose of this is twofold:
1. This allows the implementer to choose an arbitrary-sized dictionary, as opposed to being restricted to powers of 2. This is almost a minor consideration however, compared to...
2. This compresses the input data compared to a naive base-65536 input.
The current tokenizers are heavily tuned for English, because that's the vast majority of their training data, and where the vast majority of the training cost goes. A decent rule of thumb is a single token is ~4chars of English text. This saves approximately a factor of two in training costs _and time_ compared to a naive base-65536 encoding, which is huge.
Viewed in this lens, it should be obvious why tokenizers for multilingual networks are tuned more for other languages at the expense of English.
The current tokenizers are very simple and suboptimal. Expect some improvements here. That being said... at the Pareto frontier you will once again have a tradeoff. An optimal compression algorithm tuned for a more restricted dataset will perform better on that dataset than an optimal compression algorithm tuned for a less restricted superset of that dataset.
I just wanted to flag that the Tamil rendition of "Hey" in the top right quarter is incorrect. It should be ஹேய் - the first two letters are swapped.
Great article. But I am curious how the number of tokens for the amharic language became 69? I count only 5 words!
Great article. I must note that the word cloud doesn't represent Right-to-left languages accurately - the text is displayed as if they are written left-to-right.