15 Comments

Hey, I enjoyed reading this and did some exploration on my own here: https://www.icodeformybhasa.com/p/beyond-the-abcs-exploring-the-nuances. I was hoping if I could merge some parts of my dashboard with yours!

Expand full comment

Hi Shreeya, I loved your article! Yes, would love to see how we can collaborate for merging the dashboards

Expand full comment

It is strange why the median token length of the Armenian language is so large. The language belongs to the Indo-European family and has much in common with the modern languages of this family, except for the alphabet.

Expand full comment

I agree, but I think the fact that the Armenian alphabet and language does not surface very often (relative to) English/French/Spanish/.... on the Internet, words in Armenian are not able to be tokenized/compressed as efficiently as it is done for English

Expand full comment

Great project. Would love to see a linguists take on it

Expand full comment

These tokenizers are simple compression algorithms. Instead of, say, consuming pairs of characters as base-65536 directly, instead do a base-(65536-256) dictionary encoding over bytes, add the 256 raw bytes as an escape hatch, and use that.

The purpose of this is twofold:

1. This allows the implementer to choose an arbitrary-sized dictionary, as opposed to being restricted to powers of 2. This is almost a minor consideration however, compared to...

2. This compresses the input data compared to a naive base-65536 input.

The current tokenizers are heavily tuned for English, because that's the vast majority of their training data, and where the vast majority of the training cost goes. A decent rule of thumb is a single token is ~4chars of English text. This saves approximately a factor of two in training costs _and time_ compared to a naive base-65536 encoding, which is huge.

Viewed in this lens, it should be obvious why tokenizers for multilingual networks are tuned more for other languages at the expense of English.

The current tokenizers are very simple and suboptimal. Expect some improvements here. That being said... at the Pareto frontier you will once again have a tradeoff. An optimal compression algorithm tuned for a more restricted dataset will perform better on that dataset than an optimal compression algorithm tuned for a less restricted superset of that dataset.

Expand full comment

I just wanted to flag that the Tamil rendition of "Hey" in the top right quarter is incorrect. It should be ஹேய் - the first two letters are swapped.

Expand full comment

Thank you for bringing this up!! I am looking into fixing it. However, it seems like there is a matplotlib plotting issue where it randomly swaps the Tamil letters. (lol, goes towards my article's thesis that not all languages are treated equally, even by these common python libraries...) In my dataset, the Tamil letters are in the correct order, but only when I try to display it in the plot, it swaps those letters. I'll let you know if I have fixed it :)

Expand full comment

Update: I have tried several different Tamil fonts, and it seems that it's an issue with matplotlib and not with fonts. I'm not sure how I can go around displaying Tamil fonts correctly ... Really speaks to the disparity of language representation...

Expand full comment

Thanks for trying! Very strange behavior from matplotlib. I bet the challenge is that the first glyph is actually applied as a modification to the second glyph, so the plotting lib thinks it should come after. Kinda like Aˆ being displayed instead of Â.

Expand full comment

Great article. But I am curious how the number of tokens for the amharic language became 69? I count only 5 words!

Expand full comment

I know! I don't know Amharic, but my guess is that the script is so "unusual" with respect to all of the other scripts that in order for the computer to properly tokenize it, it needs to break each word into multiple "subwords"

Expand full comment

Great article. I must note that the word cloud doesn't represent Right-to-left languages accurately - the text is displayed as if they are written left-to-right.

Expand full comment

Thank you for catching this! I have attempted to remedy this -- please let me know if it looks correct now!

Expand full comment

The direction is right. I mean correct.

However Arabic script should use connected letters (in font parlance "shaping"), especially for the Lam-Alef ligature. Sorry. BiDi is as much a mess as CJK, just a less known one.

Expand full comment