It is strange why the median token length of the Armenian language is so large. The language belongs to the Indo-European family and has much in common with the modern languages of this family, except for the alphabet.
I agree, but I think the fact that the Armenian alphabet and language does not surface very often (relative to) English/French/Spanish/.... on the Internet, words in Armenian are not able to be tokenized/compressed as efficiently as it is done for English
These tokenizers are simple compression algorithms. Instead of, say, consuming pairs of characters as base-65536 directly, instead do a base-(65536-256) dictionary encoding over bytes, add the 256 raw bytes as an escape hatch, and use that.
The purpose of this is twofold:
1. This allows the implementer to choose an arbitrary-sized dictionary, as opposed to being restricted to powers of 2. This is almost a minor consideration however, compared to...
2. This compresses the input data compared to a naive base-65536 input.
The current tokenizers are heavily tuned for English, because that's the vast majority of their training data, and where the vast majority of the training cost goes. A decent rule of thumb is a single token is ~4chars of English text. This saves approximately a factor of two in training costs _and time_ compared to a naive base-65536 encoding, which is huge.
Viewed in this lens, it should be obvious why tokenizers for multilingual networks are tuned more for other languages at the expense of English.
The current tokenizers are very simple and suboptimal. Expect some improvements here. That being said... at the Pareto frontier you will once again have a tradeoff. An optimal compression algorithm tuned for a more restricted dataset will perform better on that dataset than an optimal compression algorithm tuned for a less restricted superset of that dataset.
Thank you for bringing this up!! I am looking into fixing it. However, it seems like there is a matplotlib plotting issue where it randomly swaps the Tamil letters. (lol, goes towards my article's thesis that not all languages are treated equally, even by these common python libraries...) In my dataset, the Tamil letters are in the correct order, but only when I try to display it in the plot, it swaps those letters. I'll let you know if I have fixed it :)
Update: I have tried several different Tamil fonts, and it seems that it's an issue with matplotlib and not with fonts. I'm not sure how I can go around displaying Tamil fonts correctly ... Really speaks to the disparity of language representation...
Thanks for trying! Very strange behavior from matplotlib. I bet the challenge is that the first glyph is actually applied as a modification to the second glyph, so the plotting lib thinks it should come after. Kinda like Aˆ being displayed instead of Â.
I know! I don't know Amharic, but my guess is that the script is so "unusual" with respect to all of the other scripts that in order for the computer to properly tokenize it, it needs to break each word into multiple "subwords"
Great article. I must note that the word cloud doesn't represent Right-to-left languages accurately - the text is displayed as if they are written left-to-right.
However Arabic script should use connected letters (in font parlance "shaping"), especially for the Lam-Alef ligature. Sorry. BiDi is as much a mess as CJK, just a less known one.
Hey, I enjoyed reading this and did some exploration on my own here: https://www.icodeformybhasa.com/p/beyond-the-abcs-exploring-the-nuances. I was hoping if I could merge some parts of my dashboard with yours!
Hi Shreeya, I loved your article! Yes, would love to see how we can collaborate for merging the dashboards
It is strange why the median token length of the Armenian language is so large. The language belongs to the Indo-European family and has much in common with the modern languages of this family, except for the alphabet.
I agree, but I think the fact that the Armenian alphabet and language does not surface very often (relative to) English/French/Spanish/.... on the Internet, words in Armenian are not able to be tokenized/compressed as efficiently as it is done for English
Great project. Would love to see a linguists take on it
These tokenizers are simple compression algorithms. Instead of, say, consuming pairs of characters as base-65536 directly, instead do a base-(65536-256) dictionary encoding over bytes, add the 256 raw bytes as an escape hatch, and use that.
The purpose of this is twofold:
1. This allows the implementer to choose an arbitrary-sized dictionary, as opposed to being restricted to powers of 2. This is almost a minor consideration however, compared to...
2. This compresses the input data compared to a naive base-65536 input.
The current tokenizers are heavily tuned for English, because that's the vast majority of their training data, and where the vast majority of the training cost goes. A decent rule of thumb is a single token is ~4chars of English text. This saves approximately a factor of two in training costs _and time_ compared to a naive base-65536 encoding, which is huge.
Viewed in this lens, it should be obvious why tokenizers for multilingual networks are tuned more for other languages at the expense of English.
The current tokenizers are very simple and suboptimal. Expect some improvements here. That being said... at the Pareto frontier you will once again have a tradeoff. An optimal compression algorithm tuned for a more restricted dataset will perform better on that dataset than an optimal compression algorithm tuned for a less restricted superset of that dataset.
I just wanted to flag that the Tamil rendition of "Hey" in the top right quarter is incorrect. It should be ஹேய் - the first two letters are swapped.
Thank you for bringing this up!! I am looking into fixing it. However, it seems like there is a matplotlib plotting issue where it randomly swaps the Tamil letters. (lol, goes towards my article's thesis that not all languages are treated equally, even by these common python libraries...) In my dataset, the Tamil letters are in the correct order, but only when I try to display it in the plot, it swaps those letters. I'll let you know if I have fixed it :)
Update: I have tried several different Tamil fonts, and it seems that it's an issue with matplotlib and not with fonts. I'm not sure how I can go around displaying Tamil fonts correctly ... Really speaks to the disparity of language representation...
Thanks for trying! Very strange behavior from matplotlib. I bet the challenge is that the first glyph is actually applied as a modification to the second glyph, so the plotting lib thinks it should come after. Kinda like Aˆ being displayed instead of Â.
Great article. But I am curious how the number of tokens for the amharic language became 69? I count only 5 words!
I know! I don't know Amharic, but my guess is that the script is so "unusual" with respect to all of the other scripts that in order for the computer to properly tokenize it, it needs to break each word into multiple "subwords"
Great article. I must note that the word cloud doesn't represent Right-to-left languages accurately - the text is displayed as if they are written left-to-right.
Thank you for catching this! I have attempted to remedy this -- please let me know if it looks correct now!
The direction is right. I mean correct.
However Arabic script should use connected letters (in font parlance "shaping"), especially for the Lam-Alef ligature. Sorry. BiDi is as much a mess as CJK, just a less known one.