The problem is they have no idea about the internal structure of the tokens they use, except what’s present in the data set. The model sees “Kenya” as 8473 299 = Ken ya or something, and how is it supposed to know token 8473, often used for the name of Barbie’s boyfriend, starts with K?
More and more articles will be written by language models and that really sucks 😔
ChatGPT literally makes exactly that mistake (with the initial letters)
The problem is they have no idea about the internal structure of the tokens they use, except what’s present in the data set. The model sees “Kenya” as
8473 299
=Ken ya
or something, and how is it supposed to know token8473
, often used for the name of Barbie’s boyfriend, starts with K?Also they love to make up Fun Facts.