Multi-language support #12

DadiBit · 2024-04-20T21:01:37Z

While checking out PR #11 I realized that supporting multi languages should be fairly easy to implement, however I would allow an optional langs parameter to pass a list of languages (eg eng,deu,ita) and split each language in its own training data.

If no langs param is passed, then all are checked.

Why? Well:

we don't want to update/rebuild the whole vector/dataset if we add a new language or update an exisiting one
we don't want to overload the server if we know that a certain site is going to use mostly one or two languages (eg, german and english)
it makes the code more sustainable (not just a single huge .csv file with a bunch of commits)

The text was updated successfully, but these errors were encountered:

BKSchatzki · 2024-04-20T21:57:49Z

Was going to drop some Chinese and Thai in and submit a PR, but saw this issue first and I couldn't agree more.

Having a langs param would be really important for getting rid of false positives, especially regarding languages written using something other than the Latin alphabet when they are romanized or abbreviated.

NatanaelBorges · 2024-04-21T03:42:57Z

When reviewing the code to adapt it for implementation in Portuguese, I realized that opting for integration in the .csv file would not make much sense. In my opinion, exploring an alternative implementation method has numerous potential advantages and benefits. Here's a breakdown of the key points:

Benefits:

Efficiency: By splitting the training data by language, the code avoids retraining the entire model when a new language is added or an existing one is updated. This saves time and computational resources.
Scalability: The optional langs parameter allows specifying which languages to check for profanity. This can improve server performance by avoiding unnecessary checks for languages not in use.
Maintainability: Separating the training data by language makes the codebase more manageable. It simplifies adding new languages and avoids cluttering the code with a massive combined dataset.

DadiBit · 2024-04-21T11:20:30Z

I have little to no experience with lanague-based content, but I've seen that the ISO 639 is the go-to web/API standard. Some resources:

Here are two sub-issues:

do profanity words change across individual 3-letter languages (eg Southern sou and Northern Thai nod)? ¹
do profanity words change across local variation (eg US en-US vs UK en-UK, or Portogual vs Brazil...)? ²

Although I'm pretty sure the answer is no for both questions in most cases, I would like to implement the ability to use both 3 letter codes and localization too.

If we were to implement 3 letter codes and localization:

We could have a training_data folder with an en subfolder, which would contain en.csv as the main "index" for the macro language, eng.csv as an extension to the relative macro language (ie en) and then an en-US.csv for any local extension to the 2 letter code language or eng-US for even finer localisation (probably it's never going to be used this last option tbh).

Params
Here is my idea:
If lang=en is passed then only en is used
If lang=eng is passed then both en and eng are used
If lang=en-US is passed then both en and en-US are used
Since all cav are in the 2 letter code folder we have a super easz way to know the 2 letter code from the 3 letter one.

Langs endpoint
Also as a bonus, a simple /api/langs/ could return the supported language list.

Difference between 2 letter and 3 letter codes https://en.wikipedia.org/wiki/ISO_639-3 ↩
Country codes: https://www.iso.org/obp/ui/#search ↩

joschan21 · 2024-04-22T06:29:57Z

Great discussion. From a user perspective I would find it confusing if there were multiple variations of en and eng, because intuitively there might not be a difference. By the way, great question of if this should be in the same database or namespace. The benefit would be that it's much simpler to set up.

On the other side, the risk might be that a word in spanish looking similar to an english swear word might get flagged without meaning something bad in spanish. Then again, some profanities remain the same, i.e. the most popular english profanities also work in spanish I suppose - we'd have to re-index all of them for the spanish, portugese, etc. versions. So I propose just adding them to the same database and seeing how that goes, the lang parameters make sense

DadiBit · 2024-04-22T07:31:30Z

On the other side, the risk might be that a word in spanish looking similar to an english swear word might get flagged without meaning something bad in spanish. Then again, some profanities remain the same, i.e. the most popular english profanities also work in spanish I suppose - we'd have to re-index all of them for the spanish, portugese, etc. versions. So I propose just adding them to the same database and seeing how that goes, the lang parameters make sense

An example: the word "negro" in Italian is literally the n-word, however in Spanish it's just the black color, without any negative connotation whatsoever (as far as I know)

t-var-s · 2024-04-22T10:26:19Z

however in Spanish it's just the black color, without any negative connotation whatsoever (as far as I know)

Yes, same thing in Portuguese.

DadiBit · 2024-04-22T12:30:55Z

From a user perspective I would find it confusing if there were multiple variations of en and eng, because intuitively there might not be a difference. By the way, great question of if this should be in the same database or namespace. The benefit would be that it's much simpler to set up.

It would be nice to have some input on this from the community on Asian and Arabic languages (on Wikipedia I saw multiple 3-letter languages under the "ar" common macro one)

Some profanities remain the same, i.e. the most popular english profanities also work in spanish I suppose - we'd have to re-index all of them for the spanish, portugese, etc. versions.

I think an easier approach would be to just let the dev pass the eng parameter if they're concerned with it.

For English/German-root-related languages there are certain shared words (for example English-German with "shit" and probably more that you @joschan21 know better), but at least I can tell you that most profanities in Italian are... In Italian. Like, you might hear someone say the n-word in English, however it's not extremely common.
Some English words are used as neologism, profanity or not, but I wouldn't put them in other databases.

With this said, Italian is full of swear words, so it might also just be we don't "need" English for profanities.

Also, I'm sorry to ask, but: what's the difference between "namespace" and database"? Is the database the single training csv data?

dzakyabdurhmn · 2024-04-26T11:41:12Z

I tried writing harsh words in Indonesian but it wasn't detected that they were toxic

DadiBit · 2024-05-14T19:56:20Z

Quick update: namespace support was added last week to upstash js api: upstash/vector-js#25
I'm testing some stuff with it :)

Edit: answering my old question: namespaces are a way to group data under a single index in a similar fashion to metadata, however, contrary to metadata, it is selectable on query. In other words: one database with groups.

DadiBit · 2024-05-15T05:46:01Z

Also @joschan21 which model do you recommend to convert the raw text to vector data? I'm writing a guide in a README.md to keep track of what I did in order to get started and would like to know if you have a recommendation.

NatanaelBorges mentioned this issue Apr 22, 2024

feat: add portuguese language support #22

Closed

DadiBit mentioned this issue Apr 23, 2024

Support for text symbol #20

Open

DadiBit mentioned this issue May 15, 2024

Multi language support #44

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-language support #12

Multi-language support #12

DadiBit commented Apr 20, 2024

BKSchatzki commented Apr 20, 2024

NatanaelBorges commented Apr 21, 2024

DadiBit commented Apr 21, 2024 •

edited

joschan21 commented Apr 22, 2024 •

edited

DadiBit commented Apr 22, 2024

t-var-s commented Apr 22, 2024

DadiBit commented Apr 22, 2024 •

edited

dzakyabdurhmn commented Apr 26, 2024

DadiBit commented May 14, 2024 •

edited

DadiBit commented May 15, 2024

Multi-language support #12

Multi-language support #12

Comments

DadiBit commented Apr 20, 2024

BKSchatzki commented Apr 20, 2024

NatanaelBorges commented Apr 21, 2024

DadiBit commented Apr 21, 2024 • edited

Footnotes

joschan21 commented Apr 22, 2024 • edited

DadiBit commented Apr 22, 2024

t-var-s commented Apr 22, 2024

DadiBit commented Apr 22, 2024 • edited

dzakyabdurhmn commented Apr 26, 2024

DadiBit commented May 14, 2024 • edited

DadiBit commented May 15, 2024

DadiBit commented Apr 21, 2024 •

edited

joschan21 commented Apr 22, 2024 •

edited

DadiBit commented Apr 22, 2024 •

edited

DadiBit commented May 14, 2024 •

edited