Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Russian pos tagging/lemmatization/morphological analysis fails with diacritics #12530

Open
mtak- opened this issue Apr 16, 2023 · 5 comments · May be fixed by #12554
Open

Russian pos tagging/lemmatization/morphological analysis fails with diacritics #12530

mtak- opened this issue Apr 16, 2023 · 5 comments · May be fixed by #12554
Labels
lang / ru Russian language data and models lang / uk Ukrainian language data and models

Comments

@mtak-
Copy link

mtak- commented Apr 16, 2023

It seems that while there is support for tokenization with diacritics in spaCy, the project doesn't lemmatize/morph/pos tag correctly when they are used.

How to reproduce the behaviour

import ru_core_news_lg
nlp = ru_core_news_lg.load()
doc = nlp('Я ви́жу му́жа и жену́')
print(doc[-1].pos_) # PROPN (incorrect. just a noun)
print(doc[-1].lemma_) # жену́ (incorrect. should be жена)
print(doc[-1].morph) # nothing is printed which is obviously incorrect

if changed to remove the diacritics all is well

from spacy.lang.char_classes import COMBINING_DIACRITICS
diacritics_re = re.compile(f'[{COMBINING_DIACRITICS}]')
doc = nlp(diacritics_re.sub('', 'Я ви́жу му́жа и жену́'))

print(doc[-1].pos_) # NOUN
print(doc[-1].lemma_) # жена
print(doc[-1].morph) # Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing

pymorphy3/pymorphy2 doesn't handle diacritics

it seems pymorphy3/2 doesn't handle diacritics, so perhaps before parse is called, diacritics should be removed.

diacritics_re = re.compile(f'[{COMBINING_DIACRITICS}]')
text = diacritics_re.sub('', token.text)
@mtak- mtak- changed the title Russian lemmatization/morphological analysis fails with diacritics Russian pos tagging/lemmatization/morphological analysis fails with diacritics Apr 16, 2023
@adrianeboyd adrianeboyd added lang / ru Russian language data and models lang / uk Ukrainian language data and models labels Apr 17, 2023
@adrianeboyd
Copy link
Contributor

Thanks for the note, we'll take a look!

@adrianeboyd adrianeboyd linked a pull request Apr 20, 2023 that will close this issue
3 tasks
@adrianeboyd
Copy link
Contributor

The suggestion for the lemmatizer is included in #12554.

For the poor tagging, etc. with statistical models for the tokens with diacritics, I think the best option would be to configure custom NORM, PREFIX, and SUFFIX features for ru and uk that strip diacritics. If you wanted to try this out with the current spacy release (v3.5), you could use a custom language to customize these methods, called lex_attr_getters in the defaults similar to this:

https://spacy.io/usage/linguistic-features#language-subclass

The defaults would be extended similar to this:

class RussianDefaults(BaseDefaults):
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
lex_attr_getters = LEX_ATTRS
stop_words = STOP_WORDS
suffixes = COMBINING_DIACRITICS_TOKENIZER_SUFFIXES
infixes = COMBINING_DIACRITICS_TOKENIZER_INFIXES
class Russian(Language):
lang = "ru"
Defaults = RussianDefaults

@mtak-
Copy link
Author

mtak- commented Apr 20, 2023

Wonderful! Thank you for the quick PR and suggestions.

I'm a noob when it comes to spaCy. I'm using it to generate tags on anki flashcards to study Russian. But, if I understand you correctly, the model I use should be trained with diacritics. Is that correct (e.g. ru_core_news_lg will not work)?

I ask because I tried making a custom language and the results were still unsatisfactory (even with a patch similar to #12554).

DIACRITICS_RE = re.compile(f'[{COMBINING_DIACRITICS}]')
def norm(s: str) -> str:
    return DIACRITICS_RE.sub('', s.lower())
def prefix(s: str) -> str:
    return DIACRITICS_RE.sub('', s.lower())[0]
def suffix(s: str) -> str:
    return DIACRITICS_RE.sub('', s.lower())[-3:]
ATTR_GETTERS = spacy.lang.ru.LEX_ATTRS
ATTR_GETTERS.update({
    attrs.NORM: norm,
    attrs.PREFIX: prefix,
    attrs.SUFFIX: suffix,
})

class CustomRussianDefaults(Russian.Defaults):
    lex_attr_getters = ATTR_GETTERS

@spacy.registry.languages("custom_ru")
class CustomRussian(Russian):
    lang = "custom_ru"
    Defaults = CustomRussianDefaults
nlp = ru_core_news_lg.load()
# omitted the patching of _pymorphy_lemmatize
nlp.lang = 'custom_ru'

Test

>>> nlp('Я ви́жу му́жа и жену́')[-1].morph
Animacy=Inan|Case=Acc|Gender=Fem|Number=Sing
>>> nlp('Я вижу мужа и жену')[-1].morph
Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing

The Animacy for жену́ is inanimate with diacritics, which is incorrect.

@adrianeboyd
Copy link
Contributor

The language and language defaults really needs to be set before the pipeline is loaded at all, but you can test this a bit by modifying the pipeline on-the-fly instead. (A few things may already be cached so it might not work 100%.)

nlp = spacy.load("ru_core_news_lg")
nlp.vocab.lex_attr_getters.update(...)

A cleaner version would basically make a copy of ru_core_news_lg where [nlp.lang] is edited to custom_ru. But with the above you should be able to test most things out. And keep in mind that the statistical models will still make mistakes, especially for ambiguous cases.

@Vuizur
Copy link

Vuizur commented Feb 23, 2024

I had the same problem and discovered at least a workaround:
One can create two docs, one with the original stressed text, and one with the text with diacritics removed.
That way you can iterate through the docs in parallel, getting the correct (stressed) text from doc 1 while getting the grammatical information from doc 2.

It's half as fast, but it does work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang / ru Russian language data and models lang / uk Ukrainian language data and models
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants