fix overwrite bug when adding symbol to dictionary #5329

lydianish · 2023-09-15T21:14:23Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
Did you read the contributor guideline?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes #3064.
Fixes #3705.
Fixes #1309.

TLDR; This PR fixes the bug that duplicates the symbols that were meant to be overwritten in the vocabulary file. See detailed explanation in this blog post.

Expected behavior:

A Dictionary object has an indices dict and two lists (symbols and counts). By default, when loading a vocabulary from a file, a Dictionary instance is first created by adding 4 special tokens (<s>, <pad>, </s> and <unk> in that order). Then, all the entries from the file are appended to the Dictionary. If the vocabulary file already has some of the special tokens, their file entry should contain #fairseq:overwrite, otherwise a "duplicate" error will be raised at runtime. Furthermore, during preprocessing, the saved dictionary should not contain any of the special symbols.

Current behavior:

The add_symbol function is responsible for adding the symbols to the Dictionary. It has an overwrite argument that is set to True when the corresponding line in the file has #fairseq:overwrite. Rather than testing if word in self.indices and overwrite, it is currently testing if word in self.indices and not overwrite, which makes it ignore the case where the symbol should actually be overwritten. Hence, the symbol is appended to the symbols list, and its index is changed in the indices dict. This results in duplicate symbols and incorrect indices. Generally, only the special symbols will be affected. However, because the number of special tokens is set during initialization, it remains correct.

For example, a dictionary with 50K tokens that already has <s>, <pad> , </s> and <unk> with the #fairseq:overwrite tag will end up having 50004 tokens when loaded. This will also propagate to the subsequent model which will have an embedding dimension of 50004 instead of 50K. Also, with fairseq-preprocess, the resulting dictionary will skip the first 4 special symbols but will still contain the duplicate ones.

Domino effects and backward compatibility:

By fixing this bug, dictionary files will be loaded properly. However, this fix might cause problems in pipelines that use existing architectures and pretrained models because of the mismatch in sentencepiece encoding and/or embedding dimension.

For the sake of backward compatibility, a #fairseq:duplicate flag is introduced to ensure that duplicates are kept in the dictionary just like the bug. When used with fairseq-preprocess, the produced dict.txt file will also write #fairseq:duplicate next to the same symbols.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Yes, I did 🙃

This bug ignored the tokens that were meant to be overwritten and appends them to the end of the dictionary symbols. For example, a dictionary with 50K tokens that already has `<s>`, `</s>`, `<pad>` and `<unk>` with the #fairseq:overwrite tag will end up having 50004 tokens when loaded.

Assert that overwrite works as expected (i.e. ignoring the duplicates)

For backward compatibility with the existing models/pipelines that uses a flawed dictionary loaded from file (before the bug fix)

…tionary After fixing the behaviour of add_symbol, two of the unit tests were failing because they called the function with the default value of overwrite (False).

This ensures compatibility with all the calls to add_symbol across the repo (which overwrite by default, as in the original implementation). The only place where the value is explicitly changed is when loading the dictionary from file (which was the source of the bug). In a file you have to explicitly say whether the tokens should be overwritten or duplicated

facebook-github-bot added the CLA Signed label Sep 15, 2023

Merge branch 'main' into main

2332e83

lydianish mentioned this pull request Sep 20, 2023

Dictionary entries with #fairseq:overwrite are not preserved in dict.txt output from fairseq-preprocess #3705

Open

lydianish marked this pull request as draft September 20, 2023 14:37

lydianish marked this pull request as ready for review September 20, 2023 14:39

lydianish marked this pull request as draft September 21, 2023 09:42

Fix test_overwrite in test_dictionary.py

576602d

Assert that overwrite works as expected (i.e. ignoring the duplicates)

lydianish marked this pull request as ready for review September 21, 2023 16:20

lydianish marked this pull request as draft September 21, 2023 17:03

lydianish added 2 commits September 21, 2023 22:01

Add support for fairseq:duplicate flag in dictionary

c7535b0

For backward compatibility with the existing models/pipelines that uses a flawed dictionary loaded from file (before the bug fix)

Write unit tests for overwrite and duplicate in dictionary

8987896

lydianish marked this pull request as ready for review September 21, 2023 20:33

Update dictionary.py load function documentation

0968083

lydianish marked this pull request as draft September 21, 2023 21:40

Adding symbols with overwrite=True in encode_line and add_file_to_dic…

eed21c0

…tionary After fixing the behaviour of add_symbol, two of the unit tests were failing because they called the function with the default value of overwrite (False).

lydianish marked this pull request as ready for review September 21, 2023 21:47

lydianish added 2 commits September 22, 2023 09:59

rename test_no_overwrite to test_no_overwrite_nor_duplicate

b291c8d

lydianish marked this pull request as draft March 8, 2024 12:48

remove redundant duplicate variable when loading dictionary from file

5c40fd3

lydianish marked this pull request as ready for review March 8, 2024 12:59

Merge branch 'main' into main

78b904b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix overwrite bug when adding symbol to dictionary #5329

fix overwrite bug when adding symbol to dictionary #5329

lydianish commented Sep 15, 2023 •

edited

fix overwrite bug when adding symbol to dictionary #5329

Are you sure you want to change the base?

fix overwrite bug when adding symbol to dictionary #5329

Conversation

lydianish commented Sep 15, 2023 • edited

Before submitting

What does this PR do?

Expected behavior:

Current behavior:

Domino effects and backward compatibility:

PR review

Did you have fun?

lydianish commented Sep 15, 2023 •

edited