Adding Damerau, case-insensitivity comparison, Bayesian decision making options #88

fingoldo · 2023-05-12T00:45:47Z

Describe the problem you met

Hi, thanks for this useful lib, I think it can be improved even further! ;-)

I was disappointed by the lib output for this case:

doc = nlp(
    "Visa to dubia ",
)

print(doc._.performed_spellCheck)
print(doc._.outcome_spellCheck)
print(doc._.score_spellCheck)

True
Visa to.
{dubia: [('.', 0.72154), (';', 0.07971), ('?', 0.04526), ('Norway', 0.00669), ('Estonia', 0.00593), ('Taiwan', 0.00555), ('Bermuda', 0.00531), ('Iceland', 0.00479), ('Italy', 0.00466), ('Germany', 0.00437)]}

Describe the solution you'd like
Obviously, top_n parameter which is internal to the candidate_ranking method should be made a part of constructor.
I made that adjustment locally, and unexpectedly, Cuba became a winner instead of Dubai. 2 reasons were immediately evident:

candidates to misspell comparisons are case-sensitive, which is not good for accuracy
symbols swaps are considered of the same cost as 1 addition and one deletion, which is clearly a drawback.

Therefore, optional case insensitivity and Damerau-Levenshtein distance instead of plain Levenshtein are desired.
Once implemented locally, I arrived at the desired outcome (Dubai).

However, it felt as a lucky coincidence, as the probabilities coming from underlying Bert were not accounted for when making a decision. I thought it would be very natural to be able to handle them in conjunction with the textual similarities in a Bayesian fashion, where Bert probs could be our prior, cand to misspell textual sim would be B/A conditional probs of B being the correct guess given that A was typed in. To soften the impact of possible Bert miscalibration, absolute probs conversion to ranks would be desired.

Summary of my suggestions:
feature request could be implemented by adding a few parameters to the class constructor:

top_n (int, optional): suggestions from underlying ANN model to be considered. Defaults to 10.
lowercased_distance (bool, optional): lowercase candidates before computing edit distance. Defaults to True.
damerau_distance (bool, optional): additionally account for symbol swaps when calculating a distance. Defaults to True.
bayes_selection (bool, optional): use bayes reasoning when selecting the best candidate. Bert probabilities are the prior, textual similarities of candidates to the input are treated as the probabilities B/A that the correct candidate is A, while the input was B. Defaults to True.
ranked_bert_probs (bool, optional): use ranked probs as opposed to the absolute probs values coming from Bert. Defaults to True.

Calling code would become:
contextualSpellCheck.add_to_pipe(nlp,config=dict(top_n=500,lowercased_distance =True,damerau_distance =True,bayes_selection =True,ranked_bert_probs =False))

After the implementation, I arrive at Dubai instead of the dot, which makes me happy )

Describe alternatives you've considered
I was not able to find any alternative solution. Tried a few Spacy models, to no avail.

Additional context
I used

            Doc.set_extension("bayes_probs", default=None)
            Doc.set_extension("bayes_details", default=None)

to provide additional details for the occasional exigent user:

screenshot detailing resulting candidates rating and bayesian details

If you think the community can benefit from this new functionality, I'm happy to merge my PR.

The text was updated successfully, but these errors were encountered:

fingoldo · 2023-06-10T19:13:58Z

Created a fork with aforementioned functionality, as this repo seems to be dead.

stale · 2023-07-15T00:05:43Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

R1j1t · 2023-07-25T00:24:31Z

Thanks @fingoldo for the suggestion. Do you mind creating a PR and I can review it over the weekend and merge it if all looks okay!

I was not able to give much time to the repo recently but I plan to change it! Always happy to accept contributions and discuss new ideas!

fingoldo · 2023-07-25T08:31:23Z

Thanks @fingoldo for the suggestion. Do you mind creating a PR and I can review it over the weekend and merge it if all looks okay!

I was not able to give much time to the repo recently but I plan to change it! Always happy to accept contributions and discuss new ideas!

Thanks for responding! Submitted the PR. It worked for me but not 100% sure it will work on every edge case. I hope you have some tests to run.

R1j1t · 2023-07-27T08:40:19Z

Thanks for the PR @fingoldo, I can review the PR and work on this to get this published soon!

stale · 2023-09-16T21:37:51Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

fingoldo · 2023-09-17T01:22:46Z

Thanks for the PR @fingoldo, I can review the PR and work on this to get this published soon!

Hi, is there any progress with merging? Any way I can help?

R1j1t · 2023-09-18T11:06:22Z

I am sorry for the delay, I was occupied with couple of other stuff. I will fix the pipeline issues over the weekend and we'll see from there!

robinsonkwame · 2023-10-21T22:24:42Z

Hi, is there any progress with merging?

fingoldo · 2023-10-21T23:58:41Z

Hi, is there any progress with merging?

Thanks for asking, I have adjusted the PR according to the instructions of Rajat, let's wait for his checking and hopefully approving now )

fingoldo · 2023-11-23T00:39:18Z

I am sorry for the delay, I was occupied with couple of other stuff. I will fix the pipeline issues over the weekend and we'll see from there!

Can we make the last effort please and finish the undertaking? )

fingoldo added the enhancement New feature or request label May 12, 2023

stale bot added the wontfix This will not be worked on label Jul 15, 2023

stale bot removed the wontfix This will not be worked on label Jul 25, 2023

stale bot added the wontfix This will not be worked on label Sep 16, 2023

stale bot removed the wontfix This will not be worked on label Sep 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Damerau, case-insensitivity comparison, Bayesian decision making options #88

Adding Damerau, case-insensitivity comparison, Bayesian decision making options #88

fingoldo commented May 12, 2023

fingoldo commented Jun 10, 2023

stale bot commented Jul 15, 2023

R1j1t commented Jul 25, 2023

fingoldo commented Jul 25, 2023

R1j1t commented Jul 27, 2023

stale bot commented Sep 16, 2023

fingoldo commented Sep 17, 2023

R1j1t commented Sep 18, 2023

robinsonkwame commented Oct 21, 2023

fingoldo commented Oct 21, 2023

fingoldo commented Nov 23, 2023

Adding Damerau, case-insensitivity comparison, Bayesian decision making options #88

Adding Damerau, case-insensitivity comparison, Bayesian decision making options #88

Comments

fingoldo commented May 12, 2023

fingoldo commented Jun 10, 2023

stale bot commented Jul 15, 2023

R1j1t commented Jul 25, 2023

fingoldo commented Jul 25, 2023

R1j1t commented Jul 27, 2023

stale bot commented Sep 16, 2023

fingoldo commented Sep 17, 2023

R1j1t commented Sep 18, 2023

robinsonkwame commented Oct 21, 2023

fingoldo commented Oct 21, 2023

fingoldo commented Nov 23, 2023