Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Damerau, case-insensitivity comparison, Bayesian decision making options #88

Open
fingoldo opened this issue May 12, 2023 · 11 comments
Labels
enhancement New feature or request

Comments

@fingoldo
Copy link

Describe the problem you met

Hi, thanks for this useful lib, I think it can be improved even further! ;-)

I was disappointed by the lib output for this case:

doc = nlp(
    "Visa to dubia ",
)

print(doc._.performed_spellCheck)
print(doc._.outcome_spellCheck)
print(doc._.score_spellCheck)

True
Visa to.
{dubia: [('.', 0.72154), (';', 0.07971), ('?', 0.04526), ('Norway', 0.00669), ('Estonia', 0.00593), ('Taiwan', 0.00555), ('Bermuda', 0.00531), ('Iceland', 0.00479), ('Italy', 0.00466), ('Germany', 0.00437)]}

Describe the solution you'd like
Obviously, top_n parameter which is internal to the candidate_ranking method should be made a part of constructor.
I made that adjustment locally, and unexpectedly, Cuba became a winner instead of Dubai. 2 reasons were immediately evident:

  1. candidates to misspell comparisons are case-sensitive, which is not good for accuracy
  2. symbols swaps are considered of the same cost as 1 addition and one deletion, which is clearly a drawback.

Therefore, optional case insensitivity and Damerau-Levenshtein distance instead of plain Levenshtein are desired.
Once implemented locally, I arrived at the desired outcome (Dubai).

However, it felt as a lucky coincidence, as the probabilities coming from underlying Bert were not accounted for when making a decision. I thought it would be very natural to be able to handle them in conjunction with the textual similarities in a Bayesian fashion, where Bert probs could be our prior, cand to misspell textual sim would be B/A conditional probs of B being the correct guess given that A was typed in. To soften the impact of possible Bert miscalibration, absolute probs conversion to ranks would be desired.

Summary of my suggestions:
feature request could be implemented by adding a few parameters to the class constructor:

top_n (int, optional): suggestions from underlying ANN model to be considered. Defaults to 10.
lowercased_distance (bool, optional): lowercase candidates before computing edit distance. Defaults to True.
damerau_distance (bool, optional): additionally account for symbol swaps when calculating a distance. Defaults to True.
bayes_selection (bool, optional): use bayes reasoning when selecting the best candidate. Bert probabilities are the prior, textual similarities of candidates to the input are treated as the probabilities B/A that the correct candidate is A, while the input was B. Defaults to True.
ranked_bert_probs (bool, optional): use ranked probs as opposed to the absolute probs values coming from Bert. Defaults to True.

Calling code would become:
contextualSpellCheck.add_to_pipe(nlp,config=dict(top_n=500,lowercased_distance =True,damerau_distance =True,bayes_selection =True,ranked_bert_probs =False))

After the implementation, I arrive at Dubai instead of the dot, which makes me happy )

Describe alternatives you've considered
I was not able to find any alternative solution. Tried a few Spacy models, to no avail.

Additional context
I used

            Doc.set_extension("bayes_probs", default=None)
            Doc.set_extension("bayes_details", default=None)

to provide additional details for the occasional exigent user:

screenshot detailing resulting candidates rating and bayesian details

If you think the community can benefit from this new functionality, I'm happy to merge my PR.

@fingoldo fingoldo added the enhancement New feature or request label May 12, 2023
@fingoldo
Copy link
Author

Created a fork with aforementioned functionality, as this repo seems to be dead.

@stale
Copy link

stale bot commented Jul 15, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the wontfix This will not be worked on label Jul 15, 2023
@R1j1t
Copy link
Owner

R1j1t commented Jul 25, 2023

Thanks @fingoldo for the suggestion. Do you mind creating a PR and I can review it over the weekend and merge it if all looks okay!

I was not able to give much time to the repo recently but I plan to change it! Always happy to accept contributions and discuss new ideas!

@stale stale bot removed the wontfix This will not be worked on label Jul 25, 2023
@fingoldo
Copy link
Author

Thanks @fingoldo for the suggestion. Do you mind creating a PR and I can review it over the weekend and merge it if all looks okay!

I was not able to give much time to the repo recently but I plan to change it! Always happy to accept contributions and discuss new ideas!

Thanks for responding! Submitted the PR. It worked for me but not 100% sure it will work on every edge case. I hope you have some tests to run.

@R1j1t
Copy link
Owner

R1j1t commented Jul 27, 2023

Thanks for the PR @fingoldo, I can review the PR and work on this to get this published soon!

@stale
Copy link

stale bot commented Sep 16, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the wontfix This will not be worked on label Sep 16, 2023
@fingoldo
Copy link
Author

Thanks for the PR @fingoldo, I can review the PR and work on this to get this published soon!

Hi, is there any progress with merging? Any way I can help?

@stale stale bot removed the wontfix This will not be worked on label Sep 17, 2023
@R1j1t
Copy link
Owner

R1j1t commented Sep 18, 2023

I am sorry for the delay, I was occupied with couple of other stuff. I will fix the pipeline issues over the weekend and we'll see from there!

@robinsonkwame
Copy link

Hi, is there any progress with merging?

@fingoldo
Copy link
Author

Hi, is there any progress with merging?

Thanks for asking, I have adjusted the PR according to the instructions of Rajat, let's wait for his checking and hopefully approving now )

@fingoldo
Copy link
Author

I am sorry for the delay, I was occupied with couple of other stuff. I will fix the pipeline issues over the weekend and we'll see from there!

Can we make the last effort please and finish the undertaking? )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants