Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Online GapEncoder #1439

Open
MaxHalford opened this issue Nov 3, 2023 · 1 comment
Open

Online GapEncoder #1439

MaxHalford opened this issue Nov 3, 2023 · 1 comment

Comments

@MaxHalford
Copy link
Member

skrub is a wonderful new project related to scikit-learn. You can see Gaël Varoquaux present it here. They have a transformer called GapEncoder: it's a way to embed fuzzy strings. This could be really powerful online, say for classifying Tweets or Twitch messages, where typos are aplenty.

We already have a way to do online TD-IDF/count vectorization. But we don't have Gamma-Poisson matrix factorization. It is doable online though. Once we have it, we could assemble the two into a nice GapEncoder class. See paper here.

This is related to #1412. Indeed, maybe this works well without Gamma-Poisson matrix factorization. For instance, we could use decomposition.LDA, which we already have.

@sky-2002
Copy link

@MaxHalford I can take this up, need some getting started materials for doing this on streams. Will go through the paper and skrub. Open to discussions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants