Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Neighbors to accept sparse data #6749

Open
wvdvegte opened this issue Feb 29, 2024 · 3 comments
Open

Allow Neighbors to accept sparse data #6749

wvdvegte opened this issue Feb 29, 2024 · 3 comments
Labels
snack This will take an hour or two

Comments

@wvdvegte
Copy link

What's your use case?
I want to use Neighbors to search a corpus of documents for items similar to one or more reference documents. Since Neighbors requires that Reference and Data have the same features, I have to apply either Text Embedding, Similarity Hashing or Topic Modeling in order to represent the corpora quantitatively. But for most ML tasks with text, I find Bag of Words usually producing more convincing results.

What's your proposed solution?
Allow Neighbors to accept datasets with different features, at least when it comes to sparse data from Bag of Words. So, before computing distances, the words that are in Reference but not in Data are added to Data with value 0, and the other way around.

Are there any alternative solutions?
Not that I'm aware of.

@wvdvegte
Copy link
Author

wvdvegte commented Mar 1, 2024

There is an alternative solution, which is a bit cumbersome: Concatenate Reference and Data before Bag of Words (requires that they have more or less the same variables), separate after Bag of Words with Select Rows using some criterion that distinguishes Reference from Data, then connect Matching Data to the Reference input of Neighbors and Non-matching Data to the Data input. As I said, rather cumbersome but it works.

@markotoplak markotoplak added the snack This will take an hour or two label Mar 1, 2024
@markotoplak
Copy link
Member

@wvdvegte, you could probably also use the Apply Domain widget.

But I agree, this should have been done automatically. We discussed this, and internally we should have applied the domain of the data onto the reference when comparing.

@wvdvegte
Copy link
Author

wvdvegte commented Mar 4, 2024

Indeed, in my use case Apply Domain produces processable inputs for Neighbours, too.
Although it keeps the text in the corpus, for every row it sets all variables that are not sparse, to either '?' or 'nan'. Is this intended behavior? If I'm correct, 'nan' means 'not a number', which doesn't make sense for variables that were never defined as numeric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
snack This will take an hour or two
Projects
None yet
Development

No branches or pull requests

2 participants