Allow Neighbors to accept sparse data #6749

wvdvegte · 2024-02-29T16:52:07Z

What's your use case?
I want to use Neighbors to search a corpus of documents for items similar to one or more reference documents. Since Neighbors requires that Reference and Data have the same features, I have to apply either Text Embedding, Similarity Hashing or Topic Modeling in order to represent the corpora quantitatively. But for most ML tasks with text, I find Bag of Words usually producing more convincing results.

What's your proposed solution?
Allow Neighbors to accept datasets with different features, at least when it comes to sparse data from Bag of Words. So, before computing distances, the words that are in Reference but not in Data are added to Data with value 0, and the other way around.

Are there any alternative solutions?
Not that I'm aware of.

wvdvegte · 2024-03-01T10:24:33Z

There is an alternative solution, which is a bit cumbersome: Concatenate Reference and Data before Bag of Words (requires that they have more or less the same variables), separate after Bag of Words with Select Rows using some criterion that distinguishes Reference from Data, then connect Matching Data to the Reference input of Neighbors and Non-matching Data to the Data input. As I said, rather cumbersome but it works.

markotoplak · 2024-03-01T10:47:29Z

@wvdvegte, you could probably also use the Apply Domain widget.

But I agree, this should have been done automatically. We discussed this, and internally we should have applied the domain of the data onto the reference when comparing.

wvdvegte · 2024-03-04T12:30:15Z

Indeed, in my use case Apply Domain produces processable inputs for Neighbours, too.
Although it keeps the text in the corpus, for every row it sets all variables that are not sparse, to either '?' or 'nan'. Is this intended behavior? If I'm correct, 'nan' means 'not a number', which doesn't make sense for variables that were never defined as numeric.

markotoplak added the snack This will take an hour or two label Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow Neighbors to accept sparse data #6749

Allow Neighbors to accept sparse data #6749

wvdvegte commented Feb 29, 2024

wvdvegte commented Mar 1, 2024

markotoplak commented Mar 1, 2024

wvdvegte commented Mar 4, 2024

Allow Neighbors to accept sparse data #6749

Allow Neighbors to accept sparse data #6749

Comments

wvdvegte commented Feb 29, 2024

wvdvegte commented Mar 1, 2024

markotoplak commented Mar 1, 2024

wvdvegte commented Mar 4, 2024