Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index.load () creates instance which does not find words present in the inverted index #503

Open
jmxti opened this issue Sep 28, 2021 · 1 comment

Comments

@jmxti
Copy link

jmxti commented Sep 28, 2021

I have a generated index which after loading will not match certain keywords in the inverted index. However some tricks will cause lunr to find matches for the exact same keywords.

I was not able to create a very minimal example, but the original generated index (and the reduced version) is manageable. I could not append json files as attachments so I have created a gist which includes both the full index and the one referenced to in the code below. There is also a reproduce script which will illustrate the behaviour.

Loading the index and searching for the keyword will not work:

let filename = './index-data.compact.json';
let search = 'EIFUW001R00';

let data = JSON.parse (fs.readFileSync (filename));
let index = lunr.Index.load (data);
console.log (index.search (search));

But the keyword is in the inverted index, we can manually see inspect the json:

$ jq '.invertedIndex[] | select (.[0] == "eifuw001r00") | .[0]' index-data.compact.json
"eifuw001r00"

In the reduced example I have removed all the entries in the inverted index starting from that match. When I add just the first entry (the one we try to match) we do find one:

/* ... */
data.invertedIndex.push ([ "eifuw001r00", { 
    "_index": 1, "keyCode": {}, "e-IFU CODE": { "78": {} }, "REF": {}, "PRODUCT DESCRIPTION": {} } ]
);
let index = lunr.Index.load (data);
console.log (index.search (search));

Hovewer when the entry after that is also added, no matches are found:

data.invertedIndex.push (/* ... */);
data.invertedIndex.push ([ "eifuw001r01", {
    "_index": 121, "keyCode": {}, "e-IFU CODE": { "140": {} }, "REF": {}, "PRODUCT DESCRIPTION": {} } ]
);
let index = lunr.Index.load (data);
console.log (index.search (search));

Even weirder is that altering some lunr internals (the ids of the TokenSet) will cause everything to work again.

data.invertedIndex.push (/* ... */);
data.invertedIndex.push (/* ... */);
lunr.TokenSet._nextId++;
let index = lunr.Index.load (data);
console.log (index.search (search));

Changing any of the entries in the inverted index before eifuw001r00 will make the search work too. Although I have found some changes which will still cause the search to fail, nearly all changes will make it work though. This is also the reason why I have a rather long index in the reproduction example.

I get the impression that the behaviour is related to this code here. The result of TokenSet.toString(), which includes the id, is used as the key for a lookup in this.minimizedNodes. I'm guessing that it matches something that it should not, and modifies a token set's edges, which cause it to lose information.

I have also tried to peek inside the generation of the token set inside he TokenSet.Builder (using the code in the gist). Everything seems to be going fine until the call to TokenSet.finish (). After that it seems like it only knows about eifuw0 and eifuw1 instead of eifuw001r00 and eifuw001r01.

Any idea what is causing this?
Is it expected behaviour, or is this a bug?
Is there a way to fix, or detect this?

Kind regards

@jmxti jmxti changed the title Index.load () createds instance which does not find words present in the inverted index Index.load () creates instance which does not find words present in the inverted index Sep 28, 2021
@jmxti
Copy link
Author

jmxti commented Sep 28, 2021

some further debugging gives me that I have at some point a token set with two edges: one for '0' and one for '1', both point to a single final token set which happens to have id 4. so computing the toString () (can be found here) yields: '0' not final, and '0' + '4' for the first edge and '1' + '4' for the second edge, or '00414'

I also happen to have a token set with a single edge for '0' which points to a token set with id '414'. So computing the toString () yields: '0' not final, and '0' + '414' for the only edge, or '00414'.

Both of them have the same string representation but they don't represent the same thing. The first represents that your search term ends with either a '0' or a '1', while the second represents that you can have a '0' followed by some other characters.

The reason I could not reduce the number of entries in the inverted index is because it would cause that last id '414' to change. This is also the reason lunr.TokenSet._nextIndex++ will make it seem like it fixes the problem. And this is probably also the reason why changes to some of the other terms will make it seem like it fixes the problem. If the number of characters change the indexes will again change before we get to the problem area.

If I change TokenSet.toString (), so it places markers in generated string the problem disappears, something like:

lunr.TokenSet.prototype.toString = function () {
  if (this._str) { return this._str; }

  var str = this.final ? '1' : '0',
      labels = Object.keys(this.edges).sort(),
      len = labels.length

  for (var i = 0; i < len; i++) {
    var label = labels[i],
        node = this.edges[label]

    str = str + ', L(' + label + ')I(' + node.id + ')'
  }

  return str
}

I don't think this can cause the same collisions since a label is only a single character, and an id can only contain numbers.

But

  • I don't know what the performance impact could be.
  • I don't know if the string is used in other places.
  • it feels really hacky
  • ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant