Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Prefix match support fuzziness #30720

Closed
KazaBen opened this issue Mar 22, 2024 · 9 comments
Closed

[Feature request] Prefix match support fuzziness #30720

KazaBen opened this issue Mar 22, 2024 · 9 comments
Assignees
Milestone

Comments

@KazaBen
Copy link

KazaBen commented Mar 22, 2024

Is your feature request related to a problem? Please describe.

When implementing autocomplete application, most of the queries are incomplete words.
Sometimes misspells happen in incomplete queries too. Let's look at an example:

Document = Fjallraven (notice double L)
Query = Fjalr (incomplete query, with misspell: only a single L)

It is a must do for autocomplete application to be still able to match Fjallraven with user query Fjalr.

Describe the solution you'd like

When using Prefix match add support for Fuzziness.

Describe alternatives you've considered

Combining prefix matching with word fuzzy matching: term (prefix) contains OR term contains fuzzy
However, it would not cover the previously mentioned common case:
Document = Fjallraven (notice double L)
Query = Fjalr (incomplete query, with misspell: only a single L)

term (prefix) contains couldn't match as it query has a misspell and there is no fuzziness support
term contains fuzzy couldn't match as it is incomplete word

Additional context

This exact functionality exists in Elasticsearch’s Completion Suggester Fuzzy Queries

@geirst geirst added this to the soon milestone Apr 3, 2024
@KazaBen
Copy link
Author

KazaBen commented Apr 15, 2024

@vekterli @geirst Hi 👋

We have a major project to implement autocomplete on Vespa for Vinted, and we depend on fuzziness :o
Are there any estimates when this could be implemented, or should we look for other solutions?

Thank you ❤️

@tkaessmann
Copy link

Wouldn‘t it be possible to add a simple component that first run a internal fuzzy matching query and with the corrected result a usual prefix query?
Or did i understand something wrong?

Greetings,
Tobias

@vekterli
Copy link
Member

@KazaBen I've started working on this feature between a few other things. Implementation will be a two-step process:

  1. Add support for prefix matching to our core Levenshtein algorithm implementations (Deterministic Finite Automata used for max edits {1, 2} and a generalized Levenshtein matrix fallback for max edits > 2). Work in progress.
  2. Wire this new prefix matching mode into the query evaluation pipeline. Not yet started.

I should be able to give an estimate once I've started work on part 2 (hopefully within a few days, assuming nothing else pops up) and have gotten a gist of the complexity involved.

@vekterli
Copy link
Member

Update: part 2 has been completed and its PR is pending code review. Once it has been merged and a new Vespa version has been released containing the changes, fuzzy prefix matching can be used by adding a prefix:true annotation to your query term. Example YQL:

{maxEditDistance:1,prefix:true}fuzzy("Fjalr")

Assuming no other blockers, I'd expect it to be available as part of an Open-source release some time next week. I'll update this issue with a concrete version number once that happens.

A few notes:

  • Fuzzy prefix matching will often end up matching a lot more terms than non-prefix matching, so this should be taken into account when constructing queries—in particular when query strings are short. For instance, the query {maxEditDistance:2,prefix:true}fuzzy("Fj") will match all terms since any possible prefix can be trivially transformed to Fj with 2 edits. This generalizes to be the case for every query where maxEditDistance >= len(query).
  • Related to the above, prefix locking (prefixLength:n) can be used alongside prefix matching to constrain the candidate set to terms that have prefix that exactly matches n characters of the query term. This also greatly speeds up dictionary scans.
  • Although this implements fuzzy prefix matching, one piece of the puzzle that is still missing is a ranking feature that exposes the actual edit distance between the term and the query. This is nothing new, as the same applies to non-prefix fuzzy queries. We have an existing ticket Consider add ranking features for fuzzy query operator #24242 for adding this.

@vekterli
Copy link
Member

@KazaBen version 8.337.85 is now on Docker hub, which contains support for fuzzy prefix matching. I haven't added it to the official documentation just yet, but my previous comment should have the most important bits of information (and caveats...!). Would be great if you could test it out and let me know if it solves your use case.

@vekterli
Copy link
Member

Prefix matching with fuzzy semantics is fully supported and documented, so I'm marking this issue as closed.

I had a stab at implementing a fuzzy ranking feature, but It's Complicated™ due to the many different kinds of string index abstractions we have in the backend. It would likely have to be prioritized as its own, dedicated work package.

@vekterli
Copy link
Member

Wouldn‘t it be possible to add a simple component that first run a internal fuzzy matching query and with the corrected result a usual prefix query?
Or did i understand something wrong?

Greetings,
Tobias

@tkaessmann sorry, I just realized I had entirely forgotten to reply to your message.

Initially running a regular fuzzy query based on the prefix alone would generally not return the results required to correct the query since the prefix would be too far away in edit distance from the complete term(s) (which is what the dictionary stores). There's also an inherent ambiguity in what "correct" would entail for any given input—a fuzzy prefix query for "ban" (with max edit distance 1) could be a search intent for e.g. "bananas" (exact prefix, 0 edits) or "batteries" (1 edit), or even "tanning salon" (still 1 edit). This would require expanding the input query to a potentially very large number of exact prefix queries.

The new fuzzy prefix feature requires only at most a single scan of the dictionary to find all matches—usually skipping large parts of it along the way due to our DFA-based optimizations.

I'll also very conveniently cite the May 2024 newsletter—hot off the presses:

A common use case for interactive search experiences is type-ahead searching. Here results start appearing immediately as the user types, getting more and more refined as the user keeps typing. One way to solve this is by using prefix matching; a user searching for the prefix “Edvard Gr” will get back documents matching “Edvard Grieg”. However, prefix matching requires the query and the document prefixes to match exactly; searching for “Edward Gr” or “Eddvard Gr” will not return anything at all. This may fail to surface many potential completions.

Vespa has supported “typo-friendly” fuzzy matching of terms for years, where a string matches if it can be transformed to the query within a query-specified maximum number of edits (inserts, deletions, or substitutions), but this matches entire strings and therefore cannot be used for prefixes. E.g. “Edward Gr” will not fuzzy-match “Edvard Grieg” with max edits of 1 because matching would require both replacing “w” with “v” as well as inserting “ieg” to the end, which is 4 edits in total.

Vespa 8.337 and beyond add support for fuzzy prefix matching, which combines the best of both worlds. A string is considered a match if its prefix can be transformed to the query within a specified maximum number of edits. This means “Edward Gr” and “Eddvard Gr” will match “Edvard Grieg”, “Mettal” will match “Metallica” and so on.

@tkaessmann
Copy link

@vekterli you got me with the performance argument :) Thanks for your response!

@KazaBen
Copy link
Author

KazaBen commented Jun 3, 2024

@vekterli Hi, sorry for late response, but we have tested this in production 👍

Overall feature works well, matches what is expected! Well done, and thank you! 🙇

Our observed latency was ~16ms, however, we have decided to not use this, as we use a different indexing technique to achieve the same in ~6ms.

Simply indexing an additional field title_edge_ngram, both attributes with fast-search.
title: "zara shoes"
title_edge_ngram: [z, za, zar, zara, zara s, zara sh, zara sho, zara shoe, zara shoes]

prefix query (~220ms)
SELECT * FROM suggestions WHERE title CONTAINS{prefix:true} @user_query

dropping prefix queries and querying title_edge_ngram field (~25ms)
SELECT * FROM suggestions WHERE title_edge_ngram CONTAINS @user_query

Same for fuzzy prefix queries (~16ms):
SELECT * FROM suggestions WHERE title CONTAINS{prefixLength:2, maxEditDistance: 1, prefix: true} fuzzy(@user_query)

Dropping prefix queries (~5ms)
SELECT * FROM suggestions WHERE title_edge_ngram CONTAINS{prefixLength:2, maxEditDistance: 1} fuzzy(@user_query)

Our use case:

  • 60M documents
  • 3 very big nodes
  • 3 groups
  • 10k RPS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants