Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change in the default fast field tokenizer manager. #2128

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

fulmicoton
Copy link
Collaborator

{ fast: true } now results in the use of a the default fast field tokenizer. (instead of no tokenizer) The default tokenizer lowercases.

Fast field gets a different default tokenizer manager than the normal tokenizer.

The serialization of the fast field options is unchanged.

`{ fast: true }` now results in the use of a the default fast field tokenizer. (instead of no tokenizer)
The default tokenizer lowercases.

Fast field gets a different default tokenizer manager than the normal tokenizer.

The serialization of the fast field options is unchanged.
@fulmicoton fulmicoton marked this pull request as ready for review July 18, 2023 10:21
@fulmicoton fulmicoton requested a review from fmassot July 18, 2023 10:21
@fulmicoton fulmicoton force-pushed the default_fast_field_tokenizer branch from a0037ac to 61422d7 Compare July 18, 2023 10:21
@codecov-commenter
Copy link

Codecov Report

Merging #2128 (61422d7) into main (5fafe4b) will decrease coverage by 0.01%.
The diff coverage is 98.35%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##             main    #2128      +/-   ##
==========================================
- Coverage   94.37%   94.37%   -0.01%     
==========================================
  Files         321      319       -2     
  Lines       60821    60808      -13     
==========================================
- Hits        57401    57386      -15     
- Misses       3420     3422       +2     
Impacted Files Coverage Δ
src/lib.rs 99.05% <ø> (ø)
src/schema/mod.rs 100.00% <ø> (ø)
src/schema/json_object_options.rs 91.66% <92.85%> (-1.00%) ⬇️
src/fastfield/writer.rs 94.10% <96.66%> (-0.75%) ⬇️
src/schema/text_options.rs 97.90% <98.14%> (+2.03%) ⬆️
src/aggregation/bucket/term_agg.rs 98.28% <100.00%> (ø)
src/aggregation/mod.rs 91.87% <100.00%> (ø)
src/core/index.rs 87.43% <100.00%> (ø)
src/core/index_meta.rs 96.14% <100.00%> (ø)
src/fastfield/mod.rs 100.00% <100.00%> (ø)
... and 7 more

... and 2 files with indirect coverage changes

} else {
self.fast = FastFieldTextOptions::IsEnabled(true);
}
pub fn set_fast(mut self, tokenizer_name: &str) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: is it possible to set fast to Disabled?

FastFieldTextOptions::IsEnabled(false)
/// Enum used to control the way we serialize fast field text options.
///
/// For backward compatiblity reasons, we folow the format introduce in tantivy 0.19.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// For backward compatiblity reasons, we folow the format introduce in tantivy 0.19.
/// For backward compatibility reasons, we follow the format introduced in tantivy 0.19.

@@ -153,6 +153,8 @@ pub use self::term::{Term, ValueBytes, JSON_END_OF_PATH};
pub use self::text_options::{TextFieldIndexing, TextOptions, STRING, TEXT};
pub use self::value::Value;

pub(crate) const DEFAULT_FAST_FIELD_TOKENIZER: &str = "default";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: don't we want to call it by its real name lower?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think TokenizerManager::default_for_fast_fields registers it both under lower and under default. If I understand things correctly, using default would mean the concrete choice is subject to change whereas lower should always yield the lower case even in future Tantivy versions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer default_fast_field, to avoid confusing it with default

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmassot @adamreichold "default" is the one that will be used when a user is not specifying any tokenizer.
For isntance, when using the flags.

.add_text(my_field, STRING | FAST)

It is then possible in tantivy to pass your own tokenizer manager for indexing and for fast fields.
So not putting the specific tokenizer directly here makes it possible for users to pass a tokenizer manager with the default tokenizer they prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants