Change in the default fast field tokenizer manager. #2128

fulmicoton · 2023-07-18T10:21:11Z

{ fast: true } now results in the use of a the default fast field tokenizer. (instead of no tokenizer) The default tokenizer lowercases.

Fast field gets a different default tokenizer manager than the normal tokenizer.

The serialization of the fast field options is unchanged.

`{ fast: true }` now results in the use of a the default fast field tokenizer. (instead of no tokenizer) The default tokenizer lowercases. Fast field gets a different default tokenizer manager than the normal tokenizer. The serialization of the fast field options is unchanged.

codecov-commenter · 2023-07-18T11:27:20Z

Codecov Report

Merging #2128 (61422d7) into main (5fafe4b) will decrease coverage by 0.01%.
The diff coverage is 98.35%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##             main    #2128      +/-   ##
==========================================
- Coverage   94.37%   94.37%   -0.01%     
==========================================
  Files         321      319       -2     
  Lines       60821    60808      -13     
==========================================
- Hits        57401    57386      -15     
- Misses       3420     3422       +2

Impacted Files	Coverage Δ
src/lib.rs	`99.05% <ø> (ø)`
src/schema/mod.rs	`100.00% <ø> (ø)`
src/schema/json_object_options.rs	`91.66% <92.85%> (-1.00%)`	⬇️
src/fastfield/writer.rs	`94.10% <96.66%> (-0.75%)`	⬇️
src/schema/text_options.rs	`97.90% <98.14%> (+2.03%)`	⬆️
src/aggregation/bucket/term_agg.rs	`98.28% <100.00%> (ø)`
src/aggregation/mod.rs	`91.87% <100.00%> (ø)`
src/core/index.rs	`87.43% <100.00%> (ø)`
src/core/index_meta.rs	`96.14% <100.00%> (ø)`
src/fastfield/mod.rs	`100.00% <100.00%> (ø)`
... and 7 more

... and 2 files with indirect coverage changes

fmassot · 2023-07-18T13:38:10Z

src/schema/json_object_options.rs

-        } else {
-            self.fast = FastFieldTextOptions::IsEnabled(true);
-        }
+    pub fn set_fast(mut self, tokenizer_name: &str) -> Self {


question: is it possible to set fast to Disabled?

fmassot · 2023-07-18T13:40:42Z

src/schema/text_options.rs

-        FastFieldTextOptions::IsEnabled(false)
+/// Enum used to control the way we serialize fast field text options.
+///
+/// For backward compatiblity reasons, we folow the format introduce in tantivy 0.19.


Suggested change

/// For backward compatiblity reasons, we folow the format introduce in tantivy 0.19.

/// For backward compatibility reasons, we follow the format introduced in tantivy 0.19.

fmassot · 2023-07-18T13:51:06Z

src/schema/mod.rs

@@ -153,6 +153,8 @@ pub use self::term::{Term, ValueBytes, JSON_END_OF_PATH};
 pub use self::text_options::{TextFieldIndexing, TextOptions, STRING, TEXT};
 pub use self::value::Value;

+pub(crate) const DEFAULT_FAST_FIELD_TOKENIZER: &str = "default";


nitpick: don't we want to call it by its real name lower?

I think TokenizerManager::default_for_fast_fields registers it both under lower and under default. If I understand things correctly, using default would mean the concrete choice is subject to change whereas lower should always yield the lower case even in future Tantivy versions?

I would prefer default_fast_field, to avoid confusing it with default

@fmassot @adamreichold "default" is the one that will be used when a user is not specifying any tokenizer.
For isntance, when using the flags.

.add_text(my_field, STRING | FAST)

It is then possible in tantivy to pass your own tokenizer manager for indexing and for fast fields.
So not putting the specific tokenizer directly here makes it possible for users to pass a tokenizer manager with the default tokenizer they prefer.

fulmicoton marked this pull request as ready for review July 18, 2023 10:21

fulmicoton requested a review from fmassot July 18, 2023 10:21

fulmicoton force-pushed the default_fast_field_tokenizer branch from a0037ac to 61422d7 Compare July 18, 2023 10:21

fmassot reviewed Jul 18, 2023

View reviewed changes

fmassot approved these changes Jul 18, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change in the default fast field tokenizer manager. #2128

Change in the default fast field tokenizer manager. #2128

fulmicoton commented Jul 18, 2023

codecov-commenter commented Jul 18, 2023

fmassot Jul 18, 2023

fmassot Jul 18, 2023

fmassot Jul 18, 2023

adamreichold Jul 18, 2023

PSeitz Jul 18, 2023

fulmicoton Jul 19, 2023 •

edited

	/// For backward compatiblity reasons, we folow the format introduce in tantivy 0.19.
	/// For backward compatibility reasons, we follow the format introduced in tantivy 0.19.

Change in the default fast field tokenizer manager. #2128

Are you sure you want to change the base?

Change in the default fast field tokenizer manager. #2128

Conversation

fulmicoton commented Jul 18, 2023

codecov-commenter commented Jul 18, 2023

Codecov Report

fmassot Jul 18, 2023

Choose a reason for hiding this comment

fmassot Jul 18, 2023

Choose a reason for hiding this comment

fmassot Jul 18, 2023

Choose a reason for hiding this comment

adamreichold Jul 18, 2023

Choose a reason for hiding this comment

PSeitz Jul 18, 2023

Choose a reason for hiding this comment

fulmicoton Jul 19, 2023 • edited

Choose a reason for hiding this comment

fulmicoton Jul 19, 2023 •

edited