-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fingerprint ingest processor #13724
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Gao Binlong <gbinlong@amazon.com>
❌ Gradle check result for 50cdb84: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #13724 +/- ##
============================================
+ Coverage 71.42% 71.67% +0.25%
- Complexity 59978 61309 +1331
============================================
Files 4985 5062 +77
Lines 282275 287984 +5709
Branches 40946 41711 +765
============================================
+ Hits 201603 206417 +4814
- Misses 63999 64529 +530
- Partials 16673 17038 +365 ☔ View full report in Codecov by Sentry. |
❌ Gradle check result for 4a37513: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
modules/ingest-common/src/main/java/org/opensearch/ingest/common/FingerprintProcessor.java
Outdated
Show resolved
Hide resolved
modules/ingest-common/src/main/java/org/opensearch/ingest/common/FingerprintProcessor.java
Outdated
Show resolved
Hide resolved
modules/ingest-common/src/main/java/org/opensearch/ingest/common/FingerprintProcessor.java
Show resolved
Hide resolved
Signed-off-by: Gao Binlong <gbinlong@amazon.com>
Signed-off-by: Gao Binlong <gbinlong@amazon.com>
❌ Gradle check result for aae81a7: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Gao Binlong <gbinlong@amazon.com>
private static final Set<String> HASH_METHODS = Set.of("MD5", "SHA-1", "SHA-256", "SHA3-256"); | ||
|
||
// fields used to generate hash value | ||
private final List<String> fields; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my last concern is the fields
& includeAllFields
, usage of both properties is conflicting. As far as I can tell, we use different approach in other processors by relying on field patterns and exclusion lists (see RemoveByPatternProcessor
fe):
private final List<String> fieldPatterns;
private final List<String> excludeFieldPatterns;
It is easier to reason about and configure, the includeAllFields
would correspond to "*"
pattern, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because field name can be a single star *
:
{
"test": {
"mappings": {
"properties": {
"*": {
"type": "text"
}
}
}
}
}
, so we may not use *
to represent all fields, I think that's why we implement a new remove_by_pattern
processor and introduce new parameters whose name contains pattern
, but not make the parameter field
in the remove
processor support *
, that may introduce confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we may not use * to represent all fields, I think that's why we implement a new remove_by_pattern processor and introduce new parameters whose name contains pattern
Oh I see, thanks for highlighting that, I didn't know to be fair we could use *
as a field name. How about falling back to inclusion / exclusion list:
private final Set<String> includeFields;
private final Set<String> excludeFields;
I think it should be very consistent to other processors: empty lists means all fields. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can add both inclusion and exclusion list, but seems other processors don't guarantee that empty list means all fields, for example, if the field
or exclude_field
in the remove
processor is empty, the processor does nothing:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"processors": [
{
"remove": {
"field": []
}
}
]
},
"docs": [
{
"_source": {
"uri": [
"app","home","overview"
],
"a":1
}
}
]
}
,response:
{
"docs": [
{
"doc": {
"_index": "_index",
"_id": "_id",
"_source": {
"a": 1,
"uri": [
"app",
"home",
"overview"
]
},
"_ingest": {
"timestamp": "2024-05-24T06:01:44.1073448Z"
}
}
}
]
}
, so maybe we still need the flag include_all_fields
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can add both inclusion and exclusion list, but seems other processors don't guarantee that empty list means all fields,
We have a mix, I agree. Documenting would help for sure (the semantic of RemoveProcessor
is somewhat different - it always need a field to be removed, I think empty field list should be rejected it looks like)
, so maybe we still need the flag include_all_fields.
The flag would work, the reason I am trying to get rid of it is to have non-conflicting and easy configuration. If use has no includes nor excludes (not provided or empty) - all fields, otherwise - pick and choose.
@msfroh do you have thoughts on the subject?
Description
Add a new ingest processor named
fingerprint
which generate a hash value for some specified fields or all fields in a document and write the hash value to thetarget_field
, the hash value can be used to deduplicate documents within a index and collapse search results.The usage of this processor is:
or
The main parameters in this processor are:
fields
: fields in the document used to generate hash value, field name and value are concatenated and separated by|
, like|field1|value1|field2|value2|
, for nested fields, the field name is flattened, like|root_field.sub_field1|value1|root_field.sub_field2|value2|
include_all_fields
: whether all fields are included to generate the hash value, eitherfields
orinclude_all_fields
can be set.hash_method
: MD5, SHA-1 or SHA-256, SHA-1 is the default hash method.target_field
: the field to store the hash valueignore_missing
: if one of the specified fields is missing, the processor will exit quietly and do nothing.Related Issues
#13612
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.