Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fingerprint ingest processor #13724

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

gaobinlong
Copy link
Contributor

@gaobinlong gaobinlong commented May 17, 2024

Description

Add a new ingest processor named fingerprint which generate a hash value for some specified fields or all fields in a document and write the hash value to the target_field, the hash value can be used to deduplicate documents within a index and collapse search results.

The usage of this processor is:

"processors": [
      {
        "fingerprint": {
          "fields": ["foo", "bar"],
          "target_field": "fingerprint",
          "hash_method": "SHA-256",
          "ignore_missing": true
        }
      }
    ]

or

"processors": [
      {
        "fingerprint": {
          "include_all": true,
          "target_field": "fingerprint"
        }
      }
    ]

The main parameters in this processor are:

  1. fields: fields in the document used to generate hash value, field name and value are concatenated and separated by |, like |field1|value1|field2|value2|, for nested fields, the field name is flattened, like |root_field.sub_field1|value1|root_field.sub_field2|value2|
  2. include_all_fields: whether all fields are included to generate the hash value, either fields or include_all_fields can be set.
  3. hash_method: MD5, SHA-1 or SHA-256, SHA-1 is the default hash method.
  4. target_field: the field to store the hash value
  5. ignore_missing: if one of the specified fields is missing, the processor will exit quietly and do nothing.

Related Issues

#13612

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • API changes companion pull request created.
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Gao Binlong <gbinlong@amazon.com>
Copy link
Contributor

❌ Gradle check result for 50cdb84: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for e6b8851: SUCCESS

Copy link

codecov bot commented May 17, 2024

Codecov Report

Attention: Patch coverage is 66.66667% with 38 lines in your changes are missing coverage. Please review.

Project coverage is 71.67%. Comparing base (b15cb0c) to head (cd7fcdc).
Report is 299 commits behind head on main.

Files Patch % Lines
...opensearch/ingest/common/FingerprintProcessor.java 66.66% 30 Missing and 7 partials ⚠️
...search/ingest/common/IngestCommonModulePlugin.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #13724      +/-   ##
============================================
+ Coverage     71.42%   71.67%   +0.25%     
- Complexity    59978    61309    +1331     
============================================
  Files          4985     5062      +77     
  Lines        282275   287984    +5709     
  Branches      40946    41711     +765     
============================================
+ Hits         201603   206417    +4814     
- Misses        63999    64529     +530     
- Partials      16673    17038     +365     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

❌ Gradle check result for 4a37513: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for b518133: SUCCESS

Signed-off-by: Gao Binlong <gbinlong@amazon.com>
Copy link
Contributor

❌ Gradle check result for aae81a7: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Gao Binlong <gbinlong@amazon.com>
Copy link
Contributor

✅ Gradle check result for cd7fcdc: SUCCESS

private static final Set<String> HASH_METHODS = Set.of("MD5", "SHA-1", "SHA-256", "SHA3-256");

// fields used to generate hash value
private final List<String> fields;
Copy link
Collaborator

@reta reta May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my last concern is the fields & includeAllFields, usage of both properties is conflicting. As far as I can tell, we use different approach in other processors by relying on field patterns and exclusion lists (see RemoveByPatternProcessor fe):

    private final List<String> fieldPatterns;
    private final List<String> excludeFieldPatterns;

It is easier to reason about and configure, the includeAllFields would correspond to "*" pattern, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because field name can be a single star *:

{
  "test": {
    "mappings": {
      "properties": {
        "*": {
          "type": "text"
        }
      }
    }
  }
}

, so we may not use * to represent all fields, I think that's why we implement a new remove_by_pattern processor and introduce new parameters whose name contains pattern, but not make the parameter field in the remove processor support *, that may introduce confusion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we may not use * to represent all fields, I think that's why we implement a new remove_by_pattern processor and introduce new parameters whose name contains pattern

Oh I see, thanks for highlighting that, I didn't know to be fair we could use * as a field name. How about falling back to inclusion / exclusion list:

    private final Set<String> includeFields;
    private final Set<String> excludeFields;

I think it should be very consistent to other processors: empty lists means all fields. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add both inclusion and exclusion list, but seems other processors don't guarantee that empty list means all fields, for example, if the field or exclude_field in the remove processor is empty, the processor does nothing:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "remove": {
          "field": []
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "uri": [
          "app","home","overview"
        ],
        "a":1
      }
    }
  ]
}

,response:

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_source": {
          "a": 1,
          "uri": [
            "app",
            "home",
            "overview"
          ]
        },
        "_ingest": {
          "timestamp": "2024-05-24T06:01:44.1073448Z"
        }
      }
    }
  ]
}

, so maybe we still need the flag include_all_fields.

Copy link
Collaborator

@reta reta May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add both inclusion and exclusion list, but seems other processors don't guarantee that empty list means all fields,

We have a mix, I agree. Documenting would help for sure (the semantic of RemoveProcessor is somewhat different - it always need a field to be removed, I think empty field list should be rejected it looks like)

, so maybe we still need the flag include_all_fields.

The flag would work, the reason I am trying to get rid of it is to have non-conflicting and easy configuration. If use has no includes nor excludes (not provided or empty) - all fields, otherwise - pick and choose.

@msfroh do you have thoughts on the subject?

@reta reta added the backport 2.x Backport to 2.x branch label May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants