Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Enable remote chunking via unstructured-ingest #2905

Merged
merged 9 commits into from
Apr 25, 2024

Conversation

Coniferish
Copy link
Collaborator

@Coniferish Coniferish commented Apr 18, 2024

Update: The cli shell script works when sending documents to the free api, but the paid api is down, so waiting to test against it.

  • The first commit adds docstrings and fixes type hints.
  • The second commit reorganizes test_unstructured_ingest so it matches the structure of unstructured/ingest.
  • The third commit contains the primary changes for this PR.
    • The .chunk() method responsible for sending elements to the correct method is moved from ChunkingConfig to Chunker so that ChunkingConfig acts as a config object instead of containing implementation logic. Chunker.chunk() also now takes a json file instead of a list of elements. This is done to avoid redundant serialization if the file is to be sent to the api for chunking.

Copy link
Collaborator

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, quite a few remarks.

I like what you did with the typing and docstrings but I wonder if you could pull that into a separate "prep" PR so the meaty changes are more focused.

I want to take another look in the morning but here are these initial comments in the meantime.

Btw, generally tests should appear in the same commit as the changes that make them pass. We should probably talk about this a little bit. It has to do with the test-driven approach, like first you write the test and then write the code to pass the test (and only the code it takes to pass the test). Let's chat about that next time we get together.

# -- "/.cache" of a user's profile.
# -- Define `work_dir` add the "/chunked" subdirectory to it:
chunker.pipeline_context.work_dir = tmpdir
os.makedirs(os.path.join(tmpdir, "chunked"), exist_ok=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have to create the chunked directory? Doesn't ingest do that for itself?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, without creating it FileNotFoundError: [Errno 2] No such file or directory is raised

os.makedirs(os.path.join(tmpdir, "chunked"), exist_ok=True)

filename = chunker.run(ELEMENTS_JSON_FILE)
head, tail = os.path.split(filename if filename else "")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what circumstances would filename be None? Seems like worth a separate assert, like:

assert filename is not None, "message explaining what this means and why it's a failure"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None is returned by Chunker.run() if that is somehow called without defining chunking_strategy. A message is logged, so I'll add a test for that.

unstructured/ingest/pipeline/reformat/chunking.py Outdated Show resolved Hide resolved
unstructured/ingest/cli/cli.py Show resolved Hide resolved
@Coniferish Coniferish force-pushed the jj/feature-gapped-chunking-cli branch 2 times, most recently from 162a919 to e0da9c8 Compare April 19, 2024 14:36
@Coniferish Coniferish force-pushed the jj/feature-gapped-chunking-cli branch from fb599cb to 9453fd8 Compare April 19, 2024 20:29
@Coniferish Coniferish force-pushed the jj/feature-gapped-chunking-cli branch from 9453fd8 to 766f180 Compare April 19, 2024 20:34
@Coniferish Coniferish requested a review from scanny April 23, 2024 14:49
@Coniferish Coniferish marked this pull request as ready for review April 23, 2024 15:02
Copy link
Collaborator

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think this is almost there. A few questions too.

Let me know when it's ready and I'll take a quick look and approve.

CHANGELOG.md Outdated Show resolved Hide resolved
assert head.endswith("chunked")
assert tail.endswith(".json")

def it_logs_error_with_invalid_remote_chunking_args(

This comment was marked as resolved.


chunker.run(ELEMENTS_JSON_FILE)

assert "Input should be 'basic', 'by_page', 'by_similarity'" in caplog.text
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this the output and where does it come from? Only the remote chunker can tell what chunking strategies it has registered. Seems like 'by_title' should be in the list too.

And Input should be ... isn't very helpful as an error message goes. Is there more that was cut off? I'd be looking to see something saying "chunking_strategy" to give a hint where the problem is. No reason this string can't be as long as it needs to be to clearly represent the log entry.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unstructured_client.models.errors.httpvalidationerror.HTTPValidationError raises the error and by_title is in the list. I was trying to keep it to one line while still conveying what the logged message entailed.

The full output:

'ERROR    unstructured.ingest:chunking.py:83 failed to run chunking on file /Users/johnjennings/src/unstructured/example-docs/test_evaluate_files/unstructured_output/Bank Good Credit Loan.pptx.json, {"detail":[{"loc":["body","chunking_strategy"],"msg":"Input should be \'basic\', \'by_page\', \'by_similarity\' or \'by_title\'","type":"literal_error"}]}
Traceback (most recent call last):
  File "/Users/johnjennings/src/unstructured/unstructured/ingest/pipeline/reformat/chunking.py", line 64, in run
   chunked_elements = self.chunk(elements_json)
  File "/Users/johnjennings/src/unstructured/unstructured/ingest/pipeline/reformat/chunking.py", line 109, in chunk
    return partition_via_api(
  File "/Users/johnjennings/src/unstructured/unstructured/partition/api.py", line 103, in partition_via_api\n    response = sdk.general.partition(req)
  File "/Users/johnjennings/.pyenv/versions/unstructured/lib/python3.10/site-packages/unstructured_client/utils/_human_utils.py", line 86, in wrapper
    return func(*args, **kwargs)
  File "/Users/johnjennings/.pyenv/versions/unstructured/lib/python3.10/site-packages/unstructured_client/utils/_human_split_pdf.py", line 40, in wrapper
    return func(*args, **kwargs)
  File "/Users/johnjennings/.pyenv/versions/unstructured/lib/python3.10/site-packages/unstructured_client/general.py", line 90, in partition
    raise out
unstructured_client.models.errors.httpvalidationerror.HTTPValidationError: {"detail":[{"loc":["body","chunking_strategy"],"msg":"Input should be \'basic\', \'by_page\', \'by_similarity\' or \'by_title\'","type":"literal_error"}]}'

unstructured/ingest/pipeline/reformat/chunking.py Outdated Show resolved Hide resolved
Comment on lines +117 to +119
# -- These are not currently supported by the sdk and raise an error.
# -- They will need to be updated once we bump the client version.
# combine_text_under_n_chars=self.chunking_config.combine_text_under_n_chars,
# include_orig_elements=self.chunking_config.include_orig_elements,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does being support by unstructured-client have to do with calling partition_via_api() via the unstructured library? unstructured-client doesn't even need to be installed, does it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unstructured_via_api() uses unstructured-client to call the api.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened this as a follow-up issue:
#2924

# -- pinned to <=0.18.0
# overlap=self.chunking_config.overlap,
# overlap_all=self.chunking_config.overlap_all,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the behavior when there is no chunking strategy registered with the requested name on the remote host?

Like how does the user get a message like: No such chunking strategy: 'foobar'?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is when the SDK raises an error. Example test: it_logs_error_on_invalid_remote_chunking_strategy()

Copy link
Collaborator

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

@Coniferish Coniferish force-pushed the jj/feature-gapped-chunking-cli branch from aeb7106 to d9bbfc2 Compare April 24, 2024 22:34
@scanny scanny added this pull request to the merge queue Apr 25, 2024
Merged via the queue into main with commit 3843af6 Apr 25, 2024
42 checks passed
@scanny scanny deleted the jj/feature-gapped-chunking-cli branch April 25, 2024 00:58
plutasnyy added a commit that referenced this pull request Apr 25, 2024
Update: The cli shell script works when sending documents to the free
api, but the paid api is down, so waiting to test against it.

- The first commit adds docstrings and fixes type hints.
- The second commit reorganizes `test_unstructured_ingest` so it matches
the structure of `unstructured/ingest`.
- The third commit contains the primary changes for this PR.
- The `.chunk()` method responsible for sending elements to the correct
method is moved from `ChunkingConfig` to `Chunker` so that
`ChunkingConfig` acts as a config object instead of containing
implementation logic. `Chunker.chunk()` also now takes a json file
instead of a list of elements. This is done to avoid redundant
serialization if the file is to be sent to the api for chunking.

---------

Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants