-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Enable remote chunking via unstructured-ingest #2905
Conversation
61f25db
to
e012272
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, quite a few remarks.
I like what you did with the typing and docstrings but I wonder if you could pull that into a separate "prep" PR so the meaty changes are more focused.
I want to take another look in the morning but here are these initial comments in the meantime.
Btw, generally tests should appear in the same commit as the changes that make them pass. We should probably talk about this a little bit. It has to do with the test-driven approach, like first you write the test and then write the code to pass the test (and only the code it takes to pass the test). Let's chat about that next time we get together.
test_unstructured_ingest/unit/pipeline/reformat/test_chunking.py
Outdated
Show resolved
Hide resolved
test_unstructured_ingest/unit/pipeline/reformat/test_chunking.py
Outdated
Show resolved
Hide resolved
# -- "/.cache" of a user's profile. | ||
# -- Define `work_dir` add the "/chunked" subdirectory to it: | ||
chunker.pipeline_context.work_dir = tmpdir | ||
os.makedirs(os.path.join(tmpdir, "chunked"), exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have to create the chunked
directory? Doesn't ingest do that for itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, without creating it FileNotFoundError: [Errno 2] No such file or directory
is raised
test_unstructured_ingest/unit/pipeline/reformat/test_chunking.py
Outdated
Show resolved
Hide resolved
os.makedirs(os.path.join(tmpdir, "chunked"), exist_ok=True) | ||
|
||
filename = chunker.run(ELEMENTS_JSON_FILE) | ||
head, tail = os.path.split(filename if filename else "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under what circumstances would filename
be None
? Seems like worth a separate assert, like:
assert filename is not None, "message explaining what this means and why it's a failure"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None
is returned by Chunker.run()
if that is somehow called without defining chunking_strategy
. A message is logged, so I'll add a test for that.
test_unstructured_ingest/unit/pipeline/reformat/test_chunking.py
Outdated
Show resolved
Hide resolved
162a919
to
e0da9c8
Compare
fb599cb
to
9453fd8
Compare
9453fd8
to
766f180
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think this is almost there. A few questions too.
Let me know when it's ready and I'll take a quick look and approve.
assert head.endswith("chunked") | ||
assert tail.endswith(".json") | ||
|
||
def it_logs_error_with_invalid_remote_chunking_args( |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
|
||
chunker.run(ELEMENTS_JSON_FILE) | ||
|
||
assert "Input should be 'basic', 'by_page', 'by_similarity'" in caplog.text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this the output and where does it come from? Only the remote chunker can tell what chunking strategies it has registered. Seems like 'by_title'
should be in the list too.
And Input should be ...
isn't very helpful as an error message goes. Is there more that was cut off? I'd be looking to see something saying "chunking_strategy" to give a hint where the problem is. No reason this string can't be as long as it needs to be to clearly represent the log entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unstructured_client.models.errors.httpvalidationerror.HTTPValidationError
raises the error and by_title
is in the list. I was trying to keep it to one line while still conveying what the logged message entailed.
The full output:
'ERROR unstructured.ingest:chunking.py:83 failed to run chunking on file /Users/johnjennings/src/unstructured/example-docs/test_evaluate_files/unstructured_output/Bank Good Credit Loan.pptx.json, {"detail":[{"loc":["body","chunking_strategy"],"msg":"Input should be \'basic\', \'by_page\', \'by_similarity\' or \'by_title\'","type":"literal_error"}]}
Traceback (most recent call last):
File "/Users/johnjennings/src/unstructured/unstructured/ingest/pipeline/reformat/chunking.py", line 64, in run
chunked_elements = self.chunk(elements_json)
File "/Users/johnjennings/src/unstructured/unstructured/ingest/pipeline/reformat/chunking.py", line 109, in chunk
return partition_via_api(
File "/Users/johnjennings/src/unstructured/unstructured/partition/api.py", line 103, in partition_via_api\n response = sdk.general.partition(req)
File "/Users/johnjennings/.pyenv/versions/unstructured/lib/python3.10/site-packages/unstructured_client/utils/_human_utils.py", line 86, in wrapper
return func(*args, **kwargs)
File "/Users/johnjennings/.pyenv/versions/unstructured/lib/python3.10/site-packages/unstructured_client/utils/_human_split_pdf.py", line 40, in wrapper
return func(*args, **kwargs)
File "/Users/johnjennings/.pyenv/versions/unstructured/lib/python3.10/site-packages/unstructured_client/general.py", line 90, in partition
raise out
unstructured_client.models.errors.httpvalidationerror.HTTPValidationError: {"detail":[{"loc":["body","chunking_strategy"],"msg":"Input should be \'basic\', \'by_page\', \'by_similarity\' or \'by_title\'","type":"literal_error"}]}'
# -- These are not currently supported by the sdk and raise an error. | ||
# -- They will need to be updated once we bump the client version. | ||
# combine_text_under_n_chars=self.chunking_config.combine_text_under_n_chars, | ||
# include_orig_elements=self.chunking_config.include_orig_elements, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does being support by unstructured-client
have to do with calling partition_via_api()
via the unstructured
library? unstructured-client
doesn't even need to be installed, does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unstructured_via_api()
uses unstructured-client
to call the api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened this as a follow-up issue:
#2924
# -- pinned to <=0.18.0 | ||
# overlap=self.chunking_config.overlap, | ||
# overlap_all=self.chunking_config.overlap_all, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the behavior when there is no chunking strategy registered with the requested name on the remote host?
Like how does the user get a message like: No such chunking strategy: 'foobar'
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is when the SDK raises an error. Example test: it_logs_error_on_invalid_remote_chunking_strategy()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :)
aeb7106
to
d9bbfc2
Compare
Update: The cli shell script works when sending documents to the free api, but the paid api is down, so waiting to test against it. - The first commit adds docstrings and fixes type hints. - The second commit reorganizes `test_unstructured_ingest` so it matches the structure of `unstructured/ingest`. - The third commit contains the primary changes for this PR. - The `.chunk()` method responsible for sending elements to the correct method is moved from `ChunkingConfig` to `Chunker` so that `ChunkingConfig` acts as a config object instead of containing implementation logic. `Chunker.chunk()` also now takes a json file instead of a list of elements. This is done to avoid redundant serialization if the file is to be sent to the api for chunking. --------- Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
Update: The cli shell script works when sending documents to the free api, but the paid api is down, so waiting to test against it.
test_unstructured_ingest
so it matches the structure ofunstructured/ingest
..chunk()
method responsible for sending elements to the correct method is moved fromChunkingConfig
toChunker
so thatChunkingConfig
acts as a config object instead of containing implementation logic.Chunker.chunk()
also now takes a json file instead of a list of elements. This is done to avoid redundant serialization if the file is to be sent to the api for chunking.