Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/executing partition_doc using concurrent futures #2891

Closed
salahaz opened this issue Apr 15, 2024 · 7 comments
Closed

bug/executing partition_doc using concurrent futures #2891

salahaz opened this issue Apr 15, 2024 · 7 comments
Labels
investigating Issues that require more information before they are actionable

Comments

@salahaz
Copy link

salahaz commented Apr 15, 2024

When attempting to execute partition_doc to pre-process multiple documents at the same time it fails by throwing the following error:

PackageNotFoundError: Package not found at '/var/folders/p5/dljg1qv95y97dyq1c38xgb6r0000gn/T/tmp3nwg1qob/test.docx'

Here is a sample code that causes the same issue:

results = []
max_workers = os.cpu_count() 
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
    futures = [executor.submit(partition_doc, doc_path) for doc_path in doc_paths]
    for future in concurrent.futures.as_completed(futures):
        result = future.result()
        results.append(result)

results

Environment Details

python = 3.11.1
unstructured = 0.13.2
MacOS 14.3

Any help regarding this ? Or how to process documents in a parallel way using the library ?

@salahaz salahaz added the bug Something isn't working label Apr 15, 2024
@scanny
Copy link
Collaborator

scanny commented Apr 15, 2024

@salahaz please provide the entire stack trace.

@scanny
Copy link
Collaborator

scanny commented Apr 15, 2024

Also, see if that particular file works when you are not using threads and just use partition_doc("path/to/test.doc") and let us know how that goes.

@salahaz
Copy link
Author

salahaz commented Apr 15, 2024

@scanny the particular files work without threading, and partitioning the files sequentially using a for loop works too; However, when using concurrent futures this error is raised from partition_doc. Here is the entire stack trace:

PackageNotFoundError                      Traceback (most recent call last)
Cell In[85], [line 9](vscode-notebook-cell:?execution_count=85&line=9)
      [7](vscode-notebook-cell:?execution_count=85&line=7)         futures = [executor.submit(partition_doc, doc_path) for doc_path in doc_paths]
      [8](vscode-notebook-cell:?execution_count=85&line=8)         for future in concurrent.futures.as_completed(futures):
----> [9](vscode-notebook-cell:?execution_count=85&line=9)             result = future.result()
     [10](vscode-notebook-cell:?execution_count=85&line=10)             results.append(result)
     [12](vscode-notebook-cell:?execution_count=85&line=12) results

File [~/miniconda3/envs/testing/lib/python3.11/concurrent/futures/_base.py:449](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/concurrent/futures/_base.py:449), in Future.result(self, timeout)
    [447](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/concurrent/futures/_base.py:447)     raise CancelledError()
    [448](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/concurrent/futures/_base.py:448) elif self._state == FINISHED:
--> [449](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/concurrent/futures/_base.py:449)     return self.__get_result()
    [451](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/concurrent/futures/_base.py:451) self._condition.wait(timeout)
    [453](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/concurrent/futures/_base.py:453) if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]:

File [~/miniconda3/envs/testing/lib/python3.11/concurrent/futures/_base.py:401](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/concurrent/futures/_base.py:401), in Future.__get_result(self)
    [399](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/concurrent/futures/_base.py:399) if self._exception:
    [400](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/concurrent/futures/_base.py:400)     try:
--> [401](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/concurrent/futures/_base.py:401)         raise self._exception
    [402](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/concurrent/futures/_base.py:402)     finally:
    [403](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/concurrent/futures/_base.py:403)         # Break a reference cycle with the exception in self._exception
    [404](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/concurrent/futures/_base.py:404)         self = None

File [~/miniconda3/envs/testing/lib/python3.11/concurrent/futures/thread.py:58](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/concurrent/futures/thread.py:58), in _WorkItem.run(self)
     [55](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/concurrent/futures/thread.py:55)     return
     [57](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/concurrent/futures/thread.py:57) try:
---> [58](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/concurrent/futures/thread.py:58)     result = self.fn(*self.args, **self.kwargs)
     [59](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/concurrent/futures/thread.py:59) except BaseException as exc:
     [60](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/concurrent/futures/thread.py:60)     self.future.set_exception(exc)

File [~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/documents/elements.py:539](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/documents/elements.py:539), in process_metadata.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    [537](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/documents/elements.py:537) @functools.wraps(func)
    [538](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/documents/elements.py:538) def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> list[Element]:
--> [539](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/documents/elements.py:539)     elements = func(*args, **kwargs)
    [540](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/documents/elements.py:540)     sig = inspect.signature(func)
    [541](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/documents/elements.py:541)     params: dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)

File [~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:622](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:622), in add_filetype.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    [620](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:620) @functools.wraps(func)
    [621](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:621) def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> [622](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:622)     elements = func(*args, **kwargs)
    [623](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:623)     sig = inspect.signature(func)
    [624](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:624)     params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)

File [~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:582](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:582), in add_metadata.<locals>.wrapper(*args, **kwargs)
    [580](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:580) @functools.wraps(func)
    [581](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:581) def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> [582](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:582)     elements = func(*args, **kwargs)
    [583](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:583)     sig = inspect.signature(func)
    [584](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:584)     params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)

File [~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/chunking/dispatch.py:83](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/chunking/dispatch.py:83), in add_chunking_strategy.<locals>.wrapper(*args, **kwargs)
     [80](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/chunking/dispatch.py:80)     return call_args
     [82](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/chunking/dispatch.py:82) # -- call the partitioning function to get the elements --
---> [83](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/chunking/dispatch.py:83) elements = func(*args, **kwargs)
     [85](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/chunking/dispatch.py:85) # -- look for a chunking-strategy argument --
     [86](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/chunking/dispatch.py:86) call_args = get_call_args_applying_defaults()

File [~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/doc.py:92](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/doc.py:92), in partition_doc(filename, file, include_page_breaks, include_metadata, metadata_filename, metadata_last_modified, libre_office_filter, chunking_strategy, languages, detect_language_per_element, date_from_file_object, **kwargs)
     [85](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/doc.py:85) convert_office_doc(
     [86](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/doc.py:86)     filename,
     [87](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/doc.py:87)     tmpdir,
     [88](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/doc.py:88)     target_format="docx",
     [89](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/doc.py:89)     target_filter=libre_office_filter,
     [90](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/doc.py:90) )
     [91](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/doc.py:91) docx_filename = os.path.join(tmpdir, f"{base_filename}.docx")
---> [92](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/doc.py:92) elements = partition_docx(
     [93](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/doc.py:93)     filename=docx_filename,
     [94](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/doc.py:94)     metadata_filename=metadata_filename,
     [95](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/doc.py:95)     include_page_breaks=include_page_breaks,
     [96](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/doc.py:96)     include_metadata=include_metadata,
     [97](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/doc.py:97)     metadata_last_modified=metadata_last_modified or last_modification_date,
     [98](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/doc.py:98)     languages=languages,
     [99](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/doc.py:99)     detect_language_per_element=detect_language_per_element,
    [100](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/doc.py:100) )
    [101](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/doc.py:101) # remove tmp.name from filename if parsing file
    [102](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/doc.py:102) if file:

File [~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/documents/elements.py:539](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/documents/elements.py:539), in process_metadata.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    [537](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/documents/elements.py:537) @functools.wraps(func)
    [538](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/documents/elements.py:538) def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> list[Element]:
--> [539](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/documents/elements.py:539)     elements = func(*args, **kwargs)
    [540](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/documents/elements.py:540)     sig = inspect.signature(func)
    [541](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/documents/elements.py:541)     params: dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)

File [~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:622](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:622), in add_filetype.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    [620](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:620) @functools.wraps(func)
    [621](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:621) def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> [622](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:622)     elements = func(*args, **kwargs)
    [623](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:623)     sig = inspect.signature(func)
    [624](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:624)     params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)

File [~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:582](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:582), in add_metadata.<locals>.wrapper(*args, **kwargs)
    [580](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:580) @functools.wraps(func)
    [581](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:581) def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> [582](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:582)     elements = func(*args, **kwargs)
    [583](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:583)     sig = inspect.signature(func)
    [584](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:584)     params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)

File [~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/chunking/dispatch.py:83](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/chunking/dispatch.py:83), in add_chunking_strategy.<locals>.wrapper(*args, **kwargs)
     [80](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/chunking/dispatch.py:80)     return call_args
     [82](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/chunking/dispatch.py:82) # -- call the partitioning function to get the elements --
---> [83](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/chunking/dispatch.py:83) elements = func(*args, **kwargs)
     [85](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/chunking/dispatch.py:85) # -- look for a chunking-strategy argument --
     [86](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/chunking/dispatch.py:86) call_args = get_call_args_applying_defaults()

File [~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:219](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:219), in partition_docx(filename, file, metadata_filename, include_page_breaks, include_metadata, infer_table_structure, metadata_last_modified, chunking_strategy, languages, detect_language_per_element, date_from_file_object, **kwargs)
    [216](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:216) # -- verify that only one file-specifier argument was provided --
    [217](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:217) exactly_one(filename=filename, file=file)
--> [219](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:219) elements = _DocxPartitioner.iter_document_elements(
    [220](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:220)     filename,
    [221](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:221)     file,
    [222](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:222)     metadata_filename,
    [223](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:223)     include_page_breaks,
    [224](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:224)     infer_table_structure,
    [225](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:225)     metadata_last_modified,
    [226](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:226)     date_from_file_object,
    [227](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:227) )
    [228](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:228) elements = apply_lang_metadata(
    [229](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:229)     elements=elements,
    [230](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:230)     languages=languages,
    [231](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:231)     detect_language_per_element=detect_language_per_element,
    [232](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:232) )
    [233](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:233) return list(elements)

File [~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:289](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:289), in _DocxPartitioner.iter_document_elements(cls, filename, file, metadata_filename, include_page_breaks, infer_table_structure, metadata_last_modified, date_from_file_object)
    [274](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:274) self = cls(
    [275](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:275)     filename=filename,
    [276](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:276)     file=file,
   (...)
    [281](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:281)     date_from_file_object=date_from_file_object,
    [282](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:282) )
    [283](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:283) # NOTE(scanny): It's possible for a Word document to have no sections. In particular, a
    [284](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:284) # Microsoft Teams chat transcript exported to DOCX contains no sections. Such a
    [285](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:285) # "section-less" document has to be interated differently and has no headers or footers and
    [286](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:286) # therefore no page-size or margins.
    [287](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:287) return (
    [288](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:288)     self._iter_document_elements()
--> [289](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:289)     if self._document_contains_sections
    [290](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:290)     else self._iter_sectionless_document_elements()
    [291](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:291) )

File [~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/utils.py:161](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/utils.py:161), in lazyproperty.__get__(self, obj, type)
    [156](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/utils.py:156) value = obj.__dict__.get(self._name)
    [157](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/utils.py:157) if value is None:
    [158](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/utils.py:158)     # --- on first access, the __dict__ item will be absent. Evaluate fget()
    [159](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/utils.py:159)     # --- and store that value in the (otherwise unused) host-object
    [160](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/utils.py:160)     # --- __dict__ value of same name ('fget' nominally)
--> [161](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/utils.py:161)     value = self._fget(obj)
    [162](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/utils.py:162)     obj.__dict__[self._name] = value
    [163](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/utils.py:163) return cast(_T, value)

File [~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:468](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:468), in _DocxPartitioner._document_contains_sections(self)
    [460](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:460) @lazyproperty
    [461](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:461) def _document_contains_sections(self) -> bool:
    [462](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:462)     """True when there is at least one section in the document.
    [463](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:463) 
    [464](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:464)     This is always true for a document produced by Word, but may not always be the case when the
    [465](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:465)     document results from conversion or export. In particular, a Microsoft Teams chat-transcript
    [466](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:466)     export will have no sections.
    [467](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:467)     """
--> [468](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:468)     return bool(self._document.sections)

File [~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/utils.py:161](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/utils.py:161), in lazyproperty.__get__(self, obj, type)
    [156](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/utils.py:156) value = obj.__dict__.get(self._name)
    [157](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/utils.py:157) if value is None:
    [158](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/utils.py:158)     # --- on first access, the __dict__ item will be absent. Evaluate fget()
    [159](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/utils.py:159)     # --- and store that value in the (otherwise unused) host-object
    [160](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/utils.py:160)     # --- __dict__ value of same name ('fget' nominally)
--> [161](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/utils.py:161)     value = self._fget(obj)
    [162](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/utils.py:162)     obj.__dict__[self._name] = value
    [163](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/utils.py:163) return cast(_T, value)

File [~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:431](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:431), in _DocxPartitioner._document(self)
    [428](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:428) filename, file = self._filename, self._file
    [430](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:430) if filename is not None:
--> [431](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:431)     return docx.Document(filename)
    [433](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:433) assert file is not None
    [434](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/docx.py:434) if isinstance(file, SpooledTemporaryFile):

File [~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/api.py:23](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/api.py:23), in Document(docx)
     [16](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/api.py:16) """Return a |Document| object loaded from `docx`, where `docx` can be either a path
     [17](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/api.py:17) to a ``.docx`` file (a string) or a file-like object.
     [18](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/api.py:18) 
     [19](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/api.py:19) If `docx` is missing or ``None``, the built-in default document "template" is
     [20](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/api.py:20) loaded.
     [21](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/api.py:21) """
     [22](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/api.py:22) docx = _default_docx_path() if docx is None else docx
---> [23](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/api.py:23) document_part = Package.open(docx).main_document_part
     [24](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/api.py:24) if document_part.content_type != CT.WML_DOCUMENT_MAIN:
     [25](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/api.py:25)     tmpl = "file '%s' is not a Word file, content type is '%s'"

File [~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/opc/package.py:116](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/opc/package.py:116), in OpcPackage.open(cls, pkg_file)
    [113](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/opc/package.py:113) @classmethod
    [114](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/opc/package.py:114) def open(cls, pkg_file):
    [115](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/opc/package.py:115)     """Return an |OpcPackage| instance loaded with the contents of `pkg_file`."""
--> [116](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/opc/package.py:116)     pkg_reader = PackageReader.from_file(pkg_file)
    [117](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/opc/package.py:117)     package = cls()
    [118](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/opc/package.py:118)     Unmarshaller.unmarshal(pkg_reader, package, PartFactory)

File [~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/opc/pkgreader.py:22](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/opc/pkgreader.py:22), in PackageReader.from_file(pkg_file)
     [19](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/opc/pkgreader.py:19) @staticmethod
     [20](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/opc/pkgreader.py:20) def from_file(pkg_file):
     [21](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/opc/pkgreader.py:21)     """Return a |PackageReader| instance loaded with contents of `pkg_file`."""
---> [22](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/opc/pkgreader.py:22)     phys_reader = PhysPkgReader(pkg_file)
     [23](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/opc/pkgreader.py:23)     content_types = _ContentTypeMap.from_xml(phys_reader.content_types_xml)
     [24](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/opc/pkgreader.py:24)     pkg_srels = PackageReader._srels_for(phys_reader, PACKAGE_URI)

File [~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/opc/phys_pkg.py:21](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/opc/phys_pkg.py:21), in PhysPkgReader.__new__(cls, pkg_file)
     [19](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/opc/phys_pkg.py:19)         reader_cls = _ZipPkgReader
     [20](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/opc/phys_pkg.py:20)     else:
---> [21](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/opc/phys_pkg.py:21)         raise PackageNotFoundError("Package not found at '%s'" % pkg_file)
     [22](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/opc/phys_pkg.py:22) else:  # assume it's a stream and pass it to Zip reader to sort out
     [23](https://file+.vscode-resource.vscode-cdn.net/Users/user1/Documents/testing/testing-ai/~/miniconda3/envs/testing/lib/python3.11/site-packages/docx/opc/phys_pkg.py:23)     reader_cls = _ZipPkgReader

PackageNotFoundError: Package not found at '/var/folders/p5/dljg1qv95y97dyq1c38xgb6r0000gn/T/tmpssrx6v0w/test.docx'

@scanny
Copy link
Collaborator

scanny commented Apr 15, 2024

Hmm, okay, so a little background to understand what might be happening here:

  • The files you're partitioning are .doc files.
  • The partition_doc() function uses LibreOffice to convert the .doc file to a .docx file and then partitions that resulting file with partition_docx().
  • The "Package not found ..." error is happening inside partition_docx() and means either there is no file at the specified path or that the file at that path is not a zip archive (and so cannot be a valid .docx file).

My working hypothesis is that the temporary directory used to hold the "interim" .docx file is being deleted by another thread before partition_docx() can open the .docx file, but that's on scarce evidence. More like it's just a plausible explanation.

One thing that might be worth trying would be to reduce the number of workers to something like 8 and see what happens. You could use:

max_workers = min(8, os.cpu_count())

to avoid coupling max_workers to the CPUs available on your particular machine.

Otherwise I think we'll have to put this on the list to be investigated and see if we can reproduce it on our side.

If you're game for patching some of the library code in your ~/miniconda3/envs/testing/lib/python3.11/site-packages/unstructured/partition/doc.py we can probably diagnose a bit further. For example, you could add the keyword-argument delete=False to the TemporaryDirectory() call here to see if that got past the problem: https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/doc.py#L84

But doing that is unlikely to produce a working solution, it would just help us narrow down the problem.

@salahaz
Copy link
Author

salahaz commented Apr 15, 2024

@scanny I tried your initial suggestion using min it doesn't solve the issue. It seems to be related to how the temporary files and directories are being created by partition_doc as you said when combined with concurrent futures.

@scanny
Copy link
Collaborator

scanny commented Apr 16, 2024

Hi @salahaz, we'll track this issue and see what we can discover.

In the meantime, I don't believe a multi-threading approach is viable for multiple .doc files.

A few things you can try:

Let us know how you go :)

@scanny scanny added investigating Issues that require more information before they are actionable and removed bug Something isn't working labels Apr 23, 2024
@scanny
Copy link
Collaborator

scanny commented May 6, 2024

Closing as inactive.

@salahaz feel free to reopen if you're still having trouble. And if you discovered a solution let us know so others can learn from your experience :)

@scanny scanny closed this as completed May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigating Issues that require more information before they are actionable
Projects
None yet
Development

No branches or pull requests

2 participants