Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/partion_pdf import statement not completing execution #2847

Closed
viboognesh opened this issue Apr 4, 2024 · 4 comments
Closed

bug/partion_pdf import statement not completing execution #2847

viboognesh opened this issue Apr 4, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@viboognesh
Copy link

Describe the bug
Import Statement is taking forever to execute:
I have tried to import

from unstructured.partition.pdf import partition_pdf

and the import statement is taking forever to execute.

I am using conda python - 3.10.

To Reproduce
I am running this code in a conda environment. I created an unstructured environment using the environment.yml file in the GitHub repository. I installed Poppler using conda forge and tesseract-ocr by downloading the .exe file. The first line of code is

from unstructured.partition.pdf import partition_pdf

and it is completing the execution. There are no errors but, it is not completing the execution.

Expected behaviour
I just expect the execution to be complete, but it is not completing even after 2 hours of running the code.

Screenshots
image

Environment Info
backoff==2.2.1
beautifulsoup4==4.12.3
Brotli @ file:///C:/Windows/Temp/abs_63l7912z0e/croots/recipe/brotli-split_1659616056886/work
certifi @ file:///home/conda/feedstock_root/build_artifacts/certifi_1707022139797/work/certifi
cffi==1.16.0
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.7
colorama==0.4.6
cryptography==42.0.5
dataclasses-json==0.6.4
dataclasses-json-speakeasy==0.5.11
emoji==2.10.1
filelock @ file:///C:/b/abs_f2gie28u58/croot/filelock_1700591233643/work
filetype==1.2.0
fsspec==2024.3.1
gmpy2 @ file:///C:/ci/gmpy2_1645438895476/work
huggingface-hub==0.22.2
idna==3.6
iopath==0.1.10
Jinja2 @ file:///C:/b/abs_f7x5a8op2h/croot/jinja2_1706733672594/work
joblib==1.3.2
jsonpath-python==1.0.6
langdetect==1.0.9
layoutparser==0.3.4
lxml==5.1.0
MarkupSafe @ file:///C:/b/abs_ecfdqh67b_/croot/markupsafe_1704206030535/work
marshmallow==3.20.2
mkl-fft @ file:///C:/b/abs_19i1y8ykas/croot/mkl_fft_1695058226480/work
mkl-random @ file:///C:/b/abs_edwkj1_o69/croot/mkl_random_1695059866750/work
mkl-service==2.4.0
mpmath @ file:///C:/b/abs_7833jrbiox/croot/mpmath_1690848321154/work
mypy-extensions==1.0.0
networkx @ file:///C:/b/abs_e6gi1go5op/croot/networkx_1690562046966/work
nltk==3.8.1
numpy @ file:///C:/b/abs_c1ywpu18ar/croot/numpy_and_numpy_base_1708638681471/work/dist/numpy-1.26.4-cp310-cp310-win_amd64.whl#sha256=ebb5aa2b36d8afa5ec3231c19e5a1fc75b6d85e7db483f0fb9e77dad58469977
opencv-python==4.9.0.80
packaging==23.2
pandas==2.2.1
pdf2image==1.17.0
pdfminer.six==20231228
pdfplumber==0.11.0
pillow @ file:///C:/b/abs_e22m71t0cb/croot/pillow_1707233126420/work
pillow_heif==0.16.0
portalocker==2.8.2
pycparser==2.22
pypdfium2==4.28.0
PySocks @ file:///C:/ci_310/pysocks_1642089375450/work
python-dateutil==2.8.2
python-iso639==2024.2.7
python-magic==0.4.27
pytz==2024.1
pywin32==305.1
PyYAML @ file:///C:/b/abs_782o3mbw7z/croot/pyyaml_1698096085010/work
rapidfuzz==3.6.1
regex==2023.12.25
requests @ file:///C:/b/abs_474vaa3x9e/croot/requests_1707355619957/work
scipy==1.13.0
six==1.16.0
soupsieve==2.5
sympy @ file:///C:/b/abs_82njkonm7f/croot/sympy_1701397685028/work
tabulate==0.9.0
torch==2.1.2
torchvision @ file:///C:/b/abs_61prww4bv9/croot/torchvision_1689079992237/work
tqdm==4.66.2
typing-inspect==0.9.0
typing_extensions @ file:///C:/b/abs_72cdotwc_6/croot/typing_extensions_1705599364138/work
tzdata==2024.1
unstructured==0.13.0
unstructured-client==0.18.0
urllib3==1.26.18
win-inet-pton @ file:///C:/ci_310/win_inet_pton_1642658466512/work
wrapt==1.16.0

Additional context
Add any other context about the problem here.

@viboognesh viboognesh added the bug Something isn't working label Apr 4, 2024
@scanny
Copy link
Collaborator

scanny commented Apr 4, 2024

@viboognesh does this also happen if you say from unstructured.partition.text import partition_text()?

Trying to narrow down whether it is something specific to PDF.

@viboognesh
Copy link
Author

@scanny
This happens with most of the partition packages that I tried to import
image
image
image
image

I have no problem when I try to import partition_via_api
image

I have no problem with import unstructured
image

These import statements don't complete even if I leave the computer running for a hour

@scanny
Copy link
Collaborator

scanny commented Apr 5, 2024

Can you try interrupting the wait with ^C and check the stack trace that you get? Maybe do that a few times and see if the results are consistent then post it here. That should give some idea where things are hanging.

Shouldn't have to wait more than say 30 seconds before interrupting with Ctrl-C. Wherever it is hanging, it likely will arrive there within that time.

@scanny
Copy link
Collaborator

scanny commented May 6, 2024

Closing as incomplete.

@viboognesh if you have more information on this, please feel free to reopen :)

@scanny scanny closed this as completed May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants