Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: pipelines中语义检索系统,启动运行后,上传扫描式PDF文件 无法解析 #8418

Open
1 task done
morego123 opened this issue May 11, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@morego123
Copy link

软件环境

paddle-pipelines               0.6.2
paddle2onnx                    1.2.1
paddlefsl                      1.1.0
paddlenlp                      2.8.0
paddleocr                      2.7.3
paddlepaddle-gpu               2.6.0.post117

重复问题

  • I have searched the existing issues

错误描述

INFO:     127.0.0.1:43132 - "POST /file-upload HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/pipelines/base.py", line 446, in run
    node_output, stream_id = self.graph.nodes[node_id]["component"]._dispatch_run(**node_input)
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/nodes/base.py", line 120, in _dispatch_run
    return self._dispatch_run_general(self.run, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/nodes/base.py", line 164, in _dispatch_run_general
    output, stream = run_method(**run_inputs, **run_params)
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/nodes/retriever/base.py", line 144, in run
    output, stream = run_indexing(documents=documents, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/nodes/retriever/base.py", line 110, in wrapper
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/nodes/retriever/base.py", line 229, in run_indexing
    embeddings = self.embed_documents(document_objects, **kwargs)  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/nodes/retriever/dense.py", line 367, in embed_documents
    embeddings = self._get_predictions(passages, **kwargs)["passages"]
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/nodes/retriever/dense.py", line 292, in _get_predictions
    if "passages" in dicts[0]:
IndexError: list index out of range

稳定复现步骤 & 代码

在网页端,左侧文件上传模块,上传扫描式PDF文件 无法解析。上传非扫描件PDF,正常。
对于扫描式PDF文件,是此repo本来无法解析,还是我哪个组件没安装?

@morego123 morego123 added the bug Something isn't working label May 11, 2024
@w5688414
Copy link
Contributor

您好,目前不支持扫描件的PDF,欢迎开发者贡献。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants