Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: "Corrupt JPEG data: premature end of data segment" with some files #1269

Open
macdeport opened this issue Mar 2, 2024 · 0 comments
Open
Assignees
Labels

Comments

@macdeport
Copy link

Describe the bug

Corrupt JPEG data: premature end of data segment at the end of run with some PDF files.
However, the files produced by OCRmyPDF are perfectly usable.

Steps to reproduce

Run ocrmypdf -l fra --user-words dictionary-fr.txt-alias --pdf-renderer hocr --output-type pdf -O2 --jbig2-lossy --skip-text --sidecar bid.txt -v 1 bid$.pdf bid.pdf

Files

bid$.pdf

How did you download and install the software?

PyPI (pip, poetry, pipx, etc.)

OCRmyPDF version

16.1.1

Relevant log output

$ocrmypdf -l fra --user-words dictionary-fr.txt-alias --pdf-renderer hocr --output-type pdf -O2 --jbig2-lossy --skip-text --sidecar bid.txt -v 1 bid$.pdf bid.pdf
ocrmypdf 16.1.1                                                                                                       __main__.py:59
Running: ['tesseract', '--version']                                                                                  __init__.py:133
Found tesseract 5.3.3                                                                                                __init__.py:342
Running: ['tesseract', '--version']                                                                                  __init__.py:133
Running: ['pngquant', '--version']                                                                                   __init__.py:133
Found pngquant 2.18.0                                                                                                __init__.py:342
Running: ['jbig2', '--version']                                                                                      __init__.py:133
Found jbig2 0.28                                                                                                     __init__.py:342
Running: ['gs', '--version']                                                                                         __init__.py:133
Found gs 10.2.1                                                                                                      __init__.py:342
Running: ['gs', '--version']                                                                                         __init__.py:133
Running: ['tesseract', '--list-langs']                                                                               __init__.py:133
stdout/stderr = List of available languages in "/opt/local/share/tessdata/" (3):                                      __init__.py:73
eng                                                                                                                                 
fra                                                                                                                                 
osd                                                                                                                                 
                                                                                                                                    
pikepdf mmap enabled                                                                                                  helpers.py:326
os.symlink(bid$.pdf, /var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/origin)                    helpers.py:179
os.symlink(/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/origin,                              helpers.py:179
/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/origin.pdf)                                                   
Gathering info with 1 thread workers                                                                                     info.py:772
pikepdf mmap enabled                                                                                                  helpers.py:326
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Using Tesseract OpenMP thread limit 3                                                                           tesseract_ocr.py:183
pikepdf mmap enabled                                                                                                  helpers.py:326
    1 skipping all processing on this page                                                                          _pipeline.py:319
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0                                 _graft.py:140
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0                                                             _graft.py:165
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/sidecar.txt -> bid.txt                       _pipeline.py:1051
Postprocessing...                                                                                                         ocr.py:146
Running: ['tesseract', '--version']                                                                                  __init__.py:133
xref 491: skipping image because it is an SMask                                                                      optimize.py:277
xref 297: treating as an optimization candidate                                                                      optimize.py:279
xref 490: skipping image because it is an SMask                                                                      optimize.py:277
xref 296: treating as an optimization candidate                                                                      optimize.py:279
xref 492: skipping image because it is an SMask                                                                      optimize.py:277
xref 298: treating as an optimization candidate                                                                      optimize.py:279
xref 299: treating as an optimization candidate                                                                      optimize.py:279
XrefExt(xref=298, ext='.jpg')                                                                                        optimize.py:344
Optimizable images: JPEGs: 1 PNGs: 0                                                                                 optimize.py:349
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
xref 491: skipping image because it is an SMask                                                                      optimize.py:277
xref 297: treating as an optimization candidate                                                                      optimize.py:279
xref 490: skipping image because it is an SMask                                                                      optimize.py:277
xref 296: treating as an optimization candidate                                                                      optimize.py:279
xref 492: skipping image because it is an SMask                                                                      optimize.py:277
xref 298: treating as an optimization candidate                                                                      optimize.py:279
xref 299: treating as an optimization candidate                                                                      optimize.py:279
xref 298: marking this JPEG as deflatable                                                                            optimize.py:544
Deflating JPEGs       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
PNGs                  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
xref 491: skipping image because it is an SMask                                                                      optimize.py:277
xref 297: treating as an optimization candidate                                                                      optimize.py:279
xref 490: skipping image because it is an SMask                                                                      optimize.py:277
xref 296: treating as an optimization candidate                                                                      optimize.py:279
xref 492: skipping image because it is an SMask                                                                      optimize.py:277
xref 298: treating as an optimization candidate                                                                      optimize.py:279
xref 299: treating as an optimization candidate                                                                      optimize.py:279
xref 298: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                             optimize.py:97
Optimizable images: JBIG2 groups: 0                                                                                  optimize.py:360
JBIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
os.symlink(/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/optimize.opt.pdf,                    helpers.py:179
/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/optimize.pdf)                                                 
Running: ['jbig2', '--version']                                                                                      __init__.py:133
Running: ['pngquant', '--version']                                                                                   __init__.py:133
Image optimization ratio: 1.24 savings: 19.4%                                                                       _pipeline.py:976
Total file size ratio: 1.67 savings: 40.1%                                                                          _pipeline.py:979
/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/optimize.pdf -> bid.pdf                      _pipeline.py:1051
Corrupt JPEG data: premature end of data segment
$
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants