[Bug]: "Corrupt JPEG data: premature end of data segment" with some files #1269

macdeport · 2024-03-02T10:29:57Z

Describe the bug

Corrupt JPEG data: premature end of data segment at the end of run with some PDF files.
However, the files produced by OCRmyPDF are perfectly usable.

Steps to reproduce

Run ocrmypdf -l fra --user-words dictionary-fr.txt-alias --pdf-renderer hocr --output-type pdf -O2 --jbig2-lossy --skip-text --sidecar bid.txt -v 1 bid$.pdf bid.pdf

Files

bid$.pdf

How did you download and install the software?

PyPI (pip, poetry, pipx, etc.)

OCRmyPDF version

16.1.1

Relevant log output

$ocrmypdf -l fra --user-words dictionary-fr.txt-alias --pdf-renderer hocr --output-type pdf -O2 --jbig2-lossy --skip-text --sidecar bid.txt -v 1 bid$.pdf bid.pdf
ocrmypdf 16.1.1                                                                                                       __main__.py:59
Running: ['tesseract', '--version']                                                                                  __init__.py:133
Found tesseract 5.3.3                                                                                                __init__.py:342
Running: ['tesseract', '--version']                                                                                  __init__.py:133
Running: ['pngquant', '--version']                                                                                   __init__.py:133
Found pngquant 2.18.0                                                                                                __init__.py:342
Running: ['jbig2', '--version']                                                                                      __init__.py:133
Found jbig2 0.28                                                                                                     __init__.py:342
Running: ['gs', '--version']                                                                                         __init__.py:133
Found gs 10.2.1                                                                                                      __init__.py:342
Running: ['gs', '--version']                                                                                         __init__.py:133
Running: ['tesseract', '--list-langs']                                                                               __init__.py:133
stdout/stderr = List of available languages in "/opt/local/share/tessdata/" (3):                                      __init__.py:73
eng                                                                                                                                 
fra                                                                                                                                 
osd                                                                                                                                 
                                                                                                                                    
pikepdf mmap enabled                                                                                                  helpers.py:326
os.symlink(bid$.pdf, /var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/origin)                    helpers.py:179
os.symlink(/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/origin,                              helpers.py:179
/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/origin.pdf)                                                   
Gathering info with 1 thread workers                                                                                     info.py:772
pikepdf mmap enabled                                                                                                  helpers.py:326
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Using Tesseract OpenMP thread limit 3                                                                           tesseract_ocr.py:183
pikepdf mmap enabled                                                                                                  helpers.py:326
    1 skipping all processing on this page                                                                          _pipeline.py:319
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0                                 _graft.py:140
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0                                                             _graft.py:165
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/sidecar.txt -> bid.txt                       _pipeline.py:1051
Postprocessing...                                                                                                         ocr.py:146
Running: ['tesseract', '--version']                                                                                  __init__.py:133
xref 491: skipping image because it is an SMask                                                                      optimize.py:277
xref 297: treating as an optimization candidate                                                                      optimize.py:279
xref 490: skipping image because it is an SMask                                                                      optimize.py:277
xref 296: treating as an optimization candidate                                                                      optimize.py:279
xref 492: skipping image because it is an SMask                                                                      optimize.py:277
xref 298: treating as an optimization candidate                                                                      optimize.py:279
xref 299: treating as an optimization candidate                                                                      optimize.py:279
XrefExt(xref=298, ext='.jpg')                                                                                        optimize.py:344
Optimizable images: JPEGs: 1 PNGs: 0                                                                                 optimize.py:349
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
xref 491: skipping image because it is an SMask                                                                      optimize.py:277
xref 297: treating as an optimization candidate                                                                      optimize.py:279
xref 490: skipping image because it is an SMask                                                                      optimize.py:277
xref 296: treating as an optimization candidate                                                                      optimize.py:279
xref 492: skipping image because it is an SMask                                                                      optimize.py:277
xref 298: treating as an optimization candidate                                                                      optimize.py:279
xref 299: treating as an optimization candidate                                                                      optimize.py:279
xref 298: marking this JPEG as deflatable                                                                            optimize.py:544
Deflating JPEGs       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
PNGs                  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
xref 491: skipping image because it is an SMask                                                                      optimize.py:277
xref 297: treating as an optimization candidate                                                                      optimize.py:279
xref 490: skipping image because it is an SMask                                                                      optimize.py:277
xref 296: treating as an optimization candidate                                                                      optimize.py:279
xref 492: skipping image because it is an SMask                                                                      optimize.py:277
xref 298: treating as an optimization candidate                                                                      optimize.py:279
xref 299: treating as an optimization candidate                                                                      optimize.py:279
xref 298: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                             optimize.py:97
Optimizable images: JBIG2 groups: 0                                                                                  optimize.py:360
JBIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
os.symlink(/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/optimize.opt.pdf,                    helpers.py:179
/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/optimize.pdf)                                                 
Running: ['jbig2', '--version']                                                                                      __init__.py:133
Running: ['pngquant', '--version']                                                                                   __init__.py:133
Image optimization ratio: 1.24 savings: 19.4%                                                                       _pipeline.py:976
Total file size ratio: 1.67 savings: 40.1%                                                                          _pipeline.py:979
/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/optimize.pdf -> bid.pdf                      _pipeline.py:1051
Corrupt JPEG data: premature end of data segment
$

The text was updated successfully, but these errors were encountered:

macdeport added the bug label Mar 2, 2024

macdeport assigned jbarlow83 Mar 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: "Corrupt JPEG data: premature end of data segment" with some files #1269

[Bug]: "Corrupt JPEG data: premature end of data segment" with some files #1269

macdeport commented Mar 2, 2024

[Bug]: "Corrupt JPEG data: premature end of data segment" with some files #1269

[Bug]: "Corrupt JPEG data: premature end of data segment" with some files #1269

Comments

macdeport commented Mar 2, 2024

Describe the bug

Steps to reproduce

Files

How did you download and install the software?

OCRmyPDF version

Relevant log output