We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupt JPEG data: premature end of data segment at the end of run with some PDF files. However, the files produced by OCRmyPDF are perfectly usable.
Corrupt JPEG data: premature end of data segment
Run ocrmypdf -l fra --user-words dictionary-fr.txt-alias --pdf-renderer hocr --output-type pdf -O2 --jbig2-lossy --skip-text --sidecar bid.txt -v 1 bid$.pdf bid.pdf
bid$.pdf
PyPI (pip, poetry, pipx, etc.)
16.1.1
$ocrmypdf -l fra --user-words dictionary-fr.txt-alias --pdf-renderer hocr --output-type pdf -O2 --jbig2-lossy --skip-text --sidecar bid.txt -v 1 bid$.pdf bid.pdf ocrmypdf 16.1.1 __main__.py:59 Running: ['tesseract', '--version'] __init__.py:133 Found tesseract 5.3.3 __init__.py:342 Running: ['tesseract', '--version'] __init__.py:133 Running: ['pngquant', '--version'] __init__.py:133 Found pngquant 2.18.0 __init__.py:342 Running: ['jbig2', '--version'] __init__.py:133 Found jbig2 0.28 __init__.py:342 Running: ['gs', '--version'] __init__.py:133 Found gs 10.2.1 __init__.py:342 Running: ['gs', '--version'] __init__.py:133 Running: ['tesseract', '--list-langs'] __init__.py:133 stdout/stderr = List of available languages in "/opt/local/share/tessdata/" (3): __init__.py:73 eng fra osd pikepdf mmap enabled helpers.py:326 os.symlink(bid$.pdf, /var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/origin) helpers.py:179 os.symlink(/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/origin, helpers.py:179 /var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/origin.pdf) Gathering info with 1 thread workers info.py:772 pikepdf mmap enabled helpers.py:326 Scanning contents ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00 Using Tesseract OpenMP thread limit 3 tesseract_ocr.py:183 pikepdf mmap enabled helpers.py:326 1 skipping all processing on this page _pipeline.py:319 1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0 _graft.py:140 1 Page rotation: (content, auto) -> page = (0, 0) -> 0 _graft.py:165 OCR ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00 /var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/sidecar.txt -> bid.txt _pipeline.py:1051 Postprocessing... ocr.py:146 Running: ['tesseract', '--version'] __init__.py:133 xref 491: skipping image because it is an SMask optimize.py:277 xref 297: treating as an optimization candidate optimize.py:279 xref 490: skipping image because it is an SMask optimize.py:277 xref 296: treating as an optimization candidate optimize.py:279 xref 492: skipping image because it is an SMask optimize.py:277 xref 298: treating as an optimization candidate optimize.py:279 xref 299: treating as an optimization candidate optimize.py:279 XrefExt(xref=298, ext='.jpg') optimize.py:344 Optimizable images: JPEGs: 1 PNGs: 0 optimize.py:349 Recompressing JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00 xref 491: skipping image because it is an SMask optimize.py:277 xref 297: treating as an optimization candidate optimize.py:279 xref 490: skipping image because it is an SMask optimize.py:277 xref 296: treating as an optimization candidate optimize.py:279 xref 492: skipping image because it is an SMask optimize.py:277 xref 298: treating as an optimization candidate optimize.py:279 xref 299: treating as an optimization candidate optimize.py:279 xref 298: marking this JPEG as deflatable optimize.py:544 Deflating JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00 PNGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:-- xref 491: skipping image because it is an SMask optimize.py:277 xref 297: treating as an optimization candidate optimize.py:279 xref 490: skipping image because it is an SMask optimize.py:277 xref 296: treating as an optimization candidate optimize.py:279 xref 492: skipping image because it is an SMask optimize.py:277 xref 298: treating as an optimization candidate optimize.py:279 xref 299: treating as an optimization candidate optimize.py:279 xref 298: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization optimize.py:97 Optimizable images: JBIG2 groups: 0 optimize.py:360 JBIG2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:-- os.symlink(/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/optimize.opt.pdf, helpers.py:179 /var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/optimize.pdf) Running: ['jbig2', '--version'] __init__.py:133 Running: ['pngquant', '--version'] __init__.py:133 Image optimization ratio: 1.24 savings: 19.4% _pipeline.py:976 Total file size ratio: 1.67 savings: 40.1% _pipeline.py:979 /var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/optimize.pdf -> bid.pdf _pipeline.py:1051 Corrupt JPEG data: premature end of data segment $
The text was updated successfully, but these errors were encountered:
jbarlow83
No branches or pull requests
Describe the bug
Corrupt JPEG data: premature end of data segment
at the end of run with some PDF files.However, the files produced by OCRmyPDF are perfectly usable.
Steps to reproduce
Files
bid$.pdf
How did you download and install the software?
PyPI (pip, poetry, pipx, etc.)
OCRmyPDF version
16.1.1
Relevant log output
The text was updated successfully, but these errors were encountered: