Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF file is not closed correctly #15

Open
Markkuuss opened this issue Sep 1, 2020 · 6 comments
Open

PDF file is not closed correctly #15

Markkuuss opened this issue Sep 1, 2020 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@Markkuuss
Copy link

When I use the following command, the file is not closed correctly. For example, I cannot delete the file afterwards because the PDF file is still being used by a process.

doc = Document("samples/nonfree/mandarin.pdf")`

If I write the code as follows instead, the PDF file will be closed correctly.

with open("samples/nonfree/mandarin.pdf", 'rb') as fp:
    doc = Document(fp)
@ashutoshvarma
Copy link
Owner

Can you specify the steps to reproduce the issue in detail, I was not able to reproduce it.
Are you on latest pyxpdf version v0.2.3 ?
Also which OS are you using ?

@Markkuuss
Copy link
Author

Are you on latest pyxpdf version v0.2.3 ?

Yes, since yesterday.

Also which OS are you using ?

Windows 10 Pro 10.0.18362 Build 18362

You can use the example from the tutorial:
https://pyxpdf.readthedocs.io/en/latest/tutorial/extract_images.html

The following example throws a PermissionError [WinError 32] when deleting a file.

import os
from pyxpdf import Document
from pyxpdf.xpdf import PDFImageOutput, page_iterator

filename="test.pdf"

doc = Document(filename)
pdfimages_out = PDFImageOutput(doc)

for images in page_iterator(pdfimages_out):
    print(images)

os.remove(filename)

When opened with "with", the file is properly deleted without errors.

import os
from pyxpdf import Document
from pyxpdf.xpdf import PDFImageOutput, page_iterator

filename="test.pdf"

with open(filename, 'rb') as fp:
    doc = Document(fp)

    pdfimages_out = PDFImageOutput(doc)

    for images in page_iterator(pdfimages_out):
        print(images)

os.remove(filename)

I didn't find in the documentation a way to close the file when it is opened with doc = Document("samples/nonfree/mandarin.pdf").

@ashutoshvarma ashutoshvarma added the bug Something isn't working label Sep 2, 2020
@ashutoshvarma
Copy link
Owner

ashutoshvarma commented Sep 2, 2020

Thanks for reporting, its a windows specific issue.
When creating Document using file path, opening and closing of file descriptor is handled by libxpdf (c++ sources) and file is open with 'rbN' in windows so fd is not inherited by child processes.

As you have find, for now if you need to do additional operations on pdf file, create Document with file-like object on windows.

I didn't find in the documentation a way to close the file when it is opened with doc = Document("samples/nonfree/mandarin.pdf")

A Document releases its resources when it is garbage collected

del pdfimages_out 
del doc

@ashutoshvarma ashutoshvarma self-assigned this Sep 2, 2020
@Markkuuss
Copy link
Author

A Document releases its resources when it is garbage collected

del pdfimages_out 
del doc

That was also my consideration. I have also tried it as follows. Unfortunately the same exception is thrown.

import os
from pyxpdf import Document
from pyxpdf.xpdf import PDFImageOutput, page_iterator

filename="test.pdf"

doc = Document(filename)
pdfimages_out = PDFImageOutput(doc)

for images in page_iterator(pdfimages_out):
    print(images)
    
del pdfimages_out 
del doc

os.remove(filename)

@ashutoshvarma
Copy link
Owner

Try this,

import os
from pyxpdf import Document
from pyxpdf.xpdf import PDFImageOutput, page_iterator

filename="test.pdf"

doc = Document(filename)
pdfimages_out = PDFImageOutput(doc)

for images in page_iterator(pdfimages_out):
    print(images)
    
del pdfimages_out 
del doc

import gc
gc.collect()

os.remove(filename)

My bad, actually, del is just decreasing refcount by 1, but with Document we have a a reference cycle.
They are not immediately deallocated. At regular times, the garbage collector runs, which will notice the reference cycle (using the tp_traverse slot) and break it.

I think we should gc.collect() inside Document deallocator so that we don't have to wait for gc to clear it. I will create a separate issue for this.

@Markkuuss
Copy link
Author

I think we should gc.collect() inside Document deallocator so that we don't have to wait for gc to clear it. I will create a separate issue for this.

Yeah, you're right. This works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants