Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PI: Don't load entire file into memory when passed file name #2520

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

mjsir911
Copy link

@mjsir911 mjsir911 commented Mar 15, 2024

This functionality originally added back in ced2890

Reduces memory usage by size of loaded file.

Benchmark script
from pypdf import *

filename = '/home/msirabella/tmp/100MB-TESTFILE.ORG.pdf'

writer = PdfWriter(clone_from=filename)

writer.write("out.pdf")
Before stats
πŸ“ Total allocations:
	109695

πŸ“¦ Total memory allocated:
	409.726MB

πŸ“Š Histogram of allocation size:
	min: 1.000B
	--------------------------------------------
	< 6.000B   : 40707 β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡
	< 40.000B  :   229 β–‡
	< 256.000B :    33 β–‡
	< 1.590KB  : 67394 β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡
	< 10.104KB :  1060 β–‡
	< 64.190KB :   141 β–‡
	< 407.789KB:    47 β–‡
	< 2.530MB  :    82 β–‡
	< 16.072MB :     0 
	<=102.099MB:     2 β–‡
	--------------------------------------------
	max: 102.099MB

πŸ“‚ Allocator type distribution:
	 MALLOC: 107587
	 CALLOC: 1223
	 REALLOC: 865
	 MMAP: 20

πŸ₯‡ Top 5 largest allocating locations (by size):
	- __init__:./pypdf/_reader.py:315 -> 204.210MB
	- read_from_stream:./pypdf/generic/_data_structures.py:541 -> 101.628MB
	- read_until_regex:./pypdf/_utils.py:233 -> 48.318MB
	- read_object:./pypdf/generic/_data_structures.py:1287 -> 26.012MB
	- _call_with_frames_removed:<frozen importlib._bootstrap>:241 -> 7.360MB

πŸ₯‡ Top 5 largest allocating locations (by number of allocations):
	- read_until_regex:./pypdf/_utils.py:233 -> 81058
	- read_object:./pypdf/generic/_data_structures.py:1287 -> 23017
	- _call_with_frames_removed:<frozen importlib._bootstrap>:241 -> 2101
	- _compile_bytecode:<frozen importlib._bootstrap_external>:729 -> 988
	- _create_fn:/usr/lib/python3.11/dataclasses.py:433 -> 365
After stats
πŸ“ Total allocations:
	109687

πŸ“¦ Total memory allocated:
	205.521MB

πŸ“Š Histogram of allocation size:
	min: 1.000B
	--------------------------------------------
	< 4.000B   : 40707 β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡
	< 18.000B  :     4 β–‡
	< 80.000B  :   227 β–‡
	< 348.000B :    39 β–‡
	< 1.468KB  : 67239 β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡
	< 6.341KB  :   737 β–‡
	< 27.388KB :   563 β–‡
	< 118.297KB:    68 β–‡
	< 510.959KB:    21 β–‡
	<=2.155MB  :    82 β–‡
	--------------------------------------------
	max: 2.155MB

πŸ“‚ Allocator type distribution:
	 MALLOC: 107587
	 CALLOC: 1218
	 REALLOC: 862
	 MMAP: 20

πŸ₯‡ Top 5 largest allocating locations (by size):
	- read_from_stream:./pypdf/generic/_data_structures.py:541 -> 101.628MB
	- read_until_regex:./pypdf/_utils.py:233 -> 46.318MB
	- read_object:./pypdf/generic/_data_structures.py:1287 -> 24.012MB
	- _call_with_frames_removed:<frozen importlib._bootstrap>:241 -> 7.356MB
	- _compile_bytecode:<frozen importlib._bootstrap_external>:729 -> 4.844MB

πŸ₯‡ Top 5 largest allocating locations (by number of allocations):
	- read_until_regex:./pypdf/_utils.py:233 -> 81056
	- read_object:./pypdf/generic/_data_structures.py:1287 -> 23015
	- _call_with_frames_removed:<frozen importlib._bootstrap>:241 -> 2095
	- _compile_bytecode:<frozen importlib._bootstrap_external>:729 -> 989
	- _create_fn:/usr/lib/python3.11/dataclasses.py:433 -> 365

Copy link

codecov bot commented Mar 15, 2024

Codecov Report

All modified and coverable lines are covered by tests βœ…

Project coverage is 94.97%. Comparing base (c227b0c) to head (c3fe2e7).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2520   +/-   ##
=======================================
  Coverage   94.97%   94.97%           
=======================================
  Files          50       50           
  Lines        8331     8340    +9     
  Branches     1669     1669           
=======================================
+ Hits         7912     7921    +9     
  Misses        260      260           
  Partials      159      159           

β˜” View full report in Codecov by Sentry.
πŸ“’ Have feedback on the report? Share it here.

@mjsir911 mjsir911 changed the title Don't load entire file into memory when PdfReader passed file name PI: Don't load entire file into memory when passed file name Mar 15, 2024
@mjsir911 mjsir911 changed the title PI: Don't load entire file into memory when passed file name PI: Don't load entire file into memory when passed file name Mar 15, 2024
@stefan6419846
Copy link
Collaborator

Thanks for the PR. Are the stats correct? You need twice the memory afterwards, thus it would indicate that this is indeed no performance improvement?

And could you please have a look at the failing tests? Your changes lead to new test parallelization issues on Windows as each file can be open only once at each point in time.

@stefan6419846 stefan6419846 added the PdfReader The PdfReader component is affected label Mar 15, 2024
@mjsir911
Copy link
Author

mjsir911 commented Mar 15, 2024

Thanks for the PR. Are the stats correct? You need twice the memory afterwards, thus it would indicate that this is indeed no performance improvement?

Sorry, I got the before & after mixed up. fixed

And could you please have a look at the failing tests? Your changes lead to new test parallelization issues on Windows as each file can be open only once at each point in time.

Yeah, I can do. I'll have a bit more difficulty fixing the windows tests since I don't have a windows box to test on easily but I'll figure something out.

@stefan6419846
Copy link
Collaborator

stefan6419846 commented Mar 15, 2024

AFAIK the concurrent access issues will only occur on Windows, but I cannot really state how much this would indeed affect real use-cases.

I am not really sure about the fixed tests either - explicitly calling .delete() or even having to close the embedded stream object does not really feel intuitive and maybe even clumsy.

pypdf/_reader.py Outdated
@@ -314,6 +314,7 @@ def __init__(

if isinstance(stream, (str, Path)):
stream = open(stream, "rb") # noqa: SIM115
# Wish I could just close stream in __del__ but that fails a test very strangely
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity: Do you have some details about the failure?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure how much was relevant to drop in the commit but:

when adding a self.stream.close() in a __del__ function, that does work most of the time.

The one test failure I was seeing was in tests/test_reader.py, the failing test was test_get_page_of_encrypted_file but interestingly this would pass on it's own. I narrowed down the source of the issue to the previous test test_issue297's exception block where the PdfReader() initializer was failing (that's what the test is testing for) and the __del__ block wasn't being called due to the exception happening in the __init__.

It's very possible at some point the objects would be GCd but test failures were happening due to dangling file pointers at the following test.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to add this to the commit

@mjsir911 mjsir911 force-pushed the memory branch 2 times, most recently from 1a4b1af to a0415db Compare March 15, 2024 14:53
@mjsir911
Copy link
Author

mjsir911 commented Mar 15, 2024

I am not really sure about the fixed tests either - explicitly calling .delete() or even having to close the embedded stream object does not really feel intuitive and maybe even clumsy.

It's even worst than that, unfortunately! I'm not sure what the reference chain is from «Writer» -> «»
Β«cloned from reader's streamΒ», but del writer seems to unclaim the dangling file pointer.

If it's any consolation, the test failures are kind of an edge case where:

  • user is running on windows
  • filereader (or clone_from transitively) is instantiated via string/Path
  • file is acted upon outside of the pdfreader's context (opened/unlinked/whatever again by name) while the pdfreader is still in scope / not garbage collected

Sorry for jumping the gun on calling the tests solved! Still iterating on them.

@mjsir911
Copy link
Author

mjsir911 commented Mar 15, 2024

I am not really sure about the fixed tests either - explicitly calling .delete() or even having to close the embedded stream object does not really feel intuitive and maybe even clumsy.

It's even worst than that, unfortunately! I'm not sure what the reference chain is from -> <cloned from reader's stream>, but del writer seems to unclaim the dangling file pointer.

I could potentially add a .close() or something to PdfReader which would at least make this process explicit. I would still be unsure how to propogate that to PdfWriter's API though.

Making it a context manager might work too and would mirror PdfWriter

@mjsir911 mjsir911 force-pushed the memory branch 2 times, most recently from 5c25bc8 to 0786520 Compare March 15, 2024 15:15
pypdf/_reader.py Outdated Show resolved Hide resolved
tests/test_page.py Show resolved Hide resolved
@mjsir911
Copy link
Author

I don't want this merged as it currently is, calling garbage collection manually in tests feels yucky.

@mjsir911 mjsir911 marked this pull request as draft March 15, 2024 18:16
@pubpub-zz
Copy link
Collaborator

It's even worst than that, unfortunately! I'm not sure what the reference chain is from «Writer» -> «» «cloned from reader's stream», but del writer seems to unclaim the dangling file pointer.

when you call .clone_document_from_reader() or append(pages), you clone all objects from PdfReader(). during this process we need to keep connection between the writer's objects and the reader's object in order to keep parents links for example.
When you have finished your work, or when you need to append a new set of pages detached from the PdfReader, you have to call
writer.reset_translation().

mjsir911 added a commit to terrapower/pypdf that referenced this pull request Mar 21, 2024
See py-pdf#2520, basically this was the last failing (only on windows) test
because if the pdfreaders are implicitly opening file streams that don't
get closed until they get garbage collected the .unlinks() create file
lock errors.
@mjsir911 mjsir911 force-pushed the memory branch 2 times, most recently from b105b76 to 5209fcd Compare March 21, 2024 00:48
pypdf/_reader.py Outdated Show resolved Hide resolved
pypdf/_reader.py Outdated Show resolved Hide resolved
@mjsir911
Copy link
Author

I should also add using PdfReader as a contextmanager in some documentation somewhere

Comment on lines -286 to -288
def __deepcopy__(self, memo: Any) -> "IndirectObject":
return IndirectObject(self.idnum, self.generation, self.pdf)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so found about removing deepcopy : some people may use it this could be considered as a regression. If we really want to remove it we shall use the depredication process

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pubpub-zz deprecating doesn't really make sense because with this change no objects will ever be deep-copyable, they will always have a reference to a file pointer that can't be pickleable.

The only reason deep copies work now is because the entire source PDF bytestring gets copied over with them, and that only happens when a filename is passed, deepcopying has never worked with a passed file pointer.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way deprecation would work is if you deprecated it in lieu of this PR and then merged these changes in at a later date

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you leave __deepcopy__ is there an error ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I leave __deepcopy__ with the associated covered tests there is an error, yes.

TypeError: cannot pickle '_io.FileIO' object

    res = hook_impl.function(*args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/msirabella/fork/pypdf/venv/lib/python3.11/site-packages/_pytest/python.py", line 195, in pytest_pyfunc_call
    result = testfunction(**testargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/msirabella/fork/pypdf/tests/test_page.py", line 168, in test_transformation_equivalence
    page_box1 = deepcopy(page_box)
                ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 271, in _reconstruct
    state = deepcopy(state, memo)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 146, in deepcopy
    y = copier(x, memo)
        ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 271, in _reconstruct
    state = deepcopy(state, memo)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 146, in deepcopy
    y = copier(x, memo)
        ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 161, in deepcopy
    rv = reductor(4)
         ^^^^^^^^^^^
TypeError: cannot pickle '_io.FileIO' object

@pubpub-zz
Copy link
Collaborator

I should also add using PdfReader as a contextmanager in some documentation somewhere

have you also been able to advance in your proposal ?

@mjsir911
Copy link
Author

have you also been able to advance in your proposal ?

Hi, sorry, I've been taking a break from things due to mental health but plan to be back on them sometime later next month. Moving this back to draft for now.

mjsir911 added a commit to terrapower/pypdf that referenced this pull request May 17, 2024
See py-pdf#2520, basically this was the last failing (only on windows) test
because if the pdfreaders are implicitly opening file streams that don't
get closed until they get garbage collected the .unlinks() create file
lock errors.
This breaks if PdfReader contains any un-pickleable attributes (such as
file pointers)
Was only ever being used unintentionally in the tests and doesn't really
make sense. Use .clone() instead
See py-pdf#2520, basically this was the last failing (only on windows) test
because if the pdfreaders are implicitly opening file streams that don't
get closed until they get garbage collected the .unlinks() create file
lock errors.
This halves allocated memory when doing a simple
PdfWriter(clone_from=Β«strΒ»)

I can't just close the self.stream in `__del__` because for some strange
reason the unit tests mark it as unflagged even after the test block
ends. Something about `__del__` finalizers being run on a second pass
while `weakref.finalize()` is run on the first pass.
To mirror PdfWriter, also hints towards file pointer management now that
we keep files open sometimes.
@mjsir911 mjsir911 marked this pull request as ready for review May 18, 2024 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PdfReader The PdfReader component is affected
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants