PI: Don't load entire file into memory when passed file name #2520

mjsir911 · 2024-03-15T02:32:58Z

This functionality originally added back in ced2890

Reduces memory usage by size of loaded file.

Benchmark script

from pypdf import *

filename = '/home/msirabella/tmp/100MB-TESTFILE.ORG.pdf'

writer = PdfWriter(clone_from=filename)

writer.write("out.pdf")

Before stats

📏 Total allocations:
	109695

📦 Total memory allocated:
	409.726MB

📊 Histogram of allocation size:
	min: 1.000B
	--------------------------------------------
	< 6.000B   : 40707 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
	< 40.000B  :   229 ▇
	< 256.000B :    33 ▇
	< 1.590KB  : 67394 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
	< 10.104KB :  1060 ▇
	< 64.190KB :   141 ▇
	< 407.789KB:    47 ▇
	< 2.530MB  :    82 ▇
	< 16.072MB :     0 
	<=102.099MB:     2 ▇
	--------------------------------------------
	max: 102.099MB

📂 Allocator type distribution:
	 MALLOC: 107587
	 CALLOC: 1223
	 REALLOC: 865
	 MMAP: 20

🥇 Top 5 largest allocating locations (by size):
	- __init__:./pypdf/_reader.py:315 -> 204.210MB
	- read_from_stream:./pypdf/generic/_data_structures.py:541 -> 101.628MB
	- read_until_regex:./pypdf/_utils.py:233 -> 48.318MB
	- read_object:./pypdf/generic/_data_structures.py:1287 -> 26.012MB
	- _call_with_frames_removed:<frozen importlib._bootstrap>:241 -> 7.360MB

🥇 Top 5 largest allocating locations (by number of allocations):
	- read_until_regex:./pypdf/_utils.py:233 -> 81058
	- read_object:./pypdf/generic/_data_structures.py:1287 -> 23017
	- _call_with_frames_removed:<frozen importlib._bootstrap>:241 -> 2101
	- _compile_bytecode:<frozen importlib._bootstrap_external>:729 -> 988
	- _create_fn:/usr/lib/python3.11/dataclasses.py:433 -> 365

After stats

📏 Total allocations:
	109687

📦 Total memory allocated:
	205.521MB

📊 Histogram of allocation size:
	min: 1.000B
	--------------------------------------------
	< 4.000B   : 40707 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
	< 18.000B  :     4 ▇
	< 80.000B  :   227 ▇
	< 348.000B :    39 ▇
	< 1.468KB  : 67239 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
	< 6.341KB  :   737 ▇
	< 27.388KB :   563 ▇
	< 118.297KB:    68 ▇
	< 510.959KB:    21 ▇
	<=2.155MB  :    82 ▇
	--------------------------------------------
	max: 2.155MB

📂 Allocator type distribution:
	 MALLOC: 107587
	 CALLOC: 1218
	 REALLOC: 862
	 MMAP: 20

🥇 Top 5 largest allocating locations (by size):
	- read_from_stream:./pypdf/generic/_data_structures.py:541 -> 101.628MB
	- read_until_regex:./pypdf/_utils.py:233 -> 46.318MB
	- read_object:./pypdf/generic/_data_structures.py:1287 -> 24.012MB
	- _call_with_frames_removed:<frozen importlib._bootstrap>:241 -> 7.356MB
	- _compile_bytecode:<frozen importlib._bootstrap_external>:729 -> 4.844MB

🥇 Top 5 largest allocating locations (by number of allocations):
	- read_until_regex:./pypdf/_utils.py:233 -> 81056
	- read_object:./pypdf/generic/_data_structures.py:1287 -> 23015
	- _call_with_frames_removed:<frozen importlib._bootstrap>:241 -> 2095
	- _compile_bytecode:<frozen importlib._bootstrap_external>:729 -> 989
	- _create_fn:/usr/lib/python3.11/dataclasses.py:433 -> 365

codecov · 2024-03-15T02:38:58Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.97%. Comparing base (c227b0c) to head (c3fe2e7).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2520   +/-   ##
=======================================
  Coverage   94.97%   94.97%           
=======================================
  Files          50       50           
  Lines        8331     8340    +9     
  Branches     1669     1669           
=======================================
+ Hits         7912     7921    +9     
  Misses        260      260           
  Partials      159      159

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tests/test_page.py

stefan6419846 · 2024-03-15T08:09:52Z

Thanks for the PR. Are the stats correct? You need twice the memory afterwards, thus it would indicate that this is indeed no performance improvement?

And could you please have a look at the failing tests? Your changes lead to new test parallelization issues on Windows as each file can be open only once at each point in time.

mjsir911 · 2024-03-15T08:20:31Z

Thanks for the PR. Are the stats correct? You need twice the memory afterwards, thus it would indicate that this is indeed no performance improvement?

Sorry, I got the before & after mixed up. fixed

And could you please have a look at the failing tests? Your changes lead to new test parallelization issues on Windows as each file can be open only once at each point in time.

Yeah, I can do. I'll have a bit more difficulty fixing the windows tests since I don't have a windows box to test on easily but I'll figure something out.

stefan6419846 · 2024-03-15T14:39:20Z

AFAIK the concurrent access issues will only occur on Windows, but I cannot really state how much this would indeed affect real use-cases.

I am not really sure about the fixed tests either - explicitly calling .delete() or even having to close the embedded stream object does not really feel intuitive and maybe even clumsy.

stefan6419846 · 2024-03-15T14:39:44Z

pypdf/_reader.py

@@ -314,6 +314,7 @@ def __init__(

        if isinstance(stream, (str, Path)):
            stream = open(stream, "rb")  #  noqa: SIM115
+            # Wish I could just close stream in __del__ but that fails a test very strangely


Just out of curiosity: Do you have some details about the failure?

Yeah, I'm not sure how much was relevant to drop in the commit but:

when adding a self.stream.close() in a __del__ function, that does work most of the time.

The one test failure I was seeing was in tests/test_reader.py, the failing test was test_get_page_of_encrypted_file but interestingly this would pass on it's own. I narrowed down the source of the issue to the previous test test_issue297's exception block where the PdfReader() initializer was failing (that's what the test is testing for) and the __del__ block wasn't being called due to the exception happening in the __init__.

It's very possible at some point the objects would be GCd but test failures were happening due to dangling file pointers at the following test.

I'm going to add this to the commit

mjsir911 · 2024-03-15T15:01:49Z

I am not really sure about the fixed tests either - explicitly calling .delete() or even having to close the embedded stream object does not really feel intuitive and maybe even clumsy.

It's even worst than that, unfortunately! I'm not sure what the reference chain is from «Writer» -> «»
«cloned from reader's stream», but del writer seems to unclaim the dangling file pointer.

If it's any consolation, the test failures are kind of an edge case where:

user is running on windows
filereader (or clone_from transitively) is instantiated via string/Path
file is acted upon outside of the pdfreader's context (opened/unlinked/whatever again by name) while the pdfreader is still in scope / not garbage collected

Sorry for jumping the gun on calling the tests solved! Still iterating on them.

mjsir911 · 2024-03-15T15:04:58Z

I am not really sure about the fixed tests either - explicitly calling .delete() or even having to close the embedded stream object does not really feel intuitive and maybe even clumsy.

It's even worst than that, unfortunately! I'm not sure what the reference chain is from -> <cloned from reader's stream>, but del writer seems to unclaim the dangling file pointer.

I could potentially add a .close() or something to PdfReader which would at least make this process explicit. I would still be unsure how to propogate that to PdfWriter's API though.

Making it a context manager might work too and would mirror PdfWriter

pypdf/_reader.py

tests/test_page.py

mjsir911 · 2024-03-15T18:16:09Z

I don't want this merged as it currently is, calling garbage collection manually in tests feels yucky.

pubpub-zz · 2024-03-15T20:11:40Z

It's even worst than that, unfortunately! I'm not sure what the reference chain is from «Writer» -> «» «cloned from reader's stream», but del writer seems to unclaim the dangling file pointer.

when you call .clone_document_from_reader() or append(pages), you clone all objects from PdfReader(). during this process we need to keep connection between the writer's objects and the reader's object in order to keep parents links for example.
When you have finished your work, or when you need to append a new set of pages detached from the PdfReader, you have to call
writer.reset_translation().

See py-pdf#2520, basically this was the last failing (only on windows) test because if the pdfreaders are implicitly opening file streams that don't get closed until they get garbage collected the .unlinks() create file lock errors.

pypdf/_reader.py

mjsir911 · 2024-03-21T18:44:41Z

I should also add using PdfReader as a contextmanager in some documentation somewhere

pubpub-zz · 2024-04-13T14:53:08Z

pypdf/generic/_base.py

-    def __deepcopy__(self, memo: Any) -> "IndirectObject":
-        return IndirectObject(self.idnum, self.generation, self.pdf)
-


I'm not so found about removing deepcopy : some people may use it this could be considered as a regression. If we really want to remove it we shall use the depredication process

@pubpub-zz deprecating doesn't really make sense because with this change no objects will ever be deep-copyable, they will always have a reference to a file pointer that can't be pickleable.

The only reason deep copies work now is because the entire source PDF bytestring gets copied over with them, and that only happens when a filename is passed, deepcopying has never worked with a passed file pointer.

The only way deprecation would work is if you deprecated it in lieu of this PR and then merged these changes in at a later date

If you leave __deepcopy__ is there an error ?

if I leave __deepcopy__ with the associated covered tests there is an error, yes.

TypeError: cannot pickle '_io.FileIO' object

res = hook_impl.function(*args) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/msirabella/fork/pypdf/venv/lib/python3.11/site-packages/_pytest/python.py", line 195, in pytest_pyfunc_call result = testfunction(**testargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/msirabella/fork/pypdf/tests/test_page.py", line 168, in test_transformation_equivalence page_box1 = deepcopy(page_box) ^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/copy.py", line 172, in deepcopy y = _reconstruct(x, memo, *rv) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/copy.py", line 271, in _reconstruct state = deepcopy(state, memo) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/copy.py", line 146, in deepcopy y = copier(x, memo) ^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/copy.py", line 231, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/copy.py", line 172, in deepcopy y = _reconstruct(x, memo, *rv) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/copy.py", line 271, in _reconstruct state = deepcopy(state, memo) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/copy.py", line 146, in deepcopy y = copier(x, memo) ^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/copy.py", line 231, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/copy.py", line 161, in deepcopy rv = reductor(4) ^^^^^^^^^^^ TypeError: cannot pickle '_io.FileIO' object

pubpub-zz · 2024-04-13T14:54:04Z

I should also add using PdfReader as a contextmanager in some documentation somewhere

have you also been able to advance in your proposal ?

mjsir911 · 2024-04-13T20:27:49Z

have you also been able to advance in your proposal ?

Hi, sorry, I've been taking a break from things due to mental health but plan to be back on them sometime later next month. Moving this back to draft for now.

See py-pdf#2520, basically this was the last failing (only on windows) test because if the pdfreaders are implicitly opening file streams that don't get closed until they get garbage collected the .unlinks() create file lock errors.

This breaks if PdfReader contains any un-pickleable attributes (such as file pointers)

Was only ever being used unintentionally in the tests and doesn't really make sense. Use .clone() instead

See py-pdf#2520, basically this was the last failing (only on windows) test because if the pdfreaders are implicitly opening file streams that don't get closed until they get garbage collected the .unlinks() create file lock errors.

This halves allocated memory when doing a simple PdfWriter(clone_from=«str») I can't just close the self.stream in `__del__` because for some strange reason the unit tests mark it as unflagged even after the test block ends. Something about `__del__` finalizers being run on a second pass while `weakref.finalize()` is run on the first pass.

To mirror PdfWriter, also hints towards file pointer management now that we keep files open sometimes.

mjsir911 force-pushed the memory branch from 04fbcb3 to f66f49b Compare March 15, 2024 07:34

mjsir911 changed the title ~~Don't load entire file into memory when PdfReader passed file name~~ PI: Don't load entire file into memory when passed file name Mar 15, 2024

mjsir911 force-pushed the memory branch from f66f49b to 9ccce80 Compare March 15, 2024 07:41

mjsir911 changed the title ~~PI: Don't load entire file into memory when passed file name~~ PI: Don't load entire file into memory when passed file name Mar 15, 2024

stefan6419846 reviewed Mar 15, 2024

View reviewed changes

tests/test_page.py Show resolved Hide resolved

stefan6419846 added the PdfReader The PdfReader component is affected label Mar 15, 2024

stefan6419846 reviewed Mar 15, 2024

View reviewed changes

mjsir911 force-pushed the memory branch 2 times, most recently from 1a4b1af to a0415db Compare March 15, 2024 14:53

mjsir911 force-pushed the memory branch 2 times, most recently from 5c25bc8 to 0786520 Compare March 15, 2024 15:15

pubpub-zz reviewed Mar 15, 2024

View reviewed changes

pypdf/_reader.py Outdated Show resolved Hide resolved

tests/test_page.py Show resolved Hide resolved

mjsir911 marked this pull request as draft March 15, 2024 18:16

mjsir911 force-pushed the memory branch 2 times, most recently from b105b76 to 5209fcd Compare March 21, 2024 00:48

pubpub-zz reviewed Mar 21, 2024

View reviewed changes

pypdf/_reader.py Outdated Show resolved Hide resolved

pypdf/_reader.py Outdated Show resolved Hide resolved

pubpub-zz reviewed Apr 13, 2024

View reviewed changes

mjsir911 force-pushed the memory branch from 1f81b68 to 44a828c Compare May 17, 2024 19:06

mjsir911 added 6 commits May 17, 2024 20:59

TST: Don't deepcopy PdfReader objects

8f4ac76

This breaks if PdfReader contains any un-pickleable attributes (such as file pointers)

MAINT: remove deepcopy functionality

bea8d90

Was only ever being used unintentionally in the tests and doesn't really make sense. Use .clone() instead

TST: Use buffer instead of opening file many times

18bd9ec

See py-pdf#2520, basically this was the last failing (only on windows) test because if the pdfreaders are implicitly opening file streams that don't get closed until they get garbage collected the .unlinks() create file lock errors.

STY: fix typo

c8a77aa

MAINT: Allow opening PdfReader as contextmanager

3578bf7

To mirror PdfWriter, also hints towards file pointer management now that we keep files open sometimes.

mjsir911 force-pushed the memory branch from 44a828c to 3578bf7 Compare May 18, 2024 01:12

mjsir911 marked this pull request as ready for review May 18, 2024 01:13

fixup! MAINT: Allow opening PdfReader as contextmanager

c3fe2e7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PI: Don't load entire file into memory when passed file name #2520

PI: Don't load entire file into memory when passed file name #2520

mjsir911 commented Mar 15, 2024 •

edited

codecov bot commented Mar 15, 2024 •

edited

stefan6419846 commented Mar 15, 2024

mjsir911 commented Mar 15, 2024 •

edited

stefan6419846 commented Mar 15, 2024 •

edited

stefan6419846 Mar 15, 2024

mjsir911 Mar 15, 2024

mjsir911 Mar 15, 2024

mjsir911 commented Mar 15, 2024 •

edited

mjsir911 commented Mar 15, 2024 •

edited

mjsir911 commented Mar 15, 2024

pubpub-zz commented Mar 15, 2024

mjsir911 commented Mar 21, 2024

pubpub-zz Apr 13, 2024

mjsir911 May 17, 2024

mjsir911 May 17, 2024

pubpub-zz May 17, 2024

mjsir911 May 17, 2024

pubpub-zz commented Apr 13, 2024

mjsir911 commented Apr 13, 2024

		def __deepcopy__(self, memo: Any) -> "IndirectObject":
		return IndirectObject(self.idnum, self.generation, self.pdf)

PI: Don't load entire file into memory when passed file name #2520

Are you sure you want to change the base?

PI: Don't load entire file into memory when passed file name #2520

Conversation

mjsir911 commented Mar 15, 2024 • edited

codecov bot commented Mar 15, 2024 • edited

Codecov Report

stefan6419846 commented Mar 15, 2024

mjsir911 commented Mar 15, 2024 • edited

stefan6419846 commented Mar 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjsir911 commented Mar 15, 2024 • edited

mjsir911 commented Mar 15, 2024 • edited

mjsir911 commented Mar 15, 2024

pubpub-zz commented Mar 15, 2024

mjsir911 commented Mar 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pubpub-zz commented Apr 13, 2024

mjsir911 commented Apr 13, 2024

mjsir911 commented Mar 15, 2024 •

edited

codecov bot commented Mar 15, 2024 •

edited

mjsir911 commented Mar 15, 2024 •

edited

stefan6419846 commented Mar 15, 2024 •

edited

mjsir911 commented Mar 15, 2024 •

edited

mjsir911 commented Mar 15, 2024 •

edited