New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NumPy Optimizations and joblib Comparison #8
Comments
Sorry, I don't have a benchmark. I looked at joblib but it didn't quite fit my needs. From what I see in the source, joblib uses a handful of tricks and tweaks to improve its handling of numpy arrays. Could you describe your use case? I'm interested in constructing a benchmark if possible. DiskCache lacks a memoizing decorator like joblib has but it's easy enough to write. Here's the simplest benchmark I could think of (using IPython):
For the simple identity function above, DiskCache is about 50 times faster. Out of fifty iterations, the fastest lookup took 16.9 microseconds while joblib took 832 microseconds. |
That is just great. Here is what I get in my laptop (SSD disk):
You are also faster in retrieving numpy arrays. |
Note that the scales will tip back for sufficiently large numpy arrays:
Now the numpy-aware joblib has an advantage. Making DiskCache faster is then a matter of using the optimized serialization routines that numpy provides. DiskCache has a separate serialization class that handles converting to/from the database and filesystem. You can read about it at:
I would be glad to accept pull-requests that created a numpy-aware diskcache.Disk-like serializer. |
just as a side note: the fact that joblib is only usable as a decorator is actually a serious limitation IMHO in some cases. |
I wonder if the performance difference for large numpy arrays is due to compression. Currently DiskCache does no compression of pickled objects. Depending on disk performance, I could imagine compression improving the performance of serializing large numpy arrays. |
Nope, it is not compression. Local benchmarking showed compression was ten times slower. Now I think the difference is the cryptographic hashing of inputs. Particularly for the identity function benchmark above, there's a significant impact. |
Hi,
I'm currently using joblib for caching numpy array objects. Is there any benchmark on these kind of inputs for DiskCache?
The text was updated successfully, but these errors were encountered: