Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NumPy Optimizations and joblib Comparison #8

Open
nournia opened this issue Mar 27, 2016 · 6 comments
Open

NumPy Optimizations and joblib Comparison #8

nournia opened this issue Mar 27, 2016 · 6 comments

Comments

@nournia
Copy link

nournia commented Mar 27, 2016

Hi,
I'm currently using joblib for caching numpy array objects. Is there any benchmark on these kind of inputs for DiskCache?

@grantjenks
Copy link
Owner

Sorry, I don't have a benchmark. I looked at joblib but it didn't quite fit my needs. From what I see in the source, joblib uses a handful of tricks and tweaks to improve its handling of numpy arrays.

Could you describe your use case? I'm interested in constructing a benchmark if possible.

DiskCache lacks a memoizing decorator like joblib has but it's easy enough to write. Here's the simplest benchmark I could think of (using IPython):

In [6]: %paste
import diskcache

class Cache(diskcache.Cache):
    def memoize(self, func):
        def wrapper(*args):
            try:
                return self[args]
            except KeyError:
                value = func(*args)
                self[args] = value
                return value
        return wrapper

cache = Cache('/tmp/diskcache')

@cache.memoize
def identity1(value):
    print 'identity1', value
    return value

%timeit -n1 -r50 identity1(0)

import joblib

memory = joblib.Memory('/tmp/joblib')

@memory.cache
def identity2(value):
    print 'identity2', value
    return value

%timeit -n1 -r50 identity2(0)
## -- End pasted text --
1 loop, best of 50: 16.9 µs per loop
1 loop, best of 50: 832 µs per loop

For the simple identity function above, DiskCache is about 50 times faster. Out of fifty iterations, the fastest lookup took 16.9 microseconds while joblib took 832 microseconds.

@nournia
Copy link
Author

nournia commented Mar 28, 2016

That is just great. Here is what I get in my laptop (SSD disk):

%timeit -n1 -r50 identity_diskcache(10)  # 1 loops, best of 50: 67 µs per loop
%timeit -n1 -r50 identity_joblib(10)  # 1 loops, best of 50: 217 µs per loop

import numpy as np
random = np.random.random(5000)
%timeit -n1 -r50 identity_diskcache(random)  # 1 loops, best of 50: 257 µs per loop
%timeit -n1 -r50 identity_joblib(random)  # 1 loops, best of 50: 770 µs per loop

You are also faster in retrieving numpy arrays.

@nournia nournia closed this as completed Mar 28, 2016
@grantjenks
Copy link
Owner

Note that the scales will tip back for sufficiently large numpy arrays:

In [1]: %paste
values = np.random.random(int(1e6))
%timeit -n1 -r50 identity1(values)
%timeit -n1 -r50 identity2(values)
## -- End pasted text --
1 loop, best of 50: 42.1 ms per loop
1 loop, best of 50: 12.9 ms per loop

Now the numpy-aware joblib has an advantage.

Making DiskCache faster is then a matter of using the optimized serialization routines that numpy provides. DiskCache has a separate serialization class that handles converting to/from the database and filesystem. You can read about it at:

I would be glad to accept pull-requests that created a numpy-aware diskcache.Disk-like serializer.

@grantjenks grantjenks reopened this Mar 28, 2016
@grantjenks grantjenks changed the title joblib comparision NumPy Support and joblib Comparison Mar 28, 2016
@pombredanne
Copy link
Contributor

just as a side note: the fact that joblib is only usable as a decorator is actually a serious limitation IMHO in some cases.
Take for instance this use case:
I have a function that takes a file path and does some expensive computation on this. This is a large file and I am reading it in chunks in my function. I could compute a cache key based on that stream in my function and cache the results in my function. But with a decorator-only approach this is not possible: I cannot create a sub-function that has as an argument the whole file content such that joblib can compute the cache key for me. @ogrisel @GaelVaroquaux Is this a fair statement wrt to joblib capabilities?

@grantjenks
Copy link
Owner

I wonder if the performance difference for large numpy arrays is due to compression. Currently DiskCache does no compression of pickled objects. Depending on disk performance, I could imagine compression improving the performance of serializing large numpy arrays.

@grantjenks
Copy link
Owner

Nope, it is not compression. Local benchmarking showed compression was ten times slower.

Now I think the difference is the cryptographic hashing of inputs. Particularly for the identity function benchmark above, there's a significant impact.

@grantjenks grantjenks changed the title NumPy Support and joblib Comparison NumPy Optimizations and joblib Comparison Jun 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants