NumPy Optimizations and joblib Comparison #8

nournia · 2016-03-27T06:09:35Z

Hi,
I'm currently using joblib for caching numpy array objects. Is there any benchmark on these kind of inputs for DiskCache?

grantjenks · 2016-03-28T04:48:25Z

Sorry, I don't have a benchmark. I looked at joblib but it didn't quite fit my needs. From what I see in the source, joblib uses a handful of tricks and tweaks to improve its handling of numpy arrays.

Could you describe your use case? I'm interested in constructing a benchmark if possible.

DiskCache lacks a memoizing decorator like joblib has but it's easy enough to write. Here's the simplest benchmark I could think of (using IPython):

In [6]: %paste
import diskcache

class Cache(diskcache.Cache):
    def memoize(self, func):
        def wrapper(*args):
            try:
                return self[args]
            except KeyError:
                value = func(*args)
                self[args] = value
                return value
        return wrapper

cache = Cache('/tmp/diskcache')

@cache.memoize
def identity1(value):
    print 'identity1', value
    return value

%timeit -n1 -r50 identity1(0)

import joblib

memory = joblib.Memory('/tmp/joblib')

@memory.cache
def identity2(value):
    print 'identity2', value
    return value

%timeit -n1 -r50 identity2(0)
## -- End pasted text --
1 loop, best of 50: 16.9 µs per loop
1 loop, best of 50: 832 µs per loop

For the simple identity function above, DiskCache is about 50 times faster. Out of fifty iterations, the fastest lookup took 16.9 microseconds while joblib took 832 microseconds.

nournia · 2016-03-28T08:15:09Z

That is just great. Here is what I get in my laptop (SSD disk):

%timeit -n1 -r50 identity_diskcache(10)  # 1 loops, best of 50: 67 µs per loop
%timeit -n1 -r50 identity_joblib(10)  # 1 loops, best of 50: 217 µs per loop

import numpy as np
random = np.random.random(5000)
%timeit -n1 -r50 identity_diskcache(random)  # 1 loops, best of 50: 257 µs per loop
%timeit -n1 -r50 identity_joblib(random)  # 1 loops, best of 50: 770 µs per loop

You are also faster in retrieving numpy arrays.

grantjenks · 2016-03-28T17:41:12Z

Note that the scales will tip back for sufficiently large numpy arrays:

In [1]: %paste
values = np.random.random(int(1e6))
%timeit -n1 -r50 identity1(values)
%timeit -n1 -r50 identity2(values)
## -- End pasted text --
1 loop, best of 50: 42.1 ms per loop
1 loop, best of 50: 12.9 ms per loop

Now the numpy-aware joblib has an advantage.

Making DiskCache faster is then a matter of using the optimized serialization routines that numpy provides. DiskCache has a separate serialization class that handles converting to/from the database and filesystem. You can read about it at:

I would be glad to accept pull-requests that created a numpy-aware diskcache.Disk-like serializer.

pombredanne · 2016-06-01T08:37:41Z

just as a side note: the fact that joblib is only usable as a decorator is actually a serious limitation IMHO in some cases.
Take for instance this use case:
I have a function that takes a file path and does some expensive computation on this. This is a large file and I am reading it in chunks in my function. I could compute a cache key based on that stream in my function and cache the results in my function. But with a decorator-only approach this is not possible: I cannot create a sub-function that has as an argument the whole file content such that joblib can compute the cache key for me. @ogrisel @GaelVaroquaux Is this a fair statement wrt to joblib capabilities?

grantjenks · 2016-09-12T05:59:59Z

I wonder if the performance difference for large numpy arrays is due to compression. Currently DiskCache does no compression of pickled objects. Depending on disk performance, I could imagine compression improving the performance of serializing large numpy arrays.

grantjenks · 2016-09-12T18:13:46Z

Nope, it is not compression. Local benchmarking showed compression was ten times slower.

Now I think the difference is the cryptographic hashing of inputs. Particularly for the identity function benchmark above, there's a significant impact.

nournia closed this as completed Mar 28, 2016

grantjenks reopened this Mar 28, 2016

grantjenks changed the title ~~joblib comparision~~ NumPy Support and joblib Comparison Mar 28, 2016

grantjenks mentioned this issue Mar 28, 2016

Add Memoizing Decorator Support #9

Closed

grantjenks added enhancement help wanted labels Mar 28, 2016

grantjenks added Feature Request (Needs Votes) and removed enhancement labels Nov 27, 2016

grantjenks mentioned this issue Jul 15, 2018

numpy.ndarray support #74

Closed

grantjenks changed the title ~~NumPy Support and joblib Comparison~~ NumPy Optimizations and joblib Comparison Jun 13, 2019

grantjenks removed Feature Request (Needs Votes) labels Jul 1, 2019

grantjenks mentioned this issue Nov 9, 2021

use __hash__ in core.put? #233

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NumPy Optimizations and joblib Comparison #8

NumPy Optimizations and joblib Comparison #8

nournia commented Mar 27, 2016

grantjenks commented Mar 28, 2016

nournia commented Mar 28, 2016

grantjenks commented Mar 28, 2016

pombredanne commented Jun 1, 2016

grantjenks commented Sep 12, 2016

grantjenks commented Sep 12, 2016

NumPy Optimizations and joblib Comparison #8

NumPy Optimizations and joblib Comparison #8

Comments

nournia commented Mar 27, 2016

grantjenks commented Mar 28, 2016

nournia commented Mar 28, 2016

grantjenks commented Mar 28, 2016

pombredanne commented Jun 1, 2016

grantjenks commented Sep 12, 2016

grantjenks commented Sep 12, 2016