Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: IDistributedCache - provision and questions #902

Open
mgravell opened this issue Mar 24, 2024 · 20 comments
Open

Discussion: IDistributedCache - provision and questions #902

mgravell opened this issue Mar 24, 2024 · 20 comments

Comments

@mgravell
Copy link

For context, I'm on the devdiv / aspnet team and am currently probably the main owner of the caching story.

IDistributedCache is (at least in .NET 8) the main caching backend abstraction, with local-memory, Redis, SQLite, SQL Server, Cosmos, etc backends. I propose that FASTER should have such, and have prototyped it here. The good news is: it seems to work great, and the performance is 🔥

IDistributedCache is getting some love in .NET 9, so my eyes are on this area.

I have some thoughts from this, though:

  1. should MSFT provide an official implementation? if so, where should it live? I'm happy for you to take mine
  2. request for eyeballs; am I doing anything wrong/incomplete? in particular, things like "session length", and also: since I implement expiration, do I need to explicitly remove something, or is returning false from the functions layer enough to cause that to happen?
  3. the serialization layer feels "old school"; there may be a lot of room for improvement there, to improve perf and reduce allocations and memory-copies; happy to have that discussion at length, but increasingly serializers are supporting IBufferWriter<byte> and ReadOnlySequence<byte> implementations; I'd love to have a conversation about that
  4. also, stateless serializers rather than serializer factories

Let me know if there is interest from this from the FASTER side; if not, I can try to do my best from the "other side"

@jodydonetti
Copy link

FWIW the current impl is already working well!

image

nb: I just needed to make a minor change, for which I made a PR.

@badrishc
Copy link
Contributor

badrishc commented Mar 24, 2024

Thanks @mgravell - this is a great start! With a starting point like this, it should be possible to look closely and optimize it much further. I already see opportunities for improvement of at least 2X if not more. Garnet skyrocketing is keeping us a bit busy right now, but we will check it out in the next week or two, and perhaps integrate into our NuGets (ofc, only if you think that makes sense). cc @TedHartMS

@Tornhoof
Copy link
Contributor

I'll compare it tomorrow against my own impl. If I remember correctly, the preferred method for session Handling is to cache and reuse the sessions instead of new() every time.

@badrishc
Copy link
Contributor

Yes, that is one important optimization. The other is to use VarLenBlittableAllocator if the keys and values are blittable -- instead of paying the cost of object serializer at all.

@Tornhoof
Copy link
Contributor

Tornhoof commented Mar 25, 2024

My own impl is not for a DistributedCache, but for a larger than memory local cache, using blittable keys and values (I write fixed size keys and var length binary serializable blobs only). I don't have sliding expiration, just a size check, which then cuts of segments after a certain size. So I have only one log device and not two.
My clientsession type alias looks like:

using ClientSession  = ClientSession<SpanByte, SpanByte, SpanByte, SpanByteAndMemory, Empty, SpanByteFunctions<Empty>;

That being said, these are the two important optimizations, as badrish already wrote, cached sessions and blittable keys/values.

Cache sessions

I used a ConcurrentBag, I'm aware that ConcurrentQueue and/or ObjectPool might be better, depending on the congestion of the data structure, but none of my benchmarks ever showed any conclusive difference.

Other Tidbits

  • I recommend adding an option to the options type to do DeleteOnClose (arg for CreateLogDevice), a lot of usages probably do not need to persist the cache on restart of the application.
  • My LogSettings are more complex and have extra ReadCacheSettings, according to https://microsoft.github.io/FASTER/docs/fasterkv-tuning/ it's useful for larger than memory caches, I followed that page for the other logsettings
  • My impl implements IEnumerable<...>, mostly for testing (the IDistributedCache interface does not include any enumeration support afaik)

Notes about SpanByte

If you go the SpanByte route (which is the blittable route) and want to prevent memory copies, the complexity rises quite fast, as you need to do the async/await dance without async methods. This is required as you need to use fixed to create SpanBytes from your keys/values.
This means you need to split into sync and async paths via ValueTask.IsCompletedSuccessfully.

At the end such a split method looks like this (for ReadAsync), it easily is 3 or 4 times as long and complex as doing it without the async/await dance and a lot more error prone, but Marc is probably more aware of async/await shenanigans than most of us, so if anyone would be safe doing that stuff, it is him.

This is only the ReadAsync implementation and related methods/types, I can't post the full implementation without asking legal (and not getting an answer in a timely fashion anyway).

public ValueTask<TValue?> ReadAsync<TKey, TValue>(TKey key, CancellationToken cancellationToken = default) where TKey : unmanaged where TValue : struct,  IByteWritable<TValue>
{
    var session = GetSession();
    return ReadAsyncInternal<TKey, TValue>(key, session, true, cancellationToken);
}

private unsafe ValueTask<TValue?> ReadAsyncInternal<TKey, TValue>(TKey key,
    ClientSession session, bool releaseSession, CancellationToken cancellationToken = default)
    where TKey : unmanaged where TValue : struct,  IByteWritable<TValue>
{
    try
    {
        Span<byte> keySpan = MemoryMarshal.AsBytes(new Span<TKey>(ref key));
        fixed (byte* _ = keySpan) // the fixed span needs only be fixed until the ReadAsync returns, for async completions it is copied internally
        {
            SpanByte keySpanByte = SpanByte.FromFixedSpan(keySpan);
            var task = session.ReadAsync(ref keySpanByte, token: cancellationToken);
            if (task.IsCompletedSuccessfully)
            {
                var (status, memory) = task.Result.Complete();
                var result = HandleReadStatus<TValue>(status, releaseSession, session, memory);
                return new ValueTask<TValue?>(result);
            }

            return CompleteReadAsync<TValue>(task, releaseSession, session);
        }
    }
    catch
    {
        session.Dispose();
        throw;
    }
}

private TValue? HandleReadStatus<TValue>(Status status, bool releaseSession, ClientSession session, SpanByteAndMemory memory) where TValue : struct,  IByteWritable<TValue>
{
    TValue? result = default;
    if (status.Found) // if found, we deserialize it
    {
        result = Deserialize<TValue>(memory);
    }
    else if (!status.NotFound) // everything except not found is an error
    {
        ThrowInvalidOperation(status);
    }

    // for single ops we release the session here, for multi ops we wait until they are all finished
    if (releaseSession)
    {
        ReleaseSession(session);
    }

    return result;
}

private async ValueTask<TValue?> CompleteReadAsync<TValue>(ValueTask<FasterKV<SpanByte, SpanByte>.ReadAsyncResult<SpanByte, SpanByteAndMemory, Empty>> valueTask, bool releaseSession,
    ClientSession session) where TValue : struct,  IByteWritable<TValue>
{
    try
    {
        var readAsyncResult = await valueTask.ConfigureAwait(false);
        var (status, memory) = readAsyncResult.Complete();
        return HandleReadStatus<TValue>(status, releaseSession, session, memory);
    }
    catch
    {
        session.Dispose();
        throw;
    }
}

private static TValue Deserialize<TValue>(SpanByteAndMemory spanByteAndMemory) where TValue : struct, IByteWritable<TValue>
{
    IMemoryOwner<byte> memory;
    int length;
    if (spanByteAndMemory.IsSpanByte)
    {
        // ReSharper disable once PossiblyImpureMethodCallOnReadonlyVariable
        (memory, length) = spanByteAndMemory.SpanByte.ToMemoryOwner(MemoryPool<byte>.Shared);
    }
    else
    {
        memory = spanByteAndMemory.Memory;
        length = spanByteAndMemory.Length;
    }

    return TValue.Deserialize(memory, length);
}


public interface IByteWritable<TSelf> where TSelf : struct
{
    /// <summary>
    /// Deserialize the content from memory, the memory is now owned by this instance, dispose when finished
    /// </summary>
    public static abstract TSelf Deserialize(IMemoryOwner<byte> memory, int length);

    /// <summary>
    /// Make sure TSelf is readonly struct, output is written data, if the data does not fit into the existing output
    /// buffer is populated and needs to be disposed by the caller
    /// </summary>
    public static abstract void Serialize(in TSelf self, ref Span<byte> output, out IMemoryOwner<byte>? buffer, out int written);
}

My tests indicated, that in-memory operations completed synchronously and disk-based operations asynchronously, so your tests need to be run with enough data, otherwise you might not hit the async code path.

@mgravell
Copy link
Author

mgravell commented Mar 25, 2024 via email

@mgravell
Copy link
Author

mgravell commented Mar 25, 2024

log options: done

SpanByte - I have an unsuccessful branch here; I'm using SpanByte keys and values using UTF8 and a buffer (stack or leased, depending on size); write says it works, but read says "not found" - would love any clues if anyone has a moment

specifically, see the BasicUsage test; debug output shows (currently, broken state)

  • Read: NotFound for the genuine "not there" fetch
  • Upsert: NotFound, CreatedRecord for the initial write
  • Read: NotFound for the failed second fetch when I expect a value

edit: it looks like the problem is that value is the right-size but all-zero in ConcurrentReader, where I'm checking the expiry

@Tornhoof
Copy link
Contributor

Tornhoof commented Mar 25, 2024

@mgravell Debugging through your code, in CacheFunctions the payload in the expiry check is empty.
I don't know anything about CacheFunctions sorry, I used Empty everywhere.

Ah, I missed your edit.

@mgravell
Copy link
Author

mgravell commented Mar 25, 2024

I fixed it by switching to Memory<byte> instead of SpanByte; that's enough for me to make some progress; however... it wasn't the kind of progress I'd hoped for; we now get multiple allocations on the "get" path (I expect 10k here, for the final byte[] - the rest of the 36KB is... not ideal):

| Method          | KeyLength | PayloadLength | Mean        | Error       | StdDev      | Median      | Gen0   | Gen1   | Allocated |
|---------------- |---------- |-------------- |------------:|------------:|------------:|------------:|-------:|-------:|----------:|
| FASTER_Get      | 128       | 10240         |  2,786.6 ns |    55.11 ns |   130.98 ns |  2,763.4 ns | 2.1858 | 0.0992 |   36625 B |
| FASTER_Set      | 128       | 10240         |  2,132.4 ns |    36.65 ns |    34.28 ns |  2,136.4 ns | 0.5951 | 0.0153 |   10000 B |
| FASTER_GetAsync | 128       | 10240         |  2,860.2 ns |    57.15 ns |    98.58 ns |  2,877.0 ns | 2.1896 | 0.1984 |   36721 B |
| FASTER_SetAsync | 128       | 10240         |  2,240.5 ns |    43.76 ns |   119.06 ns |  2,187.1 ns | 0.5951 | 0.0191 |   10000 B |

I think to get this truly blazing with low allocs is going to take some input from the FASTER team - which is really the key goal here; see if we can get that momentum :)


update: got it down a bit using session reuse, but I have no clue what I'm doing with sessions, so this may not be valid; current state:

| Method          | KeyLength | PayloadLength | Mean        | Error       | StdDev      | Gen0   | Gen1   | Allocated |
|---------------- |---------- |-------------- |------------:|------------:|------------:|-------:|-------:|----------:|
| FASTER_Get      | 128       | 10240         |    974.2 ns |    16.52 ns |    15.45 ns | 1.5955 | 0.0496 |   26696 B |
| FASTER_Set      | 128       | 10240         |    569.3 ns |     5.22 ns |     4.63 ns | 0.0038 |      - |      72 B |
| FASTER_GetAsync | 128       | 10240         |  1,051.4 ns |    19.92 ns |    18.64 ns | 1.5984 | 0.0992 |   26792 B |
| FASTER_SetAsync | 128       | 10240         |    780.6 ns |     1.57 ns |     1.31 ns | 0.0038 |      - |      72 B |
| SQLite_Get      | 128       | 10240         | 12,137.0 ns |   200.33 ns |   187.39 ns | 0.6866 |      - |   11616 B |
| SQLite_Set      | 128       | 10240         | 76,377.2 ns | 1,482.33 ns | 1,522.25 ns |      - |      - |    1360 B |
| SQLite_GetAsync | 128       | 10240         | 22,139.6 ns |   297.40 ns |   278.19 ns | 0.7019 |      - |   11912 B |
| SQLite_SetAsync | 128       | 10240         | 75,067.2 ns | 1,035.19 ns |   864.43 ns |      - |      - |    1280 B |
| Rocks_Get       | 128       | 10240         |    123.8 ns |     0.79 ns |     0.70 ns | 0.0091 |      - |     152 B |
| Rocks_Set       | 128       | 10240         | 12,619.8 ns |    68.27 ns |    53.30 ns |      - |      - |     224 B |
| Rocks_GetAsync  | 128       | 10240         |    763.4 ns |    18.70 ns |    55.15 ns | 0.0257 |      - |     440 B |
| Rocks_SetAsync  | 128       | 10240         | 21,776.1 ns |   228.28 ns |   213.54 ns | 0.0305 |      - |     576 B |

(note: Rocks is cheating on the "get")

@Tornhoof
Copy link
Contributor

but I have no clue what I'm doing with sessions

You didn't push your code, atleast I couldn't find it, but if it looks similar to this, then you should be good, I missed that part in the details section of my earlier post.

private readonly ConcurrentBag<ClientSession> _sessions = new();
private ClientSession GetSession()
        {
            if (_sessions.TryTake(out var result))
            {
                return result;
            }

            var session = _store.For(_simpleFunctions).NewSession<SpanByteFunctions<Empty>>();
            return session;
        }

        private void ReleaseSession(ClientSession session)
        {
            _sessions.Add(session);
        }

I dispose a session after Exceptions, because I simply don't know the state of them.

@mgravell
Copy link
Author

mgravell commented Mar 25, 2024 via email

@mgravell
Copy link
Author

mgravell commented Mar 26, 2024

Here we go; definitely has legs!

  • I switched to SpanByte (after figuring out what that means internally) - that gets rid of the excess allocations, leaving just the final byte[]
  • I also implemented a mockup of the auxiliary API I want to add in .NET 9, which gets rid of that buffer allocation
  • and in my testing, I discovered that the Rocks cache is not actually working correctly - we can ignore the cheap "get" values, since they don't return the correct data ("you had one job")

Current benchmarks (edit immediately below same comment - table is now out of date):

| Method                | KeyLength | PayloadLength | Mean        | Error       | StdDev      | Gen0   | Gen1   | Allocated |
|---------------------- |---------- |-------------- |------------:|------------:|------------:|-------:|-------:|----------:|
| FASTER_Get            | 128       | 10240         |    552.3 ns |    21.16 ns |    18.76 ns | 0.6123 |      - |   10264 B |
| FASTER_Set            | 128       | 10240         |    826.7 ns |    15.10 ns |    10.91 ns | 0.6123 |      - |   10264 B |
| FASTER_GetBuffer      | 128       | 10240         |    315.5 ns |     5.65 ns |     4.09 ns |      - |      - |         - |
| FASTER_SetBuffer      | 128       | 10240         |    480.6 ns |     7.08 ns |     4.68 ns |      - |      - |         - |
| FASTER_GetAsync       | 128       | 10240         |    625.5 ns |    16.34 ns |    14.48 ns | 0.6189 |      - |   10360 B |
| FASTER_GetAsyncBuffer | 128       | 10240         |    367.6 ns |     6.24 ns |     2.77 ns | 0.0014 |      - |      24 B |
| FASTER_SetAsync       | 128       | 10240         |    903.8 ns |    15.46 ns |    11.18 ns | 0.6123 |      - |   10264 B |
| FASTER_SetAsyncBuffer | 128       | 10240         |    585.0 ns |    10.59 ns |     3.78 ns |      - |      - |         - |
| SQLite_Get            | 128       | 10240         | 12,717.9 ns |   205.08 ns |    73.13 ns | 0.6866 |      - |   11616 B |
| SQLite_Set            | 128       | 10240         | 75,747.7 ns | 1,300.16 ns |   773.70 ns | 0.6104 |      - |   11576 B |
| SQLite_GetAsync       | 128       | 10240         | 21,364.4 ns |   403.96 ns |   179.36 ns | 0.7019 |      - |   11912 B |
| SQLite_SetAsync       | 128       | 10240         | 76,450.3 ns | 1,478.02 ns | 1,153.94 ns | 0.6104 |      - |   11496 B |
| Rocks_Get             | 128       | 10240         |    843.3 ns |    17.26 ns |    15.30 ns | 0.6218 |      - |   10416 B |
| Rocks_Set             | 128       | 10240         | 14,182.6 ns |   247.34 ns |   178.84 ns | 0.6104 |      - |   10416 B |
| Rocks_GetAsync        | 128       | 10240         |  1,876.0 ns |    88.41 ns |    82.69 ns | 0.7935 | 0.0267 |   10704 B |
| Rocks_SetAsync        | 128       | 10240         | 24,844.1 ns | 1,422.47 ns | 1,330.58 ns | 0.6409 |      - |   10768 B |

Outstanding things I could do with help from people with:

  • persistence - not sure anything is writing to disk currently; we should fix that
  • unconstrained growth; is the memory / log data self-managing?
  • expiration removal
    • in the observed case (is it enough to return false to indicate an expire? does that actually remove anything?)
    • in the unobserved case (some kind of crawl - LRU, chaos, LFU, whatever)
  • probably a bunch of things that I don't know that I don't know

But the numbers look nice, especially for the .NET 9 scenario (FASTER_GetAsyncBuffer). I can also possibly improve the "pinning" on the async path; right now I use a simple using (memory.Pin()) - however, we could do a fixed pin with a speculative "does this finish synchronously", and only take the non-stack pin if we need to transition to full async

/cc @jodydonetti @badrishc

update: now with the deferred pinning on the async get paths:

| Method                | KeyLength | PayloadLength | Mean        | Error       | StdDev      | Gen0   | Gen1   | Allocated |
|---------------------- |---------- |-------------- |------------:|------------:|------------:|-------:|-------:|----------:|
| FASTER_Get            | 128       | 10240         |    642.5 ns |    37.46 ns |    33.20 ns | 0.6123 |      - |   10264 B |
| FASTER_Set            | 128       | 10240         |  1,036.3 ns |    22.11 ns |    19.60 ns | 0.6123 |      - |   10264 B |
| FASTER_GetBuffer      | 128       | 10240         |    388.2 ns |    11.98 ns |    10.62 ns | 0.0019 |      - |      32 B |
| FASTER_SetBuffer      | 128       | 10240         |    601.8 ns |    16.66 ns |    15.59 ns |      - |      - |         - |
| FASTER_GetAsync       | 128       | 10240         |    684.4 ns |    41.90 ns |    34.99 ns | 0.6189 |      - |   10360 B |
| FASTER_GetAsyncBuffer | 128       | 10240         |    393.3 ns |     7.84 ns |     6.95 ns | 0.0033 |      - |      56 B |
| FASTER_SetAsync       | 128       | 10240         |  1,056.5 ns |    34.24 ns |    32.03 ns | 0.6123 |      - |   10264 B |
| FASTER_SetAsyncBuffer | 128       | 10240         |    663.0 ns |    16.03 ns |    14.21 ns |      - |      - |         - |
| SQLite_Get            | 128       | 10240         | 13,180.0 ns |   277.07 ns |   259.17 ns | 0.6866 |      - |   11616 B |
| SQLite_Set            | 128       | 10240         | 74,546.1 ns | 2,079.11 ns | 1,944.80 ns | 0.6104 |      - |   11576 B |
| SQLite_GetAsync       | 128       | 10240         | 27,643.9 ns |   429.91 ns |   190.88 ns | 0.7019 |      - |   11912 B |
| SQLite_SetAsync       | 128       | 10240         | 81,441.6 ns | 1,887.54 ns | 1,673.26 ns | 0.6104 |      - |   11496 B |
| Rocks_Get             | 128       | 10240         |    940.4 ns |    36.74 ns |    34.37 ns | 0.6218 |      - |   10416 B |
| Rocks_Set             | 128       | 10240         | 14,473.9 ns |   282.96 ns |   100.91 ns | 0.6104 |      - |   10416 B |
| Rocks_GetAsync        | 128       | 10240         |  1,887.8 ns |   108.86 ns |   101.82 ns | 0.7973 | 0.0267 |   10704 B |
| Rocks_SetAsync        | 128       | 10240         | 25,086.6 ns |   964.64 ns |   902.33 ns | 0.6409 |      - |   10768 B |

@jodydonetti
Copy link

This is really great, this evening I'll try to get the latest version from the repo and see if it works fine with FusionCache in the Simulator (I guess it will be flawless, like it was already before).

Will update, thanks!

@mgravell
Copy link
Author

mgravell commented Mar 26, 2024

@jodydonetti I've also thrown it onto NuGet, mostly to stop squatters, but: it is up-to-date

@Tornhoof
Copy link
Contributor

persistence - not sure anything is writing to disk currently; we should fix that

My dumb Test approach was simple, write a lot of data, look for the file on disk, then truncate the Log and wait for It to disappear

@Tornhoof
Copy link
Contributor

Tornhoof commented Mar 26, 2024

however, we could do a fixed pin with a speculative "does this finish synchronously", and only take the non-stack pin if we need to transition to full async

I ported my valuetask.iscompletedsuccessfully flow to your repo (only for getasync) and it's about 30-40ns faster than the original code, so less than 10%. That's, as far as I know the most sync it will ever be, for in-memory.
So any transition code will be probably slower.

Tornhoof/FASTERCache@136dbd1

@mgravell
Copy link
Author

mgravell commented Mar 27, 2024

@Tornhoof I got the increase to 20% with full pin defer and async transition ;p

@jodydonetti
Copy link

@jodydonetti I've also thrown it onto NuGet, mostly to stop squatters, but: it is up-to-date

Updated (v0.1.21), switched to using the new builder and tested again: everything seems to work well, even when mixing auto-recovery, soft/hard timeouts and other features!

@PaulusParssinen
Copy link

  1. the serialization layer feels "old school"; there may be a lot of room for improvement there, to improve perf and reduce allocations and memory-copies; happy to have that discussion at length, but increasingly serializers are supporting IBufferWriter<byte> and ReadOnlySequence<byte> implementations; I'd love to have a conversation about that
  2. also, stateless serializers rather than serializer factories

+1 for redesigning the the serialization layer. I was about to do that in Garnet (and Tsavorite) but I luckily noticed this issue before doing anything. Anything but streams and I'd be very happy to advocate for copying the design over to Tsavorite and Garnet😄

@mgravell
Copy link
Author

mgravell commented Mar 29, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants