Horizontal Scaling using Distributed Cache

To see the need for Horizontal scaling of the cache, we need to see how GPTCache works by default, using in-memory cache. Let's look at a high level break down of the steps involved. More detailed flow is in the diagram below.

GPTCache In-Memory Search

Above diagram depicts how for a given query search operation will determine whether cache exists or not. It happens in following steps:

Accept query from user
Embedding Conversion: User query is converted into embeddings
Vector DB Search: For a given input embedding, top k most similar embeddings are searched in the Vector DB.
Similarity Evaluation: For each of the top k embeddings, similarity is evaluated with existing cache data.
Cache Data Search: After similarity evaluation, cache information for vector embedding with the highest similarity score is searched in the cache. Using the associated Primary Key Cache information is stored in two ways.
1. In-memory Eviction Manager: In-memory Eviction Manger maintains the primary keys of the available cache and oversees the eviction of cache data.
2. Scalar Database: Scalar DB stores information such as Answers, Dependencies and other metadata.
Cache Data Retrieval: Once cache data is found, it is retrieved from Scalar DB and returned to the user.

GPTCache Distributed Cache Search

Although, In-memory eviction manager work great for a single node deployment. It won't work in a multi-node deployment scenario since, cache information is not shared across nodes.
In the diagrams above, you can observe that the only difference between the two flows is the Eviction Manager. The Distributed Eviction Manager uses Distributed Cache database such as Redis to maintain cache information.

Now that the cache is maintained in a distributed manner, the cache information is shared, and it can be made available across all nodes. This allows a multi-node GPTCache deployment to scale horizontally.