Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A strategy for performance testing #368

Open
madolson opened this issue Apr 24, 2024 · 9 comments
Open

A strategy for performance testing #368

madolson opened this issue Apr 24, 2024 · 9 comments
Labels
major-decision-pending Needs decision by core team

Comments

@madolson
Copy link
Member

madolson commented Apr 24, 2024

Opening this issue, as we no longer have a benchmarking framework. Performance is an essential part of Valkey, and we need to make it easier to evaluate if something is degrading (or improving) performance. The previous framework was not open-source, as it was maintained by filipe from Redis. The specifications are still here: https://github.com/redis/redis-benchmarks-specification, although they were never really reviewed so I'm not convinced we want to reference them, I think we should start fresh.

My vision, is that I would like us to implement some performance tests that run on test runners running on dedicated hardware (ideally bare-metal, but that might get expensive) that we can trigger to run when a specific tag get's added to a PR and during daily. The daily runs can be used for generating historical graphs.

I would like to at least see the following sets of tests:

  1. Single instance running simple core tests for each data structure (SET, LPUSH/POP, SADD, HSET, XADD, ZADD).
  2. A cluster mode instance running basic string operations.

We should run them each for at least like a minute or so, ideally in parallel.

Next steps

  1. Get consensus about the priority and the strategy.
  2. Followup with a working group to determine the specific tests we would like to automate
  3. Get the hardware and setup the infrastructure. (@madolson and @stockholmux are on AWS hardware acquisition, but would appreciate others as well)
  4. Profit!

Future work

I would also like to extend to automatically generate perf + flamegraphs of the above results as well, and have them always available on the website. That gives folks a way to see where time is being spent and maybe investigate optimizations.

@madolson madolson added the major-decision-pending Needs decision by core team label Apr 24, 2024
@madolson
Copy link
Member Author

@valkey-io/core-team Would like thoughts on the above proposal, and implicitly would appreciate a vote.

@stockholmux
Copy link
Contributor

re: hardware

I guess one precursor question: What hardware/architectures is Valkey planning on targeting?

@PingXie
Copy link
Member

PingXie commented Apr 24, 2024

I wasn't aware that there were performance benchmarking tools but I love this idea so adding my vote explicitly here

@madolson
Copy link
Member Author

I guess one precursor question: What hardware/architectures is Valkey planning on targeting?

Ideally one pair of arm hosts and one pair of x86 hosts. Something like an m7i and an m7g is probably "broadly sufficient". If GCP would like to donate some hardware, we could run it on their infra as well. :)

@zuiderkwast
Copy link
Contributor

zuiderkwast commented Apr 25, 2024

A few fixed jobs is good to have, but what I've felt the need for when doing certain optimizations is specific runs to indicate performance improvement for specific scenarios/workloads. For example, I had a PR to avoid looking up the expire dict for keys that don't have a TTL. This is only slow if there are many keys in the expire dict and also many accesses to keys that don't have a TTL. I had convincing (to myself) results on my laptop by running several times with similar results, but that automated Redis benchmark could see it.

When we test a few fixed workloads, we will always miss other workloads and scenarios.

@hwware
Copy link
Member

hwware commented Apr 25, 2024

I totally agree with this. Before designing the test, I would like to propose several my concerns.

  1. How to decide the client connection number, and thread number as memtier-benchmark
  2. How to decide the data-size (aka workload) for each kind of data type, such as value is 10 bytes, 10k, or 10MB
  3. If maxmemory is set, if we should consider all key-eviction policies

@artikell
Copy link
Contributor

I have an idea about performance.

We can refer to the process of TPC (Transaction Processing Performance Council) and design the server configuration to be suitable for various workloads of the NoSQL database.

For example(A system similar to Quora):
-Real time application data caching:likes number, user information cache
-Real time session stores:User Session
-Real time leaderboards:Quora article ranking

The data will be generated proportionally and discretely to cover scenes of different sizes.
At the same time, it can also be added to the workload of key observation policies.

This work will only involve managing workloads that are in line with actual production (including both the client and server).

@zuiderkwast
Copy link
Contributor

@artikell Do you mean we should have an advanced "traffic model" where we can define how many of each kind of command and the size of data with probabilities?

I have heard about such benchmarks (for some commercial products) where statistics is collected from users and this is used to run benchmark tests with the user's traffic model. It can be very powerful. Maybe the first thing we need is a way to collect these statistics from a running node. (It shall contain only statistics, no actual key names or value content.)

@artikell
Copy link
Contributor

@zuiderkwast Yes, a traffic model.

However, it is difficult to achieve real-time access to user data (as it involves privacy and company assets), and there may also be differences among different companies. So, I think this model can be constantly updated, and the operational standards can be implemented first.

We need to control the scope of the discussion, we can continue our discussion on https://github.com/orgs/valkey-io/discussions/398

the current issue requires a performance benchmark standard. We cannot expect this method to detect all performance issues. It can do:

  • Command performance testing: For example (SET, LPUSH/POP, SADD, HSET, XADD, ZADD)
  • Performance testing with fixed parameters, such as data size and number of clients

It cannot perform dynamic validation, such as expiration and eviction strategies,it needs to be designed separately.

A little idea. At least have a fixed performance report first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
major-decision-pending Needs decision by core team
Projects
None yet
Development

No branches or pull requests

6 participants