-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf-simple-query: change stats metric to instructions/op #18628
base: master
Are you sure you want to change the base?
Conversation
Currently, perf-simple-query use the throughput metric to sort the perf results to determine the median result and other derivative metrics, like median absolute_deviations, min, and max values. However, since the throughput depends on the absolute performance of the test machine (cpu frequency, memory speed, etc.), it is inconvinient for comparing results measured on different machines. Plus, it tends to be noisy and other stats, like instructions/op may be more useful for comparison. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The "throughput" metric is noisy, as it affected by the environment, like cpu-frequency scaling due to thermal considerations, other processes that run on the system and may affect memory caching, for example, and so on. Therefore it is invonvinient for comparing run-to-run results both on the same test machine, and in particular, on different test machines. In contrast, instructions/op are a more reliable metric for detecting regressions or improvements in different versions of scylla. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Sample result:
|
It's more stable, but measures fewer things. For example, better or poorer icache placement will not be measured. How about cycles/op? It's less frequency dependent than tps, but includes low-level details like cache utilization. |
🟢 CI State: SUCCESS✅ - Build Build Details:
|
I added cycles/op locally (see attached patch)
Now, with cpu frequency set to 2.2GHz:
Now, with noisy neighbor (scylla tree build on all cpus):
That said, I think it is fine to demand benchmark machine to be reasonably quiet during test. |
@avikivity please see above. |
What's stddev(tps)/avg(tps) vs stddev(cycles)/avg(cycles)? |
@michoecho what's your opinion here? insn/cycle is much more stable bug less realistic. We did see a case where a few hundred instructions caused a huge fluctuation in tps (#18570). |
|
The default (tps) is both noisy in each single run, and noisy across runs (some possible reasons for that are: different cpu frequency, L[1-3] cache speed and size, thermal effects, other processes running on the machine),
Why do you say it doesn't matter? That said, Rafael Espindola did use to send a plot in the past, superimposing the before and after results, in different colors (magenta and light-blue IIRC :)) so one could clearly see if there's a clear shift in the graph as a result of the change. We can improve our tooling and produce such a graphical report given the full "before" and "after" results of perf-simple-query, each providing a number of (1s) samples. |
But there's no "the median". There are two medians: tps median and insn/op median. Why do we have to choose display either one or the other? They give you different information. It would be better to have both. But if we, for some reason, can only have one, then the tps median is better for practical reasons. Insn/op is very stable. Even if you read the insn/op figure from the tps-based median printout, it should be very close to the actual insn/op median. But if you read the tps figure from the insn/op-based median printout, it can be arbitrarily far from the actual median of tps.
"High variance" != "not meaningful". Reducing a benchmark to a single number is comfortable, but not necessarily the most useful. It all depends on what you want to do with this number.
Because if anyone wants to use the benchmark for anything interesting, he should look at both statistics anyway. |
When I wanted to use perf-simple-query to detect ~1% improvement, the tps numbers were just too noisy for that and only the insns/op could show that improvement reliably.
So maybe we should indeed print the stats for all measurements, like mean and standard deviation (rather than median and MAD). And then one could compare stats between runs in a better way (I expect the mean tps to be closer across run than the media rps, for example, especially with a larger number of samples, that should reduce the mean's variance) |
"Improvement" is contingent on some metric. We can't just look at a set of disagreeing metrics and cherry-pick the one that supports our assumption. If you want to show an improvement in insn/op, you should benchmark insn/op. If you want to show an improvement in tps, you should benchmark tps. If you want to prove an improvement in tps, but you can't do it with a tps benchmark because the improvement is smaller than the margin of error, so instead you "prove" an improvement in tps with an insn/op benchmark, then your proof is invalid. Often the two stats go hand in hand (especially since Scylla is icache-hungry), but it's not a given. That said, I think it's a terrible idea to evaluate small performance patches based on the results of perf-simple-query tps, because it is noisy. But if we are convinced that our patch is an incremental improvement, but smaller than perf-simple-query's margin of error, then we should simply override the benchmark's opinion with our own, not pick a different metric and pretend that it proves an improvement in both metrics. |
Nobody is pretending that an improvement in instructions per op will necessarily produce an improvement in tps. It is what it is, and all we can say about tps, as you correctly mentioned above, is that the improvement is below the tps level of noise, and hence it is inconclusive. |
if (app.configuration().contains("stats-metric")) { | ||
stats_metric = app.configuration()["stats-metric"].as<std::string>(); | ||
} | ||
std::function<double(const perf_result& r)> get_metric; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you do it in a way that perf_alternator also benefits? and maybe perf_commitlog ?
It would be useful for me in perf_alternator at least.
maybe the switch makes sense with some longer help message explaining what measurement type can be used for what (or weak and strong spots) btw. I'd also replace MAD to something like confidence intervals (i.e. ± 100 tps) so it's easier to reason about but maybe it's controversial and requires to assume some distribution. |
The "throughput" metric is noisy, as it affected by the environment,
like cpu-frequency scaling due to thermal considerations, other
processes that run on the system and may affect memory caching,
for example, and so on. Therefore it is invonvinient for comparing
run-to-run results both on the same test machine, and in particular,
on different test machines.
In contrast, instructions/op are a more reliable metric
for detecting regressions or improvements in different versions
of scylla.