perf-simple-query: change stats metric to instructions/op #18628

bhalevy · 2024-05-12T07:00:10Z

The "throughput" metric is noisy, as it affected by the environment,
like cpu-frequency scaling due to thermal considerations, other
processes that run on the system and may affect memory caching,
for example, and so on. Therefore it is invonvinient for comparing
run-to-run results both on the same test machine, and in particular,
on different test machines.

In contrast, instructions/op are a more reliable metric
for detecting regressions or improvements in different versions
of scylla.

No need to backport as this just improves a microbenchmark tool

Currently, perf-simple-query use the throughput metric to sort the perf results to determine the median result and other derivative metrics, like median absolute_deviations, min, and max values. However, since the throughput depends on the absolute performance of the test machine (cpu frequency, memory speed, etc.), it is inconvinient for comparing results measured on different machines. Plus, it tends to be noisy and other stats, like instructions/op may be more useful for comparison. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

The "throughput" metric is noisy, as it affected by the environment, like cpu-frequency scaling due to thermal considerations, other processes that run on the system and may affect memory caching, for example, and so on. Therefore it is invonvinient for comparing run-to-run results both on the same test machine, and in particular, on different test machines. In contrast, instructions/op are a more reliable metric for detecting regressions or improvements in different versions of scylla. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

bhalevy · 2024-05-12T07:02:13Z

Sample result:

bhalevy@[] scylla$ build/release/scylla perf-simple-query --default-log-level=error --random-seed=20240510 -c 1 --duration 10 
random-seed=20240510
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
95331.79 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42311 insns/op,        0 errors)
97176.74 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42263 insns/op,        0 errors)
95756.14 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42290 insns/op,        0 errors)
96252.29 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42294 insns/op,        0 errors)
96698.87 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42295 insns/op,        0 errors)
96416.99 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42288 insns/op,        0 errors)
96470.19 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42272 insns/op,        0 errors)
95294.13 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42290 insns/op,        0 errors)
96103.36 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42308 insns/op,        0 errors)
95685.55 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42325 insns/op,        0 errors)

stats metric: instructions
median 96252.29 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42294 insns/op,        0 errors)
median absolute deviation: 13.65
maximum: 42324.56
minimum: 42263.25

avikivity · 2024-05-12T08:22:25Z

It's more stable, but measures fewer things. For example, better or poorer icache placement will not be measured.

How about cycles/op? It's less frequency dependent than tps, but includes low-level details like cache utilization.

scylladb-promoter · 2024-05-12T11:48:07Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - Unit Tests

Build Details:

Duration: 4 hr 47 min
Builder: i-0e8b99f18eaddafdb (m5ad.12xlarge)

bhalevy · 2024-05-12T14:59:55Z

It's more stable, but measures fewer things. For example, better or poorer icache placement will not be measured.

How about cycles/op? It's less frequency dependent than tps, but includes low-level details like cache utilization.

I added cycles/op locally (see attached patch)
and although they are less affected by cpu frequency, they are still heavily affected by neighboring noise.
For example, here's a quiet run with ondemand cpu-frequency governor:

random-seed=20240510
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
205047.65 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42386 insns/op,   22065 cycles/op,        0 errors)
211645.96 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42386 insns/op,   21877 cycles/op,        0 errors)
214177.74 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42349 insns/op,   21659 cycles/op,        0 errors)
213265.64 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42359 insns/op,   21752 cycles/op,        0 errors)
213157.34 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42370 insns/op,   21751 cycles/op,        0 errors)

stats metric: cycles
median 213265.64 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42359 insns/op,   21752 cycles/op,        0 errors)
median absolute deviation: 93.19
maximum: 22065.09
minimum: 21658.97

Now, with cpu frequency set to 2.2GHz:

random-seed=20240510
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
95239.37 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42329 insns/op,   22427 cycles/op,        0 errors)
98041.75 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42299 insns/op,   22013 cycles/op,        0 errors)
98559.92 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42300 insns/op,   21898 cycles/op,        0 errors)
98501.02 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42316 insns/op,   21913 cycles/op,        0 errors)
96703.86 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42305 insns/op,   22313 cycles/op,        0 errors)

stats metric: cycles
median 98041.75 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42299 insns/op,   22013 cycles/op,        0 errors)
median absolute deviation: 115.47
maximum: 22426.86
minimum: 21897.61

Now, with noisy neighbor (scylla tree build on all cpus):

random-seed=20240510
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
43701.19 tps ( 63.2 allocs/op,   0.0 logallocs/op,  14.4 tasks/op,   42138 insns/op,   47439 cycles/op,        0 errors)
40661.29 tps ( 63.2 allocs/op,   0.0 logallocs/op,  14.4 tasks/op,   42113 insns/op,   52565 cycles/op,        0 errors)
39845.14 tps ( 63.2 allocs/op,   0.0 logallocs/op,  14.4 tasks/op,   42137 insns/op,   53558 cycles/op,        0 errors)
41020.09 tps ( 63.2 allocs/op,   0.0 logallocs/op,  14.4 tasks/op,   42118 insns/op,   52096 cycles/op,        0 errors)
41908.10 tps ( 63.2 allocs/op,   0.0 logallocs/op,  14.4 tasks/op,   42139 insns/op,   51007 cycles/op,        0 errors)

stats metric: cycles
median 41020.09 tps ( 63.2 allocs/op,   0.0 logallocs/op,  14.4 tasks/op,   42118 insns/op,   52096 cycles/op,        0 errors)
median absolute deviation: 1089.49
maximum: 53558.16
minimum: 47439.46

That said, I think it is fine to demand benchmark machine to be reasonably quiet during test.

bhalevy · 2024-05-21T08:00:06Z

@avikivity please see above.
cycles/ops are still noisy.
I'm not sure they are much better than tps.

avikivity · 2024-05-21T08:37:25Z

What's stddev(tps)/avg(tps) vs stddev(cycles)/avg(cycles)?

avikivity · 2024-05-21T08:39:02Z

@michoecho what's your opinion here? insn/cycle is much more stable bug less realistic. We did see a case where a few hundred instructions caused a huge fluctuation in tps (#18570).

michoecho · 2024-05-21T09:07:25Z

@michoecho what's your opinion here? insn/cycle is much more stable bug less realistic. We did see a case where a few hundred instructions caused a huge fluctuation in tps (#18570).

Printing cycles/op: very good.
Adding the "stats-metric" option: okay. But why? Does this help you with something?
Switching the default "stats-metric" to instructions: bad. Why care about the distribution of insn/op? It's always very narrow and repeatable anyway. The variability of tps/op or cyc/op is more interesting. But it doesn't matter anyway.

bhalevy · 2024-05-21T09:53:11Z

@michoecho what's your opinion here? insn/cycle is much more stable bug less realistic. We did see a case where a few hundred instructions caused a huge fluctuation in tps (#18570).

Printing cycles/op: very good.

Adding the "stats-metric" option: okay. But why? Does this help you with something?

The default (tps) is both noisy in each single run, and noisy across runs (some possible reasons for that are: different cpu frequency, L[1-3] cache speed and size, thermal effects, other processes running on the machine),
Therefore determining the median result (out of 5 by default) based on this statistic is unreliable.
Since it is noisy and in practice we focus more on insns/op rather than on ps,
I thought that the stats-metric should reflect that (and be allowed to be set differently for research puposes)

Switching the default "stats-metric" to instructions: bad. Why care about the distribution of insn/op? It's always very narrow and repeatable anyway. The variability of tps/op or cyc/op is more interesting. But it doesn't matter anyway.

Why do you say it doesn't matter?
When we want to compare the net effect of some change on performance using perf-simple-query,
we want to compare "before" v.s "after" results.
We typically compare the "median" value before/after, not the whole distribution, so we better determine the median in a meaningful way.

That said, Rafael Espindola did use to send a plot in the past, superimposing the before and after results, in different colors (magenta and light-blue IIRC :)) so one could clearly see if there's a clear shift in the graph as a result of the change. We can improve our tooling and produce such a graphical report given the full "before" and "after" results of perf-simple-query, each providing a number of (1s) samples.

michoecho · 2024-05-21T11:59:21Z

The default (tps) is both noisy in each single run, and noisy across runs (some possible reasons for that are: different cpu frequency, L[1-3] cache speed and size, thermal effects, other processes running on the machine),
Therefore determining the median result (out of 5 by default) based on this statistic is unreliable.

But there's no "the median". There are two medians: tps median and insn/op median. Why do we have to choose display either one or the other? They give you different information. It would be better to have both.

But if we, for some reason, can only have one, then the tps median is better for practical reasons. Insn/op is very stable. Even if you read the insn/op figure from the tps-based median printout, it should be very close to the actual insn/op median. But if you read the tps figure from the insn/op-based median printout, it can be arbitrarily far from the actual median of tps.
So switching the median type reduces the amount of available information.

When we want to compare the net effect of some change on performance using perf-simple-query,
we want to compare "before" v.s "after" results.
We typically compare the "median" value before/after, not the whole distribution, so we better determine the median in a meaningful way.

"High variance" != "not meaningful".

Reducing a benchmark to a single number is comfortable, but not necessarily the most useful.

It all depends on what you want to do with this number.

Why do you say it doesn't matter?

Because if anyone wants to use the benchmark for anything interesting, he should look at both statistics anyway.

bhalevy · 2024-05-21T14:55:12Z

The default (tps) is both noisy in each single run, and noisy across runs (some possible reasons for that are: different cpu frequency, L[1-3] cache speed and size, thermal effects, other processes running on the machine),
Therefore determining the median result (out of 5 by default) based on this statistic is unreliable.

But there's no "the median". There are two medians: tps median and insn/op median. Why do we have to choose display either one or the other? They give you different information. It would be better to have both.

But if we, for some reason, can only have one, then the tps median is better for practical reasons. Insn/op is very stable. Even if you read the insn/op figure from the tps-based median printout, it should be very close to the actual insn/op median.

When I wanted to use perf-simple-query to detect ~1% improvement, the tps numbers were just too noisy for that and only the insns/op could show that improvement reliably.

But if you read the tps figure from the insn/op-based median printout, it can be arbitrarily far from the actual median of tps. So switching the median type reduces the amount of available information.

When we want to compare the net effect of some change on performance using perf-simple-query,
we want to compare "before" v.s "after" results.
We typically compare the "median" value before/after, not the whole distribution, so we better determine the median in a meaningful way.

"High variance" != "not meaningful".

Reducing a benchmark to a single number is comfortable, but not necessarily the most useful.

It all depends on what you want to do with this number.

Why do you say it doesn't matter?

Because if anyone wants to use the benchmark for anything interesting, he should look at both statistics anyway.

So maybe we should indeed print the stats for all measurements, like mean and standard deviation (rather than median and MAD).

And then one could compare stats between runs in a better way (I expect the mean tps to be closer across run than the media rps, for example, especially with a larger number of samples, that should reduce the mean's variance)

michoecho · 2024-05-21T15:52:36Z

When I wanted to use perf-simple-query to detect ~1% improvement, the tps numbers were just too noisy for that and only the insns/op could show that improvement reliably.

"Improvement" is contingent on some metric. We can't just look at a set of disagreeing metrics and cherry-pick the one that supports our assumption.

If you want to show an improvement in insn/op, you should benchmark insn/op. If you want to show an improvement in tps, you should benchmark tps.

If you want to prove an improvement in tps, but you can't do it with a tps benchmark because the improvement is smaller than the margin of error, so instead you "prove" an improvement in tps with an insn/op benchmark, then your proof is invalid.

Often the two stats go hand in hand (especially since Scylla is icache-hungry), but it's not a given.

That said, I think it's a terrible idea to evaluate small performance patches based on the results of perf-simple-query tps, because it is noisy. But if we are convinced that our patch is an incremental improvement, but smaller than perf-simple-query's margin of error, then we should simply override the benchmark's opinion with our own, not pick a different metric and pretend that it proves an improvement in both metrics.

bhalevy · 2024-05-22T06:14:41Z

But if we are convinced that our patch is an incremental improvement, but smaller than perf-simple-query's margin of error, then we should simply override the benchmark's opinion with our own, not pick a different metric and pretend that it proves an improvement in both metrics.

Nobody is pretending that an improvement in instructions per op will necessarily produce an improvement in tps. It is what it is, and all we can say about tps, as you correctly mentioned above, is that the improvement is below the tps level of noise, and hence it is inconclusive.

nuivall · 2024-05-27T15:52:27Z

test/perf/perf_simple_query.cc

+            if (app.configuration().contains("stats-metric")) {
+                stats_metric = app.configuration()["stats-metric"].as<std::string>();
+            }
+            std::function<double(const perf_result& r)> get_metric;


could you do it in a way that perf_alternator also benefits? and maybe perf_commitlog ?

It would be useful for me in perf_alternator at least.

nuivall · 2024-05-27T15:56:34Z

"Improvement" is contingent on some metric. We can't just look at a set of disagreeing metrics and cherry-pick the one that supports our assumption.

maybe the switch makes sense with some longer help message explaining what measurement type can be used for what (or weak and strong spots)

btw. I'd also replace MAD to something like confidence intervals (i.e. ± 100 tps) so it's easier to reason about but maybe it's controversial and requires to assume some distribution.

bhalevy added 2 commits May 12, 2024 09:57

bhalevy requested a review from avikivity May 12, 2024 07:00

bhalevy added the area/development tools label May 12, 2024

bhalevy mentioned this pull request May 12, 2024

utils: chunked_vector: optimize for trivially_copyable types #18609

Closed

1 task

bhalevy mentioned this pull request May 12, 2024

Querier: optimize moves of querier_opt, partition_range and partition_slice #18633

Open

1 task

bhalevy added the backport/none Backport is not required label May 21, 2024

nuivall reviewed May 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf-simple-query: change stats metric to instructions/op #18628

perf-simple-query: change stats metric to instructions/op #18628

bhalevy commented May 12, 2024 •

edited

bhalevy commented May 12, 2024 •

edited

avikivity commented May 12, 2024

scylladb-promoter commented May 12, 2024

bhalevy commented May 12, 2024 •

edited

bhalevy commented May 21, 2024

avikivity commented May 21, 2024

avikivity commented May 21, 2024

michoecho commented May 21, 2024

bhalevy commented May 21, 2024

michoecho commented May 21, 2024

bhalevy commented May 21, 2024

michoecho commented May 21, 2024

bhalevy commented May 22, 2024

nuivall May 27, 2024

nuivall commented May 27, 2024

perf-simple-query: change stats metric to instructions/op #18628

Are you sure you want to change the base?

perf-simple-query: change stats metric to instructions/op #18628

Conversation

bhalevy commented May 12, 2024 • edited

bhalevy commented May 12, 2024 • edited

avikivity commented May 12, 2024

scylladb-promoter commented May 12, 2024

🟢 CI State: SUCCESS

Build Details:

bhalevy commented May 12, 2024 • edited

bhalevy commented May 21, 2024

avikivity commented May 21, 2024

avikivity commented May 21, 2024

michoecho commented May 21, 2024

bhalevy commented May 21, 2024

michoecho commented May 21, 2024

bhalevy commented May 21, 2024

michoecho commented May 21, 2024

bhalevy commented May 22, 2024

nuivall May 27, 2024

Choose a reason for hiding this comment

nuivall commented May 27, 2024

bhalevy commented May 12, 2024 •

edited

bhalevy commented May 12, 2024 •

edited

bhalevy commented May 12, 2024 •

edited