Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf-simple-query: change stats metric to instructions/op #18628

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

bhalevy
Copy link
Member

@bhalevy bhalevy commented May 12, 2024

The "throughput" metric is noisy, as it affected by the environment,
like cpu-frequency scaling due to thermal considerations, other
processes that run on the system and may affect memory caching,
for example, and so on. Therefore it is invonvinient for comparing
run-to-run results both on the same test machine, and in particular,
on different test machines.

In contrast, instructions/op are a more reliable metric
for detecting regressions or improvements in different versions
of scylla.

  • No need to backport as this just improves a microbenchmark tool

Currently, perf-simple-query use the throughput metric
to sort the perf results to determine the median result
and other derivative metrics, like median absolute_deviations,
min, and max values.

However, since the throughput depends on the absolute performance
of the test machine (cpu frequency, memory speed, etc.), it is
inconvinient for comparing results measured on different machines.
Plus, it tends to be noisy and other stats, like instructions/op
may be more useful for comparison.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The "throughput" metric is noisy, as it affected by the environment,
like cpu-frequency scaling due to thermal considerations, other
processes that run on the system and may affect memory caching,
for example, and so on. Therefore it is invonvinient for comparing
run-to-run results both on the same test machine, and in particular,
on different test machines.

In contrast, instructions/op are a more reliable metric
for detecting regressions or improvements in different versions
of scylla.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
@bhalevy
Copy link
Member Author

bhalevy commented May 12, 2024

Sample result:

bhalevy@[] scylla$ build/release/scylla perf-simple-query --default-log-level=error --random-seed=20240510 -c 1 --duration 10 
random-seed=20240510
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
95331.79 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42311 insns/op,        0 errors)
97176.74 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42263 insns/op,        0 errors)
95756.14 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42290 insns/op,        0 errors)
96252.29 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42294 insns/op,        0 errors)
96698.87 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42295 insns/op,        0 errors)
96416.99 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42288 insns/op,        0 errors)
96470.19 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42272 insns/op,        0 errors)
95294.13 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42290 insns/op,        0 errors)
96103.36 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42308 insns/op,        0 errors)
95685.55 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42325 insns/op,        0 errors)

stats metric: instructions
median 96252.29 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42294 insns/op,        0 errors)
median absolute deviation: 13.65
maximum: 42324.56
minimum: 42263.25

@avikivity
Copy link
Member

It's more stable, but measures fewer things. For example, better or poorer icache placement will not be measured.

How about cycles/op? It's less frequency dependent than tps, but includes low-level details like cache utilization.

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - Unit Tests

Build Details:

  • Duration: 4 hr 47 min
  • Builder: i-0e8b99f18eaddafdb (m5ad.12xlarge)

@bhalevy
Copy link
Member Author

bhalevy commented May 12, 2024

It's more stable, but measures fewer things. For example, better or poorer icache placement will not be measured.

How about cycles/op? It's less frequency dependent than tps, but includes low-level details like cache utilization.

I added cycles/op locally (see attached patch)
and although they are less affected by cpu frequency, they are still heavily affected by neighboring noise.
For example, here's a quiet run with ondemand cpu-frequency governor:

random-seed=20240510
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
205047.65 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42386 insns/op,   22065 cycles/op,        0 errors)
211645.96 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42386 insns/op,   21877 cycles/op,        0 errors)
214177.74 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42349 insns/op,   21659 cycles/op,        0 errors)
213265.64 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42359 insns/op,   21752 cycles/op,        0 errors)
213157.34 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42370 insns/op,   21751 cycles/op,        0 errors)

stats metric: cycles
median 213265.64 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42359 insns/op,   21752 cycles/op,        0 errors)
median absolute deviation: 93.19
maximum: 22065.09
minimum: 21658.97

Now, with cpu frequency set to 2.2GHz:

random-seed=20240510
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
95239.37 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42329 insns/op,   22427 cycles/op,        0 errors)
98041.75 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42299 insns/op,   22013 cycles/op,        0 errors)
98559.92 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42300 insns/op,   21898 cycles/op,        0 errors)
98501.02 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42316 insns/op,   21913 cycles/op,        0 errors)
96703.86 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42305 insns/op,   22313 cycles/op,        0 errors)

stats metric: cycles
median 98041.75 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42299 insns/op,   22013 cycles/op,        0 errors)
median absolute deviation: 115.47
maximum: 22426.86
minimum: 21897.61

Now, with noisy neighbor (scylla tree build on all cpus):

random-seed=20240510
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
43701.19 tps ( 63.2 allocs/op,   0.0 logallocs/op,  14.4 tasks/op,   42138 insns/op,   47439 cycles/op,        0 errors)
40661.29 tps ( 63.2 allocs/op,   0.0 logallocs/op,  14.4 tasks/op,   42113 insns/op,   52565 cycles/op,        0 errors)
39845.14 tps ( 63.2 allocs/op,   0.0 logallocs/op,  14.4 tasks/op,   42137 insns/op,   53558 cycles/op,        0 errors)
41020.09 tps ( 63.2 allocs/op,   0.0 logallocs/op,  14.4 tasks/op,   42118 insns/op,   52096 cycles/op,        0 errors)
41908.10 tps ( 63.2 allocs/op,   0.0 logallocs/op,  14.4 tasks/op,   42139 insns/op,   51007 cycles/op,        0 errors)

stats metric: cycles
median 41020.09 tps ( 63.2 allocs/op,   0.0 logallocs/op,  14.4 tasks/op,   42118 insns/op,   52096 cycles/op,        0 errors)
median absolute deviation: 1089.49
maximum: 53558.16
minimum: 47439.46

That said, I think it is fine to demand benchmark machine to be reasonably quiet during test.

@bhalevy
Copy link
Member Author

bhalevy commented May 21, 2024

@avikivity please see above.
cycles/ops are still noisy.
I'm not sure they are much better than tps.

@avikivity
Copy link
Member

What's stddev(tps)/avg(tps) vs stddev(cycles)/avg(cycles)?

@avikivity
Copy link
Member

@michoecho what's your opinion here? insn/cycle is much more stable bug less realistic. We did see a case where a few hundred instructions caused a huge fluctuation in tps (#18570).

@michoecho
Copy link
Contributor

@michoecho what's your opinion here? insn/cycle is much more stable bug less realistic. We did see a case where a few hundred instructions caused a huge fluctuation in tps (#18570).

  1. Printing cycles/op: very good.
  2. Adding the "stats-metric" option: okay. But why? Does this help you with something?
  3. Switching the default "stats-metric" to instructions: bad. Why care about the distribution of insn/op? It's always very narrow and repeatable anyway. The variability of tps/op or cyc/op is more interesting. But it doesn't matter anyway.

@bhalevy
Copy link
Member Author

bhalevy commented May 21, 2024

@michoecho what's your opinion here? insn/cycle is much more stable bug less realistic. We did see a case where a few hundred instructions caused a huge fluctuation in tps (#18570).

  1. Printing cycles/op: very good.
  2. Adding the "stats-metric" option: okay. But why? Does this help you with something?

The default (tps) is both noisy in each single run, and noisy across runs (some possible reasons for that are: different cpu frequency, L[1-3] cache speed and size, thermal effects, other processes running on the machine),
Therefore determining the median result (out of 5 by default) based on this statistic is unreliable.
Since it is noisy and in practice we focus more on insns/op rather than on ps,
I thought that the stats-metric should reflect that (and be allowed to be set differently for research puposes)

  1. Switching the default "stats-metric" to instructions: bad. Why care about the distribution of insn/op? It's always very narrow and repeatable anyway. The variability of tps/op or cyc/op is more interesting. But it doesn't matter anyway.

Why do you say it doesn't matter?
When we want to compare the net effect of some change on performance using perf-simple-query,
we want to compare "before" v.s "after" results.
We typically compare the "median" value before/after, not the whole distribution, so we better determine the median in a meaningful way.

That said, Rafael Espindola did use to send a plot in the past, superimposing the before and after results, in different colors (magenta and light-blue IIRC :)) so one could clearly see if there's a clear shift in the graph as a result of the change. We can improve our tooling and produce such a graphical report given the full "before" and "after" results of perf-simple-query, each providing a number of (1s) samples.

@michoecho
Copy link
Contributor

The default (tps) is both noisy in each single run, and noisy across runs (some possible reasons for that are: different cpu frequency, L[1-3] cache speed and size, thermal effects, other processes running on the machine),
Therefore determining the median result (out of 5 by default) based on this statistic is unreliable.

But there's no "the median". There are two medians: tps median and insn/op median. Why do we have to choose display either one or the other? They give you different information. It would be better to have both.

But if we, for some reason, can only have one, then the tps median is better for practical reasons. Insn/op is very stable. Even if you read the insn/op figure from the tps-based median printout, it should be very close to the actual insn/op median. But if you read the tps figure from the insn/op-based median printout, it can be arbitrarily far from the actual median of tps.
So switching the median type reduces the amount of available information.

When we want to compare the net effect of some change on performance using perf-simple-query,
we want to compare "before" v.s "after" results.
We typically compare the "median" value before/after, not the whole distribution, so we better determine the median in a meaningful way.

"High variance" != "not meaningful".

Reducing a benchmark to a single number is comfortable, but not necessarily the most useful.

It all depends on what you want to do with this number.

Why do you say it doesn't matter?

Because if anyone wants to use the benchmark for anything interesting, he should look at both statistics anyway.

@bhalevy
Copy link
Member Author

bhalevy commented May 21, 2024

The default (tps) is both noisy in each single run, and noisy across runs (some possible reasons for that are: different cpu frequency, L[1-3] cache speed and size, thermal effects, other processes running on the machine),
Therefore determining the median result (out of 5 by default) based on this statistic is unreliable.

But there's no "the median". There are two medians: tps median and insn/op median. Why do we have to choose display either one or the other? They give you different information. It would be better to have both.

But if we, for some reason, can only have one, then the tps median is better for practical reasons. Insn/op is very stable. Even if you read the insn/op figure from the tps-based median printout, it should be very close to the actual insn/op median.

When I wanted to use perf-simple-query to detect ~1% improvement, the tps numbers were just too noisy for that and only the insns/op could show that improvement reliably.

But if you read the tps figure from the insn/op-based median printout, it can be arbitrarily far from the actual median of tps. So switching the median type reduces the amount of available information.

When we want to compare the net effect of some change on performance using perf-simple-query,
we want to compare "before" v.s "after" results.
We typically compare the "median" value before/after, not the whole distribution, so we better determine the median in a meaningful way.

"High variance" != "not meaningful".

Reducing a benchmark to a single number is comfortable, but not necessarily the most useful.

It all depends on what you want to do with this number.

Why do you say it doesn't matter?

Because if anyone wants to use the benchmark for anything interesting, he should look at both statistics anyway.

So maybe we should indeed print the stats for all measurements, like mean and standard deviation (rather than median and MAD).

And then one could compare stats between runs in a better way (I expect the mean tps to be closer across run than the media rps, for example, especially with a larger number of samples, that should reduce the mean's variance)

@michoecho
Copy link
Contributor

When I wanted to use perf-simple-query to detect ~1% improvement, the tps numbers were just too noisy for that and only the insns/op could show that improvement reliably.

"Improvement" is contingent on some metric. We can't just look at a set of disagreeing metrics and cherry-pick the one that supports our assumption.

If you want to show an improvement in insn/op, you should benchmark insn/op. If you want to show an improvement in tps, you should benchmark tps.

If you want to prove an improvement in tps, but you can't do it with a tps benchmark because the improvement is smaller than the margin of error, so instead you "prove" an improvement in tps with an insn/op benchmark, then your proof is invalid.

Often the two stats go hand in hand (especially since Scylla is icache-hungry), but it's not a given.

That said, I think it's a terrible idea to evaluate small performance patches based on the results of perf-simple-query tps, because it is noisy. But if we are convinced that our patch is an incremental improvement, but smaller than perf-simple-query's margin of error, then we should simply override the benchmark's opinion with our own, not pick a different metric and pretend that it proves an improvement in both metrics.

@bhalevy
Copy link
Member Author

bhalevy commented May 22, 2024

But if we are convinced that our patch is an incremental improvement, but smaller than perf-simple-query's margin of error, then we should simply override the benchmark's opinion with our own, not pick a different metric and pretend that it proves an improvement in both metrics.

Nobody is pretending that an improvement in instructions per op will necessarily produce an improvement in tps. It is what it is, and all we can say about tps, as you correctly mentioned above, is that the improvement is below the tps level of noise, and hence it is inconclusive.

if (app.configuration().contains("stats-metric")) {
stats_metric = app.configuration()["stats-metric"].as<std::string>();
}
std::function<double(const perf_result& r)> get_metric;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you do it in a way that perf_alternator also benefits? and maybe perf_commitlog ?

It would be useful for me in perf_alternator at least.

@nuivall
Copy link
Member

nuivall commented May 27, 2024

"Improvement" is contingent on some metric. We can't just look at a set of disagreeing metrics and cherry-pick the one that supports our assumption.

maybe the switch makes sense with some longer help message explaining what measurement type can be used for what (or weak and strong spots)

btw. I'd also replace MAD to something like confidence intervals (i.e. ± 100 tps) so it's easier to reason about but maybe it's controversial and requires to assume some distribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants