Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cirrus: use faster VM's for integration tests #22698

Merged
merged 1 commit into from
May 21, 2024

Conversation

Luap99
Copy link
Member

@Luap99 Luap99 commented May 14, 2024

Use 4 core VM vompred to the standard 2 cores, integration tests scale
almost linear with extra cores, as such doubling the cores makes the
tests almost twice as fast. This brings the test time down to 15-17 min
in CI.

Does this PR introduce a user-facing change?

None

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels May 14, 2024
Copy link
Contributor

openshift-ci bot commented May 14, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 14, 2024
Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

Copy link

Cockpit tests failed for commit 79bd109. @martinpitt, @jelly, @mvollmer please check.

Copy link

Cockpit tests failed for commit f5f1f74. @martinpitt, @jelly, @mvollmer please check.

@Luap99
Copy link
Member Author

Luap99 commented May 14, 2024

Timings with 4 core 4 GB

type distro user DB local remote container
int rawhide root 16:14 15:03
int rawhide rootless 16:15
int fedora-40 root 15:04 16:12 15:21
int fedora-40 rootless 15:22
int fedora-39 root boltdb 15:57 16:29 15:01
int fedora-39 rootless boltdb 15:26
int debian-13 root 16:05 17:07
int debian-13 rootless 15:02
sys rawhide root 39:08 25:51
sys rawhide rootless 39:00
sys fedora-40-aarch64 root 32:06 22:14
sys fedora-40 root 37:07 25:38
sys fedora-40 rootless 41:12 25:08
sys fedora-39 root boltdb 40:46 27:28
sys fedora-39 rootless boltdb 39:48
sys debian-13 root 36:28 30:08
sys debian-13 rootless 39:33
bud fedora-40 root 31:01 30:29
mach fedora-40-aarch64 rootless 19:17
mach fedora-40 rootless 15:03

int tests make great use of the extra cpu cores here 15-17 min vs 27-30 min on main right now

@Luap99
Copy link
Member Author

Luap99 commented May 14, 2024

However it doesn't seem to help at all for system tests, which also makes sense as they run in sequence not in parallel so extra cores do not help

Use 4 core VM vompred to the standard 2 cores, integration tests scale
almost linear with extra cores, as such doubling the cores makes the
tests almost twice as fast. This brings the test time down to 15-17 min
in CI.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
@Luap99 Luap99 changed the title WIP: gce use faster VM's cirrus: use faster VM's for integration tests May 16, 2024
@openshift-ci openshift-ci bot added release-note-none and removed do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels May 16, 2024
@Luap99
Copy link
Member Author

Luap99 commented May 16, 2024

I made the change to only effect the integration tasks as other tests does not seem to benifit from more cores and it would be a waste of money to use more cores there.

I leave it as draft until I get approval from @cevich.

@edsantiago
Copy link
Collaborator

The approach I've been working on (side branch) for speeding up system tests is:

  1. invoke bats with --jobs 4 --no-parallelize-across-files. This enables parallelization within individual .bats files. However:
  2. also need to, for start, export BATS_NO_PARALLELIZE_WITHIN_FILE=true within setup() for each individual file, then
  3. go file by file, checking and fixing hardcoded image names ("test", "ctr", etc), and removing the NO_PARALLELIZE
  4. in some cases like 001 which is almost entirely parallelizable except for one rmi test, just move the unparallelizable tests to a separate file
  5. probably have to fiddle with basic_setup() and basic_teardown() to be less strict about stray containers/images
  6. and then figure out what other problems come up, and if this is even worth it

I'm focusing on the local-registry thing today and probably next week, because I see that as higher ROI.

@rhatdan
Copy link
Member

rhatdan commented May 16, 2024

Does this cost us seriously more money? Or even Less? IE if E2E tests complete in X time for Y Dollars/time, is this less or greater then before and how much difference? If it ain't much we should make the change, or at least get Red Hat an estimate of additional cost.

Saving time on tests helps developers a great deal especially when we get near releases.

@Luap99
Copy link
Member Author

Luap99 commented May 16, 2024

The approach I've been working on (side branch) for speeding up system tests is:

1. invoke bats with `--jobs 4 --no-parallelize-across-files`. This enables parallelization within individual `.bats` files. However:

2. also need to, for start, `export BATS_NO_PARALLELIZE_WITHIN_FILE=true` within `setup()` for each individual file, then

3. go file by file, checking and fixing hardcoded image names ("test", "ctr", etc), and removing the `NO_PARALLELIZE`

4. in some cases like `001` which is almost entirely parallelizable except for one `rmi` test, just move the unparallelizable tests to a separate file

5. probably have to fiddle with `basic_setup()` and `basic_teardown()` to be less strict about stray containers/images

6. and then figure out what other problems come up, and if this is even worth it

I'm focusing on the local-registry thing today and probably next week, because I see that as higher ROI.

Yes, agreed.

Note right now I focus only on e2e per the current Jira card (https://issues.redhat.com/browse/RUN-2107)
The system tests are the next step for next week and is assigned to me but I am very happy to collaborate with you on that one (https://issues.redhat.com/browse/RUN-2108)

@Luap99
Copy link
Member Author

Luap99 commented May 16, 2024

Does this cost us seriously more money? Or even Less? IE if E2E tests complete in X time for Y Dollars/time, is this less or greater then before and how much difference? If it ain't much we should make the change, or at least get Red Hat an estimate of additional cost.

Saving time on tests helps developers a great deal especially when we get near releases.

I don't know how the pricing structure works, but assuming double the cores double the price we now only need half the time so it should be not significantly more expensive IMO. But that only goes for integration tests which this is about.

The pricing comment is about tests that do not scale at all see timings comment above. There is absolutely no point in running system tests on higher core count VM's right now as we would pay more for no time gain.

Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

@edsantiago
Copy link
Collaborator

Latest change (47f01e8) LGTM

@edsantiago
Copy link
Collaborator

type distro user DB local remote container
int rawhide root 15:01 15:12
int rawhide rootless 15:25
int fedora-40 root 16:24 15:08 15:35
int fedora-40 rootless 14:43
int fedora-39 root boltdb 16:31 16:02 15:42
int fedora-39 rootless boltdb 16:25
int debian-13 root 16:36 16:04
int debian-13 rootless 13:43

int test timings are phenomenal.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 16, 2024
@rhatdan
Copy link
Member

rhatdan commented May 16, 2024

@Luap99 Can you remove Draft so this will merge.

@cevich
Copy link
Member

cevich commented May 21, 2024

@Luap99 this LGTM.

I don't know how the pricing structure works, but assuming double the cores double the price

The time developers spend waiting for tests to complete is FAR more expensive than the VMs. Granted this is also hard to measure, since developers aren't likely just sitting on their hands waiting. In either case, there are cost-increases built-in to cloud-spend budgeting. It's incredibly rare that testing is reduced and/or cloud-costs go down magically. That's all to say, I'm not concerned at all about the cost increase by this PR.

@Luap99 Luap99 marked this pull request as ready for review May 21, 2024 14:10
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 21, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit d85d563 into containers:main May 21, 2024
89 of 91 checks passed
@Luap99 Luap99 deleted the CI-VM branch May 21, 2024 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants