Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rawhide kernel 6.10.0 >=20240514 - podman update device-read-bps = 0 #22701

Open
edsantiago opened this issue May 14, 2024 · 20 comments
Open

rawhide kernel 6.10.0 >=20240514 - podman update device-read-bps = 0 #22701

edsantiago opened this issue May 14, 2024 · 20 comments

Comments

@edsantiago
Copy link
Collaborator

Seen in OpenQA. No logs available, this is a weird thing that only records movies and I don't have the desire to hand-type all the error. It basically looks like

FAIL
device-read-bps
expected a lot
got 0

This is just a placeholder for now. Smells like kernel bug to me, but it could also be a bug on our end (including in tests). If I see this blowing up (as measured by openqa emails) I will explore further. Until then, nothing to do.

@edsantiago
Copy link
Collaborator Author

@Luap99
Copy link
Member

Luap99 commented May 16, 2024

Can you create simple reproducer? AFAIK cgroup setup depends podman -> crun -> systemd-> kernel so maybe check if the other components changed too.

@edsantiago
Copy link
Collaborator Author

Can you create simple reproducer?

That has been my goal, as you might have predicted. However, dnf --enablerepo=updates-testing upgrade kernel does not bring in any affected (6.10) kernel, only 6.9, and I'm much too lazy to hunt down all the 6.10 packages. But okay, I'll find some time to do so.

@AdamWill
Copy link

@edsantiago there is no testing repo for Rawhide, so if an update fails gating there isn't really a proper repo to get it from, unfortunately. You have to get it from Koji. You can use koji download-build --arch=x86_64 --arch=noarch <NVR> to download all the packages from the build, but for kernels that's a lot of packages, so I usually just cherry-pick the few packages I need to install from the web UI.

openQA does record logs, but we don't happen to pipe the output of this specific test command to a file at present. It would be easy to do that if it's useful, though.

@Luap99 it's the kernel that is causing this. The same test is passing just fine on every other Rawhide update; it fails only on kernel updates, which means the kernel is the cause.

@Luap99
Copy link
Member

Luap99 commented May 17, 2024

Thanks @AdamWill, I guess then we have to get a simple reproducer and file a kernel bug.

@edsantiago
Copy link
Collaborator Author

I'm being lazy again: the failure is a 0514 kernel build. I see a 0517 koji build and have not seen any OpenQA error emails about it. Until I have reason to suspect otherwise, I'll assume the problem is fixed. (And will save myself the time of pulling the kernel and looking for a reproducer)

@edsantiago
Copy link
Collaborator Author

sigh... never mind. 0517 did fail in OpenQA.

Reproducer:

# uname -r
6.9.0-0.rc7.20240510git448b3fe5a0ea.62.fc41.x86_64
# dnf -y install podman-tests

# podman run -d --name foo quay.io/libpod/testimage:20240123 sleep inf
<cid>
# podman exec foo cat /sys/fs/cgroup/io.max
# podman update --device-read-bps=/dev/zero:10mb foo
<cid>
# podman exec foo cat /sys/fs/cgroup/io.max
1:5 rbps=10485760 wbps=max riops=max wiops=max    <<<<< THIS IS GOOD

Then:

# wget https://kojipkgs.fedoraproject.org//packages/kernel/6.10.0/0.rc0.20240517gitea5f6ad9ad96.6.fc41/x86_64/kernel{,-core,-modules,-modules-core}-6.10.0-0.rc0.20240517gitea5f6ad9ad96.6.fc41.x86_64.rpm
# dnf install kern*rpm; reboot

Then

# uname -r
6.10.0-0.rc0.20240517gitea5f6ad9ad96.6.fc41.x86_64
# podman rm -f -a
[repeat the podman run/update/exec from above]
1:5 rbps=0 wbps=0 riops=0 wiops=0       <<<<<< THIS IS NOT GOOD

@edsantiago edsantiago changed the title kernel-6.10.0-0.rc0.20240514gita5131c3fdf26.2.fc41 - podman update device-read-bps = 0 rawhide kernel 6.10.0 >=20240514 - podman update device-read-bps = 0 May 20, 2024
@edsantiago
Copy link
Collaborator Author

edsantiago commented May 20, 2024

Filed rhbz2281805

@Luap99
Copy link
Member

Luap99 commented May 29, 2024

Does this still happen with 6.10 rc1?

@edsantiago
Copy link
Collaborator Author

If by rc1 you mean 6.10.0-0.rc1.17, then yes

@edsantiago
Copy link
Collaborator Author

@Luap99
Copy link
Member

Luap99 commented May 29, 2024

a cli reproducer should be something like this

mkdir /sys/fs/cgroup/test-cgroup
echo "1:5 rbps=10485760" > /sys/fs/cgroup/test-cgroup/io.max
cat /sys/fs/cgroup/test-cgroup/io.max
rmdir /sys/fs/cgroup/test-cgroup

@Luap99
Copy link
Member

Luap99 commented May 29, 2024

I tried to get a rawhide VM going to test myself install but seems like something with dnf is terribly broken there as I cannot install anything due checksum errors. I tried several VM's all fail in the same way...

@AdamWill
Copy link

huh, that seems odd? I'm running Rawhide here and not seeing anything like that, and our automated tests aren't either.

@edsantiago
Copy link
Collaborator Author

On 1mt, a minute or two ago, I saw a ton of red checksum errors but dnf install podman ended up successful.

@AdamWill
Copy link

I do see this mail, which might be relevant. I hadn't updated to that yet. But openQA did pass tests today...which includes doing quite a lot of package installs...

@Luap99
Copy link
Member

Luap99 commented May 29, 2024

Yeah seems to be working now again, not sure what happened.

@Luap99
Copy link
Member

Luap99 commented May 29, 2024

Tried 6.10.0-0.rc1.20240528git2bfcfd584ff5.18 and can reproduce with the shell commands above, you may need to add the io controller first on a fresh boot.

echo +io > /sys/fs/cgroup/cgroup.subtree_control
mkdir /sys/fs/cgroup/test-cgroup
echo "1:5 rbps=10485760" > /sys/fs/cgroup/test-cgroup/io.max
cat /sys/fs/cgroup/test-cgroup/io.max
rmdir /sys/fs/cgroup/test-cgroup

I think this must be reported to the kernel upstream, I don't see this getting solved just sitting in the fedora bugzilla.

@AdamWill
Copy link

well, @jmflinuxtx - the Fedora kernel maintainer - is aware of the issue, so I was kinda leaving it to him to report it to the appropriate upstream venues. I find it pretty impossible to know where to send kernel issues.

@jmflinuxtx
Copy link

Yess, I am aware, I passed this on to Waiman Long. He thought there was a patch for it and that turned out not to cover this case, so he was looking again. In the meantime, we just hit RC1 so bug fixes are coming in fast, and it is possible that someone else has a fix. Worst case, I can bisect later this week.

intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this issue May 30, 2024
Commit bf20ab5 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW")
attempts to revert the code change introduced by commit cd5ab1b
("blk-throttle: add .low interface").  However, it leaves behind the
bps_conf[] and iops_conf[] fields in the throtl_grp structure which
aren't set anywhere in the new blk-throttle.c code but are still being
used by tg_prfill_limit() to display the limits in io.max. Now io.max
always displays the following values if a block queue is used:

	<m>:<n> rbps=0 wbps=0 riops=0 wiops=0

Fix this problem by removing bps_conf[] and iops_conf[] and use bps[]
and iops[] instead to complete the revert.

Fixes: bf20ab5 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW")
Reported-by: Justin Forbes <jforbes@redhat.com>
Closes: containers/podman#22701 (comment)
Signed-off-by: Waiman Long <longman@redhat.com>
torvalds pushed a commit to torvalds/linux that referenced this issue May 31, 2024
Commit bf20ab5 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW")
attempts to revert the code change introduced by commit cd5ab1b
("blk-throttle: add .low interface").  However, it leaves behind the
bps_conf[] and iops_conf[] fields in the throtl_grp structure which
aren't set anywhere in the new blk-throttle.c code but are still being
used by tg_prfill_limit() to display the limits in io.max. Now io.max
always displays the following values if a block queue is used:

	<m>:<n> rbps=0 wbps=0 riops=0 wiops=0

Fix this problem by removing bps_conf[] and iops_conf[] and use bps[]
and iops[] instead to complete the revert.

Fixes: bf20ab5 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW")
Reported-by: Justin Forbes <jforbes@redhat.com>
Closes: containers/podman#22701 (comment)
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240530134547.970075-1-longman@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants