Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple intermitent failures in seveal SMP targets due to UART missbehaviour #72858

Open
aescolar opened this issue May 16, 2024 · 10 comments
Open
Assignees
Labels
area: ARM64 ARM (64-bit) Architecture area: RISCV RISCV Architecture (32-bit & 64-bit) area: SMP Symmetric multiprocessing area: UART Universal Asynchronous Receiver-Transmitter bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@aescolar
Copy link
Member

aescolar commented May 16, 2024

Describe the bug
Multiple tests are failing intermittently in CI qemu_riscv32/qemu_virt_riscv32/smp and qemu_cortex_a53//smp.
From the logs it appears characters from the UART prompt and output are being mingled breaking messages and losing characters
and causing tests which check the output against a regex to fail.

To Reproduce
Steps to reproduce the behavior:

  1. twister -p qemu_riscv32/qemu_virt_riscv32/smp -T samples/posix/env -s sample.posix.env
    OR
  2. twister -p qemu_riscv32/qemu_virt_riscv32/smp -T samples/subsys/logging/logger/ -s sample.logger.basic
  3. See error (sometimes)

Expected behavior
No tests errors (and no mingled shell prompt and output messages)

Impact
main CI fails sometimes.

Logs and console output
https://github.com/zephyrproject-rtos/zephyr/actions/runs/9108203173/job/25038560360#step:11:245
https://github.com/zephyrproject-rtos/zephyr/actions/runs/9108203173/job/25038560360#step:11:1245

INFO    - 1136/1136 qemu_riscv32/qemu_virt_riscv32/smp samples/posix/env/sample.posix.env                  FAILED Timeout (qemu 119.933s)
INFO    - /__w/zephyr/zephyr/twister-out/qemu_riscv32_qemu_virt_riscv32_smp/samples/posix/env/sample.posix.env/handler.log
ERROR   - *** Booting Zephyr OS build v3.6.0-4069-gaed0fbf7741e ***


BOARD=qemu_riscv32
uart:~$ UILD_VERSIOmN=v3.6.0-4069-gaed0fbf7741e
ALERT=

INFO    - /__w/zephyr/zephyr/twister-out/qemu_riscv32_qemu_virt_riscv32_smp/samples/posix/env/sample.posix.env/handler.log
INFO    - 3 Iteration:
INFO    - Adding tasks to the queue...

INFO    - Added initial list of jobs to queue
INFO    - 1136/1136 qemu_riscv32/qemu_virt_riscv32/smp samples/posix/env/sample.posix.env                  FAILED Timeout (qemu 119.881s)
INFO    - /__w/zephyr/zephyr/twister-out/qemu_riscv32_qemu_virt_riscv32_smp/samples/posix/env/sample.posix.env/handler.log
ERROR   - *** Booting Zephyr OS build v3.6.0-4069-gaed0fbf7741e ***


uart:~$ mOARD=qemu_riscv32
BUILD_VERSION=v3.6.0-4069-gaed0fbf7741e
ALERT=

INFO    - /__w/zephyr/zephyr/twister-out/qemu_riscv32_qemu_virt_riscv32_smp/samples/posix/env/sample.posix.env/handler.log
INFO    - 4 Iteration:
INFO    - Adding tasks to the queue...
INFO    - Added initial list of jobs to queue

INFO    - 1136/1136 qemu_riscv32/qemu_virt_riscv32/smp samples/posix/env/sample.posix.env                  FAILED Timeout (qemu 119.919s)
INFO    - /__w/zephyr/zephyr/twister-out/qemu_riscv32_qemu_virt_riscv32_smp/samples/posix/env/sample.posix.env/handler.log
ERROR   - *** Booting Zephyr OS build v3.6.0-4069-gaed0fbf7741e ***


BARD=qemu_riscv32
[1;32mBUILD_VERSION=v3.6.0-4069-gaed0fbf7741e
ALERT=

Environment (please complete the following information):

  • OS: Linux CI (Ubuntu 22.04)
  • Toolchain (e.g Zephyr SDK, ...): Zephyr SDK (CI)
  • Commit SHA or Version used: Main as of now c8cd57f

Aditional info
The issue can be reproduced at least back to 527e712

@aescolar aescolar added bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug Regression Something, which was working, does not anymore area: RISCV RISCV Architecture (32-bit & 64-bit) and removed Regression Something, which was working, does not anymore labels May 16, 2024
@aescolar
Copy link
Member Author

aescolar added a commit to aescolar/zephyr that referenced this issue May 16, 2024
There is some instability problems with qemu_riscv32
zephyrproject-rtos#72858
which cause this sample test to fail in CI at random.
Let's change the integration platform to a reliable one,
so this test focuses on the sample and does not produce
false test failures due to the platform.

Signed-off-by: Alberto Escolar Piedras <alberto.escolar.piedras@nordicsemi.no>
@aescolar aescolar added the area: UART Universal Asynchronous Receiver-Transmitter label May 16, 2024
@aescolar
Copy link
Member Author

CC @dcpleung

@fkokosinski
Copy link
Member

fkokosinski commented May 16, 2024

Hi @aescolar, I've checked other QEMU-based platforms that have SMP variants, and I could get this behavior to trigger with qemu_cortex_a53//smp as well:

*** Booting Zephyr OS build v3.6.0-4072-g83cd4d36ab52 ***
Secondary CPU core 1 (MPID:0x1) is up


uart:~$ mOARD=qemu_cortex_a53
BUILD_VERSION=v3.6.0-4072-g83cd4d36ab52
ALERT=

So just FYI this affects other archs as well

aescolar added a commit to aescolar/zephyr that referenced this issue May 16, 2024
There is some instability problems with qemu_riscv32
zephyrproject-rtos#72858
which cause this sample test to fail in CI at random.
Let's change the integration platform to a reliable one,
so this test focuses on the sample and does not produce
false test failures due to the platform.

Signed-off-by: Alberto Escolar Piedras <alberto.escolar.piedras@nordicsemi.no>
@fkokosinski fkokosinski added area: SMP Symmetric multiprocessing area: ARM64 ARM (64-bit) Architecture labels May 17, 2024
carlescufi pushed a commit that referenced this issue May 17, 2024
There is some instability problems with qemu_riscv32
#72858
which cause this sample test to fail in CI at random.
Let's change the integration platform to a reliable one,
so this test focuses on the sample and does not produce
false test failures due to the platform.

Signed-off-by: Alberto Escolar Piedras <alberto.escolar.piedras@nordicsemi.no>
@aescolar aescolar added priority: medium Medium impact/importance bug area: SMP Symmetric multiprocessing and removed priority: high High impact/importance bug area: SMP Symmetric multiprocessing labels May 19, 2024
@aescolar
Copy link
Member Author

Lowering priority to medium as this happens seldom enough that the 3 retry mechanism in CI tends to be enough to avoid CI failures, and when not a rerun will with high likelihood be enough.

@aescolar aescolar changed the title Multiple intermitent failures in qemu_riscv32/qemu_virt_riscv32/smp in CI Multiple intermitent failures in seveal SMP targets May 19, 2024
@aescolar aescolar changed the title Multiple intermitent failures in seveal SMP targets Multiple intermitent failures in seveal SMP targets due to UART missbehaviour May 19, 2024
@aescolar
Copy link
Member Author

CC @ithinuel @andyross

@andyross
Copy link
Contributor

I'm a little confused. Is that 'uart:~$' business in handler.log a twister feature? It doesn't appear on the west run output nor does it seem to be part of the test code. Maybe the raw host file descriptor from qemu is inappropriately shared with something in the rig?

My vague guess is this is a twister bug. @nashif ?

coreboot-org-bot pushed a commit to coreboot/zephyr-cros that referenced this issue May 20, 2024
There is some instability problems with qemu_riscv32
zephyrproject-rtos/zephyr#72858
which cause this sample test to fail in CI at random.
Let's change the integration platform to a reliable one,
so this test focuses on the sample and does not produce
false test failures due to the platform.

(cherry picked from commit 19df415)

Original-Signed-off-by: Alberto Escolar Piedras <alberto.escolar.piedras@nordicsemi.no>
GitOrigin-RevId: 19df415
Change-Id: Ie35bb6f889211353018ad1139744845aa48bae93
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/zephyr/+/5549530
Reviewed-by: Fabio Baltieri <fabiobaltieri@google.com>
Tested-by: ChromeOS Prod (Robot) <chromeos-ci-prod@chromeos-bot.iam.gserviceaccount.com>
Commit-Queue: Fabio Baltieri <fabiobaltieri@google.com>
@davidivan-og
Copy link

davidivan-og commented May 28, 2024

I'm a little confused. Is that 'uart:~$' business in handler.log a twister feature? It doesn't appear on the west run output nor does it seem to be part of the test code. Maybe the raw host file descriptor from qemu is inappropriately shared with something in the rig?

That string comes from here and is used here. I've seen this issue on our system when using SMP without using twister. I haven't been able to come up with a consistent repro for the issue unfortunately :/. All I know is that it happens sporadically with printk when SMP is enabled. I've never seen this happen with shell_print().

@andyross
Copy link
Contributor

So the bug is just that we have multiple contexts multiplexing the console output and interleaving? Have you tried CONFIG_PRINTK_SYNC=y? That will separate the output cleanly by individual printk() calls (usually but not always the same as "lines") and might work around your bug. But it seems like the actual bug is in the test, making assumptions about serialization that don't work in SMP.

@davidivan-og
Copy link

davidivan-og commented May 29, 2024

So the bug is just that we have multiple contexts multiplexing the console output and interleaving? Have you tried CONFIG_PRINTK_SYNC=y? That will separate the output cleanly by individual printk() calls (usually but not always the same as "lines") and might work around your bug. But it seems like the actual bug is in the test, making assumptions about serialization that don't work in SMP.

That's actually enabled by default when SMP is enabled & there is > 1 core enabled

config PRINTK_SYNC

This is where my debugging had gotten me and then I got stumped. I was gonna try to create and post a repro here but I'm sidetracked right now :/

@ithinuel
Copy link
Collaborator

ithinuel commented Jun 4, 2024

I have seen related issues with running subsys/ipc tests/samples on an521 where printk messages where corrupted because core0 and core1 were using the same uart concurently.

In these cases, it was 2 independent Zephyr instances running (one on each core).

I havin’t had time to investigate how to properly fix it, I temporarily removed one of the offending message so that it’s not corrupting the one the test is looking for.
I’m not sure using the same uart from 2 independent Zephyr instances is expected to be sound in the first place.

I have also seen printk output corruptions in tflm-ethosu sample where printk is used for multiple threads on the same core.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: ARM64 ARM (64-bit) Architecture area: RISCV RISCV Architecture (32-bit & 64-bit) area: SMP Symmetric multiprocessing area: UART Universal Asynchronous Receiver-Transmitter bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

No branches or pull requests

7 participants