twister: handle exceptions when running binaries #72893

nashif · 2024-05-16T15:13:22Z

Deal with binaries not found. Also set default return code to something
other than zero to catch issues and report correct status.

Fixes #72807

Signed-off-by: Anas Nashif anas.nashif@intel.com

hakehuang · 2024-05-17T04:35:55Z

similiar with #72892, but the fix in this PR looks more considerate

arbrauns

If I read this correctly, this solution is even less general than the one in #72892: at least that one caught exceptions in any Handler, this one is specific to the BinaryHandler.

There's something missing further up the call/exec stack. Without the try/except, the exception propagated up to somewhere where it got printed, that's where a check is needed.

nashif · 2024-05-17T09:59:29Z

If I read this correctly, this solution is even less general than the one in #72892: at least that one caught exceptions in any Handler, this one is specific to the BinaryHandler.

ok, makes sense. Updated to be more general.

arbrauns · 2024-05-17T10:02:55Z

I still feel like there's something missing. Previously run() threw an exception, which then killed the entire runner, right? Presumably that caused a python process to exit with a nonzero exit code somewhere, right? Where did that happen and why isn't it caught?

arbrauns · 2024-05-17T10:51:47Z

Maybe it helps if I reword it: somewhere, an error condition is being swallowed, i.e. the error is noticed, but not passed on to eventually fail the CI run. This PR does not address that: the error wasn't being swallowed in process(), it was being passed on to the calling function as an exception as it should be.

PerMac

Please, add filling of the device.log

PerMac · 2024-05-17T11:21:26Z

scripts/pylib/twister/twisterlib/runner.py

+                logger.error(f"Error: {e}")
+                self.instance.status = "error"
+                self.instance.reason = "general runtime exception"
+


Suggested change

d_log = f"{self.instance.build_dir}/device.log"

with open(d_log, "a") as dlog_fp:

dlog_fp.write(str(e))

without it reports contain useless build.log.

PerMac · 2024-05-17T11:35:19Z

Maybe it helps if I reword it: somewhere, an error condition is being swallowed, i.e. the error is noticed, but not passed on to eventually fail the CI run. This PR does not address that: the error wasn't being swallowed in process(), it was being passed on to the calling function as an exception as it should be.

I am still not sure what are you referring to. With the current version twister will exit with exitcode==1, because there were errors in the run: https://github.com/zephyrproject-rtos/zephyr/blob/main/scripts/pylib/twister/twisterlib/twister_main.py#L236
Is this what you mean by "fail the CI run"?. We don't want to "crushing" the run, i.e. abruptly exiting on an error, since we want the run to finish to see all test results.

arbrauns · 2024-05-17T11:39:48Z

The original problem was that the runner crashed (the reason doesn't matter), half the tests weren't being run, but CI was still completely green. With the changes in this PR, an exception being thrown by self.run() is now caught and the status is reported as error, but if the runner crashes anywhere else for any other reason, the same problem of CI ending up completely green still persists, right?

We don't want to "crushing" the run, i.e. abruptly exiting on an error, since we want the run to finish to see all test results.

If the runner crashes in a way that it can't handle and report by itself, then that is exactly what needs to happen. Catastrophic failure needs to result in more errors, not no errors.

Let's approach it like this: which piece of code starts the runner? Why doesn't it notice and report when the runner crashes?

PerMac · 2024-05-17T12:24:57Z

I believe I finally understood what do you mean. I saw the original error in our CI, but since there are several workers (I think this is what you call runner?) running it didn't look like something was missing, beside those unit tests. In logs I saw errors (not reported), but it looks like it didn't skip on running anything else. However, there was a huge scope, varying in between runs, so it was impossible to see if sth was not executed.
I was able to reproduce what you're talking about locally, calling

scripts/twister --verbose -T tests/unit/ -p unit_testing -p qemu_x86 -T samples/hello_world/ -j 1 --shuffle-tests --subset 1/1

After the worker failed, nothing more was executed, and summary was printed without reporting errors (although passed was also 0, but who looks at it) and twister exit with 0.
I think I was able to resolve the issue by wrapping this line with try except and doing an instance status reporting: https://github.com/zephyrproject-rtos/zephyr/blob/main/scripts/pylib/twister/twisterlib/runner.py#L1325
IMO such fix can be yet another PR. I think I will also have to add new log file, e.g. "pipeline.log" and dump the error there. Also to pass errors from this file to the report (what I also asked from this PR).
@arbrauns does it sound aligned with your issue?

arbrauns · 2024-05-17T12:29:57Z

I think what should happen is that twister simply exits with a nonzero exit code. If a worker crashes so hard that it can't even report back its failure, it's fine to kill the CI run, investigate and fix the problem, then retry the CI run. It shouldn't happen often.

PerMac · 2024-05-17T13:52:22Z

@arbrauns What about this? #72945
I tested it on main with reverted unit_testing fix and instead of silently crashing I got twister going over every test, reporting pipeline error and dumping an error traceback into results report. If you also want to see the error in online logs you can ran twister with -v --inline-logs. This is the standard for all other errors/failures. I think it is better if we can handle such errors without crashing.

When I added also #72893 into the mix I was no longer getting Pipeline errors, since those unit_testing errors where already handled.

nashif · 2024-05-17T14:02:01Z

@arbrauns What about this? #72945

Well, those are not pipeline errors, not sure what this means exactly, as the bug fix for the binary shows, it was an issue in twister itself trying to run the wrong binary, something we should abort on and stop execution ASAP. So I agree with @arbrauns completely and the latest revision in this PR just does that, stop twister and do not try to push this into logs an pretent all is good.

arbrauns · 2024-05-17T14:16:22Z

@PerMac

This is the standard for all other errors/failures. I think it is better if we can handle such errors without crashing.

Of course it's better if it's handled gracefully, but the core problem here is that errors which aren't being handled gracefully (of which there are infinitely many, you can't possibly cover all error scenarios) simply vanish without a trace. #72945 is yet another bandaid that gracefully handles one error case, but it doesn't fix the root problem.

@nashif after looking into it deeper, I think what's actually missing is a check for Process.exitcode in TwisterRunner.execute after the .join(). That's where the exit code of the runner is being swallowed.

nashif · 2024-05-17T14:40:07Z

@nashif after looking into it deeper, I think what's actually missing is a check for Process.exitcode in TwisterRunner.execute after the .join(). That's where the exit code of the runner is being swallowed.

yes, but this will be very difficult to handle dealing with multiprocessing, I could check for this and exist but it will also be nice to catpture the backtrace and error, not only react on an exit code.

PerMac · 2024-05-20T12:01:26Z

@arbrauns @nashif

"#72945 is yet another bandaid that gracefully handles one error case, but it doesn't fix the root problem."

It is not a bandaid for one error case, but the whole set of exceptions you want to catch if you don't want to be specific
https://docs.python.org/3/library/exceptions.html#exception-hierarchy
and from PEP8:
"A bare except: clause will catch SystemExit and KeyboardInterrupt exceptions, making it harder to interrupt a program with Control-C, and can disguise other problems. If you want to catch all exceptions that signal program errors, use except Exception: (bare except is equivalent to except BaseException:)." https://peps.python.org/pep-0008/#programming-recommendations:~:text=If%20you%20want%20to%20catch%20all%20exceptions%20that%20signal%20program%20errors%2C%20use%20except%20Exception%3A

The solution from this PR uses the exact same mechanism at the same line i.e. wrapping pb.process from L1325 with try/except Exception. My PR detects the same set of errors and marks tests causing these errors accordingly, allowing to go forward. This PR instead adds a result to the results_queue and adds new mechanism of scanning the results_queue for error and exiting when one is found. @nashif what is the benefit of doing this extra queuing and scanning, instead of just making twister to exit with error, as a part of except Exception? Also, something is amiss with the current PR if the idea was to exit on first error. Now Twister will check the existence of errors only after the whole set is executed. E.g. try

git revert 7d3dc4e7e377746ce752390082e634120b7edb8e
scripts/twister -v -T tests/unit -p unit_testing

It will go over all 62 configurations, erroring on each, and only then error-exit (without any summary). My PR will set results to errors for all those cases, put according error traces in reports (one can see full error in twister logs with inline-logs, which is the default workflow), and have a graceful exit with a summary and exit code==1 (due to errors in results).

PerMac

The current implementation won't exit unit the whole scope is finished.
Also, please have a look at #72945. I think it is a better approach (but of course I'm biased).

nashif · 2024-05-20T16:39:09Z

My PR detects the same set of errors and marks tests causing these errors accordingly, allowing to go forward.

The main problem here is the fact that the "errors" or failures are not test related but issues in the runner, and test should not generate an exception in the python code of the runner. Just counting those as test errors because of wrong logic or something gone wrong in twister is not a good idea, it should just quit and a fix need to be provided.

PerMac · 2024-05-21T10:49:38Z

@nashif I am fine with such approach, just proposed a way to see what was the problem and still be able to execute the whole scope. My +/-1 refers to the fact that your current implementation checks the queue only after every test in the scope is evaluated. So instead of quitting on the first pipeline error, one has to wait potentially till the end to just see twister crushing.

PerMac

LGTM

nashif · 2024-05-24T10:59:55Z

@LukaszMrugala @KamilxPaszkiet can you help with the failing test please?

When something goes wrong with execution, due to twister issues and bugs, do not continue with execution and abort the process and report and return an error code. Fixes zephyrproject-rtos#72807 Signed-off-by: Anas Nashif <anas.nashif@intel.com>

zephyrbot added the area: Twister Twister label May 16, 2024

zephyrbot requested review from gchwier, golowanow, gopiotr, hakehuang, KamilxPaszkiet, LukaszMrugala and PerMac May 16, 2024 15:13

zephyrbot assigned nashif May 16, 2024

hakehuang previously approved these changes May 17, 2024

View reviewed changes

arbrauns suggested changes May 17, 2024

View reviewed changes

nashif dismissed hakehuang’s stale review via 14768cc May 17, 2024 09:58

nashif force-pushed the topic/twister/exception_unit branch from 33c4034 to 14768cc Compare May 17, 2024 09:58

PerMac requested changes May 17, 2024

View reviewed changes

PerMac mentioned this pull request May 17, 2024

twister: exception handling for FileNotFoundError when running binary #72892

Closed

nashif force-pushed the topic/twister/exception_unit branch from 14768cc to 4ef5243 Compare May 17, 2024 13:56

nashif force-pushed the topic/twister/exception_unit branch from 4ef5243 to ab8761f Compare May 18, 2024 19:27

nashif requested a review from PerMac May 18, 2024 19:27

PerMac requested changes May 20, 2024

View reviewed changes

nashif force-pushed the topic/twister/exception_unit branch from ab8761f to 37b8dd5 Compare May 23, 2024 15:13

PerMac previously approved these changes May 24, 2024

View reviewed changes

hakehuang previously approved these changes May 27, 2024

View reviewed changes

nashif dismissed stale reviews from hakehuang and PerMac via b69adeb May 29, 2024 13:45

nashif force-pushed the topic/twister/exception_unit branch from 37b8dd5 to b69adeb Compare May 29, 2024 13:45

nashif requested review from PerMac, hakehuang and arbrauns May 29, 2024 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

twister: handle exceptions when running binaries #72893

twister: handle exceptions when running binaries #72893

nashif commented May 16, 2024

hakehuang commented May 17, 2024

arbrauns left a comment •

edited

nashif commented May 17, 2024

arbrauns commented May 17, 2024

arbrauns commented May 17, 2024

PerMac left a comment

PerMac May 17, 2024

PerMac commented May 17, 2024

arbrauns commented May 17, 2024 •

edited

PerMac commented May 17, 2024 •

edited

arbrauns commented May 17, 2024

PerMac commented May 17, 2024

nashif commented May 17, 2024 •

edited

arbrauns commented May 17, 2024 •

edited

nashif commented May 17, 2024

PerMac commented May 20, 2024 •

edited

PerMac left a comment

nashif commented May 20, 2024

PerMac commented May 21, 2024 •

edited

PerMac left a comment

nashif commented May 24, 2024

+                d_log = f"{self.instance.build_dir}/device.log"
+                with open(d_log, "a") as dlog_fp:
+                    dlog_fp.write(str(e))

twister: handle exceptions when running binaries #72893

Are you sure you want to change the base?

twister: handle exceptions when running binaries #72893

Conversation

nashif commented May 16, 2024

hakehuang commented May 17, 2024

arbrauns left a comment • edited

Choose a reason for hiding this comment

nashif commented May 17, 2024

arbrauns commented May 17, 2024

arbrauns commented May 17, 2024

PerMac left a comment

Choose a reason for hiding this comment

PerMac May 17, 2024

Choose a reason for hiding this comment

PerMac commented May 17, 2024

arbrauns commented May 17, 2024 • edited

PerMac commented May 17, 2024 • edited

arbrauns commented May 17, 2024

PerMac commented May 17, 2024

nashif commented May 17, 2024 • edited

arbrauns commented May 17, 2024 • edited

nashif commented May 17, 2024

PerMac commented May 20, 2024 • edited

PerMac left a comment

Choose a reason for hiding this comment

nashif commented May 20, 2024

PerMac commented May 21, 2024 • edited

PerMac left a comment

Choose a reason for hiding this comment

nashif commented May 24, 2024

arbrauns left a comment •

edited

arbrauns commented May 17, 2024 •

edited

PerMac commented May 17, 2024 •

edited

nashif commented May 17, 2024 •

edited

arbrauns commented May 17, 2024 •

edited

PerMac commented May 20, 2024 •

edited

PerMac commented May 21, 2024 •

edited