Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

twister: handle exceptions when running binaries #72893

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nashif
Copy link
Member

@nashif nashif commented May 16, 2024

Deal with binaries not found. Also set default return code to something
other than zero to catch issues and report correct status.

Fixes #72807

Signed-off-by: Anas Nashif anas.nashif@intel.com

@hakehuang
Copy link
Collaborator

similiar with #72892, but the fix in this PR looks more considerate

hakehuang
hakehuang previously approved these changes May 17, 2024
Copy link
Contributor

@arbrauns arbrauns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I read this correctly, this solution is even less general than the one in #72892: at least that one caught exceptions in any Handler, this one is specific to the BinaryHandler.

There's something missing further up the call/exec stack. Without the try/except, the exception propagated up to somewhere where it got printed, that's where a check is needed.

@nashif
Copy link
Member Author

nashif commented May 17, 2024

If I read this correctly, this solution is even less general than the one in #72892: at least that one caught exceptions in any Handler, this one is specific to the BinaryHandler.

ok, makes sense. Updated to be more general.

@arbrauns
Copy link
Contributor

I still feel like there's something missing. Previously run() threw an exception, which then killed the entire runner, right? Presumably that caused a python process to exit with a nonzero exit code somewhere, right? Where did that happen and why isn't it caught?

@arbrauns
Copy link
Contributor

Maybe it helps if I reword it: somewhere, an error condition is being swallowed, i.e. the error is noticed, but not passed on to eventually fail the CI run. This PR does not address that: the error wasn't being swallowed in process(), it was being passed on to the calling function as an exception as it should be.

Copy link
Member

@PerMac PerMac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add filling of the device.log

logger.error(f"Error: {e}")
self.instance.status = "error"
self.instance.reason = "general runtime exception"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
d_log = f"{self.instance.build_dir}/device.log"
with open(d_log, "a") as dlog_fp:
dlog_fp.write(str(e))

without it reports contain useless build.log.

@PerMac
Copy link
Member

PerMac commented May 17, 2024

Maybe it helps if I reword it: somewhere, an error condition is being swallowed, i.e. the error is noticed, but not passed on to eventually fail the CI run. This PR does not address that: the error wasn't being swallowed in process(), it was being passed on to the calling function as an exception as it should be.

I am still not sure what are you referring to. With the current version twister will exit with exitcode==1, because there were errors in the run: https://github.com/zephyrproject-rtos/zephyr/blob/main/scripts/pylib/twister/twisterlib/twister_main.py#L236
Is this what you mean by "fail the CI run"?. We don't want to "crushing" the run, i.e. abruptly exiting on an error, since we want the run to finish to see all test results.

@arbrauns
Copy link
Contributor

arbrauns commented May 17, 2024

The original problem was that the runner crashed (the reason doesn't matter), half the tests weren't being run, but CI was still completely green. With the changes in this PR, an exception being thrown by self.run() is now caught and the status is reported as error, but if the runner crashes anywhere else for any other reason, the same problem of CI ending up completely green still persists, right?

We don't want to "crushing" the run, i.e. abruptly exiting on an error, since we want the run to finish to see all test results.

If the runner crashes in a way that it can't handle and report by itself, then that is exactly what needs to happen. Catastrophic failure needs to result in more errors, not no errors.

Let's approach it like this: which piece of code starts the runner? Why doesn't it notice and report when the runner crashes?

@PerMac
Copy link
Member

PerMac commented May 17, 2024

I believe I finally understood what do you mean. I saw the original error in our CI, but since there are several workers (I think this is what you call runner?) running it didn't look like something was missing, beside those unit tests. In logs I saw errors (not reported), but it looks like it didn't skip on running anything else. However, there was a huge scope, varying in between runs, so it was impossible to see if sth was not executed.
I was able to reproduce what you're talking about locally, calling

scripts/twister --verbose -T tests/unit/ -p unit_testing -p qemu_x86 -T samples/hello_world/ -j 1 --shuffle-tests --subset 1/1

After the worker failed, nothing more was executed, and summary was printed without reporting errors (although passed was also 0, but who looks at it) and twister exit with 0.
I think I was able to resolve the issue by wrapping this line with try except and doing an instance status reporting: https://github.com/zephyrproject-rtos/zephyr/blob/main/scripts/pylib/twister/twisterlib/runner.py#L1325
IMO such fix can be yet another PR. I think I will also have to add new log file, e.g. "pipeline.log" and dump the error there. Also to pass errors from this file to the report (what I also asked from this PR).
@arbrauns does it sound aligned with your issue?

@arbrauns
Copy link
Contributor

I think what should happen is that twister simply exits with a nonzero exit code. If a worker crashes so hard that it can't even report back its failure, it's fine to kill the CI run, investigate and fix the problem, then retry the CI run. It shouldn't happen often.

@PerMac
Copy link
Member

PerMac commented May 17, 2024

@arbrauns What about this? #72945
I tested it on main with reverted unit_testing fix and instead of silently crashing I got twister going over every test, reporting pipeline error and dumping an error traceback into results report. If you also want to see the error in online logs you can ran twister with -v --inline-logs. This is the standard for all other errors/failures. I think it is better if we can handle such errors without crashing.

When I added also #72893 into the mix I was no longer getting Pipeline errors, since those unit_testing errors where already handled.

@nashif nashif force-pushed the topic/twister/exception_unit branch from 14768cc to 4ef5243 Compare May 17, 2024 13:56
@nashif
Copy link
Member Author

nashif commented May 17, 2024

@arbrauns What about this? #72945

Well, those are not pipeline errors, not sure what this means exactly, as the bug fix for the binary shows, it was an issue in twister itself trying to run the wrong binary, something we should abort on and stop execution ASAP. So I agree with @arbrauns completely and the latest revision in this PR just does that, stop twister and do not try to push this into logs an pretent all is good.

@arbrauns
Copy link
Contributor

arbrauns commented May 17, 2024

@PerMac

This is the standard for all other errors/failures. I think it is better if we can handle such errors without crashing.

Of course it's better if it's handled gracefully, but the core problem here is that errors which aren't being handled gracefully (of which there are infinitely many, you can't possibly cover all error scenarios) simply vanish without a trace. #72945 is yet another bandaid that gracefully handles one error case, but it doesn't fix the root problem.

@nashif after looking into it deeper, I think what's actually missing is a check for Process.exitcode in TwisterRunner.execute after the .join(). That's where the exit code of the runner is being swallowed.

@nashif
Copy link
Member Author

nashif commented May 17, 2024

@nashif after looking into it deeper, I think what's actually missing is a check for Process.exitcode in TwisterRunner.execute after the .join(). That's where the exit code of the runner is being swallowed.

yes, but this will be very difficult to handle dealing with multiprocessing, I could check for this and exist but it will also be nice to catpture the backtrace and error, not only react on an exit code.

@nashif nashif force-pushed the topic/twister/exception_unit branch from 4ef5243 to ab8761f Compare May 18, 2024 19:27
@nashif nashif requested a review from PerMac May 18, 2024 19:27
@PerMac
Copy link
Member

PerMac commented May 20, 2024

@arbrauns @nashif

"#72945 is yet another bandaid that gracefully handles one error case, but it doesn't fix the root problem."

It is not a bandaid for one error case, but the whole set of exceptions you want to catch if you don't want to be specific
https://docs.python.org/3/library/exceptions.html#exception-hierarchy
and from PEP8:
"A bare except: clause will catch SystemExit and KeyboardInterrupt exceptions, making it harder to interrupt a program with Control-C, and can disguise other problems. If you want to catch all exceptions that signal program errors, use except Exception: (bare except is equivalent to except BaseException:)." https://peps.python.org/pep-0008/#programming-recommendations:~:text=If%20you%20want%20to%20catch%20all%20exceptions%20that%20signal%20program%20errors%2C%20use%20except%20Exception%3A

The solution from this PR uses the exact same mechanism at the same line i.e. wrapping pb.process from L1325 with try/except Exception. My PR detects the same set of errors and marks tests causing these errors accordingly, allowing to go forward. This PR instead adds a result to the results_queue and adds new mechanism of scanning the results_queue for error and exiting when one is found. @nashif what is the benefit of doing this extra queuing and scanning, instead of just making twister to exit with error, as a part of except Exception? Also, something is amiss with the current PR if the idea was to exit on first error. Now Twister will check the existence of errors only after the whole set is executed. E.g. try

git revert 7d3dc4e7e377746ce752390082e634120b7edb8e
scripts/twister -v -T tests/unit -p unit_testing

It will go over all 62 configurations, erroring on each, and only then error-exit (without any summary). My PR will set results to errors for all those cases, put according error traces in reports (one can see full error in twister logs with inline-logs, which is the default workflow), and have a graceful exit with a summary and exit code==1 (due to errors in results).

Copy link
Member

@PerMac PerMac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation won't exit unit the whole scope is finished.
Also, please have a look at #72945. I think it is a better approach (but of course I'm biased).

@nashif
Copy link
Member Author

nashif commented May 20, 2024

My PR detects the same set of errors and marks tests causing these errors accordingly, allowing to go forward.

The main problem here is the fact that the "errors" or failures are not test related but issues in the runner, and test should not generate an exception in the python code of the runner. Just counting those as test errors because of wrong logic or something gone wrong in twister is not a good idea, it should just quit and a fix need to be provided.

@PerMac
Copy link
Member

PerMac commented May 21, 2024

@nashif I am fine with such approach, just proposed a way to see what was the problem and still be able to execute the whole scope. My +/-1 refers to the fact that your current implementation checks the queue only after every test in the scope is evaluated. So instead of quitting on the first pipeline error, one has to wait potentially till the end to just see twister crushing.

@nashif nashif force-pushed the topic/twister/exception_unit branch from ab8761f to 37b8dd5 Compare May 23, 2024 15:13
PerMac
PerMac previously approved these changes May 24, 2024
Copy link
Member

@PerMac PerMac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nashif
Copy link
Member Author

nashif commented May 24, 2024

@LukaszMrugala @KamilxPaszkiet can you help with the failing test please?

hakehuang
hakehuang previously approved these changes May 27, 2024
When something goes wrong with execution, due to twister issues and
bugs, do not continue with execution and abort the process and report
and return an error code.

Fixes zephyrproject-rtos#72807

Signed-off-by: Anas Nashif <anas.nashif@intel.com>
@nashif nashif dismissed stale reviews from hakehuang and PerMac via b69adeb May 29, 2024 13:45
@nashif nashif force-pushed the topic/twister/exception_unit branch from 37b8dd5 to b69adeb Compare May 29, 2024 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lots of twister errors in CI that don't get reported as failures
5 participants