Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent Rust transform death spiral by not panicking on transform returning Err #18530

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

voutilad
Copy link
Contributor

If a Rust transform returns Err, the runtime panics because it currently calls expect on the Result. This makes any Err returned by a users transform put the wasm runtime into a death spiral of panic and restart loops. Until the runtime has a concept of handling "bad" records that cause errors (e.g. DLQ, drop, etc.) the transform should be responsible for this. Returning Err should just result in a log message, at worst.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

  • Don't panic if a Rust transform returns Err when processing a message.

Bug Fixes

  • Stop panic-loops in the Rust runtime when a transform returns Err.

If a Rust transform returns Err, the runtime panics because it currently
calls expect on the Result. This makes any Err returned by a users
transform put the wasm runtime into a death spiral of panic and restart
loops. Until the runtime has a concept of handling "bad" records that
cause errors (e.g. DLQ, drop, etc.) the transform should be responsible
for this. Returning Err should just result in a log message, at worst.
@github-actions github-actions bot added the area/wasm WASM Data Transforms label May 16, 2024
@voutilad voutilad changed the title Prevent Rust transform death spiral by not panicking on transform Err. Prevent Rust transform death spiral by not panicking on transform returning Err May 16, 2024
@voutilad voutilad requested a review from oleiman May 17, 2024 11:11
Comment on lines +182 to +183
.inspect_err(|e| eprintln!("transforming record failed with error: {:?}", e))
.unwrap_or(())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, and thanks for the change. I'm hesitant to merge right away since the panic-on-expect behavior feels intentional, and it may be directly related to the transform execution model viz how we calculate backoffs and such. Just need to give it some though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will drop records that fail on the floor, which is not a safe default because there is no way to know what failed where. I don't agree with this change (of course my opinion doesn't matter anymore), users should opt into this. For consistency other SDKs should change to this (go sdk panics on failures).

This default pauses until a deploy happens and the developers fix the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rockwotj I don’t see the pause behavior you describe. What I see is the same record being attempted to be reprocessed repeatedly, causing the same panic over and over again spamming the logs.

If the behavior is to panic, there’s also no reason to use a Result type. Just have the transform code panic directly. Returning Err just to cause a panic is very surprising as someone writing code against the api.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there are retred because we can't tell the difference between an OOM where a retry can fix and a permanent failure right now. That is fixable without SDK changes (basically if the user code fails as the first batch it will never succeed)

RE API - it's future proofing it. We should have an SDK or yanl configurable way to set the dead letter queue, then we can do the dead letter queue pattern automatically for a user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/wasm WASM Data Transforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants