Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] robus infinite collision condition still exist #483

Open
nicolas-rabault opened this issue Apr 17, 2024 · 0 comments · Fixed by #487
Open

[BUG] robus infinite collision condition still exist #483

nicolas-rabault opened this issue Apr 17, 2024 · 0 comments · Fixed by #487
Assignees
Milestone

Comments

@nicolas-rabault
Copy link
Member

nicolas-rabault commented Apr 17, 2024

Details

Which version of the bug has been detected on

Luos engine 3.1.0 and all others before that

Description of the bug

Robus can experience some message collisions on the network due to the multi-master aspect of the protocol. After a collision, Robus has to retry to send a message and do something to avoid re-colliding. But it seems that we still have one condition where collision avoidance doesn't work.

Context and environment

Few explanations about basic protocol timeout

On Robus timeout is used to avoid transmission during a reception AKA collision. The idea is to lock the transmission as soon as we receive something and unlock it after a timeout. more info about timeout in the related documentation page
To manage that Robus reset a timer to a specific value at each byte's reception so that after an inactivity period on the bus all the nodes can send messages again.

Timeout used for collision avoidance

Sometimes 2 nodes will try to send messages at the same time. In this condition, the timeout is not working and we still have a collision on the network. This collision will be detected and Robus will retry to send the message after a timeout period depending on its node ID to avoid to recollide with the same node again:
image
But the thing is that the collision avoidance timer is the same timer used for normal reception so in reality the node 2 collision avoidance timeout is overwritten by the reception of node 1 tx:
image
This leads us to the case where we could have a failure of collision avoidance :
image
Here we have 3 nodes colliding and then a fourth node colliding with the retry of node 1. This leads us to a collision loop.

How to reproduce the bug

@houkhouk only sees it in one specific condition in years, so it's almost impossible to reproduce voluntarily.

Possible solution

To avoid this we could give the timeout timer priority to the latest timeout. If a normal timeout should trigger before a collision avoidance timeout we should not reset it.
To say it differently this timer should prioritize the latest timeout possible:
image

@nicolas-rabault nicolas-rabault added this to the 3.1.0 milestone Apr 17, 2024
@nicolas-rabault nicolas-rabault self-assigned this Apr 17, 2024
@nicolas-rabault nicolas-rabault linked a pull request May 14, 2024 that will close this issue
7 tasks
@nicolas-rabault nicolas-rabault linked a pull request May 23, 2024 that will close this issue
7 tasks
@nicolas-rabault nicolas-rabault mentioned this issue May 23, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant