[BUG] robus infinite collision condition still exist #483

nicolas-rabault · 2024-04-17T08:02:22Z

Details

Which version of the bug has been detected on

Luos engine 3.1.0 and all others before that

Description of the bug

Robus can experience some message collisions on the network due to the multi-master aspect of the protocol. After a collision, Robus has to retry to send a message and do something to avoid re-colliding. But it seems that we still have one condition where collision avoidance doesn't work.

Context and environment

Few explanations about basic protocol timeout

On Robus timeout is used to avoid transmission during a reception AKA collision. The idea is to lock the transmission as soon as we receive something and unlock it after a timeout. more info about timeout in the related documentation page
To manage that Robus reset a timer to a specific value at each byte's reception so that after an inactivity period on the bus all the nodes can send messages again.

Timeout used for collision avoidance

Sometimes 2 nodes will try to send messages at the same time. In this condition, the timeout is not working and we still have a collision on the network. This collision will be detected and Robus will retry to send the message after a timeout period depending on its node ID to avoid to recollide with the same node again:

But the thing is that the collision avoidance timer is the same timer used for normal reception so in reality the node 2 collision avoidance timeout is overwritten by the reception of node 1 tx:

This leads us to the case where we could have a failure of collision avoidance :

Here we have 3 nodes colliding and then a fourth node colliding with the retry of node 1. This leads us to a collision loop.

How to reproduce the bug

@houkhouk only sees it in one specific condition in years, so it's almost impossible to reproduce voluntarily.

Possible solution

To avoid this we could give the timeout timer priority to the latest timeout. If a normal timeout should trigger before a collision avoidance timeout we should not reset it.
To say it differently this timer should prioritize the latest timeout possible:

nicolas-rabault added this to the 3.1.0 milestone Apr 17, 2024

nicolas-rabault self-assigned this Apr 17, 2024

nicolas-rabault linked a pull request May 14, 2024 that will close this issue

Fix timeout management in ROBUS protocol #484

Closed

7 tasks

nicolas-rabault linked a pull request May 23, 2024 that will close this issue

Fix stability issues #487

Merged

7 tasks

nicolas-rabault mentioned this issue May 23, 2024

Rc 3.1.0 #458

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] robus infinite collision condition still exist #483

[BUG] robus infinite collision condition still exist #483

nicolas-rabault commented Apr 17, 2024 •

edited

[BUG] robus infinite collision condition still exist #483

[BUG] robus infinite collision condition still exist #483

Comments

nicolas-rabault commented Apr 17, 2024 • edited

Details

Which version of the bug has been detected on

Description of the bug

Context and environment

Few explanations about basic protocol timeout

Timeout used for collision avoidance

How to reproduce the bug

Possible solution

nicolas-rabault commented Apr 17, 2024 •

edited