-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consumer not receiving messages when power off and restart, consumer's ack floor is ahead of stream's last sequence #5412
Comments
After cutting the power, could you snapshot the store directory and share with us privately after reboot but before you restart the nats-server? Also, what size payloads have you set the system to use? I see it is complaining they are over 8MB which we strongly discourage. |
Or provide more information on number of messages inbound into the WQ, how big are they, how are the consumers operating? Does the stream usually have messages or do the consumers usually consumer messages pretty quickly and keep the number of messages very low or at zero? |
same problem on kv tests and reboot container by drain docker node or reboot VM #5205 |
I have encountered this problem. The sequence of consumer exceeds that of stream, so I cannot get new messages from stream until the sequence of stream exceeds that of consumer. |
We are trying to reproduce, so any and all information on how you are interacting with the stream is helpful. How fast messages are being sent into the system? In general is the stream near empty or at empty most of the time? |
in our case, 1 msg per sec was sent we drained nodes with meta leader and stream leader |
also we have problems where we drain not leader node |
Thanks everyone for sharing about the symptoms from this issue, currently we're looking at a reproducible example to trip this condition. |
We have a standard 1mb payloads. We test on version master -> 2.10.5 and always when restarting the docker server problem with consumers and problem with last sequence reproduce |
In my testing, the size of payload is around 50KB, with a consumption rate of 100 messages per second. The consumers usually consumer messages pretty quickly and keep the number of messages very low. To avoid incidents in production, I temporarily forked a branch and made an aggressive fix, where if the consumer initialization finds the ack floor higher than the stream's lastSeq, it forcibly resets to lastSeq. This allows the consumer to consume normally, but data loss occurs. I believe the fundamental issue is that the persistence mechanism of the stream during a power outage can be improved to avoid losing the cursor. |
I have already taken a snapshot of the store directory after rebooting but before restarting the nats-server. How can I share it with you? Thanks for the hard work! |
@stitchcula you can send to wally@nats.io if possible? |
Observed behavior
When the server is unexpectedly powered off and then restarted, all consumers are unable to self-recover and fail to receive messages. The log is as follows:
After the restart, it was almost 100% likely to reproduce the situation described in the issue, where the consumer's ack floor is greater than the stream's last sequence, resulting in all consumers being unable to consume.
Note that:
This appears to be the last blk index recorded in index.db, which actually cannot be found on the disk. This inconsistency may be caused by non-atomic sync writing of the index.db and blk to disk. When server/filestore.go
func (fs *fileStore) recoverFullState() (rerr error)
cannot find the corresponding [index].blk, it defaults to setting the stream's LastSeq to 0. The consumer's meta is also independently persisted, and server/consumer.gofunc (o *consumer) checkStateForInterestStream() error
reads and finds that the recorded AckFloor is higher than the actual stream's LastSeq, thus throwing an error.By the way, I've noticed that the persistence and recovery mechanisms of the filestore in different versions of the stream vary slightly.
I tested various sync time periods, including 'sync: always', but it doesn't seem to have much effect:
The related issue is #4566 . I believe this is a server-side problem and should not be circumvented by client-side operations like deleting and re-adding consumers.
This issue is beyond my ability to fix quickly. I earnestly seek assistance. @derekcollison @bruth
If you need more information or logs, please feel free to contact me, and I will provide them as soon as possible.
Expected behavior
After the server unexpectedly power off and then restarted, all consumers should able to self-recover and messages should be consumed normally.
Some degree of data loss is acceptable.
Server and client version
I conducted power-off tests on the following server versions, and all exhibited the issue:
All client versions have been cross-tested:
Host environment
I have used multiple servers for testing, and have tested with various file systems (xfs, ext4) and types of disks (nano flash, SSD, mechanical hard disk) to rule out the possibility of randomness.
eg:
Steps to reproduce
After normal production and consumption for a period of time, directly cut off the server's power supply and then restart, you can almost certainly reproduce the issue 100% of the time.
Using reboot and kill -9 seems unable to simulate this process.
The text was updated successfully, but these errors were encountered: