Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qdrant on ECS, with efs storage mounted: Service internal error: RocksDB open error: IO error: While lock file, Resource temporarily unavailable #4145

Open
msciancalepore98 opened this issue Apr 30, 2024 · 12 comments
Labels
bug Something isn't working

Comments

@msciancalepore98
Copy link

I have deployed a Qdrant instance on ECS, where the storage path is mounted on an EFS disk.
Scenario: The first deploy goes just fine, everything works; If for some reason, I re-trigger the ECS task deployment, a RocksDB-related problem occurs:

Panic occurred in file /qdrant/lib/collection/src/shards/replica_set/mod.rs at line 261: Failed to load local shard "./storage/collections/test-collection/0": Service internal error: RocksDB open error: IO error: While lock file: ./storage/collections/test-collection/0/segments/6ccb0e5a-6176-49d0-8eb7-6c5eb4b0d3b8/LOCK: Resource temporarily unavailable

From this error, it seems that a rolling update of Qdrant on ECS/k8s could never work, due to the fact that the first replica is attached to the storage, and concurrently the new replica tries to attach to the same storage in R/W.. am I missing anything here? Also, it seems that only one RocksDB process can access the same DB... Right now this is a big blocker and I didn't find any solutions to this, any ideas?

(I mean, a basic solution would be to store data in the container virtual storage, but I would lose all the vectors if the container restarts..)

@msciancalepore98 msciancalepore98 added the bug Something isn't working label Apr 30, 2024
@generall
Copy link
Member

How do you deploy qdrant in your k8s?

We have several ready-made solutions like https://github.com/qdrant/qdrant-helm or hybrid-cloud https://hybrid-cloud.qdrant.tech/ where those problems are all resolved

@msciancalepore98
Copy link
Author

It's just a simple ECS task deploy, where EBS is used to provide persistency (efs). Is it possible that I need a non sharable disk? I saw in that helm example that the PVC is of type ReadWriteOnce..

@msciancalepore98
Copy link
Author

msciancalepore98 commented May 3, 2024

@generall If I delete all the LOCKS on disk using:

sudo rm */0/segments/*/payload_index/LOCK && sudo rm */0/segments/*/LOCK

If I trigger a rolling update, it goes fine and the new Qdrant instance recreates the LOCKs.

Now, why is this Failed to load local shard happening even when no other Qdrant instance is up at the same time? in that situation no process is holding the LOCK at all, hence the new Qdrant instance should be able to access the collections shards and restore them properly.

Also, when a Qdrant instance is shut down, it should cleanup the LOCK files properly. (I can see this locally as well, even if Qdrant is shut down, LOCK files are all over the place, is there a reason for this? Also, this is more weird due to the fact that I cannot reproduce this panic locally.)

@generall
Copy link
Member

generall commented May 3, 2024

Hey @msciancalepore98, I can't give you any guarantees of qdrant's work and what is expected to happen or not, if you continue to butchering storage internals like this.

@msciancalepore98
Copy link
Author

If you could actually help with proper debugging hints would be great as well, I am trying different things to get to the root cause of this behaviour with EFS.

Also, I can't use any auto-managed solution in my environment, only deploying tasks on ECS.

@generall
Copy link
Member

generall commented May 3, 2024

We never tested qdrant on EFS, and I am not sure it is good idea to use it. Also I don't know what exactly you are trying to do, but if you are trying to mount same FS to multiple instances of qdrant - it is not going to work

@ryanlee588
Copy link

I am facing a similar issue. Commenting to stay to date

@timvisee
Copy link
Member

timvisee commented May 3, 2024

Is it possible that I need a non sharable disk?

Correct. At least, in terms of file shares, we do recommend not to use this.

Also, each instance must have their very own storage directory. These cannot be shared. The cluster itself will take care of putting sharing all your data across the cluster and putting it in each storage directory separately.

We never tested qdrant on EFS

@generall In their FAQ they do promise strong consistency and support for proper file locking. But I also feel like we've seen issues with this before.

@janicetyp
Copy link

janicetyp commented May 3, 2024

hi @timvisee, i'm encountering a similar issue, wondering if you'd have any advice on how to preserve the existing collections while resolving the LOCK error? the qdrant instance we've got running on ECS keeps crashing due to this reason and I don’t see a way to resolve it without rebuilding the whole thing from scratch, TIA appreciate the help!

to add on a bit more info - we're deploying Qdrant on ECS with an EFS mount, we were facing the too many open files error and we increased the limit to 120k, but soon after we encountered a disk quota error. After referring to Qdrant discord, we tried to update from 1.6.1 to 1.9.0 which was unable to resolve the issue, now facing this LOCK problem after we reverted to version 1.6.1 with the same set up.

@timvisee
Copy link
Member

timvisee commented May 3, 2024

And this happens on every restart, and you're 100% sure you don't have another instance running on the same data?

To be honest, I'm not entirely sure. We haven't hit this ourselves yet.

You might end up having to purge lock files yourself, but I have no idea what other damage that might do.

@pvieito
Copy link

pvieito commented May 3, 2024

Hey @timvisee @generall:

And this happens on every restart, and you're 100% sure you don't have another instance running on the same data?

This is an issue for example when you deploy Qdrant in a service that automatically monitors & relaunches it on failure, like ECS or Kubernetes. For example, imagine that Kubernetes is doing a health-check on the Qdrant endpoint, it starts to fail and it launches a new Qdrant to replace the old one, it connects it to the same storage but it has the LOCKs from the failed instance. Qdrant should have some sort of env-var or configuration to do a clean-up on start and remove any locks from previous failed instances / runs.

@timvisee
Copy link
Member

timvisee commented May 3, 2024

Qdrant should have some sort of env-var or configuration to do a clean-up on start and remove any locks from previous failed instances / runs.

As far as I'm aware, it does this already.

Running locally and killing with kill -9 doesn't show this. We don't see this problem in normal k8s operation either. That's why I wonder whether locking on EFS is as good as they promise it to be.

Or are you saying the failed instance is still running while the new instance starts? In that case this would be expected behavior and that should be prevented.

I'll try to do some debugging later to see whether I can catch the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants