Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: pipelines的document_stores.faiss中vector_id更新出错 #8346

Open
1 task done
MaxHouxu opened this issue Apr 29, 2024 · 2 comments
Open
1 task done

[Bug]: pipelines的document_stores.faiss中vector_id更新出错 #8346

MaxHouxu opened this issue Apr 29, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@MaxHouxu
Copy link

MaxHouxu commented Apr 29, 2024

软件环境

- paddlepaddle:2.6.0
- paddlepaddle-gpu: 
- paddlenlp: 2.6.1

重复问题

  • I have searched the existing issues

错误描述

当删除数据之后再增加数据时,vector_id会重复,导致SQL更新冲突

稳定复现步骤 & 代码

update_embeddings新增数据的时候,vector_id以当前文档个数依次增加:

vector_id = sum([index.ntotal for index in self.faiss_indexes.values()])


> vector_id_map = {}
> for doc in document_batch:
>     vector_id_map[str(doc.id)] = str(vector_id) + "_" + index
>     vector_id += 1
> self.update_vector_ids(vector_id_map, index=index)

当删除数据之后再增加数据时,vector_id会重复,导致SQL更新冲突:

(sqlite3.IntegrityError) UNIQUE constraint failed: document.vector_id
[SQL: UPDATE document SET vector_id=CASE document.id WHEN ? THEN ? WHEN ? THEN ? WHEN ? THEN ? END WHERE document.id IN (?, ?, ?) AND document."index" = ?]
[parameters: ('a11c96f3f9729487bb584f52e404a5a', '275_faiss_index', 'cd10279366bb16cd8a48696b179bdd3', '276_faiss_index', 'd81e0d6d84af695310371c6d915e2293', '277_faiss_index', 'a11c96f3f9729487bb584f52e404a5a', 'cd10279366bb16cd8a48696b179bdd3', 'd81e0d6d84af695310371c6d915e2293', 'faiss_index')]
(Background on this error at: https://sqlalche.me/e/14/gkpj)
@MaxHouxu MaxHouxu added the bug Something isn't working label Apr 29, 2024
@w5688414
Copy link
Contributor

提供一下最小复现代码,方便我们快速定位

@MaxHouxu
Copy link
Author

MaxHouxu commented May 16, 2024

使用FAISSDocumentStore中的delete_documents删除了向量库中的部分文档后,faiss向量库会自动更新vector_id,使id保持连续,但是SQL存储的文档库meta中的vector_id并没有更新,导致对应不上

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants