Skip to content

Cleanlab support for text/markdown note/document? #732

Answered by jwmueller
vasnt asked this question in Q&A
Discussion options

You must be logged in to vote

Yes. If you break the text into reasonably-sized chunks, use a model to obtain vector embeddings for each chunk, and then pass them as features into Datalab, it will automatically find nearly (and exactly) duplicated chunks for you. Learn more about Datalab here:

https://docs.cleanlab.ai/stable/tutorials/datalab/text.html
https://docs.cleanlab.ai/stable/tutorials/datalab/datalab_quickstart.html

(it will also detect other issues in your data like outliers)

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by vasnt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants