-
Cleanlab can find duplicate data within a text/markdown note/document? |
Beta Was this translation helpful? Give feedback.
Answered by
jwmueller
May 30, 2023
Replies: 1 comment
-
Yes. If you break the text into reasonably-sized chunks, use a model to obtain vector embeddings for each chunk, and then pass them as https://docs.cleanlab.ai/stable/tutorials/datalab/text.html (it will also detect other issues in your data like outliers) |
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
vasnt
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Yes. If you break the text into reasonably-sized chunks, use a model to obtain vector embeddings for each chunk, and then pass them as
features
into Datalab, it will automatically find nearly (and exactly) duplicated chunks for you. Learn more about Datalab here:https://docs.cleanlab.ai/stable/tutorials/datalab/text.html
https://docs.cleanlab.ai/stable/tutorials/datalab/datalab_quickstart.html
(it will also detect other issues in your data like outliers)