Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH]: adding capability for Spacy embeddings #2049

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

Vishnunkumar
Copy link

Description of changes

  • New functionality: added feature to leverage spacy models

Test plan

  • will add unit test cases

Documentation Changes

  • Docstrings are added with proper Error Handling.

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@tazarov
Copy link
Contributor

tazarov commented Apr 26, 2024

@Vishnunkumar, this looks good. Thank you.

Can you also add a test for this under chromadb/test/ef/test_spacy_ef.py? We're in the process of refactoring the embedding functions DX with an emphasis on testing and usability.

Did you also create a PR for the docs with links to official docs etc?

Copy link
Contributor

@tazarov tazarov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a few points to address:

  • Tests
  • Docs (unless already done)
  • Additional options on the EF:
    • can spacy cache models (if yes, add caching dir with a default value)
    • can other models be loaded other than en_code_web_{model} (if yes, maybe make things more flexible
    • Are there any other constraints that can/should be applied to the library that can/are useful?
    • Is batching supported where multiple docs can be passed in a single round instead of a loop?

@Vishnunkumar
Copy link
Author

Vishnunkumar commented Apr 27, 2024

@Vishnunkumar, this looks good. Thank you.

Can you also add a test for this under chromadb/test/ef/test_spacy_ef.py? We're in the process of refactoring the embedding functions DX with an emphasis on testing and usability. - Working on it with ETA of 2 days

Did you also create a PR for the docs with links to official docs etc? - will look into this

Working on the comments, Thanks for the review @tazarov

@Vishnunkumar
Copy link
Author

Sharing the snap of report for test cases.

image

@Vishnunkumar
Copy link
Author

Vishnunkumar commented Apr 28, 2024

  • Tests - Added unit test cases
  • Docs PR - working on it: Raised a PR for the same
  • spacy models doesn't have any seperate process for caching, please look at this for reference
  • can other models be loaded other than en_code_web_{model} (if yes, maybe make things more flexible - Possible models with weights are "md" and "lg", the function loads lg model by default if they are downloaded already. else it will throw an error.
  • I believe the below piece of code is handling the multiple docs when the input is list of multiple docs
    cast(Embeddings, [list(self._nlp(doc).vector.astype("float")) for doc in input])

@tazarov
Copy link
Contributor

tazarov commented Apr 29, 2024

@Vishnunkumar, looking at spacy docs, I can see that there are many models also, some in other languages - https://spacy.io/models. Also, I see some additional models that trade off accuracy for efficiency. Can we make it so that people can specify those other models as well? e.g., specify the whole model - en_core_web_sm or de_core_news_lg?

@Vishnunkumar
Copy link
Author

@Vishnunkumar, looking at spacy docs, I can see that there are many models also, some in other languages - https://spacy.io/models. Also, I see some additional models that trade off accuracy for efficiency. Can we make it so that people can specify those other models as well? e.g., specify the whole model - en_core_web_sm or de_core_news_lg?

  • Understood, I was going only for english models, will update on this accordingly.

@Vishnunkumar Vishnunkumar changed the title Spacy Embeddings [ENH]: adding capability for Spacy embeddings May 1, 2024
@Vishnunkumar
Copy link
Author

Vishnunkumar commented May 2, 2024

  • Tests - Added unit test cases
  • Docs PR - working on it: Raised a PR for the same. PR#244
  • spacy models doesn't have any seperate process for caching, please look at this for reference
  • can other models be loaded other than en_code_web_{model} (if yes, maybe make things more flexible - Possible models with weights are "md" and "lg", the function loads lg model by default if they are downloaded already. else it will throw an error.
  • I believe the below piece of code is handling the multiple docs when the input is list of multiple docs
    cast(Embeddings, [list(self._nlp(doc).vector.astype("float")) for doc in input])

Hi @tazarov , I have update with necessary requirments. Please review when you are avaiable

Copy link

vercel bot commented May 3, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
chroma ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 13, 2024 4:40am

Copy link
Contributor

@tazarov tazarov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good. Thanks @Vishnunkumar.

There are only a couple of minor things to look at.

chromadb/test/ef/test_spacy_ef.py Show resolved Hide resolved
chromadb/test/ef/test_spacy_ef.py Show resolved Hide resolved

def test_spacyembddingfunction_throwserror_whenmodel_notfound():
with pytest.raises(ValueError,
match="""spacy models are not downloaded yet, please download them using `spacy download model_name`, Please checkout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a portion of this text as match can do loose matching.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved

@Vishnunkumar
Copy link
Author

This is good. Thanks @Vishnunkumar.

There are only a couple of minor things to look at.

Hi @tazarov , I will look into these.

@Vishnunkumar
Copy link
Author

Vishnunkumar commented May 6, 2024

This is good. Thanks @Vishnunkumar.
There are only a couple of minor things to look at.

Hi @tazarov , I will look into these. Please review as per availability

Hi @tazarov, I have updated the changes.

@Vishnunkumar Vishnunkumar requested a review from tazarov May 6, 2024 12:01
Copy link
Contributor

@tazarov tazarov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can I ask you to run the following to make sure everything is formatted as it should:

pip install -r requirements_dev.txt # install dev deps
pre-commit run --all-files trailing-whitespace                                                                                                                                                                                                                                                      
pre-commit run --all-files mixed-line-ending
pre-commit run --all-files end-of-file-fixer
pre-commit run --all-files requirements-txt-fixer
pre-commit run --all-files check-xml
pre-commit run --all-files check-merge-conflict
pre-commit run --all-files check-case-conflict
pre-commit run --all-files check-docstring-first
pre-commit run --all-files black
pre-commit run --all-files flake8
pre-commit run --all-files prettier
pre-commit run --all-files check-yaml

@tazarov
Copy link
Contributor

tazarov commented May 9, 2024

@Vishnunkumar, thanks for sticking with this. I think this is ready to merge now.

@Vishnunkumar
Copy link
Author

@Vishnunkumar, thanks for sticking with this. I think this is ready to merge now.

Hey @tazarov , thanks for reviewing and helping me throughout the PR. Always feeling great to contribute to open source communities.

@Vishnunkumar
Copy link
Author

Vishnunkumar commented May 9, 2024

Hey @tazarov, just wanted to check who will be merging this PR? Also it seems the docs repository has been archived. Is there any other repo that I should look for adding docs.

@tazarov
Copy link
Contributor

tazarov commented May 12, 2024

@Vishnunkumar, we've recently merged the docs repo here alongside a big refactor. The new docs reside in this repo now under https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com/pages/integrations. Just create a new .md, and don't forget to update the sidebar.js

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants