Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for llm in spacy init config #12820

Draft
wants to merge 8 commits into
base: v4
Choose a base branch
from

Conversation

rmitsch
Copy link
Contributor

@rmitsch rmitsch commented Jul 13, 2023

Description

Add support for llm component. in spacy init config

Types of change

Enhancement.

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@rmitsch rmitsch added enhancement Feature requests and improvements feat / cli Feature: Command-line interface labels Jul 13, 2023
@rmitsch rmitsch self-assigned this Jul 13, 2023
@rmitsch
Copy link
Contributor Author

rmitsch commented Jul 14, 2023

As of now spacy-llm is required if llm is added in the pipeline. This is because we need its registry entries to resolve --llm.task ner to e. g. spacy.NER.v2 (note that it the current state we don't select the highest version, this is a todo). This means we'd have to install spacy-llm for the CI tests though. Are we ok with that?

Alternatively we could also required the registry handle to be set explicitly: e. g. --llm.task spacy.NER.v2. In this case there's no need to require spacy-llm, but it's somewhat less consistent with the way --pipeline is configured.

Opinions?

@@ -3,7 +3,7 @@ the docs and the init config command. It encodes various best practices and
can help generate the best possible configuration, given a user's requirements. #}
{%- set use_transformer = hardware != "cpu" and transformer_data -%}
{%- set transformer = transformer_data[optimize] if use_transformer else {} -%}
{%- set listener_components = ["tagger", "morphologizer", "parser", "ner", "textcat", "textcat_multilabel", "entity_linker", "span_finder", "spancat", "spancat_singlelabel", "trainable_lemmatizer"] -%}
{%- set listener_components = ["tagger", "morphologizer", "parser", "ner", "textcat", "textcat_multilabel", "entity_linker", "span_finder", "spancat", "spancat_singlelabel", "trainable_lemmatizer", "llm"] -%}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llm shouldn't be in this list of components with listeners, which will cause tok2vec or transformer to be added (depending on the rest of the template, of course).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I vaguely remember that the LLM block was not output properly if "llm" was not in listener_components. I'll check.

Comment on lines +600 to +610
{% if "llm" in components -%}
[components.llm]
factory = "llm"

[components.llm.model]
@llm_models = "{{ llm_spec['model'] }}"

[components.llm.task]
@llm_tasks = "{{ llm_spec['task'] }}"
{% endif -%}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (depending on exactly how this gets designed in the end), it should be possible to only have this block once because it does not depend on the listener setup.

@pytest.mark.parametrize("pipeline", [["llm"]])
@pytest.mark.parametrize("llm_model", ["noop"])
@pytest.mark.parametrize("llm_task", ["ner", "sentiment"])
def test_init_config_llm(pipeline, llm_model, llm_task):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test should probably be in spacy-llm unless it's installed by default (similar to spacy-transformers, where I'm not sure this is tested at all, but where it could be tested).

Copy link
Member

@svlandeg svlandeg Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should consider installing spacy-llm by default going forward. At some point we want to move it to the core codebase, but for now, having it installed by default + kept as separate repo has the advantages that we can still release updates more quickly with spacy-llm while not troubling users with an additional install command. The requirements are minimal.

Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently not working for most OS HF models, as they don't have a default name like e.g. ""spacy.GPT-3-5.v1" does.

For instance, spacy init config myconfig.cfg -p "llm" --llm.model "Falcon" --llm.task 'ner'

will fail with

✘ Config validation error
llm.model -> name       field required
{'@llm_models': 'spacy.Falcon.v1'}

valid_values.add(reg_name)
if reg_name.lower() == user_value:
spec["matched_reg_handle"] = reg_handle
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are 2 versions of a model/task, is this code guaranteed to give you the latest version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't - I just checked - it produces spacy.NER.v1 instead of spacy.NER.v2 when given "ner"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it just grabs the first occurence as of now. I left that as TBD since it wasn't clear how we want to go forward.

@svlandeg svlandeg changed the base branch from master to main January 29, 2024 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / cli Feature: Command-line interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants