[WIP] Support adding pipeline component instances #12710

honnibal · 2023-06-10T16:32:42Z

Add a method Language.add_pipe_instance that allows you to add component instances to the pipeline. This is primarily intended to make it easier to assemble a pipeline dynamically in a script, which is useful for exploration or for situations where you don't care so much about serialisation.

Components that are added this way cannot be reconstructed automatically during deserialisation. Instead, a pipe_instances argument is added to spacy.load().

Example usage 1: creating an NLP object with a function, and using nlp.to_disk() and nlp.from_disk().

If users wish to assemble the pipeline themselves in a function, they don't need to worry about the config system. They just have to make sure they use the same function to construct the pipeline each time.

def make_nlp():
    nlp = spacy.blank("en")
    tagger = Tagger(nlp.vocab, model=DEFAULT_TAGGER_MODEL, overwrite=False, scorer=tagger_score, neg_prefix="!", label_smoothing=0.0)
    nlp.add_pipe_instance(tagger)
    return nlp

nlp = make_nlp()
nlp.to_disk("/tmp/my_nlp")
# Later
nlp = make_nlp()
nlp.from_disk("/tmp/my_nlp")

Example usage 2: Passing component instances to spacy.load()

If pipe instances are added directly, we won't be able to reconstruct them in spacy.load(). Instead of just making the pipeline impossible to load this way, you can pass the missing instances using the pipe_instances argument. This approach is kind of impractical for components that need the Vocab instance, but I definitely think it's better than just not supporting it.

vocab = Vocab()
tagger = Tagger(vocab, model=DEFAULT_TAGGER_MODEL, overwrite=False, scorer=tagger_score, neg_prefix="!", label_smoothing=0.0)
nlp = spacy.load("/tmp/my_nlp", vocab=vocab, pipe_instances={"tagger": my_tagger})

I added some logic to fetch the vocab instance from a pipe instance, so that nlp = spacy.load("/tmp/my_nlp", pipe_instances={"tagger": my_tagger}) would work. The value of that logic is a bit dubious.

Questions for discussion

We could overload add_pipe, so that you can just pass the pipe instance in there. I've kept this separate to avoid interaction with existing features, at least until we're happy with it. In the long term I think the overload is a bit more consistent with the general design of the API.
We could support pipe metadata in add_pipe_instance, but then how would we serialise it? I felt it was better to avoid the complication, and if you want those things to be correct you can assemble the pipeline via the config system. I could easily be missing implications of this metadata not being there though.
Naming of pipe_instance argument?
Fishing the vocab instance out of the pipe instances seems dubious. Probably we should just not do this, and ask that the vocab argument be set?
Wording of the warnings about vocab instances?
Should we check for vocab instances more robustly? We could go through the __dict__, but that still doesn't cover nested objects. I felt that the vocab member was a decent value thing to check, albeit not watertight.

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

svlandeg

I definitely see what you're aiming to do here, and I know that the config system is occassionally a bit of a hurdle for users, so it makes sense to think of alternative ways that don't rely on the config so much.

The "Example usage 1" to me is crucial to understand how this functionality would/should look like in terms of UX. The code that you cited in the PR description won't actually work, because DEFAULT_TAGGER_MODEL is a Config object (aka a dict), not an actual Thinc Model.

If I'm not mistaken, you'd have to do something like this instead:

tok2vec = build_hash_embed_cnn_tok2vec(pretrained_vectors=None, width=96, depth=4, embed_size=2000, window_size=1, maxout_pieces=3, subword_features=True)
model = build_tagger_model(tok2vec=tok2vec)
tagger = Tagger(nlp.vocab, model=model, overwrite=False, scorer=tagger_score, neg_prefix="!", label_smoothing=0.0)

Clearly this isn't user-friendly either, because we've now lost a way to just go with the default settings when defining the tok2vec. Also, if we want to better support this approach, we'll need to better document the functions underlying the registry names (right now docs focus on "spacy.HashEmbedCNN.v2" and don't even mention build_hash_embed_cnn_tok2vec)

Alternatively, we'd have something like this:

RESOLVED_TAGGER_MODEL = registry.resolve(Config().from_str(default_model_config))["model"]

which would be an actual Thinc model, and something we can provide as part of the core library. But it doesn't really address the core issue to me, which is that if you want to go in and change settings of this tagger model, you'd still end up having to edit the default_model_config, so you'd really still be working with the config system.

We'd probably always run into this kind of config dependency with the default settings, because we've decided for v3 to record all of the default values in these default model config strings (representing dicts), and we explicitely didn't want to repeat them in the functions themselves (like in build_hash_embed_cnn_tok2vec) to avoid this constant overwriting of defaults we had before... 🤔

svlandeg · 2023-06-21T08:09:25Z

They just have to make sure they use the same function to construct the pipeline each time.

To continue a bit on the usage 1 example, this could become problematic inbetween spaCy versions, no? If we'd change the default tagger model, this make_nlp function bypasses the versioning and could break or result in different results?

honnibal · 2023-06-21T08:28:13Z

The "Example usage 1" to me is crucial to understand how this functionality would/should look like in terms of UX.

I used a tagger for this, which is admittedly a bit of a fiddly use-case for it. Assembling a matcher or something by hand might be more common, or if you're just putting together some arbitrary components.

Clearly this isn't user-friendly either, because we've now lost a way to just go with the default settings when defining the tok2vec. Also, if we want to better support this approach, we'll need to better document the functions underlying the registry names (right now docs focus on "spacy.HashEmbedCNN.v2" and don't even mention build_hash_embed_cnn_tok2vec)

These are good points to keep in mind. I would suggest getting the core functionality in first, and then at least theres some way to do this, even if the support for the overall workflow in the docs and the library is shaky.

We don't necessarily need to document every one of these registered functions. What we do need are good ways for users to navigate from registry names to the location in the source code that the name resolves to. The VSCode extension I think does this, right? We need something like that on the command-line as well.

To continue a bit on the usage 1 example, this could become problematic in between spaCy versions, no? If we'd change the default tagger model, this make_nlp function bypasses the versioning and could break or result in different results?

Well, let's say people go and reference undocumented stuff with this workflow. Yes, they can have their things break from this between versions. That's a risk that they should take in a well-informed way.

Trying to prevent every mistake or shield users from potential problems can be at the expense of flexibility. If we ask what we expect the Python API to be able to do, from a simple flexibility standpoint, to me it definitely should be able to add components by instance.

svlandeg · 2023-06-26T10:59:59Z

I used a tagger for this, which is admittedly a bit of a fiddly use-case for it. Assembling a matcher or something by hand might be more common, or if you're just putting together some arbitrary components.

So which use-cases are we really aiming to support here? The matcher doesn't seem applicable either - at least the built-in Matcher can't actually be used as a pipeline component, as it returns a list of matches on __call__, not the annotated docs like other components do.

If we're thinking about stateless functions like the evil_component test case added here, it feels like a small effort to add a one-liner to allow for nlp.add_pipe("evil_instance"):

@Language.component("evil_instance")
def evil_component(doc):
    ...

I can imagine other use-cases where you want to initiate the component class directly from within code though, like for spacy-llm's (v0.3.0) LLM component:

from spacy_llm.pipeline import LLMWrapper
from spacy_llm.tasks import NERTask
from spacy_llm.backends.rest.openai import OpenAIBackend
from spacy_llm.cache import BatchCache

cache = BatchCache(path=None, batch_size=342, max_batches_in_mem=42)
backend = OpenAIBackend(config={"model": "text-davinci-003"}, strict=False, max_tries=4, interval=1.0, max_request_time=40)
instance = LLMWrapper(vocab=nlp.vocab, task=NERTask, backend=backend, cache=cache, save_io=False)

nlp.add_pipe_instance(instance, name="my_llm")

Like with the tagger use-case I tried to work through in my first example, you end up specifying all values of the Python objects you need to use, because we're currently handling all the defaults through the registered functions & the config.

Does the above code feel like a satisfying solution to offer to users? Or do we need to rethink again how we define these default values? I just want to understand the level of support we want to offer here.

We don't necessarily need to document every one of these registered functions. What we do need are good ways for users to navigate from registry names to the location in the source code that the name resolves to. The VSCode extension I think does this, right? We need something like that on the command-line as well.

Agreed - this would be good functionality to have!

svlandeg · 2023-06-26T11:00:53Z

spacy/util.py

@@ -675,6 +675,7 @@ def load_model_from_init_py(
    enable: Union[str, Iterable[str]] = _DEFAULT_EMPTY_PIPES,
    exclude: Union[str, Iterable[str]] = _DEFAULT_EMPTY_PIPES,
    config: Union[Dict[str, Any], Config] = SimpleFrozenDict(),
+    pipe_instances: Dict[str, Any] = SimpleFrozenDict(),


Not having this in load_model_from_init_py broke nlp = spacy.load("en_core_web_sm"), which wasn't caught by the test suite :|

honnibal · 2023-06-26T12:02:58Z

Does the above code feel like a satisfying solution to offer to users? Or do we need to rethink again how we define these default values? I just want to understand the level of support we want to offer here.

I think it's better than nothing, yeah.

I'm interested in whether there's a way we can somehow provide the defaults in Python without duplicating them in the configs. Perhaps we could inspect the function to build the default config? Anyway we can leave that for a separate question. I'd still want add_pipe_instance even if we don't.

If people build a suite of their own functions and things, they can manage their own defaults, and they should be able to express a reasonable Python API for assembling a pipeline without having to go through our config system.

svlandeg

We could overload add_pipe, so that you can just pass the pipe instance in there. I've kept this separate to avoid interaction with existing features, at least until we're happy with it. In the long term I think the overload is a bit more consistent with the general design of the API.

I'm personally not a huge fan of overloading, for type & clarity reasons. For now, I think it's fine as a separate method.

We could support pipe metadata in add_pipe_instance, but then how would we serialise it? I felt it was better to avoid the complication, and if you want those things to be correct you can assemble the pipeline via the config system. I could easily be missing implications of this metadata not being there though.

This meta data typically contains the defaults from the factory, so it seems irrelevant in this case?

Naming of pipe_instance argument?

I can't think of a better one...

Fishing the vocab instance out of the pipe instances seems dubious. Probably we should just not do this, and ask that the vocab argument be set?

Should we check for vocab instances more robustly? We could go through the dict, but that still doesn't cover nested objects. I felt that the vocab member was a decent value thing to check, albeit not watertight.

I think the _get_instantiated_vocab() method looks reasonable as-is and will provide useful warnings when there's conflicts.

svlandeg · 2023-07-05T08:49:56Z

spacy/language.py

        self._pipe_meta[name] = self.get_factory_meta(factory_name)
+        pipe_index = self._get_pipe_index(before, after, first, last)
        self._components.insert(pipe_index, (name, pipe_component))


We can revert the changes to add_pipe in this PR, if we're not overloading.

svlandeg · 2023-07-05T09:01:23Z

spacy/__init__.py

@@ -32,6 +32,7 @@ def load(
    enable: Union[str, Iterable[str]] = util._DEFAULT_EMPTY_PIPES,
    exclude: Union[str, Iterable[str]] = util._DEFAULT_EMPTY_PIPES,
    config: Union[Dict[str, Any], Config] = util.SimpleFrozenDict(),
+    pipe_instances: Dict[str, Any] = util.SimpleFrozenDict(),


Suggested change

pipe_instances: Dict[str, Any] = util.SimpleFrozenDict(),

pipe_instances: Dict[str, PipeCallable] = util.SimpleFrozenDict(),

cf the type we're using in add_pipe_instance()

Same comment for the other methods, too.

svlandeg · 2023-07-05T09:02:41Z

spacy/language.py

+        where you provide a config and let spaCy construct the instance). See 'spacy.load'
+        for details of how to load back a pipeline with components added by instance.
+
+        pipe_instance (Callable[[Doc], Doc]): The component to add.


Suggested change

pipe_instance (Callable[[Doc], Doc]): The component to add.

component (Callable[[Doc], Doc]): The component to add.

svlandeg · 2023-07-05T09:04:23Z

spacy/language.py

+        last (bool): If True, insert component last in the pipeline.
+        RETURNS (Callable[[Doc], Doc]): The pipeline component.
+
+        DOCS: https://spacy.io/api/language#add_pipe_instance


TODO: add documentation to API page

svlandeg · 2023-07-05T09:06:02Z

spacy/language.py

+        name = name if name is not None else getattr(component, "name")
+        if name is None:
+            raise ValueError("TODO error")


Alternatively, we could require name to be a str in this method, and not Optional.

spacy/language.py

svlandeg · 2023-07-05T09:22:03Z

spacy/language.py

            if pipe_name not in pipeline:
                opts = ", ".join(pipeline.keys())
                raise ValueError(Errors.E956.format(name=pipe_name, opts=opts))
+            if pipe_name in pipe_instances:
+                if pipe_name in exclude:
+                    continue
+                else:
+                    nlp.add_pipe_instance(pipe_instances[pipe_name])
+            # Is it important that we instantiate pipes that
+            # aren't excluded? It seems like we would want
+            # the exclude check above. I've left it how it
+            # is though, in case there's some sort of crazy
+            # load-bearing side-effects someone is relying on?
            pipe_cfg = util.copy_config(pipeline[pipe_name])
            raw_config = Config(filled["components"][pipe_name])
            if pipe_name not in exclude:


I agree that it looks like we should move the exclude check to be the first thing in the for loop - we specifically say that excluded components won't be loaded.

Let's move that to a separate PR though?

svlandeg · 2023-07-05T09:24:59Z

spacy/errors.py

@@ -219,6 +219,9 @@ class Warnings(metaclass=ErrorsWithCodes):
    W125 = ("The StaticVectors key_attr is no longer used. To set a custom "
            "key attribute for vectors, configure it through Vectors(attr=) or "
            "'spacy init vectors --attr'")
+    W126 = ("Pipe instance '{name}' is being added with a vocab "
+            "instance that will not match other components. This is "


Suggested change

"instance that will not match other components. This is "

"instance that does not match other components. This is "

honnibal added 5 commits June 10, 2023 16:53

Add errors for pipe instance problems

6f821ef

Support adding pipeline component by instance

aa0d747

Support adding pipeline component by instance

4332d12

Format

b9730a6

Fix find missing pipes

afbdd82

svlandeg added enhancement Feature requests and improvements feat / pipeline Feature: Processing pipeline and components labels Jun 10, 2023

honnibal added 2 commits June 11, 2023 13:27

Fix conflicts

9753484

Format

77a0859

svlandeg reviewed Jun 21, 2023

View reviewed changes

svlandeg added 3 commits June 26, 2023 11:04

Merge branch 'master' into feat/add-pipe-instance

dcd8a76

fix

4cc5bd3

add pipe_instances also to load_model_from_init_py

9fcbc8e

svlandeg reviewed Jun 26, 2023

View reviewed changes

svlandeg added 2 commits July 5, 2023 10:35

Merge branch 'master' into feat/add-pipe-instance

8a79a71

fix warning numbers

0fd797e

svlandeg reviewed Jul 5, 2023

View reviewed changes

svlandeg changed the base branch from master to main January 29, 2024 09:52

honnibal added 4 commits April 16, 2024 12:13

Fix error

42a373d

Fix comment

0c8393e

Fix comment

ef18829

Fix doc string

bc3337d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Support adding pipeline component instances #12710

[WIP] Support adding pipeline component instances #12710

honnibal commented Jun 10, 2023

svlandeg left a comment

svlandeg commented Jun 21, 2023

honnibal commented Jun 21, 2023

svlandeg commented Jun 26, 2023

svlandeg Jun 26, 2023 •

edited

honnibal commented Jun 26, 2023

svlandeg left a comment

svlandeg Jul 5, 2023

svlandeg Jul 5, 2023

svlandeg Jul 5, 2023

svlandeg Jul 5, 2023

svlandeg Jul 5, 2023

svlandeg Jul 5, 2023

svlandeg Jul 5, 2023

svlandeg Jul 5, 2023

	pipe_instances: Dict[str, Any] = util.SimpleFrozenDict(),
	pipe_instances: Dict[str, PipeCallable] = util.SimpleFrozenDict(),

	pipe_instance (Callable[[Doc], Doc]): The component to add.
	component (Callable[[Doc], Doc]): The component to add.

	"instance that will not match other components. This is "
	"instance that does not match other components. This is "

[WIP] Support adding pipeline component instances #12710

Are you sure you want to change the base?

[WIP] Support adding pipeline component instances #12710

Conversation

honnibal commented Jun 10, 2023

Questions for discussion

Checklist

svlandeg left a comment

Choose a reason for hiding this comment

svlandeg commented Jun 21, 2023

honnibal commented Jun 21, 2023

svlandeg commented Jun 26, 2023

svlandeg Jun 26, 2023 • edited

Choose a reason for hiding this comment

honnibal commented Jun 26, 2023

svlandeg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

svlandeg Jun 26, 2023 •

edited