Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Values in json_schema_extra get sorted automatically during model_json_schema #508

Open
1 of 8 tasks
mwildehahn opened this issue Mar 16, 2024 · 4 comments
Open
1 of 8 tasks

Comments

@mwildehahn
Copy link
Contributor

  • This is an actually a bug report.
  • I am not getting good LLM Results
  • I have tried asking for help in the community on discord or discussions and have not received a response.
  • I have tried searching the documentation and have not found an answer.

What Model are you using?

  • gpt-3.5-turbo
  • gpt-4-turbo
  • gpt-4
  • Other (please specify)

Describe the bug
This isn't a bug with instructor per se but any examples you add to json_schema_extra get sorted: pydantic/pydantic#7580. This can impact the quality of the generation if you have "thought" for instance to trigger CoT, it will get output at the end and then steer the model towards doing that too.

There should be an option to avoid this in pydantic but thought it was worth noting in the example since this threw me off.

To Reproduce
Steps to reproduce the behavior, including code snippets of the model and the input data and openai response.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

@jxnl
Copy link
Owner

jxnl commented Mar 17, 2024

i've always used chain_of_thought and never noticed, will talk to the pydantic team

@mwildehahn
Copy link
Contributor Author

mwildehahn commented Mar 17, 2024 via email

@jxnl
Copy link
Owner

jxnl commented Mar 17, 2024

can you give a code snippet? i can't see what youlre thinking about

@mwildehahn
Copy link
Contributor Author

Something like this:

class NextAction(BaseModel):
    thought: str = Field(
        description="Think about the best next action based on what the user said."
    )
    action: Literal["laugh", "cry"]

    model_config = ConfigDict(
        json_schema_extra={
            "examples": [
                {
                    "user": "a funny joke",
                    "thought": "jokes are funny, i should laugh",
                    "action": "laugh",
                },
                {
                    "user": "a sad story",
                    "thought": "the story is sad, i should cry",
                    "action": "cry",
                },
            ]
        }
    )

I ended up going with something like this:

class NextAction(BaseModel):
    thought: str = Field(
        description="Think about the best next action based on what the user said."
    )
    action: Literal["laugh", "cry"]

    model_config = ConfigDict(
        json_schema_extra={
            "examples": [
                {
                    "user": "a funny joke",
                    "thought": "jokes are funny, i should laugh",
                    "action": "laugh",
                },
                {
                    "user": "a sad story",
                    "thought": "the story is sad, i should cry",
                    "action": "cry",
                },
            ],
            "examples_description": "The `user` in examples is only for illustrative purposes, you should not include it in your output."
        }
    )

which seems to work well.

this is the sorting issue I was talking about:

>>> print(json.dumps(NextAction.model_json_schema(), indent=2))
{
  "examples": [
    {
      "action": "laugh",
      "thought": "jokes are funny, i should laugh",
      "user": "a funny joke"
    },
    {
      "action": "cry",
      "thought": "the story is sad, i should cry",
      "user": "a sad story"
    }
  ],
  "properties": {
    "thought": {
      "description": "Think about the best next action based on what the user said.",
      "title": "Thought",
      "type": "string"
    },
    "action": {
      "enum": [
        "laugh",
        "cry"
      ],
      "title": "Action",
      "type": "string"
    }
  },
  "required": [
    "thought",
    "action"
  ],
  "title": "NextAction",
  "type": "object"
}

this breaks CoT because the model thinks it' should output the action key first, which isn't what we want.

In debugging all of this, I also noticed that the schema that gets sent includes references for everything. For simple schemas where you're not even re-using the models this increases the token count + makes the usage more unreliable from what I can tell.

ie:

class NestedModel(BaseModel):
    note: str


class NextAction(BaseModel):
    thought: str = Field(
        description="Think about the best next action based on what the user said."
    )
    action: Literal["laugh", "cry"]
    nested: NestedModel

    model_config = ConfigDict(
        json_schema_extra={
            "examples": [
                {
                    "user": "a funny joke",
                    "thought": "jokes are funny, i should laugh",
                    "action": "laugh",
                },
                {
                    "user": "a sad story",
                    "thought": "the story is sad, i should cry",
                    "action": "cry",
                },
            ],
            "examples_description": "The `user` in examples is only for illustrative purposes, you should not include it in your output.",
        }
    )

turns into:

{
  "$defs": {
    "NestedModel": {
      "properties": {
        "note": {
          "title": "Note",
          "type": "string"
        }
      },
      "required": [
        "note"
      ],
      "title": "NestedModel",
      "type": "object"
    }
  },
  "examples": [
    {
      "action": "laugh",
      "thought": "jokes are funny, i should laugh",
      "user": "a funny joke"
    },
    {
      "action": "cry",
      "thought": "the story is sad, i should cry",
      "user": "a sad story"
    }
  ],
  "examples_description": "The `user` in examples is only for illustrative purposes, you should not include it in your output.",
  "properties": {
    "thought": {
      "description": "Think about the best next action based on what the user said.",
      "title": "Thought",
      "type": "string"
    },
    "action": {
      "enum": [
        "laugh",
        "cry"
      ],
      "title": "Action",
      "type": "string"
    },
    "nested": {
      "$ref": "#/$defs/NestedModel"
    }
  },
  "required": [
    "thought",
    "action",
    "nested"
  ],
  "title": "NextAction",
  "type": "object"
}

the LLM has to parse the references correctly along with do the task which I was getting poor results with.

I worked around the sorting + this referencing issue with this:

class DereferencedJsonSchema(BaseModel):

    @classmethod
    def model_json_schema(cls, *args, **kwargs):
        """Custom model_json_schema method to make it easier for an LLM to understand the schema.

        By default, pydantic will create refs stored within `$defs` for any
        schema. This is usually what you want but it unnecessarily complicates
        the schemas being sent to the LLM. If your not referencing a schema
        multiple times, this will also increase the token count. Additionally,
        pydantic will also sort keys. If you provide examples to
        `json_schema_extra`, pydantic will sort the keys within those examples
        which can negatively influence how the model generates it's output.

        We work around both those issues with this function and have seen much
        better results when using it.

        """
        json_schema = super(DereferencedJsonSchema, cls).model_json_schema(
            *args, **kwargs
        )
        json_schema_extra = cls.model_config.get("json_schema_extra")

        if json_schema_extra:
            # re-apply the extras to get around sorting
            json_schema.update(json_schema_extra)  # type: ignore

        # dereference the schema
        dereferenced: dict[str, Any] = jsonref.replace_refs(json_schema, proxies=False)  # type: ignore

        # remove defs
        dereferenced.pop("$defs")
        return dereferenced

which results in:

{
  "examples": [
    {
      "user": "a funny joke",
      "thought": "jokes are funny, i should laugh",
      "action": "laugh"
    },
    {
      "user": "a sad story",
      "thought": "the story is sad, i should cry",
      "action": "cry"
    }
  ],
  "examples_description": "The `user` in examples is only for illustrative purposes, you should not include it in your output.",
  "properties": {
    "thought": {
      "description": "Think about the best next action based on what the user said.",
      "title": "Thought",
      "type": "string"
    },
    "action": {
      "enum": [
        "laugh",
        "cry"
      ],
      "title": "Action",
      "type": "string"
    },
    "nested": {
      "properties": {
        "note": {
          "title": "Note",
          "type": "string"
        }
      },
      "required": [
        "note"
      ],
      "title": "NestedModel",
      "type": "object"
    }
  },
  "required": [
    "thought",
    "action",
    "nested"
  ],
  "title": "NextAction",
  "type": "object"
}

depends on the schema and how often you're referencing the same class, but I got significantly better results with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants