Values in json_schema_extra get sorted automatically during `model_json_schema` #508

mwildehahn · 2024-03-16T20:36:48Z

This is an actually a bug report.
I am not getting good LLM Results
I have tried asking for help in the community on discord or discussions and have not received a response.
I have tried searching the documentation and have not found an answer.

What Model are you using?

gpt-3.5-turbo
gpt-4-turbo
gpt-4
Other (please specify)

Describe the bug
This isn't a bug with instructor per se but any examples you add to json_schema_extra get sorted: pydantic/pydantic#7580. This can impact the quality of the generation if you have "thought" for instance to trigger CoT, it will get output at the end and then steer the model towards doing that too.

There should be an option to avoid this in pydantic but thought it was worth noting in the example since this threw me off.

To Reproduce
Steps to reproduce the behavior, including code snippets of the model and the input data and openai response.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

jxnl · 2024-03-17T00:04:53Z

i've always used chain_of_thought and never noticed, will talk to the pydantic team

mwildehahn · 2024-03-17T00:23:30Z

I also was putting `user` as a key to simulate the user request and then the response. The QA example didn’t seem practical because then it requires the user prompt in the model? Am I missing something there?

…

On Sat, Mar 16, 2024 at 5:05 PM Jason Liu ***@***.***> wrote: i've always used chain_of_thought and never noticed, will talk to the pydantic team — Reply to this email directly, view it on GitHub <#508 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFMAUL54ZUIE4GC2NULGXLYYTM3ZAVCNFSM6AAAAABEZSVOJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBSGE4TKNBQHE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

jxnl · 2024-03-17T01:34:22Z

can you give a code snippet? i can't see what youlre thinking about

mwildehahn · 2024-03-17T04:48:34Z

Something like this:

class NextAction(BaseModel):
    thought: str = Field(
        description="Think about the best next action based on what the user said."
    )
    action: Literal["laugh", "cry"]

    model_config = ConfigDict(
        json_schema_extra={
            "examples": [
                {
                    "user": "a funny joke",
                    "thought": "jokes are funny, i should laugh",
                    "action": "laugh",
                },
                {
                    "user": "a sad story",
                    "thought": "the story is sad, i should cry",
                    "action": "cry",
                },
            ]
        }
    )

I ended up going with something like this:

class NextAction(BaseModel):
    thought: str = Field(
        description="Think about the best next action based on what the user said."
    )
    action: Literal["laugh", "cry"]

    model_config = ConfigDict(
        json_schema_extra={
            "examples": [
                {
                    "user": "a funny joke",
                    "thought": "jokes are funny, i should laugh",
                    "action": "laugh",
                },
                {
                    "user": "a sad story",
                    "thought": "the story is sad, i should cry",
                    "action": "cry",
                },
            ],
            "examples_description": "The `user` in examples is only for illustrative purposes, you should not include it in your output."
        }
    )

which seems to work well.

this is the sorting issue I was talking about:

>>> print(json.dumps(NextAction.model_json_schema(), indent=2))
{
  "examples": [
    {
      "action": "laugh",
      "thought": "jokes are funny, i should laugh",
      "user": "a funny joke"
    },
    {
      "action": "cry",
      "thought": "the story is sad, i should cry",
      "user": "a sad story"
    }
  ],
  "properties": {
    "thought": {
      "description": "Think about the best next action based on what the user said.",
      "title": "Thought",
      "type": "string"
    },
    "action": {
      "enum": [
        "laugh",
        "cry"
      ],
      "title": "Action",
      "type": "string"
    }
  },
  "required": [
    "thought",
    "action"
  ],
  "title": "NextAction",
  "type": "object"
}

this breaks CoT because the model thinks it' should output the action key first, which isn't what we want.

In debugging all of this, I also noticed that the schema that gets sent includes references for everything. For simple schemas where you're not even re-using the models this increases the token count + makes the usage more unreliable from what I can tell.

ie:

class NestedModel(BaseModel):
    note: str


class NextAction(BaseModel):
    thought: str = Field(
        description="Think about the best next action based on what the user said."
    )
    action: Literal["laugh", "cry"]
    nested: NestedModel

    model_config = ConfigDict(
        json_schema_extra={
            "examples": [
                {
                    "user": "a funny joke",
                    "thought": "jokes are funny, i should laugh",
                    "action": "laugh",
                },
                {
                    "user": "a sad story",
                    "thought": "the story is sad, i should cry",
                    "action": "cry",
                },
            ],
            "examples_description": "The `user` in examples is only for illustrative purposes, you should not include it in your output.",
        }
    )

turns into:

{
  "$defs": {
    "NestedModel": {
      "properties": {
        "note": {
          "title": "Note",
          "type": "string"
        }
      },
      "required": [
        "note"
      ],
      "title": "NestedModel",
      "type": "object"
    }
  },
  "examples": [
    {
      "action": "laugh",
      "thought": "jokes are funny, i should laugh",
      "user": "a funny joke"
    },
    {
      "action": "cry",
      "thought": "the story is sad, i should cry",
      "user": "a sad story"
    }
  ],
  "examples_description": "The `user` in examples is only for illustrative purposes, you should not include it in your output.",
  "properties": {
    "thought": {
      "description": "Think about the best next action based on what the user said.",
      "title": "Thought",
      "type": "string"
    },
    "action": {
      "enum": [
        "laugh",
        "cry"
      ],
      "title": "Action",
      "type": "string"
    },
    "nested": {
      "$ref": "#/$defs/NestedModel"
    }
  },
  "required": [
    "thought",
    "action",
    "nested"
  ],
  "title": "NextAction",
  "type": "object"
}

the LLM has to parse the references correctly along with do the task which I was getting poor results with.

I worked around the sorting + this referencing issue with this:

class DereferencedJsonSchema(BaseModel):

    @classmethod
    def model_json_schema(cls, *args, **kwargs):
        """Custom model_json_schema method to make it easier for an LLM to understand the schema.

        By default, pydantic will create refs stored within `$defs` for any
        schema. This is usually what you want but it unnecessarily complicates
        the schemas being sent to the LLM. If your not referencing a schema
        multiple times, this will also increase the token count. Additionally,
        pydantic will also sort keys. If you provide examples to
        `json_schema_extra`, pydantic will sort the keys within those examples
        which can negatively influence how the model generates it's output.

        We work around both those issues with this function and have seen much
        better results when using it.

        """
        json_schema = super(DereferencedJsonSchema, cls).model_json_schema(
            *args, **kwargs
        )
        json_schema_extra = cls.model_config.get("json_schema_extra")

        if json_schema_extra:
            # re-apply the extras to get around sorting
            json_schema.update(json_schema_extra)  # type: ignore

        # dereference the schema
        dereferenced: dict[str, Any] = jsonref.replace_refs(json_schema, proxies=False)  # type: ignore

        # remove defs
        dereferenced.pop("$defs")
        return dereferenced

which results in:

{
  "examples": [
    {
      "user": "a funny joke",
      "thought": "jokes are funny, i should laugh",
      "action": "laugh"
    },
    {
      "user": "a sad story",
      "thought": "the story is sad, i should cry",
      "action": "cry"
    }
  ],
  "examples_description": "The `user` in examples is only for illustrative purposes, you should not include it in your output.",
  "properties": {
    "thought": {
      "description": "Think about the best next action based on what the user said.",
      "title": "Thought",
      "type": "string"
    },
    "action": {
      "enum": [
        "laugh",
        "cry"
      ],
      "title": "Action",
      "type": "string"
    },
    "nested": {
      "properties": {
        "note": {
          "title": "Note",
          "type": "string"
        }
      },
      "required": [
        "note"
      ],
      "title": "NestedModel",
      "type": "object"
    }
  },
  "required": [
    "thought",
    "action",
    "nested"
  ],
  "title": "NextAction",
  "type": "object"
}

depends on the schema and how often you're referencing the same class, but I got significantly better results with this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Values in json_schema_extra get sorted automatically during `model_json_schema` #508

Values in json_schema_extra get sorted automatically during `model_json_schema` #508

mwildehahn commented Mar 16, 2024

jxnl commented Mar 17, 2024

mwildehahn commented Mar 17, 2024 via email

jxnl commented Mar 17, 2024

mwildehahn commented Mar 17, 2024

Values in json_schema_extra get sorted automatically during model_json_schema #508

Values in json_schema_extra get sorted automatically during model_json_schema #508

Comments

mwildehahn commented Mar 16, 2024

jxnl commented Mar 17, 2024

mwildehahn commented Mar 17, 2024 via email

jxnl commented Mar 17, 2024

mwildehahn commented Mar 17, 2024

Values in json_schema_extra get sorted automatically during `model_json_schema` #508

Values in json_schema_extra get sorted automatically during `model_json_schema` #508