Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds option for JSON schema optimization #863

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

leloykun
Copy link
Contributor

@leloykun leloykun commented May 4, 2024

Pydantic's .model_json_schema() and get_schema_from_signature don't actually make optional fields/arguments optional in the json schema. This forces the model to output the keys even when the values are null anyway--slowing down inference the larger the schema is & the more optional fields there is.


For example, for this Pydantic class:

class Test(BaseModel):
    field_a: int
    field_b: Optional[int]
    field_c: None

.model_json_schema() builds this schema:

{
    "properties": {
        "field_a": {"title": "Field A", "type": "integer"},
        "field_b": {
            "anyOf": [{"type": "integer"}, {"type": "null"}],
            "title": "Field B",
        },
        "field_c": {"title": "Field C", "type": "null"},
    },
    "required": ["field_a", "field_b", "field_c"],
    "title": "Test",
    "type": "object",
}

optimize_schema in this PR reduces this to:

{
    "properties": {
        "field_a": {"title": "Field A", "type": "integer"},
        "field_b": {"title": "Field B", "type": "integer"},
    },
    "required": ["field_a"],
    "title": "Test",
    "type": "object",
}

Likewise, get_schema_from_signature converts this function:

def test_add(a: int, b: int | None = None):
    if b is None:
        return a
    return a + b

to this schema:

{
    "properties": {
        "a": {"title": "A", "type": "integer"},
        "b": {
            "anyOf": [{"type": "integer"}, {"type": "null"}],
            "title": "B",
        },
    },
    "required": ["a", "b"],
    "title": "Arguments",
    "type": "object",
}

optimize_schema reduces this to:

{
    "properties": {
        "a": {"title": "A", "type": "integer"},
        "b": {"title": "B", "type": "integer"},
    },
    "required": ["a"],
    "title": "Arguments",
    "type": "object",
}

I decided to add a flag, enable_schema_optimization, and set it to False by default because it further restricts the support distribution and thus might break models finetuned without this setting.

@eitanturok
Copy link
Contributor

There seems to be another potential bug here. Given the function

def test_add(a: int, b: int | None = None):
    if b is None:
        return a
    return a + b

the function get_schema_from_signature outputs "title": "Arguments" both when optimize_schema is used and when it is not used. It seems like the output should have "title": "test_add".

Perhaps I should raise this in a separate issue.

@leloykun
Copy link
Contributor Author

leloykun commented May 6, 2024

@eitanturok I don't think this is a bug cuz we don't use the title field when building the FSM (& when generating outputs)

Can you provide an example where this breaks something?

@eitanturok
Copy link
Contributor

@leloykun

I'm using outlines to make my models better at function calling and this current setup causes me some issues.

At a high level, I take the generated schema and use it 1) for the system prompt and 2) to create a regex. I input this schema into the system prompt so it knows which functions it has access to. But if the json schema does NOT contain the function's name, the model won't know how to call it.

Here is an example:

def test_add(a: int, b: int | None = None):
    if b is None:
        return a
    return a + b
    
schema_json = get_schema_from_signature(tool)
schema_str = json.dumps(schema_json).strip()
schema_regex = build_regex_from_schema(schema_str, whitespace_pattern)

system_prompt = f"You are an expert at function calling and have access to the following tools: {function_schema}."
system_prompt += "Please call one of these functions."
system_prompt = system_prompt.format(schema_str)

generator = generate.regex(model, schema_regex)

If the function name is not included in the schema generated from get_schema_from_signature then this causes issues to arise.

@leloykun
Copy link
Contributor Author

leloykun commented May 7, 2024

@eitanturok, we should raise this as a separate issue

I'm thinking of replacing this line in get_schema_from_signature

model = create_model("Arguments", **arguments)

with

model = create_model(fn.__name__, **arguments)

or

try:
    fn_name = fn.__name__
except Exception as e:
    fn_name = "Arguments"
model = create_model(fn_name, **arguments)

just to be safer

what do you think?

@eitanturok
Copy link
Contributor

I was thinking the same thing. I'll raise this a separate issue.

@eitanturok
Copy link
Contributor

Raised the issue in #878. Future discussions should take place there.

@leloykun leloykun force-pushed the fc--add-schema-optimization branch from 0a4b076 to dbf193e Compare May 20, 2024 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants