Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

synthetic data generation #1

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

aowen14
Copy link

@aowen14 aowen14 commented Apr 2, 2024

Add an example for synthetic dataset generation. Looking for general feedback first on how commented the example is, and anything regarding the experiment, etc.

Note: I haven't done any linting stuff yet, this is more for a vibe check first.

Copy link
Contributor

@willbakst willbakst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, there's some formatting stuff to do. I only commented on structure and left formatting for after linting etc. is done.

### Strategy:
We'll generate synthetic queries with Task Templates. We'll start with just one query example and with LLM's and some insight, we'll generate thousands of synthetic queries.

To Generate Task Tamples, we will describe query generation process to a Language model and ask it to come up with samples of templated queries, which we can use to expand later by generating different data to fill the templates. We will start with GPT-4 generate our first examples (roughly 5), and GPT-4-Turbo.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type: Tamples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also generally sentence structure here is lacking.

from typing import List, Dict, Literal


class QueryAnswerPrompt(OpenAICall):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be updated to use the newly released GroqCall in v0.9.1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want that to be the primary way that this is done? I get that it can use Groq, but wouldn't you want the code to be portable to other providers instantly with a url and api key change?


USER:
I'm looking to create a set of synthetic <Queries, Response> pairs.
The queries will be asking for asking the model to provide code for tasks using {library_focus} to perform {task_type} tasks that users might want done from the command line.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asking for asking?

from pydantic import BaseModel, Field
from typing import List, Dict, Literal, Type

class SyntheticQueryGenerationPrompt(OpenAICall):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would name this SyntheticQueryGenerator and put it in a calls folder.

Optionally, use mirascope init --prompts_location calls to initialize a .mirascope project. This will create the calls folder for you.

I would then call mirascope add synthetic_query_generator so that this is versioned as 0001.

print(examples_string)
return examples_string

class QueryTemplateList(BaseModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, I would but this base model and it's corresponding extractor in their own file in the calls folder.

unextracted_query_list : str


class TemplateVariableOptions(BaseModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants