-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
synthetic data generation #1
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned, there's some formatting stuff to do. I only commented on structure and left formatting for after linting etc. is done.
### Strategy: | ||
We'll generate synthetic queries with Task Templates. We'll start with just one query example and with LLM's and some insight, we'll generate thousands of synthetic queries. | ||
|
||
To Generate Task Tamples, we will describe query generation process to a Language model and ask it to come up with samples of templated queries, which we can use to expand later by generating different data to fill the templates. We will start with GPT-4 generate our first examples (roughly 5), and GPT-4-Turbo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type: Tamples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also generally sentence structure here is lacking.
from typing import List, Dict, Literal | ||
|
||
|
||
class QueryAnswerPrompt(OpenAICall): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be updated to use the newly released GroqCall
in v0.9.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want that to be the primary way that this is done? I get that it can use Groq, but wouldn't you want the code to be portable to other providers instantly with a url and api key change?
|
||
USER: | ||
I'm looking to create a set of synthetic <Queries, Response> pairs. | ||
The queries will be asking for asking the model to provide code for tasks using {library_focus} to perform {task_type} tasks that users might want done from the command line. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asking for asking?
from pydantic import BaseModel, Field | ||
from typing import List, Dict, Literal, Type | ||
|
||
class SyntheticQueryGenerationPrompt(OpenAICall): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would name this SyntheticQueryGenerator
and put it in a calls
folder.
Optionally, use mirascope init --prompts_location calls
to initialize a .mirascope
project. This will create the calls
folder for you.
I would then call mirascope add synthetic_query_generator
so that this is versioned as 0001
.
print(examples_string) | ||
return examples_string | ||
|
||
class QueryTemplateList(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, I would but this base model and it's corresponding extractor in their own file in the calls folder.
unextracted_query_list : str | ||
|
||
|
||
class TemplateVariableOptions(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Add an example for synthetic dataset generation. Looking for general feedback first on how commented the example is, and anything regarding the experiment, etc.
Note: I haven't done any linting stuff yet, this is more for a vibe check first.