🔨 MODEL FORGE

Evaluate hosted OpenAI GPT / Google Vertex AI PaLM2 / Gemini or local Ollama models against a task.

Distribute arbitrary tasks as YAML to local or hosted language models. In MODEL FORGE tasks are broken down by: agents, optional postprocessor, and evaluators. Tasks have a top level prompt -- the actual work to do. For example, you could use the following as a task prompt: "Implement a simple example of malloc in C with the following signature: void* malloc(size_t size)". Next, you could include a postprocessor request to a local model to extract only the program's source code from the agent's response. Finally, your evaluator would be instructed to act as an expert in the task ideally with CoT (Chain of Thought) based examples examples included.

Requirements

macOS / Linux (Ubuntu as a distro was tested)
Python 3.10+
Depending on use case:
- OpenAI API key
- Google Cloud Vertex AI service account credentails .json
- Ollama installed

🏎️ Quick start

Clone the repository
Setup a Python environment and install dependencies
Execute the entry point script: python src/main.py

git clone https://github.com/Brandon7CC/MODELFORGE
cd MODELFORGE/
python -m venv forge-env
source forge-env/bin/activate
pip install -r requirements.txt
python src/main.py -h
echo "Done! Next, you can try FizzBuzz with Ollama locally!\npython src/main.py task_configs/FizzBuzz.yaml

Supported Providers

OpenAI text completion models. For example,
- gpt-3.5-turbo
- gpt-4
- gpt-4-1106-preview
Google Vertex AI PALM2 / Gemini text/code completion models. For example,
- gemini-pro
- text-unicorn@001
- code-bison
OSS models via Ollama e.g. LLaMA, Orca2, Vicuna, Mixtral8x7b, Mistral, Phi2, etc

Use cases

Evaluate model(s) against a common task
Produce examples of creative ways to solve a problem
- Chain models together to enable a simple thought loop

👨‍💻 FizzBuzz!

FizzBuzz is a classic "can you code" question. It's simple, but can provide a level of insight into how a developer thinks through a problem. For example, in Python, the use of control flow, lambdas, etc. Here's the problem statement:

Write a program to display numbers from 1 to n. For multiples of three, print "Fizz" instead of the number, and for the multiples of five, print "Buzz". For numbers which are multiples of both three and five, print "FizzBuzz".

Next, we'll make our task configuration file (this is already done for you in task_configs/FizzBuzz.yaml), but we'll walk you through it. To do so we'll define a top level task called "FizzBuzz", give it a prompt and the number of times we want he model to solve the problem.

tasks:
  - name: FizzBuzz
    # If a run count is not provided then the task will only run until evaluator success.
    run_count: 5
    prompt: |
        Write a program to display numbers from 1 to n. For multiples of three, print "Fizz" 
        instead of the number, and for the multiples of five, print "Buzz". For numbers which 
        are multiples of both three and five, print "FizzBuzz".
        Let's think step by step.

Now we'll define our "agent" -- the model which will act as an expert to complete our task. Models can be any of the supported hosted / local Ollama models (e.g. Google's Gemini, OpenAI's GPT-4, or Mistral AI's Mixtral8x7b via Ollama).

tasks:
  - name: FizzBuzz
    run_count: 5
    prompt: |
        ...
    agent: 
      # We'll generate a custom model for each base model
      base_model: mixtral:8x7b-instruct-v0.1-q4_1
      temperature: 0.98
      system_prompt: | 
        You're an expert Python developer. Follow these requirement **exactly**:
        - The code you produce is at the principal level;
        - You follow modern object oriented programming patterns;
        - You list your requirements and design a simple test before implementing.
        Review the user's request and follow these requirements.

Optionally we can create a "postprocessor". We'll only want the code completed by the agent to be evaluated so here we're going to have our postprocessor model extract the source code from the agent's response.

tasks:
  - name: FizzBuzz
    # If a run count is not provided then the task will only run until evaluator success.
    run_count: 5
    prompt: |
        ...
    agent: 
      # We'll generate a custom model for each base model
      base_model: gpt-4-1106-preview
      temperature: 0.98
      system_prompt: | 
        ...
    postprocessor:
      base_model: mistral
      temperature: 0.1
      system_prompt: |
        You have one job: return the source code provided in the user's message. 
        **ONLY** return the exact source code. Your response is not read by a human.

Lastly, you'll want an "evaluator" model which will act as an expert in reviewing the output from the agent/postprocessor. The job of the evaluator is to return TRUE / FALSE. Additionally, we can fail up to 10 times -- re-query the agent. Here's were a bit of the magic comes in -- we'll include a brief summary of the failed attempt -- a critique within the next query to the agent. This enables the agent to iterate on itself in a much more effective way. Here we'll want our evaluator to review the implementation of FizzBuzz.

tasks:
  - name: FizzBuzz
    # If a run count is not provided then the task will only run until evaluator success.
    run_count: 5
    prompt: |
        ...
    agent: 
      # We'll generate a custom model for each base model
      base_model: codellama
      temperature: 0.98
      system_prompt: | 
        ...
    postprocessor:
      base_model: gemini-pro
      temperature: 0.1
      system_prompt: |
        ...
    # Evaluators have defined system prompts to only return true / false for their domain.
    evaluator:
      base_model: gpt-4-1106-preview
      temperature: 0.1
      system_prompt: |
        Assess if a given sample program correctly implements Fizz Buzz. 
        The program should display numbers from 1 to n. For multiples of three, it should 
        print "Fizz" instead of the number, for the multiples of five, it should print "Buzz", 
        and for numbers which are multiples of both three and five, it should print "FizzBuzz".
        Guidelines for Evaluation
          - Correctness: Verify that the program outputs "Fizz" for multiples of 3, "Buzz" for 
            multiples of 5, and "FizzBuzz" for numbers that are multiples of both 3 and 5. For
            all other numbers, it should output the number itself.
          - Range Handling: Check if the program correctly handles the range from 1 to n, where
            n is the upper limit provided as input.
          - Error Handling: Assess if the program includes basic error handling, such as ensuring
            the input is a positive integer.

Insperation

This work was inspired by Google DeepMind's FunSearch approach to finding a novel solution to the cap set problem. At the macro level this was done by developing CoT (Chain of Thought) based examples, repeatedly prompting PaLM2 to generate a large amounts of programs, and then evaluating those programs on several levels.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
src		src
task_configs		task_configs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

task_configs

task_configs

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

🔨 MODEL FORGE

Requirements

🏎️ Quick start

Supported Providers

Use cases

👨‍💻 FizzBuzz!

Insperation

About

Languages

License

Brandon7CC/MODELFORGE

Folders and files

Latest commit

History

Repository files navigation

🔨 MODEL FORGE

Requirements

🏎️ Quick start

Supported Providers

Use cases

👨‍💻 FizzBuzz!

Insperation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages