Skip to content
This repository has been archived by the owner on Mar 16, 2024. It is now read-only.

Enhance Regression Testing and Implement Eval Class for Robust Agent Evaluation #10

Open
emrgnt-cmplxty opened this issue Jun 20, 2023 · 0 comments
Labels
good first issue Good for newcomers

Comments

@emrgnt-cmplxty
Copy link
Owner

emrgnt-cmplxty commented Jun 20, 2023

I wanted to bring up the subject of our regression testing and the potential for a new, more comprehensive approach to evaluating our agents' performance. Currently, our regression testing is limited mainly to search functionality, which leaves a lot of our codebase under-tested. A robust and comprehensive testing framework is critical to ensuring code quality, catching bugs early, and facilitating smooth code integration. It's also a key contributor to long-term code maintainability.

To this end, I propose two main initiatives:

Expand our Regression Testing: By enhancing our suite of regression tests, we can ensure that changes in one part of our code don't break something somewhere else. This will help us maintain system integrity and minimize the risks associated with ongoing development.
Introduce an Evaluation Suite with Eval Class: In addition to expanded regression testing, we should consider implementing a comprehensive evaluation suite featuring an Eval class. This class will allow us to evaluate how well our agents are performing by comparing their actions with expected outcomes.
Here's a rough skeleton of how the Eval class could look like:


class Eval(abc.ABC):
    """
    Evaluation classes generally should override two methods:
    `generate_eval_result`: Takes an instruction and a list of expected actions and evaluates the correctness of the agent's actions.
    `_extract_actions`: Removes the actions from a passed list of messages.
    """

    def __init__(self, *args, **kwargs):
        if "main_config" not in kwargs:
            raise ValueError("main_config must be provided to Eval")
        self.config = AutomataAgentConfigFactory.create_config(args, kwargs)

    def generate_eval_result(self, instructions: str, expected_actions: List[EvalAction]):
        """
        Evaluates a single sample.
        """
        logger.debug("Evaluating Instructions: %s", instructions)
        agent = AutomataAgentFactory.create_agent(instructions=instructions, config=self.config)
        agent.run()
        messages = Eval._extract_non_instruction_messages(agent)
        extracted_actions = Eval._extract_actions(messages)
        return calc_eval_result(extracted_actions, expected_actions)

    @staticmethod
    def _extract_actions(messages: List[OpenAIChatMessage]) -> List[Action]:
        """Extracts actions from a list of messages."""
        extracted_actions: List[Action] = []
        for message in messages:
            actions = AutomataActionExtractor.extract_actions(message.content)
            extracted_actions.extend(actions)
        return extracted_actions

Logic for the old Eval implementation can be seen here.

The Eval class's primary purpose is to take a set of instructions and corresponding expected actions and evaluate whether the instruction execution generates the anticipated actions. It's important to note that a good suite of evaluations won't necessarily pass with 100% success, but rather it will provide us a performance baseline and a clear target to strive for.

As always, don't hesitate to ask if you have any questions or need further clarification. Your contributions to this project are highly valued!

@emrgnt-cmplxty emrgnt-cmplxty added the good first issue Good for newcomers label Jun 23, 2023
emrgnt-cmplxty added a commit that referenced this issue Aug 28, 2023
Huntemall pushed a commit to Huntemall/automata-dev that referenced this issue Oct 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant