Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huggingface agent #2599

Open
wants to merge 20 commits into
base: main
Choose a base branch
from
Open

Conversation

whiskyboy
Copy link
Collaborator

@whiskyboy whiskyboy commented May 5, 2024

Why are these changes needed?

Introducing a new agent named HuggingFaceAgent which can connect to models in HuggingFace Hub to achieve several multimodal capabilities.

This agent essentially consists of a pairing between an assistant and a user-proxy agent, both are registered with the huggingface-hub models capabilities. Users could seamlessly access this agent to leverage its multimodal capabilities, without the need for manual registration of toolkits for execution.

Some key changes:

  1. added HuggingFaceClient class in autogen/agentchat/contrib/huggingface_utils.py: this class simplifies calling HuggingFace models locally or remotely.
  2. added HuggingFaceAgent class in autogen/agentchat/contrib/huggingface_agent.py: this agent utilizes HuggingFaceClient to achieve multimodal capabilities.
  3. added HuggingFaceImageGenerator class in autogen/agentchat/contrib/capabilities/generate_images.py: this class enables text-based LLMs to generate images using HuggingFaceClient.
  4. added notebook samples to demostrate how these new classes work
  5. fixed some bugs

Related issue number

The second approach mentioned in #2577

Checks

@codecov-commenter
Copy link

codecov-commenter commented May 5, 2024

Codecov Report

Attention: Patch coverage is 0% with 206 lines in your changes are missing coverage. Please review.

Project coverage is 19.10%. Comparing base (372ac1e) to head (0cf54fd).
Report is 17 commits behind head on main.

Files Patch % Lines
autogen/agentchat/contrib/huggingface_agent.py 0.00% 96 Missing and 1 partial ⚠️
autogen/agentchat/contrib/huggingface_utils.py 0.00% 87 Missing ⚠️
.../agentchat/contrib/capabilities/generate_images.py 0.00% 22 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2599       +/-   ##
===========================================
- Coverage   33.11%   19.10%   -14.02%     
===========================================
  Files          86       88        +2     
  Lines        9108     9444      +336     
  Branches     1938     2173      +235     
===========================================
- Hits         3016     1804     -1212     
- Misses       5837     7524     +1687     
+ Partials      255      116      -139     
Flag Coverage Δ
unittests 19.04% <0.00%> (-14.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sonichi sonichi requested a review from BeibinLi May 5, 2024 17:07
@sonichi sonichi added multimodal language + vision, speech etc. integration software integration alt-models Pertains to using alternate, non-GPT, models (e.g., local models, llama, etc.) labels May 5, 2024
@WaelKarkoub
Copy link
Collaborator

@whiskyboy thanks for the PR! I had a couple of design questions and wanted your opinion on them.

Autogen has an image generation capability, which allows anyone to add text-to-image capabilities to any LLM.

class ImageGeneration(AgentCapability):

What do you think about implementing a new custom ImageGenerator that uses huggingface's apis, as opposed to creating a new agent type? We have dalle image generator implemented for reference.

For image-to-text, we also have a capability called VisionCapability. @BeibinLi has more information on the design choices for that capability but I just wanted to bring it up for awareness.

class VisionCapability(AgentCapability):

@whiskyboy
Copy link
Collaborator Author

@WaelKarkoub Thanks for your comment!
Yes, and in fact I have got inspired and learned a lot from the design of the two capabilities you mentioend above, and also from the MultimodalConversableAgent and LLaVAAgent, during development. Here are my thoughts:

  1. Can we achieve the same functionality within the current multimodal capability implementations?
    Certainly, we can implement a custom ImageGenerator or a custom custom_caption_func to realize the text-to-image and image-to-text capabilities using Huggingface's APIs. However, Huggingface provides the potential of many other multimodal capabilities, such as 'image-to-image', 'audio-to-audio', etc, which go beyond the current implementations. (A full list could be found here.) This draft PR serves as a PoC only now to show how a huggingface agent works. Once we align on the design, I'll proceed with implementing additional capabilities
  2. Should we add a new agent type or should we add some new multimodal capabilities to leveraging Huggingface multimodal models?
    Both designs make sense to me. Introducing a new agent type would allow for covering a diverse range of different multimodal capabilities for general purpose easily, while registering a new capability is more suitable for a specific task. (But we can also have a general capability or register multiple capabilities to one agent. So I'm flexible and open to either approach)
  3. Do we really need a built-in support to Huggingface multimodal models?
    I got the idea inspired from Transformers Agents and JARVIS . It's appealing (to me at least) to have a non-openai and out-of-box solution for adding multimodal capabilities to a text-only LLM in autogen. Huggingface stands out as a suitable choice due to its diverse range of multimodal models spanning from general-purpose to domain-specific areas. Additionally, it offers a cost-effective solution.

@WaelKarkoub
Copy link
Collaborator

@whiskyboy This is very cool and I appreciate your efforts! Your reasoning fits well with what I think now. Both approaches could be beneficial to the autogen community and could coexist. We can have standalone huggingface conversible agents as well as huggingface image generators, audio generators, etc.

I look at Autogen as a lego world where users can mix and match different useful tools (lego pieces), and the tools you've developed are valuable and versatile enough to be applicable across many areas (e.g., agent capabilities). For a concrete example, what do you think about breaking down the text-to-image functionality and implementing it as an ImageGenerator that HuggingFaceAgent could also utilize? The HuggingFaceAgent wouldn't implement it as a capability but could directly use this newly decoupled logic. We could apply a similar strategy to other modalities as well.

One last question, is the image-to-image capability the same as image editing? If so, I'm considering improving the image generator capability to allow for this.

@whiskyboy
Copy link
Collaborator Author

whiskyboy commented May 6, 2024

@WaelKarkoub It's glad to know we are working towards the same goal!

what do you think about breaking down the text-to-image functionality and implementing it as an ImageGenerator that HuggingFaceAgent could also utilize?

Sounds like a versatile lego block that could be utilized by both standalone agents and agent capabilities? I think it's a good idea! As it could enhance the function reusability, and make the code more readable and maintainable.

is the image-to-image capability the same as image editing?

Yes, some typical user scenarios include style transfer, image inpainting, etc. For instance, the timbrooks/instruct-pix2pix model could transform a dog in one image into a cat. These models are usually diffusion models that accept a souce image and a prompt text as input.

@whiskyboy
Copy link
Collaborator Author

@WaelKarkoub @BeibinLi minding take a review of this PR? I'll add the documentation and tests once you approve the design.

@whiskyboy whiskyboy marked this pull request as ready for review May 17, 2024 04:04
Comment on lines +115 to +119
@self._user_proxy.register_for_execution()
@self._assistant.register_for_llm(
name=HuggingFaceCapability.TEXT_TO_IMAGE.name,
description="Generates images from input text.",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the idea behind using function registration instead of using the text analyzer agent?

Comment on lines +59 to +64
self._assistant = AssistantAgent(
self.name + "_inner_assistant",
system_message=system_message,
llm_config=inner_llm_config,
is_termination_msg=lambda x: False,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have to expose these two agents to the public by initializing them in the constructor for a couple of reasons:

  1. Users can apply transform messages capability to limit token count by either truncation or compression.
  2. Expose to the users that we'll be making extra API calls

from autogen.agentchat.contrib import img_utils


class HuggingFaceClient:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to be a model client?

class ModelClient(Protocol):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alt-models Pertains to using alternate, non-GPT, models (e.g., local models, llama, etc.) integration software integration multimodal language + vision, speech etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants