[Feature Request]: Connect to the HuggingFace Hub to achieve a multimodal capability #2577

whiskyboy · 2024-05-03T03:41:03Z

Is your feature request related to a problem? Please describe.

The HuggingFace Hub provides an elegant python client to allow users to control over 100,000+ huggingface models and run inference on these models to achieve a variety of multimodal tasks, like image-to-text, text-to-speech, etc. By connecting to this hub, a text-based LLM like gpt-3.5-tubor could also have the multimodal capability to handle images, video, audio, and documents, in a cost efficient way.

However, it still needs some additional coding work to allow an autogen agent to interact with a huggingface-hub client, such as wrapping the client method into a function, parsing different input/output types, and model deployment management. That's why I'm seeking if autogen could have an out-of-box solution for the connecting.

Other similar works: JARVIS, Transformers Agent

Describe the solution you'd like

The simplest and most straightforward way is to provide a huggingface-hub toolkit with inference functions. Users can then easily register this toolkit to an autogen agent according to their requirements. However, it's not clear to me where is the best place to place this toolkit.
The second approach is to provide a huggingface_agent, like Transformers Agent. This agent would essentially consist of a pairing between an assistant and a user-proxy agent, both are registered with the huggingface-hub toolkit. Users could seamlessly access this agent to leverage its multimodal capabilities, without the need for manual registration of toolkits for execution.
The third approach is to create a multimodal capability and register this capability to any given agent by hooking the process_last_received_message method. However, it may not be straightforward for some tasks such as text-to-image.

Additional context

I'd like to hear your suggestions and make contributions in different ways.

The text was updated successfully, but these errors were encountered:

ekzhu · 2024-05-03T06:31:51Z

Re 1. We have some on going work in #1929 and #2414 for adding functions as a function store. This is similar to your idea of adding built in hugging face tools. It's best to discuss with @gagb @afourney and @LeoLjl about this direction and see if you can combine effort.

Re 2. Sounds interesting! I think we can start with a notebook example to show how this work. And then we can decide on whether to just do a notebook PR or a contrib agent.

Re 3. cc @WaelKarkoub @BeibinLi we do have text to image capability already.

WaelKarkoub · 2024-05-03T15:10:11Z

@whiskyboy we implemented VisionCapability that adds the vision modality to any LLM (image-to-text): https://microsoft.github.io/autogen/docs/notebooks/agentchat_lmm_gpt-4v/#behavior-with-and-without-visioncapability-for-agents.

We also implemented an ImageGeneration capability that allows for any LLM to generate images (text-to-image): https://microsoft.github.io/autogen/docs/notebooks/agentchat_image_generation_capability

Other multimodality features are currently being worked on that you can track the progress of in this roadmap #1975. Let me know if you have other ideas that we could add to the roadmap.

whiskyboy · 2024-05-03T17:15:59Z

@ekzhu
It's good to know there will be a function store in AutoGen soon! I will also try to provide a PoC of the #2 approach in these two days.

@WaelKarkoub
Thank you for sharing this awesome roadmap! I'm also thinking of adding some similar multimodal capabilities like TTS or document QA but with non-openai models (more specifically, with open-source models in huggingface hub). Although the current implementation of some capabilities accepts a customer process function, a built-in support of huggingface models is also attractive (to me at least). Additionaly, we can achieve more capabilities like image-to-image and audio separition by leveraging hf-hub.

WaelKarkoub · 2024-05-03T17:21:16Z

@whiskyboy just for awareness, I have a PR that handles text-to-speech and speech-to-text #2098. I'm experimenting with architecture but it mainly works.

whiskyboy · 2024-05-05T16:28:30Z

@WaelKarkoub @ekzhu
Drafted a PR here: #2599

whiskyboy added the enhancement New feature or request label May 3, 2024

ekzhu added the multimodal language + vision, speech etc. label May 3, 2024

whiskyboy mentioned this issue May 5, 2024

Huggingface agent #2599

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Connect to the HuggingFace Hub to achieve a multimodal capability #2577

[Feature Request]: Connect to the HuggingFace Hub to achieve a multimodal capability #2577

whiskyboy commented May 3, 2024

ekzhu commented May 3, 2024

WaelKarkoub commented May 3, 2024 •

edited

whiskyboy commented May 3, 2024

WaelKarkoub commented May 3, 2024

whiskyboy commented May 5, 2024

[Feature Request]: Connect to the HuggingFace Hub to achieve a multimodal capability #2577

[Feature Request]: Connect to the HuggingFace Hub to achieve a multimodal capability #2577

Comments

whiskyboy commented May 3, 2024

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Additional context

ekzhu commented May 3, 2024

WaelKarkoub commented May 3, 2024 • edited

whiskyboy commented May 3, 2024

WaelKarkoub commented May 3, 2024

whiskyboy commented May 5, 2024

WaelKarkoub commented May 3, 2024 •

edited