[Roadmap] Multimodal Agent Roadmap #454

zechengz · 2024-03-07T09:10:51Z

Required prerequisites

I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
Consider asking first in a Discussion.

Motivation

Currently, large multimodal models (LMMs) are gradually replacing large language models (LLMs). Different from LLMs, LMMs usually allow inputs with multiple modalities and some models even have outputs with multiple modalities. With more modalities, LMMs offer more flexibility and make models perform more kinds of tasks. Thus, this means that utilizing multimodal models in agents will potentially enhance Camel Agent's capability. Recent famous LMMs include GPT-4V, Gemini, and Claude 3. In this feature request, we mainly focus on GPT-4V, but we need to make the interface general to other kinds of LMMs.

Solution

Basic Multimodal Agent (with GPT-4V):

Enable and add image_url to camel agent's input_message, where image_url can be url to image or base64 encoded image data. May need to modify the BaseMessage.
Agent memory need to support image storage, some kinds of memory may not support image storage. Default ChatHistoryMemory should work well.
Update OpenAITokenCounter to include counting image tokens.
Add image related examples such as OCI or object detection to verify agent with image modality.
- Require adding new image related prompt in prompts folder.
Brainstorm more interesting example that user and assistant can utilize image modality in a collaborative way.

Advanced Multimodal Agent (GPT-4V):

Enable image modality in EmbodiedAgent and create some interesting examples.

Multimodal Agent with different LMMs:

Support Claude 3 or Gemini etc. other than GPT-4V.

Alternatives

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

dandansamax · 2024-03-07T12:02:16Z

An important modification will be support multimodal in BaseMessage, which is our primary data exchange format. It may require lots of code changing to refactor it.

zechengz · 2024-03-08T10:19:51Z

@dandansamax IMO it depends on how we want to store the images. If we just store the images with base64 encoded format (which is also in string format), then we may not need to much changes. We can discuss more details on how we want to store the images offline.

dandansamax · 2024-03-08T16:40:43Z

@dandansamax IMO it depends on how we want to store the images. If we just store the images with base64 encoded format (which is also in string format), then we may not need to much changes. We can discuss more details on how we want to store the images offline.

@zechengz I agree, using base64 for image storage seems promising. However, I'm concerned about potential slowdowns in image editing processes and we still need to modify the BaseMessage structure for differentiating between image and text content. Let's delve into this further in a meeting.

zechengz · 2024-03-11T17:25:13Z

Discussed with @dandansamax offline, in general we will

Modify the BaseMessage
- Add image: Optional[PIL.IMAGE]
  - We store the image because we need some image stats such as image size
- Just focus on base64 and not image url
- Some memory only supports the text, we can detect and raise an error
See previous multimodal prompt PR [see the PR https://github.com/[Feature] Multimodal agents demo #320]

zechengz added enhancement New feature or request Agent Related to camel agents Example labels Mar 7, 2024

zechengz assigned zechengz, lightaime and dandansamax Mar 7, 2024

zechengz changed the title ~~[Feature Request] Multimodal Agent Support~~ [Feature Request] Multimodal Agent Roadmap Mar 7, 2024

Wendong-Fan added the call for contribution label Mar 11, 2024

zechengz mentioned this issue Mar 17, 2024

feat: enable image modality for ChatAgent #473

Merged

10 tasks

Wendong-Fan changed the title ~~[Feature Request] Multimodal Agent Roadmap~~ [Roadmap] Multimodal Agent Roadmap Apr 6, 2024

Zhoues mentioned this issue Apr 23, 2024

feat: Support multi-modal input and multi-modal output in one agent #529

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap] Multimodal Agent Roadmap #454

[Roadmap] Multimodal Agent Roadmap #454

zechengz commented Mar 7, 2024 •

edited

dandansamax commented Mar 7, 2024

zechengz commented Mar 8, 2024

dandansamax commented Mar 8, 2024

zechengz commented Mar 11, 2024

[Roadmap] Multimodal Agent Roadmap #454

[Roadmap] Multimodal Agent Roadmap #454

Comments

zechengz commented Mar 7, 2024 • edited

Required prerequisites

Motivation

Solution

Alternatives

Additional context

dandansamax commented Mar 7, 2024

zechengz commented Mar 8, 2024

dandansamax commented Mar 8, 2024

zechengz commented Mar 11, 2024

zechengz commented Mar 7, 2024 •

edited