Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Roadmap] Multimodal Agent Roadmap #454

Open
2 tasks done
zechengz opened this issue Mar 7, 2024 · 4 comments
Open
2 tasks done

[Roadmap] Multimodal Agent Roadmap #454

zechengz opened this issue Mar 7, 2024 · 4 comments
Assignees
Labels
Agent Related to camel agents call for contribution enhancement New feature or request Example

Comments

@zechengz
Copy link
Member

zechengz commented Mar 7, 2024

Required prerequisites

Motivation

Currently, large multimodal models (LMMs) are gradually replacing large language models (LLMs). Different from LLMs, LMMs usually allow inputs with multiple modalities and some models even have outputs with multiple modalities. With more modalities, LMMs offer more flexibility and make models perform more kinds of tasks. Thus, this means that utilizing multimodal models in agents will potentially enhance Camel Agent's capability. Recent famous LMMs include GPT-4V, Gemini, and Claude 3. In this feature request, we mainly focus on GPT-4V, but we need to make the interface general to other kinds of LMMs.

Solution

Basic Multimodal Agent (with GPT-4V):

  • Enable and add image_url to camel agent's input_message, where image_url can be url to image or base64 encoded image data. May need to modify the BaseMessage.
  • Agent memory need to support image storage, some kinds of memory may not support image storage. Default ChatHistoryMemory should work well.
  • Update OpenAITokenCounter to include counting image tokens.
  • Add image related examples such as OCI or object detection to verify agent with image modality.
    • Require adding new image related prompt in prompts folder.
  • Brainstorm more interesting example that user and assistant can utilize image modality in a collaborative way.

Advanced Multimodal Agent (GPT-4V):

  • Enable image modality in EmbodiedAgent and create some interesting examples.

Multimodal Agent with different LMMs:

  • Support Claude 3 or Gemini etc. other than GPT-4V.

Alternatives

No response

Additional context

No response

@zechengz zechengz added enhancement New feature or request Agent Related to camel agents Example labels Mar 7, 2024
@zechengz zechengz changed the title [Feature Request] Multimodal Agent Support [Feature Request] Multimodal Agent Roadmap Mar 7, 2024
@dandansamax
Copy link
Collaborator

An important modification will be support multimodal in BaseMessage, which is our primary data exchange format. It may require lots of code changing to refactor it.

@zechengz
Copy link
Member Author

zechengz commented Mar 8, 2024

@dandansamax IMO it depends on how we want to store the images. If we just store the images with base64 encoded format (which is also in string format), then we may not need to much changes. We can discuss more details on how we want to store the images offline.

@dandansamax
Copy link
Collaborator

@dandansamax IMO it depends on how we want to store the images. If we just store the images with base64 encoded format (which is also in string format), then we may not need to much changes. We can discuss more details on how we want to store the images offline.

@zechengz I agree, using base64 for image storage seems promising. However, I'm concerned about potential slowdowns in image editing processes and we still need to modify the BaseMessage structure for differentiating between image and text content. Let's delve into this further in a meeting.

@zechengz
Copy link
Member Author

Discussed with @dandansamax offline, in general we will

  • Modify the BaseMessage
    • Add image: Optional[PIL.IMAGE]
      • We store the image because we need some image stats such as image size
    • Just focus on base64 and not image url
    • Some memory only supports the text, we can detect and raise an error
  • See previous multimodal prompt PR [see the PR https://github.com/[Feature] Multimodal agents demo #320]

@Wendong-Fan Wendong-Fan changed the title [Feature Request] Multimodal Agent Roadmap [Roadmap] Multimodal Agent Roadmap Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Agent Related to camel agents call for contribution enhancement New feature or request Example
Projects
Status: 🚀 Roadmap
Development

No branches or pull requests

4 participants