You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, large multimodal models (LMMs) are gradually replacing large language models (LLMs). Different from LLMs, LMMs usually allow inputs with multiple modalities and some models even have outputs with multiple modalities. With more modalities, LMMs offer more flexibility and make models perform more kinds of tasks. Thus, this means that utilizing multimodal models in agents will potentially enhance Camel Agent's capability. Recent famous LMMs include GPT-4V, Gemini, and Claude 3. In this feature request, we mainly focus on GPT-4V, but we need to make the interface general to other kinds of LMMs.
Solution
Basic Multimodal Agent (with GPT-4V):
Enable and add image_url to camel agent's input_message, where image_url can be url to image or base64 encoded image data. May need to modify the BaseMessage.
Agent memory need to support image storage, some kinds of memory may not support image storage. Default ChatHistoryMemory should work well.
Update OpenAITokenCounter to include counting image tokens.
Add image related examples such as OCI or object detection to verify agent with image modality.
Require adding new image related prompt in prompts folder.
Brainstorm more interesting example that user and assistant can utilize image modality in a collaborative way.
Advanced Multimodal Agent (GPT-4V):
Enable image modality in EmbodiedAgent and create some interesting examples.
Multimodal Agent with different LMMs:
Support Claude 3 or Gemini etc. other than GPT-4V.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
An important modification will be support multimodal in BaseMessage, which is our primary data exchange format. It may require lots of code changing to refactor it.
@dandansamax IMO it depends on how we want to store the images. If we just store the images with base64 encoded format (which is also in string format), then we may not need to much changes. We can discuss more details on how we want to store the images offline.
@dandansamax IMO it depends on how we want to store the images. If we just store the images with base64 encoded format (which is also in string format), then we may not need to much changes. We can discuss more details on how we want to store the images offline.
@zechengz I agree, using base64 for image storage seems promising. However, I'm concerned about potential slowdowns in image editing processes and we still need to modify the BaseMessage structure for differentiating between image and text content. Let's delve into this further in a meeting.
Required prerequisites
Motivation
Currently, large multimodal models (LMMs) are gradually replacing large language models (LLMs). Different from LLMs, LMMs usually allow inputs with multiple modalities and some models even have outputs with multiple modalities. With more modalities, LMMs offer more flexibility and make models perform more kinds of tasks. Thus, this means that utilizing multimodal models in agents will potentially enhance Camel Agent's capability. Recent famous LMMs include GPT-4V, Gemini, and Claude 3. In this feature request, we mainly focus on GPT-4V, but we need to make the interface general to other kinds of LMMs.
Solution
Basic Multimodal Agent (with GPT-4V):
image_url
to camel agent'sinput_message
, whereimage_url
can be url to image or base64 encoded image data. May need to modify theBaseMessage
.memory
need to support image storage, some kinds ofmemory
may not support image storage. DefaultChatHistoryMemory
should work well.OpenAITokenCounter
to include counting image tokens.prompts
folder.Advanced Multimodal Agent (GPT-4V):
EmbodiedAgent
and create some interesting examples.Multimodal Agent with different LMMs:
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: