Skip to content

gokayfem/ComfyUI_VLM_nodes

Repository files navigation

👁️ VLM Nodes

🔽Examples below • 📙 Visit my other repo to learn more about Vision Language Models


Usage

  • For Windows and Linux
cd custom_nodes
git clone https://github.com/gokayfem/ComfyUI_VLM_nodes.git
  • For macOS or AMD GPUs(ROCm) go to the mac branch. Download the repository as zip and unzip it to the custom_nodes folder.

VLM Nodes

Utilizes llama-cpp-python for integration of LLaVa models. You can load and use any VLM with LLaVa models in GGUF format with this nodes.
You need to download the model similar to ggml-model-q4_k.gguf and it's clip projector similar to mmproj-model-f16.gguf from this repositories (in the files and versions).
python=>3.9 is necessary.
Put all of the files inside models/LLavacheckpoints
Note that every model's clip projector is different!

Structured Output

Getting structured outputs can be quite challenging through prompt engineering alone.
I've added the Structured Output node to VLM Nodes.
Now, you can obtain your answers reliably.
You can extract entities, numbers, classify prompts with given classes, and generate one specific prompt. These are just a few examples.
You can add additional descriptions to fields and choose the attributes you want it to return.
structured

Image to Music

Utilizes VLMs, LLMs and AudioLDM-2 to make music from images.
Use SaveAudioNode to save the music inside output folder.
It will automatically download the necessary files into models/LLavacheckpoints/files_for_audioldm2

output.mp4

LLM to Music

Utilizes Chat Musician, an open-source LLM that integrates intrinsic musical abilities.
ChatMusician Demo Page
You can try prompts from this demo page.

Download the GGUF file
ChatMusician GGUF Files
ChatMusician.Q5_K_M.gguf or ChatMusician.Q5_K_S.gguf recommended

BIG BIG BIG Warning: It does NOT work perfectly, if you got errors accept the error queue prompt again with the same settings!!

chatmusician.mp4

InternLM-XComposer2-VL Node

Utilizes AutoGPTQ for integration of InternLM-XComposer2-VL Model. It will automatically download the necessary files into models/LLavacheckpoints/files_for_internlm. This is one of the best models for visual perception.
Important Note : This model is heavy.

Automatic Prompt Generation and Suggestion Nodes

Get Keyword node: It can take LLava outputs and extract keywords from them.
LLava PromptGenerator node: It can create prompts given descriptions or keywords using (input prompt could be Get Keyword or LLava output directly).
Suggester node: It can generate 5 different prompts based on the original prompt using consistent in the options or random prompts using random in the options.

  • Works best with LLava 1.5 and 1.6.

Play with the temperature for creative or consistent results. Higher the temperature more creative are the results.
If you want to dive deep into LLM Settings

Outputs are JSON looking texts, you can see them as a text using JsonToText Node.
You can see any string output with ViewText Node
You can set any string input using SimpleText Node
Utilizes llama-cpp-agents for getting structured outputs.

LLM Prompt Generation from text nodes

LLM PromptGenerator node: Qwen 1.8B Stable Diffusion Prompt
IF prompt MKR
This LLM's works best for now for prompt generation.
LLMSampler node: You can chat with any LLM in gguf format, you can use LLava models as an LLM also.

API PromptGenerator node: You can use ChatGPT and DeepSeek API's to create prompts. https://platform.deepseek.com/ gives 10m free tokens.

  • ChatGPT-4
  • ChatGPT-3.5
  • DeepSeek You can use them for simple chat also there is an option in the node.

UForm-Gen2 Qwen Node

UForm-Gen2 is an extremely fast small generative vision-language model primarily designed for Image Captioning and Visual Question Answering.
UForm-Gen2 Qwen
It will automatically download the necessary files into models/LLavacheckpoints/files_for_uform_gen2_qwen

Kosmos-2 Node

Kosmos-2: Grounding Multimodal Large Language Models to the World. Kosmos-2 It will automatically download the necessary files into models/LLavacheckpoints/files_for_kosmos2

moondream1 and moondream2 Node

This node is designed to work with the Moondream model, a powerful small vision language model built by @vikhyatk using SigLIP, Phi-1.5, and the LLaVa training dataset. The model boasts 1.6 billion parameters and is made available for research purposes only; commercial use is not allowed.

moondream2 is a small vision language model designed to run efficiently on edge devices.

It will automatically download the necessary files into models/LLavacheckpoints/files_for__moondream and models/LLavacheckpoints/files_for_moondream2

JoyTag Node

@fpgamine's JoyTag is a state of the art AI vision model for tagging images, with a focus on sex positivity and inclusivity.
It uses the Danbooru tagging schema, but works across a wide range of images, from hand drawn to photographic. It will automatically download the necessary files into models/LLavacheckpoints/files_for_joytagger

Acknowledgements

Example LLaVa Nodes

image

Example Image to Music

image

Example InternLM-XComposer Node

image

Example Using Automatic Prompt Generation

image

LLM Nodes

VLM + LLM

Example UForm-Gen2 Qwen Node

image

Example Kosmos-2 Node

image

Example moondream

image

Example Joytag

image

Example Prompt Generation

image

Example SimpleChat

image

Example LLava Sampler Advanced

image