Image Processing in LLMs – TaskMatrix (Visual ChatGPT)

By Tsuyoshi Matsuzaki on 2023-03-15• ( 5 Comments )

As well as a human needs another skill for drawing images, image generation in machine teaching should also have another architectures – such as, diffusion models, control-net, etc.
TaskMatrix (formerly, Visual ChatGPT) integrates these expert’s visual models and performs complex image processing tasks by reasoning in LLMs.

Please download and run the reference implementation in GitHub repository.

GitHub : TaskMatrix – Microsoft
https://github.com/microsoft/visual-chatgpt

Note : You can also run TaskMatrix (Visual ChatGPT) hosted in Hugging Face.

From : TaskMatrix – GitHub

In my previous post, I have demystified Reasoning+Acting (shortly, ReAct) framework. TaskMatrix is also built on this framework to perform image processing with LLM’s reasoning.

This post briefly shows you how it works on this framework.
This idea will also give you a hint for building the practical image processing with AI.

TaskMatrix (Visual ChatGPT) Overview

If you’re familiar with ReAct chain in LLMs (see here), the idea of TaskMatrix is very simple.

Instead of training new model from scratch using multiple modalities of data (such as, text, images, videos, etc), TaskMatrix simply integrates with existing models (foundation models) and tools in ReAct flow.
In order to perform visual external actions with text’s instructions, the stable visual foundation models (VFM) – such as, GAN, stable diffusion, BLIP, etc – are integrated as the tools in ReAct chain. (This chain is built on LangChain. See my previous post for building ReAct on LangChain.)

The following is the list of available tools (capabilities) in the current implementation in GitHub.

Available Tools

ImageCaptioning	Get Photo Description
Text2Image	Generate Image From User Input Text
ImageEditing.inference_remove()	Remove Something From The Photo
ImageEditing.inference_replace()	Replace Something From The Photo
InstructPix2Pix	Instruct Image Using Text
VisualQuestionAnswering	Answer Question About The Image
Image2Canny	Edge Detection On Image
CannyText2Image	Generate Image Condition On Canny Image
Image2Line	Line Detection On Image
LineText2Image	Generate Image Condition On Line Image
Image2Hed	Hed Detection On Image
HedText2Image	Generate Image Condition On Soft Hed Boundary Image
Image2Seg	Segmentation On Image
SegText2Image	Generate Image Condition On Segmentations
Image2Depth	Predict Depth On Image
DepthText2Image	Generate Image Condition On Depth
Image2Normal	Predict Normal Map On Image
NormalText2Image	Generate Image Condition On Normal Map
Image2Scribble	Sketch Detection On Image
ScribbleText2Image	Generate Image Condition On Sketch Image
Image2Pose	Pose Detection On Image
PoseText2Image	Generate Image Condition On Pose Image

Now, let’s assume that I submit the following instruction :

“replace the sofa in this image with a desk and then make it like a water-color painting”

In this example, the instruction is disassembled into the following 2 actions and the corresponding tools (in this case, CLIPSeg, Inpaint, and InstructPix2Pix – all in Hugging Face) are performed in each actions of ReAct chain.

Action 1: Replace Something From The Photo	First the image is segmented by CLIPSeg model, and the part is then painted by Inpaint pipeline.
Action 2: Instruct Image Using Text	The image is processed by human instruction (text) with InstructPix2Pix pipeline.

You can process images with a variety of combinations – e.g, get human pose using Image2Pose, and generate new image with same pose by text instruction using PoseText2Image.

See the prompt in the background

Now let’s briefly see the prompt in the background.

First, when the user uploads the original image, TaskMatrix saves this image on server. (Here I assume that the file path is image/9bb5e03b.png.)
After the image is successfully saved on server, TaskMatrix generates the text’s description (caption) of this image with BLIP model in Hugging Face, and the following prompt is sent to ChatGPT.
ChatGPT now returns the following highlighted text (“Received“) as a response.

Human: provide a figure named image/9bb5e03b.png. The description is: a living room with a couch and a couch in the corner. This information helps you to understand this image, but you should use tools to finish following tasks, rather than directly imagine from my description. If you understand, say "Received".AI: Received.

Note (Updated on Dec 2023) : In this example, it uses ChatGPT, but new GPT-4 Vision (a multimodal model, shortly GPT-4V) has the ability to accept a prompt with visual objects, and the integration with LLM’s reasoning can then be more simplified and improved.
With multimodal models, you don’t need to pass the file name and image’s caption any more, and directly pass an image with instructions.

When the user has instructed “replace the sofa in this image with a desk and then make it like a water-color painting“, the chain will then be invoked.

First in this chain, the following prompt is sent to ChatGPT, and the highlighted text is then returned.

As you can see below, ReAct chain in this example is performed by zero-shot style reasoning with the previous chat history (which includes the file name and image’s caption).
If there’s no need to process external actions, ChatGPT will respond “No” for the question “Thought: Do I need to use a tool?“.
However, in this case, ChatGPT has responded to run the action “Replace Something From The Photo“.

Note : Here I have configured only 4 tools, which are minimal tools required to run this example.
You can configure all available tools to run various types of instructions as you need, but the tool will consume a lot of GPU memories and it will need more capacity. (When you configure all tools, the following prompt will include the descriptions of all these tools.)

prompt 1

Visual ChatGPT is designed to be able to assist with a wide range of text and visual related tasks, from answering simple questions to providing in-depth explanations and discussions on a wide range of topics. Visual ChatGPT is able to generate human-like text based on the input it receives, allowing it to engage in natural-sounding conversations and provide responses that are coherent and relevant to the topic at hand.Visual ChatGPT is able to process and understand large amounts of text and images. As a language model, Visual ChatGPT can not directly read images, but it has a list of tools to finish different visual tasks. Each image will have a file name formed as "image/xxx.png", and Visual ChatGPT can invoke different tools to indirectly understand pictures. When talking about images, Visual ChatGPT is very strict to the file name and will never fabricate nonexistent files. When using tools to generate new image files, Visual ChatGPT is also known that the image may not be the same as the user's demand, and will use other visual question answering tools or description tools to observe the real image. Visual ChatGPT is able to use tools in a sequence, and is loyal to the tool observation outputs rather than faking the image content and image file name. It will remember to provide the file name from the last tool observation, if a new image is generated.Human may provide new figures to Visual ChatGPT with a description. The description helps Visual ChatGPT to understand this image, but Visual ChatGPT should use tools to finish following tasks, rather than directly imagine from the description.Overall, Visual ChatGPT is a powerful visual dialogue assistant tool that can help with a wide range of tasks and provide valuable insights and information on a wide range of topics.TOOLS:------Visual ChatGPT  has access to the following tools:> Get Photo Description: useful when you want to know what is inside the photo. receives image_path as input. The input to this tool should be a string, representing the image_path.> Remove Something From The Photo: useful when you want to remove and object or something from the photo from its description or location. The input to this tool should be a comma seperated string of two, representing the image_path and the object need to be removed.> Replace Something From The Photo: useful when you want to replace an object from the object description or location with another object from its description. The input to this tool should be a comma seperated string of three, representing the image_path, the object to be replaced, the object to be replaced with> Instruct Image Using Text: useful when you want to the style of the image to be like the text. like: make it look like a painting. or make it like a robot. The input to this tool should be a comma seperated string of two, representing the image_path and the text.To use a tool, please use the following format:```Thought: Do I need to use a tool? YesAction: the action to take, should be one of [Get Photo Description, Remove Something From The Photo, Replace Something From The Photo, Instruct Image Using Text]Action Input: the input to the actionObservation: the result of the action```When you have a response to say to the Human, or if you do not need to use a tool, you MUST use the format:```Thought: Do I need to use a tool? NoAI: [your response here]```You are very strict to the filename correctness and will never fake a file name if it does not exist.You will remember to provide the image file name loyally if it's provided in the last tool observation.Begin!Previous conversation history:Human: provide a figure named image/9bb5e03b.png. The description is: a living room with a couch and a couch in the corner. This information helps you to understand this image, but you should use tools to finish following tasks, rather than directly imagine from my description. If you understand, say "Received".AI: Received.New input: replace the sofa in this image with a desk and then make it like a water-color paintingSince Visual ChatGPT is a text language model, Visual ChatGPT must use tools to observe images rather than imagination.The thoughts and observations are only visible for Visual ChatGPT, Visual ChatGPT should remember to repeat important information in the final response for Human.Thought: Do I need to use a tool? YesAction: Replace Something From The PhotoAction Input: image/9bb5e03b.png, couch, desk

After ChatGPT has responded to run “Replace Something From The Photo“, the chain in the framework will capture this response and issue the corresponding external action.
In this case, the following commands will be issued in this action. :

The sofa (couch) in the image is segmented by predicting with CLIPSeg model.
It then paints the image of desk in the segmented part by running pipeline with Inpaint model.

In this case, we assume that the generated new image is then saved as image/5737_replace-something_9bb5e03b_9bb5e03b.png on server.

In the next prompt, the following text is then sent to ChatGPT. (The highlighted text is also the response from ChatGPT.)

As you can see below, the path of the generated file (in this case, image/5737_replace-something_9bb5e03b_9bb5e03b.png) is filled in this prompt, and ChatGPT has then responded to run the next action “Instruct Image Using Text“.

prompt 2

Visual ChatGPT is designed to be able to assist with a wide range of text and visual related tasks, from answering simple questions to providing in-depth explanations and discussions on a wide range of topics. Visual ChatGPT is able to generate human-like text based on the input it receives, allowing it to engage in natural-sounding conversations and provide responses that are coherent and relevant to the topic at hand.Visual ChatGPT is able to process and understand large amounts of text and images. As a language model, Visual ChatGPT can not directly read images, but it has a list of tools to finish different visual tasks. Each image will have a file name formed as "image/xxx.png", and Visual ChatGPT can invoke different tools to indirectly understand pictures. When talking about images, Visual ChatGPT is very strict to the file name and will never fabricate nonexistent files. When using tools to generate new image files, Visual ChatGPT is also known that the image may not be the same as the user's demand, and will use other visual question answering tools or description tools to observe the real image. Visual ChatGPT is able to use tools in a sequence, and is loyal to the tool observation outputs rather than faking the image content and image file name. It will remember to provide the file name from the last tool observation, if a new image is generated.Human may provide new figures to Visual ChatGPT with a description. The description helps Visual ChatGPT to understand this image, but Visual ChatGPT should use tools to finish following tasks, rather than directly imagine from the description.Overall, Visual ChatGPT is a powerful visual dialogue assistant tool that can help with a wide range of tasks and provide valuable insights and information on a wide range of topics.TOOLS:------Visual ChatGPT  has access to the following tools:> Get Photo Description: useful when you want to know what is inside the photo. receives image_path as input. The input to this tool should be a string, representing the image_path.> Remove Something From The Photo: useful when you want to remove and object or something from the photo from its description or location. The input to this tool should be a comma seperated string of two, representing the image_path and the object need to be removed.> Replace Something From The Photo: useful when you want to replace an object from the object description or location with another object from its description. The input to this tool should be a comma seperated string of three, representing the image_path, the object to be replaced, the object to be replaced with> Instruct Image Using Text: useful when you want to the style of the image to be like the text. like: make it look like a painting. or make it like a robot. The input to this tool should be a comma seperated string of two, representing the image_path and the text.To use a tool, please use the following format:```Thought: Do I need to use a tool? YesAction: the action to take, should be one of [Get Photo Description, Remove Something From The Photo, Replace Something From The Photo, Instruct Image Using Text]Action Input: the input to the actionObservation: the result of the action```When you have a response to say to the Human, or if you do not need to use a tool, you MUST use the format:```Thought: Do I need to use a tool? NoAI: [your response here]```You are very strict to the filename correctness and will never fake a file name if it does not exist.You will remember to provide the image file name loyally if it's provided in the last tool observation.Begin!Previous conversation history:Human: provide a figure named image/9bb5e03b.png. The description is: a living room with a couch and a couch in the corner. This information helps you to understand this image, but you should use tools to finish following tasks, rather than directly imagine from my description. If you understand, say "Received".AI: Received.New input: replace the sofa in this image with a desk and then make it like a water-color paintingSince Visual ChatGPT is a text language model, Visual ChatGPT must use tools to observe images rather than imagination.The thoughts and observations are only visible for Visual ChatGPT, Visual ChatGPT should remember to repeat important information in the final response for Human.Thought: Do I need to use a tool?  YesAction: Replace Something From The PhotoAction Input: image/9bb5e03b.png, couch, deskObservation: image/5737_replace-something_9bb5e03b_9bb5e03b.pngThought: Do I need to use a tool?  YesAction: Instruct Image Using TextAction Input: image/5737_replace-something_9bb5e03b_9bb5e03b.png, make it like a water-color painting

The chain in the framework will then capture this response and issue the next action, in which the image is processed with the instruction “make it like a water-color painting” by running pipeline with InstructPix2Pix model.

After this action is performed, the following text is then sent to ChatGPT. (The highlighted text is also the response from ChatGPT.)
In this final prompt, ChatGPT now responds “No” for the question “Do I need to use a tool?“.

prompt 3

Visual ChatGPT is designed to be able to assist with a wide range of text and visual related tasks, from answering simple questions to providing in-depth explanations and discussions on a wide range of topics. Visual ChatGPT is able to generate human-like text based on the input it receives, allowing it to engage in natural-sounding conversations and provide responses that are coherent and relevant to the topic at hand.Visual ChatGPT is able to process and understand large amounts of text and images. As a language model, Visual ChatGPT can not directly read images, but it has a list of tools to finish different visual tasks. Each image will have a file name formed as "image/xxx.png", and Visual ChatGPT can invoke different tools to indirectly understand pictures. When talking about images, Visual ChatGPT is very strict to the file name and will never fabricate nonexistent files. When using tools to generate new image files, Visual ChatGPT is also known that the image may not be the same as the user's demand, and will use other visual question answering tools or description tools to observe the real image. Visual ChatGPT is able to use tools in a sequence, and is loyal to the tool observation outputs rather than faking the image content and image file name. It will remember to provide the file name from the last tool observation, if a new image is generated.Human may provide new figures to Visual ChatGPT with a description. The description helps Visual ChatGPT to understand this image, but Visual ChatGPT should use tools to finish following tasks, rather than directly imagine from the description.Overall, Visual ChatGPT is a powerful visual dialogue assistant tool that can help with a wide range of tasks and provide valuable insights and information on a wide range of topics.TOOLS:------Visual ChatGPT  has access to the following tools:> Get Photo Description: useful when you want to know what is inside the photo. receives image_path as input. The input to this tool should be a string, representing the image_path.> Remove Something From The Photo: useful when you want to remove and object or something from the photo from its description or location. The input to this tool should be a comma seperated string of two, representing the image_path and the object need to be removed.> Replace Something From The Photo: useful when you want to replace an object from the object description or location with another object from its description. The input to this tool should be a comma seperated string of three, representing the image_path, the object to be replaced, the object to be replaced with> Instruct Image Using Text: useful when you want to the style of the image to be like the text. like: make it look like a painting. or make it like a robot. The input to this tool should be a comma seperated string of two, representing the image_path and the text.To use a tool, please use the following format:```Thought: Do I need to use a tool? YesAction: the action to take, should be one of [Get Photo Description, Remove Something From The Photo, Replace Something From The Photo, Instruct Image Using Text]Action Input: the input to the actionObservation: the result of the action```When you have a response to say to the Human, or if you do not need to use a tool, you MUST use the format:```Thought: Do I need to use a tool? NoAI: [your response here]```You are very strict to the filename correctness and will never fake a file name if it does not exist.You will remember to provide the image file name loyally if it's provided in the last tool observation.Begin!Previous conversation history:Human: provide a figure named image/9bb5e03b.png. The description is: a living room with a couch and a couch in the corner. This information helps you to understand this image, but you should use tools to finish following tasks, rather than directly imagine from my description. If you understand, say "Received".AI: Received.New input: replace the sofa in this image with a desk and then make it like a water-color paintingSince Visual ChatGPT is a text language model, Visual ChatGPT must use tools to observe images rather than imagination.The thoughts and observations are only visible for Visual ChatGPT, Visual ChatGPT should remember to repeat important information in the final response for Human.Thought: Do I need to use a tool?  YesAction: Replace Something From The PhotoAction Input: image/9bb5e03b.png, couch, deskObservation: image/5737_replace-something_9bb5e03b_9bb5e03b.pngThought: Do I need to use a tool?  YesAction: Instruct Image Using TextAction Input: image/5737_replace-something_9bb5e03b_9bb5e03b.png, make it like a water-color paintingObservation: image/770e_pix2pix_5737_9bb5e03b.pngThought: Do I need to use a tool?  NoAI: Here is the image you requested.![image/770e_pix2pix_5737_9bb5e03b.png](image/770e_pix2pix_5737_9bb5e03b.png). . .

The ReAct chain will then be completed and stopped. (I note that all the text below “Do I need to use a tool? No” in the response will be ignored in the framework.)

Challenges for image processing with LLMs …

As I have mentioned, today’s multimodal models will be useful for this work.
In the above example, we have submitted a file name (not image itself) in instructions (prompts), but the reasoning for image processing is accomplished more precisely and flexibly using multimodal models, in which an image itself as well as a text can be passed as instructions.
Sebastien et. al., 2023 (Microsoft Research) shows that multimodal model (in this case, GPT-4 Vision) can significantly improve the quality of drawing than the previous single modal model in a lot of cases – such as, simple drawing in TikZ, drawing a photo using SVG, drawing a 3D model in HTML using JavaScript and three.js, etc.

Another interesting research for image processing is the case of Long et. al., 2023 (UC Berkeley, UCSF), in which it achieves more precise drawing with the help of reasoning by LLMs, compared with the vanilla stable diffusion – such as, “3 apples arranged in an L-shape” which is difficult to draw precisely by the vanilla stable diffusion, but can be done by generating the layout of apples in reasoning by LLMs.

There exist a lot of research works trying to reason the image processing by LLMs.

Categories: Uncategorized

Tagged as: MachineLearning

5 replies»

Pingback: Azure Updates (2023.03.16) | ブチザッキ
Huseyin says:
2023-03-20 at 20:38
Thanks for the post.. But i didnt able to use visual chatgpt with my free api although I did same shown on the github page.. Is API usage is not allowed without any paid subscription ?
LikeLike
- Tsuyoshi Matsuzaki says:
  2023-03-21 at 05:19
  Did you specify the following 3 visual foundation tools at least ? (Example in this post uses these 3 models.)
  Of course, you can set all possible foundation tools to run, but it needs large amount of resources (especially, GPU memory) in GPU resources, such as, Tesla V100 or A100.
  python visual_chatgpt.py –load ImageCaptioning_cuda:0,ImageEditing_cuda:0,InstructPix2Pix_cuda:0
  LikeLike
lukegao1egao says:
2023-05-27 at 14:39
Hi，
I hope this message finds you well. I have been recently following your work and I’m quite intrigued by one particular aspect. I noticed that you have the ability to “see the prompt in the background” and I was hoping you could explain how you accomplish this.
Could you kindly detail the process or steps you take to achieve this? Also, are there any specific tools or software applications that you use to make this possible?
I believe understanding this process will significantly improve my workflow and efficiency, and your expertise would be of great help. Any tips or suggestions would be greatly appreciated.
Thank you for taking the time to read this and I look forward to hearing back from you soon.
LikeLike
- Tsuyoshi Matsuzaki says:
  2023-05-29 at 10:56
  Hi,
  I’m sorry, but I’m not using any tools or utilities (such as, protocol capturing developer tools) in this tracking.
  Both TaskMatrix (Visual ChatGPT) and LangChain are open-sources, and we can then track what prompts are really used in the background.
  For instance, the method predict() in langchain.chains.llm.LLMChain will be called internally and finally _call() will be invoked. (You can then track these inputs and responses.)
  I hope it helps you.
  LikeLike

tsmatz

Professional Development, Data Science

Image Processing in LLMs – TaskMatrix (Visual ChatGPT)

TaskMatrix (Visual ChatGPT) Overview

See the prompt in the background

Challenges for image processing with LLMs …

5 replies»

Leave a reply to lukegao1egao Cancel reply

Recent Posts

Reinforcement Learning

Imitation Learning

Language Processing

Diffusion Models

Tags

Follow

Image Processing in LLMs – TaskMatrix (Visual ChatGPT)

TaskMatrix (Visual ChatGPT) Overview

See the prompt in the background

Challenges for image processing with LLMs …

Share this:

Related

5 replies»

Leave a reply to lukegao1egao Cancel reply

Recent Posts

Reinforcement Learning

Imitation Learning

Language Processing

Diffusion Models

Tags

Follow