Block: Vision Block — Zelaxy Docs

Process visual content with customizable prompts to extract insights and information from images. Reach for this block whenever a workflow needs to describe images, extract text (OCR), interpret charts, or answer questions about visual content using a multimodal AI model.

Overview

Property	Value
Type	`vision`
Category	`tools`
Color	`#C2410C`

When to Use

Describe or caption an image from a public URL
Extract text from screenshots, receipts, or scanned documents (OCR)
Analyze charts, diagrams, or graphs and summarize findings
Answer specific questions about visual content with a custom prompt
Route images through GPT-4o or Claude 3 depending on quality needs
Feed the structured analysis result into downstream blocks (Function, Google Sheets, etc.)

Configuration

Image URL

A publicly accessible URL pointing to the image to analyze. This field is required and accepts any http:// or https:// URL, or a reference from another block such as {{imageFetcher.url}}.

Sub-block id: imageUrl | Type: short-input | Required: yes

Vision Model

The multimodal AI model used for analysis. Defaults to gpt-4o.

Sub-block id: model | Type: dropdown | Required: no (defaults to gpt-4o)

Label	ID
gpt-4o	`gpt-4o`
claude-3-opus	`claude-3-opus-20240229`
claude-3-sonnet	`claude-3-sonnet-20240229`

Prompt

A natural-language instruction describing what to analyze or extract from the image. If left empty the tool falls back to a default description prompt. This field is required.

Sub-block id: prompt | Type: long-input | Required: yes

API Key

The API key for the selected model provider. Use {{OPENAI_API_KEY}} for GPT-4o or {{ANTHROPIC_API_KEY}} for Claude 3 models. The value is treated as a password (masked in the UI).

Sub-block id: apiKey | Type: short-input | Required: yes

Inputs & Outputs

Inputs:

apiKey (string) — Provider API key
imageUrl (string) — Image URL
model (string) — Vision model
prompt (string) — Analysis prompt

Outputs:

content (string) — Analysis result returned by the model
model (string) — Model that was used for analysis
tokens (number) — Total tokens consumed by the request

Tools

Vision Tool (vision_tool) — Sends the image URL and prompt to the selected vision model's API. For GPT-4o it calls https://api.openai.com/v1/chat/completions; for Claude 3 models it calls https://api.anthropic.com/v1/messages. Capable of understanding image content, extracting text, identifying objects, and providing detailed visual descriptions.

YAML Example

vision_1:
  type: vision
  name: "Vision"
  inputs:
    imageUrl: "{{starter.imageUrl}}"
    model: "gpt-4o"
    prompt: "Extract all visible text from this image and return it as plain text."
    apiKey: "{{OPENAI_API_KEY}}"
  connections:
    outgoing:
      - target: next-block-id

Vision Block