ZIP · ללא הרשמה · רישיון שימוש כלול בקובץ

📖 מה ה-Skill הזה כולל

מתי להשתמש

"multimodal AI", "vision AI", "GPT-4o vision", "Claude vision", "Gemini multimodal", "video understanding", "צילום + טקסט + קול".

הוראות עבודה

1. המודלים המובילים 2026

מודל	Text	Image	Video	Audio	Vision quality
GPT-5 (OpenAI)	✅	✅ in/out	✅ in	✅ S2S	מצוין
Claude Opus 4.7	✅	✅ in	❌ direct	❌	מצוין ב-charts/docs
Gemini 2.5 Pro	✅	✅ in/out (Imagen)	✅ native, 2M context	✅	מצוין ב-video
Gemini 2.5 Flash	✅	✅	✅	✅	מהיר וזול
Llama 3.2 Vision	✅	✅	❌	❌	open source
Pixtral Large (Mistral)	✅	✅	❌	❌	open weights

2. Capabilities Matrix

Capability	Best model 2026
Document understanding (charts, tables, PDFs)	Claude Opus, Gemini 2.5
Video summarization	Gemini 2.5 (1-2 hours native)
OCR + handwriting	Claude, GPT-5
Image generation	DALL-E 3, FLUX, Imagen 4
Realtime voice	OpenAI Realtime
Agentic with vision	Claude (computer use), GPT-5
Code from screenshot	Claude, GPT-5

3. Use Cases מובילים

Document AI

חוזה PDF → extract terms + visualize.
Receipt → expense report.
Insurance claim form → structured.

Visual Q&A

"מה יש בתמונה?"
Product identification (e-commerce).
Damage assessment (insurance).

Video Understanding

Meeting recording → minutes + action items.
Surveillance event detection.
Sports highlights.
Tutorial → step-by-step doc.

Accessibility

Screen reader for blind users.
Sign language recognition (early-stage).

Agentic / Computer Use

Claude controls browser (computer use API).
Screenshot → understand UI → click.

4. Architecture Patterns

A. Single multimodal call

response = client.messages.create(
    model="claude-opus-4-7",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {...}},
            {"type": "text", "text": "Extract invoice data"}
        ]
    }]
)

פשוט, מהיר, איכותי.

B. Pipeline (specialized → unified)

PDF → OCR (specialized) → text → LLM
Image → object detection → metadata → LLM
Video → frame extraction + Whisper → text + frames → LLM

עדיף לעיתים. שליטה גבוהה יותר.

C. Multi-step agent

Vision model identifies → tool call → action → vision verify

דוגמה: agentic browsing.

5. Document Processing — איך לעשות נכון

# Convert PDF → images → analyze
from pdf2image import convert_from_path
images = convert_from_path("doc.pdf", dpi=200)

for img in images:
    base64_img = encode_to_base64(img)
    result = claude.messages.create(
        model="claude-sonnet-4-6",
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": base64_img}},
                {"type": "text", "text": "Extract: invoice_number, amount, date, vendor"}
            ]
        }]
    )

טיפים:

200 DPI = sweet spot (איכות vs cost).
prompt חייב להיות מדויק: "Extract X. Output JSON. No prose."
Multi-page = multiple images או page-by-page loop.

6. Video Understanding

Gemini native video

import google.generativeai as genai

video_file = genai.upload_file("meeting.mp4")
response = model.generate_content([
    video_file,
    "Summarize this meeting. List action items."
])

תומך עד 2 שעות בקונטקסט. גם long video שאלות.

Frame sampling pattern

Extract frame כל 1-5 שניות.
Audio → Whisper → תמלול.
Combine timestamps + frames + transcript.
LLM summary.

7. Cost Considerations

Operation	Cost רף
Image (1024×1024) Claude	~$0.001-0.003
Image GPT-5	~$0.001-0.005
Video minute Gemini	~$0.01-0.05
Audio minute Realtime	~$0.06-0.30
Generated image DALL-E/FLUX	$0.04

Optimization:

Resize images לפני שליחה.
בחר resolution לפי task (low for thumbnails, high for OCR).
Cache hashable inputs.

8. Vision Limitations 2026

Counting רב — עדיין שגוי לפעמים.
Spatial reasoning מורכב — חולשה.
Tiny text — איכות יורדת.
Occlusion — חלקית.
Cultural / regional context — bias toward Western.

תמיד validate critical extractions עם human או רגרסיה אוטומטית.

9. Multimodal Safety

Image moderation: NSFW detector תמיד.
Real people in images → privacy concern.
Generated images → watermarking (C2PA).
Audio clones → consent + detection.
Cross-modal jailbreaks — image עם הוראות נסתרות.

10. Production Patterns

Pattern 1: Triage with vision

"Is this receipt? quality? language?" → fast cheap model → route.

Pattern 2: Vision + RAG

Detect product in image → search product DB → answer.

Pattern 3: Multimodal agents

Claude/GPT controls UI by viewing screen + planning actions.

Pattern 4: Synthesis

טקסט + image + audio combined → unified output (newsletter, report).

11. ישראלי context

Israeli multimodal startups

Hour One — synthetic presenters.
D-ID — talking avatars from photo.
Lightricks — LTX video generation.
Tabnine — code completion (text only).
AI21 — Jamba multimodal updates.
ZenCity — video + text analysis for municipalities.

עברית multimodal

Text in images: Claude/GPT/Gemini עברית טובה.
Hebrew handwriting OCR — חלש. בנו pipeline ייעודי.
Hebrew speech in videos — Whisper לא מצוין, Gemini טוב יותר.

12. Implementation Roadmap

Identify modality mix — image-heavy? video? voice?
Pick frontier model — Claude (docs), Gemini (video), GPT-5 (general).
Prototype — single API call פתוח לכמה דוגמאות.
Build evals — golden multimodal set, 100+ cases.
Pipeline — specialized + unified כשצריך.
Cost optimize — resize, cache, model tier.
Production — monitoring, fallbacks, safety.

קלט נדרש

שדה	תיאור
Modalities	image / video / audio / mix
Volume	files/month
Latency	real-time / batch
Languages	including Hebrew?
Compliance	privacy on faces, audio?

פלט צפוי

Deliverable	תוכן
Model + provider	Claude / Gemini / GPT-5
Architecture	single-call / pipeline / agent
Cost model	per file, monthly
Eval set	multimodal golden cases
Safety + privacy	moderation + consent

כללי עבודה

פלט בעברית, מונחים מקצועיים באנגלית
מחירים per image/minute/1M tokens
2026: Gemini king of video, Claude king of docs

דגלים אדומים

שמירת תמונות עם פנים אנשים אמיתיים בלי הסכמה → פרטיות.
Spatial reasoning קריטי בלי human validation → טעויות.
Audio cloning בלי הסכמה → אסור.
אין moderation על image inputs/outputs → סיכון.
Hebrew handwriting OCR בלי בדיקה → דיוק נמוך.

הערות חשובות

Multimodal הופך ל-default 2026. כל app חדש שוקל.
Gemini 2M context = video של 2 שעות בקריאה אחת.
Claude Opus = champion של charts + docs.
Voice + vision agent (Claude computer use) = מהפכה ב-2025-26.
בעברית: handwriting + speech = חלשים יחסית. plan accordingly.

פרומפט לדוגמה

תכנן multimodal app שמסכם פגישות זום (וידאו + קול + slides).

השווה Claude vs Gemini ל-receipt OCR בעברית, 10K/חודש.

איך לבנות agent שלוקח screenshot של אתר ומציע redesign?

📥 התקנה בחצי דקה

1. הורד ופתח את קובץ ה-ZIP — תקבל תיקייה בשם multimodal-applications.
2. ב-Claude Code: העבר את התיקייה אל ~/.claude/skills/.
באפליקציה (Claude / Cowork): הגדרות ← Capabilities ← Skills ← העלאה.
3. בקש מ-Claude את מה שצריך בעברית — הוא יפעיל את ה-skill לבד כשזה רלוונטי.