📖 מה ה-Skill הזה כולל

מתי להשתמש

"Production AI", "AI workflow", "Orchestration", "Multiple AI calls", "Reliability".

הוראות עבודה

1. Why Orchestration

Single LLM call ≠ Production AI. Real apps need: chaining, fallbacks, caching, monitoring.

2. Common Patterns

Sequential Chain

Input → Translate → Classify → Generate → Output

Parallel Calls

Input → [Summarize | Categorize | Extract] → Combine → Output

Conditional Routing

Input → Classify → If A: Path 1 / If B: Path 2

Fallback

Try Claude Opus → If fail/timeout → Try Sonnet → If fail → Static response

Retry with Backoff

Try → Fail → Wait 1s → Try → Fail → Wait 5s → Try → Fail → Alert

3. Tools / Frameworks

Code-Based

LangChain — comprehensive.
LlamaIndex — RAG-focused.
Haystack — production-grade.
Custom — full control.

No-Code

Make.com / n8n — visual orchestration.
Zapier AI Actions.

LLM Routers

OpenRouter — multi-model fallback.
LiteLLM — unified API.
Portkey — gateway with caching.

4. Production Concerns

Latency

LLM calls 1-30 sec.
Stream when possible.
Parallelize independent calls.
Cache repeat calls.

Cost

Track per-feature.
Use cheaper models when possible.
Cache aggressively.
Batch when async.

Reliability

Retry transient failures.
Fallback to alt models.
Static responses when all fail.
Circuit breakers.

Observability

Log every LLM call.
Track latency, cost, errors.
Alert on anomalies.
Tools: LangSmith, Helicone, Langfuse, Portkey.

5. Caching Strategy

Levels

Exact match — same input, return cached output.
Semantic — similar input, return similar output.
Prompt cache (Anthropic) — system prompt cached.

Tools

Redis — exact match.
GPTCache — semantic.
Portkey — built-in.

6. Sample Production Workflow

async def process_request(user_input):
    # 1. Cache check
    cached = await cache.get(user_input)
    if cached:
        return cached
    
    # 2. Classify (cheap model)
    category = await call_llm(
        model="haiku-4-5",
        prompt=f"Classify: {user_input}",
        timeout=5
    )
    
    # 3. Route based on category
    if category == "complex":
        # Use expensive model
        response = await call_llm(
            model="opus-4",
            prompt=full_prompt(user_input),
            timeout=30,
            retry=3
        )
    else:
        # Use cheap model
        response = await call_llm(
            model="sonnet-4-6",
            prompt=basic_prompt(user_input),
            timeout=10,
            retry=2
        )
    
    # 4. Cache result
    await cache.set(user_input, response, ttl=3600)
    
    # 5. Log + observe
    log_call(user_input, response, latency, cost)
    
    return response

7. Observability — Top Tools 2026

Tool	Strengths
LangSmith	LangChain-native
Helicone	Easy integration, dashboards
Langfuse	Open source
Portkey	Gateway + observability
PromptLayer	Prompt versioning

8. Cost Optimization

Strategies

Cheaper model first, expensive only when needed.
Prompt caching (Anthropic 90% off).
Semantic caching for repeated queries.
Batch API (50% off, async).
Quantization (open source self-host).

9. Error Patterns

Common Errors

Rate limit (429) → backoff.
Timeout → retry or fallback.
Bad output (JSON parse fail) → retry stricter prompt.
Hallucination → validate output.
API down → switch provider.

10. Security

API keys in env vars / secrets vault.
Input sanitization (prompt injection).
Output filtering (PII, harmful).
Rate limit per user.

11. Israel Specifics

Multi-region considerations (data residency).
Hebrew prompts in caching = unique cache keys.
Privacy — review data flows.

12. אסיים בהמלצה.

פרומפט לדוגמה

Build production AI orchestration. Stack?

AI app, latency 30 sec. Optimize.

Failover plan when Claude API down.

📥 התקנה בחצי דקה

1. הורד ופתח את קובץ ה-ZIP — תקבל תיקייה בשם ai-orchestration.

2. ב-Claude Code: העבר את התיקייה אל ~/.claude/skills/.
באפליקציה (Claude / Cowork): הגדרות ← Capabilities ← Skills ← העלאה.

3. בקש מ-Claude את מה שצריך בעברית — הוא יפעיל את ה-skill לבד כשזה רלוונטי.

Skill AI Orchestration — תיאום מערכות AI ל-Claude