מתי להשתמש
"Production AI", "AI workflow", "Orchestration", "Multiple AI calls", "Reliability".
הוראות עבודה
1. Why Orchestration
Single LLM call ≠ Production AI. Real apps need: chaining, fallbacks, caching, monitoring.
2. Common Patterns
Sequential Chain
Input → Translate → Classify → Generate → Output
Parallel Calls
Input → [Summarize | Categorize | Extract] → Combine → Output
Conditional Routing
Input → Classify → If A: Path 1 / If B: Path 2
Fallback
Try Claude Opus → If fail/timeout → Try Sonnet → If fail → Static response
Retry with Backoff
Try → Fail → Wait 1s → Try → Fail → Wait 5s → Try → Fail → Alert
3. Tools / Frameworks
Code-Based
- LangChain — comprehensive.
- LlamaIndex — RAG-focused.
- Haystack — production-grade.
- Custom — full control.
No-Code
- Make.com / n8n — visual orchestration.
- Zapier AI Actions.
LLM Routers
- OpenRouter — multi-model fallback.
- LiteLLM — unified API.
- Portkey — gateway with caching.
4. Production Concerns
Latency
- LLM calls 1-30 sec.
- Stream when possible.
- Parallelize independent calls.
- Cache repeat calls.
Cost
- Track per-feature.
- Use cheaper models when possible.
- Cache aggressively.
- Batch when async.
Reliability
- Retry transient failures.
- Fallback to alt models.
- Static responses when all fail.
- Circuit breakers.
Observability
- Log every LLM call.
- Track latency, cost, errors.
- Alert on anomalies.
- Tools: LangSmith, Helicone, Langfuse, Portkey.
5. Caching Strategy
Levels
- Exact match — same input, return cached output.
- Semantic — similar input, return similar output.
- Prompt cache (Anthropic) — system prompt cached.
Tools
- Redis — exact match.
- GPTCache — semantic.
- Portkey — built-in.
6. Sample Production Workflow
async def process_request(user_input):
# 1. Cache check
cached = await cache.get(user_input)
if cached:
return cached
# 2. Classify (cheap model)
category = await call_llm(
model="haiku-4-5",
prompt=f"Classify: {user_input}",
timeout=5
)
# 3. Route based on category
if category == "complex":
# Use expensive model
response = await call_llm(
model="opus-4",
prompt=full_prompt(user_input),
timeout=30,
retry=3
)
else:
# Use cheap model
response = await call_llm(
model="sonnet-4-6",
prompt=basic_prompt(user_input),
timeout=10,
retry=2
)
# 4. Cache result
await cache.set(user_input, response, ttl=3600)
# 5. Log + observe
log_call(user_input, response, latency, cost)
return response
7. Observability — Top Tools 2026
| Tool | Strengths |
|---|---|
| LangSmith | LangChain-native |
| Helicone | Easy integration, dashboards |
| Langfuse | Open source |
| Portkey | Gateway + observability |
| PromptLayer | Prompt versioning |
8. Cost Optimization
Strategies
- Cheaper model first, expensive only when needed.
- Prompt caching (Anthropic 90% off).
- Semantic caching for repeated queries.
- Batch API (50% off, async).
- Quantization (open source self-host).
9. Error Patterns
Common Errors
- Rate limit (429) → backoff.
- Timeout → retry or fallback.
- Bad output (JSON parse fail) → retry stricter prompt.
- Hallucination → validate output.
- API down → switch provider.
10. Security
- API keys in env vars / secrets vault.
- Input sanitization (prompt injection).
- Output filtering (PII, harmful).
- Rate limit per user.
11. Israel Specifics
- Multi-region considerations (data residency).
- Hebrew prompts in caching = unique cache keys.
- Privacy — review data flows.
12. אסיים בהמלצה.
פרומפט לדוגמה
Build production AI orchestration. Stack?
AI app, latency 30 sec. Optimize.
Failover plan when Claude API down.
© 2026 AI Expert Pro | גרסה 1.0.0