מתי להשתמש
"RAG", "Retrieval augmented", "Custom AI", "AI on my data", "Knowledge base AI".
הוראות עבודה
1. Why RAG
Combine LLM's reasoning with your specific data. Better than fine-tuning for most use cases. No retraining needed.
2. Architecture
INGESTION (one-time / periodic):
Documents → Chunks → Embeddings → Vector DB
QUERY (per request):
Question → Embedding → Vector search (top K) →
Context + Question → LLM → Answer with sources
3. Components
א. Document Loaders
- PDFs, docs, web pages, CSVs.
- Tools: LangChain loaders, LlamaIndex.
ב. Chunking
- Split docs into 512-1024 token chunks.
- Overlap 10-20%.
- Strategy: fixed / semantic / recursive.
ג. Embeddings
- Convert text → vector.
- Models: OpenAI text-embedding-3, Cohere multilingual (Hebrew!), Voyage AI.
ד. Vector DB
- Stores embeddings.
- Pinecone (managed), Weaviate, Qdrant, Chroma, pgvector.
ה. Retrieval
- Top-K nearest neighbors.
- Filter by metadata.
ו. Reranking (optional, recommended)
- Re-score top 50 → top 5.
- Tools: Cohere Rerank, Voyage AI.
ז. LLM
- Claude / GPT-4 with retrieved context.
4. Implementation — Simple (Python)
from openai import OpenAI
from pinecone import Pinecone
# Setup
openai_client = OpenAI()
pc = Pinecone(api_key="...")
index = pc.Index("my-knowledge")
# Ingest
def ingest_doc(text, metadata):
chunks = chunk_text(text, size=512, overlap=50)
for i, chunk in enumerate(chunks):
embedding = openai_client.embeddings.create(
input=chunk, model="text-embedding-3-small"
).data[0].embedding
index.upsert([(f"doc-{i}", embedding, {"text": chunk, **metadata})])
# Query
def query(question):
q_embed = openai_client.embeddings.create(
input=question, model="text-embedding-3-small"
).data[0].embedding
results = index.query(vector=q_embed, top_k=5, include_metadata=True)
context = "\n\n".join([r.metadata["text"] for r in results.matches])
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
}]
)
return response.choices[0].message.content
5. Chunking Strategies
Fixed Size
- 512 tokens, 50 overlap.
- Simple, works most cases.
Semantic
- Split at sentence/paragraph boundaries.
- Better preservation of meaning.
Recursive
- Try paragraphs, fall back to sentences if too big.
Document-specific
- Headings, sections, tables.
6. Hybrid Search (Better Quality)
Combine:
- Semantic search (vector similarity)
- Keyword search (BM25)
- Weighted: 70% semantic + 30% keyword
Result: better recall, especially for proper nouns.
7. Reranking (Highly Recommended)
import cohere
co = cohere.Client("...")
# After initial retrieval (top 50)
reranked = co.rerank(
query=question,
documents=top_50_chunks,
top_n=5,
model="rerank-english-v3.0" # or rerank-multilingual
)
# Use top 5 reranked
Improves answer quality significantly.
8. Evaluation
Retrieval Quality
- Context Precision: were retrieved chunks relevant?
- Context Recall: were all relevant chunks retrieved?
Answer Quality
- Faithfulness: answer matches context (no hallucinations)?
- Answer Relevance: answers the question?
Tools
- RAGAS (Python framework).
- TruLens.
9. Common Pitfalls
❌ Bad chunking — splits mid-sentence. ❌ No reranking — quality drops 20-40%. ❌ No evaluation — quality unknown. ❌ No source citations — hallucinations look real. ❌ Stale embeddings — outdated answers.
10. Cost (per 1K queries)
Small RAG
- Embeddings (one-time): $5-20.
- Vector DB (Pinecone): $70/m.
- LLM (Haiku): $1-5 per 1K queries.
- Reranking (Cohere): $1 per 1K queries.
11. Hebrew RAG
- Embedding: Cohere multilingual works.
- OCR for Hebrew PDFs: Google Vision / Claude Vision.
- Reranking: Cohere multilingual.
- LLM: Claude excellent for Hebrew.
12. Frameworks
- LangChain — most popular, complex.
- LlamaIndex — RAG-focused.
- Haystack — production-grade.
- Custom — full control.
13. אסיים בהמלצה.
פרומפט לדוגמה
Build RAG over 10K Israeli legal docs (Hebrew).
RAG quality bad. Diagnostic.
Vector DB choice: Pinecone vs Qdrant?
© 2026 AI Expert Pro | גרסה 1.0.0