📖 מה ה-Skill הזה כולל

מתי להשתמש

"RAG", "Retrieval augmented", "Custom AI", "AI on my data", "Knowledge base AI".

הוראות עבודה

1. Why RAG

Combine LLM's reasoning with your specific data. Better than fine-tuning for most use cases. No retraining needed.

2. Architecture

INGESTION (one-time / periodic):
Documents → Chunks → Embeddings → Vector DB

QUERY (per request):
Question → Embedding → Vector search (top K) →
Context + Question → LLM → Answer with sources

3. Components

א. Document Loaders

PDFs, docs, web pages, CSVs.
Tools: LangChain loaders, LlamaIndex.

ב. Chunking

Split docs into 512-1024 token chunks.
Overlap 10-20%.
Strategy: fixed / semantic / recursive.

ג. Embeddings

Convert text → vector.
Models: OpenAI text-embedding-3, Cohere multilingual (Hebrew!), Voyage AI.

ד. Vector DB

Stores embeddings.
Pinecone (managed), Weaviate, Qdrant, Chroma, pgvector.

ה. Retrieval

Top-K nearest neighbors.
Filter by metadata.

ו. Reranking (optional, recommended)

Re-score top 50 → top 5.
Tools: Cohere Rerank, Voyage AI.

ז. LLM

Claude / GPT-4 with retrieved context.

4. Implementation — Simple (Python)

from openai import OpenAI
from pinecone import Pinecone

# Setup
openai_client = OpenAI()
pc = Pinecone(api_key="...")
index = pc.Index("my-knowledge")

# Ingest
def ingest_doc(text, metadata):
    chunks = chunk_text(text, size=512, overlap=50)
    for i, chunk in enumerate(chunks):
        embedding = openai_client.embeddings.create(
            input=chunk, model="text-embedding-3-small"
        ).data[0].embedding
        index.upsert([(f"doc-{i}", embedding, {"text": chunk, **metadata})])

# Query
def query(question):
    q_embed = openai_client.embeddings.create(
        input=question, model="text-embedding-3-small"
    ).data[0].embedding
    results = index.query(vector=q_embed, top_k=5, include_metadata=True)
    
    context = "\n\n".join([r.metadata["text"] for r in results.matches])
    
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}"
        }]
    )
    return response.choices[0].message.content

5. Chunking Strategies

Fixed Size

512 tokens, 50 overlap.
Simple, works most cases.

Semantic

Split at sentence/paragraph boundaries.
Better preservation of meaning.

Recursive

Try paragraphs, fall back to sentences if too big.

Document-specific

Headings, sections, tables.

6. Hybrid Search (Better Quality)

Combine:
- Semantic search (vector similarity)
- Keyword search (BM25)
- Weighted: 70% semantic + 30% keyword

Result: better recall, especially for proper nouns.

7. Reranking (Highly Recommended)

import cohere
co = cohere.Client("...")

# After initial retrieval (top 50)
reranked = co.rerank(
    query=question,
    documents=top_50_chunks,
    top_n=5,
    model="rerank-english-v3.0"  # or rerank-multilingual
)

# Use top 5 reranked

Improves answer quality significantly.

8. Evaluation

Retrieval Quality

Context Precision: were retrieved chunks relevant?
Context Recall: were all relevant chunks retrieved?

Answer Quality

Faithfulness: answer matches context (no hallucinations)?
Answer Relevance: answers the question?

Tools

RAGAS (Python framework).
TruLens.

9. Common Pitfalls

❌ Bad chunking — splits mid-sentence. ❌ No reranking — quality drops 20-40%. ❌ No evaluation — quality unknown. ❌ No source citations — hallucinations look real. ❌ Stale embeddings — outdated answers.

10. Cost (per 1K queries)

Small RAG

Embeddings (one-time): $5-20.
Vector DB (Pinecone): $70/m.
LLM (Haiku): $1-5 per 1K queries.
Reranking (Cohere): $1 per 1K queries.

11. Hebrew RAG

Embedding: Cohere multilingual works.
OCR for Hebrew PDFs: Google Vision / Claude Vision.
Reranking: Cohere multilingual.
LLM: Claude excellent for Hebrew.

12. Frameworks

LangChain — most popular, complex.
LlamaIndex — RAG-focused.
Haystack — production-grade.
Custom — full control.

13. אסיים בהמלצה.

פרומפט לדוגמה

Build RAG over 10K Israeli legal docs (Hebrew).

RAG quality bad. Diagnostic.

Vector DB choice: Pinecone vs Qdrant?

📥 התקנה בחצי דקה

1. הורד ופתח את קובץ ה-ZIP — תקבל תיקייה בשם rag-systems.

2. ב-Claude Code: העבר את התיקייה אל ~/.claude/skills/.
באפליקציה (Claude / Cowork): הגדרות ← Capabilities ← Skills ← העלאה.

3. בקש מ-Claude את מה שצריך בעברית — הוא יפעיל את ה-skill לבד כשזה רלוונטי.

Skill מערכות RAG — ארכיטקטורה ויישום ל-Claude