Post

frissonitte's RAG Project Assistant

Retrieval-Augmented Generation system answering questions about personal ML and automation projects — combining AST-extracted code structure with hand-curated documentation in a hybrid knowledge base, featuring similarity-threshold gating and project-filtering metadata routing.

frissonitte's RAG Project Assistant

Overview

When building portfolio projects, developers often document them in disparate locations: code repositories, README files, blog posts, and internal journals. To unify these knowledge sources and provide a single conversational interface for technical screening, I built RAG Project Assistant.

This project is a Retrieval-Augmented Generation (RAG) system deployed as a containerized FastAPI service. It is designed to answer questions about my engineering projects by retrieving context from a hybrid knowledge base containing both AST-extracted codebase structures and hand-curated markdown documentations.


Architecture & Workflow

The architecture is designed to prevent two common failure modes in RAG systems: hallucination on out-of-scope queries (e.g., asking the bot about pizza recipes) and cross-project context contamination (e.g., mixing up database synchronization logic of two different systems).

graph TD
    Query[User Query] --> Detector[Project Detector Router]
    Detector -->|Detects Specific Project| MetaFilter[ChromaDB Metadata Filter]
    Detector -->|No Specific Project| FullCorpus[Full Corpus Search]

    MetaFilter --> RetVal[Retrieve Chunks]
    FullCorpus --> RetVal

    RetVal --> Gate[L2 Similarity Gate L2 < 1.40]
    Gate -->|Within Threshold| LLM[LLM Generator: Llama 3.3 70B]
    Gate -->|Exceeds Threshold| Refuse[Reject Out-of-Scope Prompt]

    LLM --> Answer[Grounded Response with Sources]
    Refuse --> RefuseMsg[Polite Refusal: Out of Scope]

Results at a Glance

PhaseComponentEngineering ApproachOutcome / Latency
1Hybrid Knowledge IngestionAST Code Extraction + Markdown DocumentationClean, context-rich chunks
2Guardrails & GatingChromaDB L2 Similarity Gate ($L_2 < 1.40$)Zero hallucination on out-of-scope queries
3Query RoutingKeyword-based project detection + Metadata filtersCorrect context mapping (No contamination)
4DeploymentFastAPI on HF Spaces (Docker) + Groq Llama 3.3 70Bsub-second response times

Phase 1: AST Code Ingestion & Hybrid Representation

To give the LLM concrete knowledge of code architectures without overwhelming the context window, the ingestion pipeline separates code structure from raw prose.

I wrote a parser that uses Python’s ast module to extract class signatures, function definitions, docstrings, and execution flows. These code signatures are stored alongside high-level project write-ups.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import ast

class CodeStructureExtractor(ast.NodeVisitor):
    def __init__(self):
        self.classes = []
        self.functions = []

    def visit_ClassDef(self, node):
        methods = [n.name for n in node.body if isinstance(n, ast.FunctionDef)]
        self.classes.append({
            "name": node.name,
            "methods": methods,
            "docstring": ast.get_docstring(node)
        })
        self.generic_visit(node)

    def visit_FunctionDef(self, node):
        if not node.name.startswith('_'):  # Extract public interfaces
            self.functions.append({
                "name": node.name,
                "args": [arg.arg for arg in node.args.args],
                "docstring": ast.get_docstring(node)
            })
        self.generic_visit(node)

Phase 2: Similarity-Threshold Gating

Standard RAG pipelines return the top-$k$ nearest neighbors regardless of how far they are from the query vector. If a user asks a query completely unrelated to the portfolio, the vector database still returns “closest” chunks, forcing the LLM to hallucinate or try to synthesize an answer.

To prevent this, I calibrated a similarity-threshold gate using L2 distance over the embedding space.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def retrieve_and_gate(query_text, collection, limit=3, threshold=1.40):
    # Retrieve closest matches
    results = collection.query(
        query_texts=[query_text],
        n_results=limit,
        include=["documents", "metadatas", "distances"]
    )

    # Extract distances
    distances = results["distances"][0]
    valid_documents = []
    sources = []

    for doc, meta, dist in zip(results["documents"][0], results["metadatas"][0], distances):
        # L2 Distance gate check
        if dist < threshold:
            valid_documents.append(doc)
            sources.append(meta.get("source", "Unknown"))

    if not valid_documents:
        # Exceeds threshold: Trigger low-confidence refusal flow
        return {
            "answer": "I'm sorry, but that query appears to be out of the scope of Emirhan's projects. Ask me about WBC Analyzer, Kinematic Pipeline, Popcorn Wagon, or his backend engineering experience.",
            "sources": [],
            "low_confidence": True
        }

    return {"documents": valid_documents, "sources": list(set(sources)), "low_confidence": False}

Phase 3: Project-Filtering Metadata Routing

When a user asks: “How is the database synchronized?”, this concept applies to multiple projects (e.g., zidorun or Platech Coating). Standard semantic search returns a mix of both codebases, leading to corrupted, contaminated answers.

The RAG Assistant resolves this by running a lightweight keyword-based project router before querying ChromaDB.

  1. Routing: If a project is explicitly or implicitly identified in the query, the retrieval applies a metadata filter on project_id.
  2. Fallback: If the filtered query yields no results, or if no project is detected, the system queries the entire corpus to allow for cross-project comparison queries (e.g., “Compare Popcorn Wagon with WBC Analyzer”).

Phase 4: API Development & Deployment

The system is deployed as a Dockerized API service using FastAPI and hosted on Hugging Face Spaces.

Observability and rate limiting:

  • Groq Inference: Migrated from local execution (Ollama) to Groq Cloud (Llama 3.3 70B) to achieve sub-second generation speeds.
  • Slowapi: Configured rate-limiting decorators (slowapi) on the /query endpoint to block potential denial-of-service attempts by throttling queries based on IP addresses.
  • Frontend Widget: A vanilla JS chat widget embedded on the portfolio site interacts asynchronously with the Hugging Face space endpoint, rendering streaming-like responses and formatting source attributions clearly.

Verification & Summary

By employing metadata routing and L2 threshold gates, the RAG Project Assistant maintains strict alignment with the source documents. It serves as an automated, highly reliable technical screening representative that is directly integrated into this portfolio site.

This post is licensed under CC BY 4.0 by the author.