Complete Tutorial: Building Your First RAG System
Follow a real-world example project step-by-step. This tutorial shows you exactly what prompts to copy-paste, what conversations to have with Claude Code, and what results to expect at each stage.
๐ Before You Start
Make sure you've completed Getting Started. You should have ScholaRAG cloned and Claude Code installed.
๐ Example Project
Context:
- โข Researcher: PhD student in Education
- โข Field: Language Learning
- โข Time: 30 min active + 2 hrs automated
Research Question:
"Do AI chatbots improve speaking proficiency in university language learners?"
Goal: 60-80 papers for dissertation
The 7-Stage Workflow
Each stage uses a dedicated prompt that you copy-paste to Claude Code. Claude reads the prompt, has a conversation with you, then automatically executes Python scripts when ready.
STAGES 1-3
Planning
Define scope, design queries, configure PRISMA criteria
โฑ๏ธ ~25 minutes
STAGES 4-5
Building
Fetch papers, screen with AI-PRISMA, build vector DB
โฑ๏ธ ~1-2 hours (automated)
STAGES 6-7
Research
Query your RAG, write documentation
โฑ๏ธ Ongoing
Step 0: Initialize Project
Before starting Stage 1, create your project folder:
cd ScholaRAG
python scholarag_cli.py init
# Answer the prompts:
# Project name: AI-Chatbots-Language-Learning
# Research question: Do AI chatbots improve speaking proficiency?
# Domain: educationThis creates a timestamped project folder with standardized structure:
projects/2025-10-24_AI-Chatbots-Language-Learning/
โโโ config.yaml
โโโ data/
โโโ rag/
โโโ outputs/Stage 1: Research Domain Setup (15 min)
๐ What This Stage Does
Refines your research question through conversation. Claude asks clarifying questions about scope, constraints, and criteria. Result: Updated config.yaml with precise parameters.
How to Run
- 1. Open project in VS Code:
cd projects/2025-10-24_AI-Chatbots-Language-Learning - 2. Open Claude Code: Press
Cmd/Ctrl + Shift + Pโ "Claude: Open Chat" - 3. Copy Stage 1 prompt: Open
ScholaRAG/prompts/01_research_domain_setup.md - 4. Paste to Claude Code and follow conversation
๐ฌ Example Conversation
Claude:
"Are you focusing on ESL or all foreign languages?"
You:
"Both ESL and foreign languages"
Claude:
"Should we include rule-based chatbots or only AI-powered?"
You:
"Only AI-powered (neural networks)"
...and so on for 3-5 rounds
โ Stage 1 Complete When:
- โข
config.yamlupdated with refined criteria - โข Research question is specific and answerable
- โข Expected paper count is reasonable (20-500)
Stage 2: Query Strategy (10 min)
๐ What This Stage Does
Designs Boolean search queries for each database (Semantic Scholar, OpenAlex, arXiv). Claude suggests keywords, synonyms, and exclusion terms. Result: Search queries in config.yaml.
How to Run
- 1. Open
ScholaRAG/prompts/02_query_strategy.md - 2. Copy entire prompt โ Paste to Claude Code
- 3. Claude generates queries โ You review and approve
๐ Example Query Output
Semantic Scholar Query:
(chatbot OR "conversational agent" OR "dialogue system") AND
(language learning OR "second language" OR "foreign language") AND
(speaking OR pronunciation OR fluency OR "oral proficiency")
Exclusion:
NOT (children OR "primary school" OR "elementary")Stage 3: PRISMA Configuration (20 min)
๐ What This Stage Does
Creates AI-PRISMA rubric with inclusion/exclusion criteria. Defines how AI will evaluate papers. Result: data/prisma/ai_prisma_rubric.yaml
How to Run
- 1. Copy
prompts/03_prisma_configuration.mdto Claude - 2. Define criteria through conversation (research design, outcome measures, etc.)
- 3. Claude generates rubric โ You review
Stage 4: RAG Design (15 min)
๐ What This Stage Does
Plans vector database configuration (chunk size, embedding model, etc.). Result: RAG config in config.yaml
This stage is mostly automated. Claude uses sensible defaults (ChromaDB, 512-token chunks, OpenAI embeddings). You just confirm.
Stage 5: Execution (1-2 hours, automated)
๐ What This Stage Does
Runs 5 Python scripts sequentially: fetch โ deduplicate โ screen โ download PDFs โ build RAG. Claude executes these automatically. You can monitor progress.
Automated Steps
01_fetch_papers.py
Queries Semantic Scholar, OpenAlex, arXiv
Output: data/open_access/*.csv
02_deduplicate.py
Removes duplicates by DOI, title similarity
Output: data/combined/deduplicated.csv
03_screen_papers.py
AI-PRISMA screening with Claude
Output: data/prisma/screened.csv
04_download_pdfs.py
Downloads PDFs from open access URLs
Output: data/pdfs/*.pdf
05_build_rag.py
Chunks PDFs, generates embeddings, builds ChromaDB
Output: rag/chroma_db/
โ Stage 5 Complete When:
- โข Vector database built:
rag/chroma_db/ - โข PDFs downloaded:
data/pdfs/ - โข PRISMA diagram generated:
outputs/prisma_diagram.png
Stage 6: Research Conversation (ongoing)
๐ What This Stage Does
Query your RAG system to extract insights. Use specialized prompts from Prompt Library.
Instead of direct Claude chat, you use the RAG interface:
python scripts/06_query_rag.pyThis ensures all answers are grounded in your screened papers, not Claude's general knowledge.
๐ Research Conversation Guide
Learn query strategies and best practices
๐ก Prompt Library
7 ready-to-use research prompts
Stage 7: Documentation & Writing (ongoing)
๐ What This Stage Does
Generate PRISMA diagrams, organize findings, prepare publication materials.
Generate PRISMA diagram:
python scripts/07_generate_prisma.pyThis creates publication-ready PRISMA 2020 flow diagram with your paper counts.
๐ Documentation & Writing Guide
Learn how to structure systematic reviews and manage bibliographies
Common Issues
No papers found in Stage 5
Cause: Query too restrictive or databases don't have papers in your domain
Solution:
- 1. Go back to Stage 2, broaden queries
- 2. Adjust year range (e.g., 2010-2025 instead of 2020-2025)
- 3. Add more synonym keywords
AI-PRISMA screening rejected all papers
Cause: Inclusion criteria too strict
Solution:
- 1. Review
data/prisma/ai_prisma_rubric.yaml - 2. Relax criteria (e.g., allow quasi-experimental, not just RCT)
- 3. Re-run
python scripts/03_screen_papers.py
PDF download rate very low (<30%)
Cause: Most papers behind paywalls (common in medicine/psychology)
Solution:
- 1. Use institutional access if available
- 2. Add arXiv preprints to search
- 3. Consider "Knowledge Repository Mode" (see README) for broader coverage
What's Next?
You now have a working RAG system! The real research begins in Stage 6. Explore the Prompt Library for specialized query templates, or dive into Research Conversation to learn advanced querying techniques.