This chapter corresponds to code in the ScholaRAG repository

Complete Tutorial: Building Your First RAG System

Follow a real-world example project step-by-step. This tutorial shows you exactly what prompts to copy-paste, what conversations to have with Claude Code, and what results to expect at each stage.

📖 Before You Start

Make sure you've completed Getting Started. You should have ScholaRAG cloned and Claude Code installed.

📚 Example Project

Context:

• Researcher: PhD student in Education
• Field: Language Learning
• Time: 30 min active + 2 hrs automated

Research Question:

"Do AI chatbots improve speaking proficiency in university language learners?"

Goal: 60-80 papers for dissertation

The 7-Stage Workflow

Each stage uses a dedicated prompt that you copy-paste to Claude Code. Claude reads the prompt, has a conversation with you, then automatically executes Python scripts when ready.

STAGES 1-3

Planning

Define scope, design queries, configure PRISMA criteria

⏱️ ~25 minutes

STAGES 4-5

Building

Fetch papers, screen with AI-PRISMA, build vector DB

⏱️ ~1-2 hours (automated)

STAGES 6-7

Research

Query your RAG, write documentation

⏱️ Ongoing

Step 0: Initialize Project

Before starting Stage 1, create your project folder:

cd ScholaRAG
python scholarag_cli.py init

# Answer the prompts:
# Project name: AI-Chatbots-Language-Learning
# Research question: Do AI chatbots improve speaking proficiency?
# Domain: education

This creates a timestamped project folder with standardized structure:

projects/2025-10-24_AI-Chatbots-Language-Learning/
├── config.yaml
├── data/
├── rag/
└── outputs/

Stage 1: Research Domain Setup (15 min)

📋 What This Stage Does

Refines your research question through conversation. Claude asks clarifying questions about scope, constraints, and criteria. Result: Updated config.yaml with precise parameters.

How to Run

1. Open project in VS Code:

cd projects/2025-10-24_AI-Chatbots-Language-Learning

2. Open Claude Code: Press Cmd/Ctrl + Shift + P → "Claude: Open Chat"
3. Copy Stage 1 prompt: Open ScholaRAG/prompts/01_research_domain_setup.md
4. Paste to Claude Code and follow conversation

💬 Example Conversation

Claude:

"Are you focusing on ESL or all foreign languages?"

You:

"Both ESL and foreign languages"

Claude:

"Should we include rule-based chatbots or only AI-powered?"

You:

"Only AI-powered (neural networks)"

...and so on for 3-5 rounds

✅ Stage 1 Complete When:

• config.yaml updated with refined criteria
• Research question is specific and answerable
• Expected paper count is reasonable (20-500)

Stage 2: Query Strategy (10 min)

📋 What This Stage Does

Designs Boolean search queries for each database (Semantic Scholar, OpenAlex, arXiv). Claude suggests keywords, synonyms, and exclusion terms. Result: Search queries in config.yaml.

How to Run

1. Open ScholaRAG/prompts/02_query_strategy.md
2. Copy entire prompt → Paste to Claude Code
3. Claude generates queries → You review and approve

📝 Example Query Output

Semantic Scholar Query:
(chatbot OR "conversational agent" OR "dialogue system") AND
(language learning OR "second language" OR "foreign language") AND
(speaking OR pronunciation OR fluency OR "oral proficiency")

Exclusion:
NOT (children OR "primary school" OR "elementary")

Stage 3: PRISMA Configuration (20 min)

📋 What This Stage Does

Creates AI-PRISMA rubric with inclusion/exclusion criteria. Defines how AI will evaluate papers. Result: data/prisma/ai_prisma_rubric.yaml

How to Run

1. Copy prompts/03_prisma_configuration.md to Claude
2. Define criteria through conversation (research design, outcome measures, etc.)
3. Claude generates rubric → You review

Stage 4: RAG Design (15 min)

📋 What This Stage Does

Plans vector database configuration (chunk size, embedding model, etc.). Result: RAG config in config.yaml

This stage is mostly automated. Claude uses sensible defaults (ChromaDB, 512-token chunks, OpenAI embeddings). You just confirm.

Stage 5: Execution (1-2 hours, automated)

📋 What This Stage Does

Runs 5 Python scripts sequentially: fetch → deduplicate → screen → download PDFs → build RAG. Claude executes these automatically. You can monitor progress.

Automated Steps

01_fetch_papers.py

Queries Semantic Scholar, OpenAlex, arXiv

Output: data/open_access/*.csv

02_deduplicate.py

Removes duplicates by DOI, title similarity

Output: data/combined/deduplicated.csv

03_screen_papers.py

AI-PRISMA screening with Claude

Output: data/prisma/screened.csv

04_download_pdfs.py

Downloads PDFs from open access URLs

Output: data/pdfs/*.pdf

05_build_rag.py

Chunks PDFs, generates embeddings, builds ChromaDB

Output: rag/chroma_db/

✅ Stage 5 Complete When:

• Vector database built: rag/chroma_db/
• PDFs downloaded: data/pdfs/
• PRISMA diagram generated: outputs/prisma_diagram.png

Stage 6: Research Conversation (ongoing)

📋 What This Stage Does

Query your RAG system to extract insights. Use specialized prompts from Prompt Library.

Instead of direct Claude chat, you use the RAG interface:

python scripts/06_query_rag.py

This ensures all answers are grounded in your screened papers, not Claude's general knowledge.

📖 Research Conversation Guide

Learn query strategies and best practices

Stage 7: Documentation & Writing (ongoing)

📋 What This Stage Does

Generate PRISMA diagrams, organize findings, prepare publication materials.

Generate PRISMA diagram:

python scripts/07_generate_prisma.py

This creates publication-ready PRISMA 2020 flow diagram with your paper counts.

📄 Documentation & Writing Guide

Learn how to structure systematic reviews and manage bibliographies

Common Issues

No papers found in Stage 5

Cause: Query too restrictive or databases don't have papers in your domain

Solution:

1. Go back to Stage 2, broaden queries
2. Adjust year range (e.g., 2010-2025 instead of 2020-2025)
3. Add more synonym keywords

AI-PRISMA screening rejected all papers

Cause: Inclusion criteria too strict

Solution:

1. Review data/prisma/ai_prisma_rubric.yaml
2. Relax criteria (e.g., allow quasi-experimental, not just RCT)
3. Re-run python scripts/03_screen_papers.py

PDF download rate very low (<30%)

Cause: Most papers behind paywalls (common in medicine/psychology)

Solution:

1. Use institutional access if available
2. Add arXiv preprints to search
3. Consider "Knowledge Repository Mode" (see README) for broader coverage

What's Next?

You now have a working RAG system! The real research begins in Stage 6. Explore the Prompt Library for specialized query templates, or dive into Research Conversation to learn advanced querying techniques.

Core Concepts

Research Conversation

Complete Tutorial: Building Your First RAG System

📚 Example Project

The 7-Stage Workflow

Step 0: Initialize Project

Stage 1: Research Domain Setup (15 min)

How to Run

Stage 2: Query Strategy (10 min)

How to Run

Stage 3: PRISMA Configuration (20 min)

How to Run

Stage 4: RAG Design (15 min)

Stage 5: Execution (1-2 hours, automated)

Automated Steps

Stage 6: Research Conversation (ongoing)

📖 Research Conversation Guide

💡 Prompt Library

Stage 7: Documentation & Writing (ongoing)

📄 Documentation & Writing Guide

Common Issues

What's Next?