Introduction to ScholaRAG
Learn how ScholaRAG transforms the traditional literature review process from weeks of manual work into hours of AI-powered efficiency.
What is ScholaRAG?
ScholaRAG is an open-source, conversational AI-guided system that helps researchers build custom RAG (Retrieval-Augmented Generation) systems for academic literature review. Built on top of VS Code and Claude Code, it guides you through every step of creating a systematic review pipeline.
๐ก Key Insight
Unlike generic chatbots, ScholaRAG creates a dedicated knowledge base from your specific research domain, ensuring every answer is grounded in the papers you've screened and approved.

The Problem It Solves
Traditional Literature Review (6-8 weeks)
If you've ever conducted a systematic review, you know the pain:
- Database Search: Spend days crafting queries for PubMed, ERIC, Web of Science
- Export & Screen: Download 500+ papers, export to Excel, read abstracts one by one
- Full-Text Review: Manually review 200+ PDFs for inclusion criteria
- Data Extraction: Copy-paste findings, methods, and statistics into spreadsheets
- Citation Hell: Constantly re-read papers to verify citations and quotes
The result? 67-75% of your time spent on mechanical tasks instead of analysis.
โ ๏ธ Common Pain Point
"I've read this paper three times, but I still can't remember which one had the meta-analysis on sample size calculations." โ Every PhD student, ever.
With ScholaRAG (2-3 weeks)
- 30-minute Setup: Build your RAG system with step-by-step Claude Code guidance
- 2-hour Screening: PRISMA pipeline screens thousands of papers automatically
- Instant Queries: Ask questions and get answers with specific paper citations
- Never Forget: Your RAG system remembers every relevant detail across all papers
โ Real Results
PhD students using ScholaRAG complete literature reviews in 2-3 weeks instead of 6-8 weeks, spending more time on analysis and writing.
What You'll Build
In approximately 30 minutes of active setup (plus 3-4 hours of automated processing), you'll create:
PRISMA Pipeline
Screen 500+ papers down to 50-150 highly relevant ones using systematic criteria
Database Strategy
ScholaRAG supports comprehensive multi-database coverage with both free open-access sources and institutional databases for broader reach:
Open Access Databases (Free, No Setup Required)
๐ Semantic Scholar
CS, Engineering, and General Sciences
- โ 200M+ papers indexed
- โ Free API (no key needed)
- โ ~40% open access PDFs
- โ AI-generated TL;DR summaries
๐ OpenAlex
All fields, comprehensive metadata
- โ 250M+ works catalogued
- โ Free API (unlimited)
- โ ~50% open access links
- โ Rich metadata (citations, authors)
๐ arXiv
STEM preprints
- โ 2.4M+ preprints
- โ Free API (no key needed)
- โ 100% PDF access
- โ Latest research (pre-publication)
Institutional Databases (Optional, Requires Access)
๐ฌ Scopus
Comprehensive multidisciplinary index
- โ 87M+ records (1788-present)
- โ ๏ธ Requires institutional access
- ๐ Metadata only (no PDFs)
- โ Excellent for broad coverage
๐ Web of Science
High-impact research index
- โ 171M+ records (1900-present)
- โ ๏ธ Requires institutional subscription
- ๐ Metadata only (no PDFs)
- โ Citation network analysis
๐ก Complete Retrieval Strategy
ScholaRAG fetches ALL available papers from each database (no arbitrary limits):
- โ Comprehensive coverage - never miss relevant papers
- โ Newest-first ordering - recent papers prioritized
- โ Smart pagination - handles databases with 20K+ results
- โ User confirmation - interactive prompts for large datasets
- โ Year cutoff suggestions - manage scope effectively
Institutional databases provide metadata only but dramatically increase paper identification (3-5x more papers found).
You'll set up access to these databases in Step 6 of Getting Started, and learn to query them effectively in Stage 2 of the workflow.
Core Concepts
1. AI-Powered PRISMA Screening
PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) is the gold standard for systematic reviews. ScholaRAG implements PRISMA 2020 with AI-enhanced multi-dimensional evaluation:
- Identification: Comprehensive database search with complete retrieval (no limits)
- Screening: AI-powered multi-dimensional evaluation using large language models
- Eligibility: Confidence-based routing (auto-include/exclude/human-review)
- Inclusion: Validated final set with optional human agreement metrics (Cohen's Kappa)
โ Multi-Dimensional AI Evaluation
ScholaRAG uses AI-PRISMA Rubric with transparent criteria:
- Sub-criteria scoring - Population, Intervention, Comparison, Outcomes (PICO framework)
- Evidence grounding - AI must quote abstract text to justify decisions
- Confidence thresholds - Auto-include โฅ90%, auto-exclude โค10%, human-review 11-89%
- Hallucination detection - Cross-check quoted evidence against abstracts
- Human validation - Optional quality check with inter-rater reliability (ฮบ)
This approach achieves 10-20% pass rates matching manual systematic review standards (vs. 93% with simple keyword matching).
2. RAG (Retrieval-Augmented Generation)
RAG combines two powerful capabilities:
- Retrieval: Semantic search finds the most relevant papers and sections
- Generation: LLM synthesizes answers grounded in retrieved content
This architecture prevents hallucinations by ensuring every statement is backed by actual research. Learn more about RAG in our Implementation Guide.
3. 7-Stage Workflow
ScholaRAG breaks down the complex process into 7 conversational stages with Claude Code:
Research Domain Setup
Define your research question, scope, and objectives
15 min
Query Strategy Design
Craft Boolean search queries for multiple databases
10 min
PRISMA Configuration
Set inclusion criteria and screen papers automatically
20 min
RAG System Design
Configure vector database and embedding model
15 min
Execution Plan
Review automation pipeline before execution
10 min
Research Conversation
Download PDFs, build RAG, run queries
2-3 hrs automated
Documentation Writing
Generate PRISMA diagrams and research reports
1-2 hrs
Who Should Use ScholaRAG?
๐ PhD Students
Dissertation literature reviews, qualifying exams, and proposal development
๐ฌ Researchers
Meta-analysis preparation, grant writing, and systematic reviews
๐จโ๐ซ Professors
Course material updates, research synthesis, and mentoring students
๐ Librarians
Systematic review consulting and research data management
Prerequisites
Before starting, ensure you have:
- VS Code with Claude Code extension installed
- Python 3.9+ on your system
- Anthropic API key (free tier available)
- 30 minutes for initial setup + 3-4 hours for automated processing
- Basic familiarity with your research domain
๐ Note on API Costs & Efficiency
ScholaRAG supports the latest AI coding models optimized for research automation:
- Claude Sonnet 4.5 (Oct 2025): Currently the most effective coding model for research automation, achieving state-of-the-art performance on SWE-bench
- Claude Haiku 4.5 (Oct 2025): Frontier performance at 1/3 cost, 4-5x faster than Sonnet 3.5 - excellent for high-volume screening tasks
- GPT-5-Codex: Advanced code generation model with superior reasoning for complex research workflows
A typical literature review (500 papers screened, 150 included) costs under $20 with Haiku 4.5 or $25-40 with Sonnet 4.5. Compare this to weeks of manual labor!
Next Steps
Ready to start building? Head to Chapter 2: Getting Started to set up your environment and run your first ScholaRAG workflow.
Quick start preview:
# Clone the repository
git clone https://github.com/HosungYou/ScholaRAG.git
cd ScholaRAG
# Install dependencies
pip install -r requirements.txt
# Open in VS Code with Claude Code
code .Further Reading: PRISMA Guidelines ยท Contextual Retrieval (Anthropic) ยท Templates & Examples