Introduction to ScholaRAG

Learn how ScholaRAG transforms the traditional literature review process from weeks of manual work into hours of AI-powered efficiency.

What is ScholaRAG?

ScholaRAG is an open-source, conversational AI-guided system that helps researchers build custom RAG (Retrieval-Augmented Generation) systems for academic literature review. Built on top of VS Code and Claude Code, it guides you through every step of creating a systematic review pipeline.

๐Ÿ’ก Key Insight

Unlike generic chatbots, ScholaRAG creates a dedicated knowledge base from your specific research domain, ensuring every answer is grounded in the papers you've screened and approved.

ScholaRAG: The AI Knowledge Flow - Diagram showing the workflow from Academic Papers through PRISMA Filtering to RAG System and AI Assistant

The Problem It Solves

Traditional Literature Review (6-8 weeks)

If you've ever conducted a systematic review, you know the pain:

  1. Database Search: Spend days crafting queries for PubMed, ERIC, Web of Science
  2. Export & Screen: Download 500+ papers, export to Excel, read abstracts one by one
  3. Full-Text Review: Manually review 200+ PDFs for inclusion criteria
  4. Data Extraction: Copy-paste findings, methods, and statistics into spreadsheets
  5. Citation Hell: Constantly re-read papers to verify citations and quotes

The result? 67-75% of your time spent on mechanical tasks instead of analysis.

โš ๏ธ Common Pain Point

"I've read this paper three times, but I still can't remember which one had the meta-analysis on sample size calculations." โ€” Every PhD student, ever.

With ScholaRAG (2-3 weeks)

  1. 30-minute Setup: Build your RAG system with step-by-step Claude Code guidance
  2. 2-hour Screening: PRISMA pipeline screens thousands of papers automatically
  3. Instant Queries: Ask questions and get answers with specific paper citations
  4. Never Forget: Your RAG system remembers every relevant detail across all papers

โœ… Real Results

PhD students using ScholaRAG complete literature reviews in 2-3 weeks instead of 6-8 weeks, spending more time on analysis and writing.

What You'll Build

In approximately 30 minutes of active setup (plus 3-4 hours of automated processing), you'll create:

๐Ÿ”

PRISMA Pipeline

Screen 500+ papers down to 50-150 highly relevant ones using systematic criteria

๐Ÿ—„๏ธ

Vector Database

Semantic search across your papers using ChromaDB or FAISS

๐Ÿค–

Research RAG

Query system powered by Claude Sonnet 4.5 with paper-specific citations

Database Strategy

ScholaRAG supports comprehensive multi-database coverage with both free open-access sources and institutional databases for broader reach:

Open Access Databases (Free, No Setup Required)

๐Ÿ“š Semantic Scholar

CS, Engineering, and General Sciences

  • โœ… 200M+ papers indexed
  • โœ… Free API (no key needed)
  • โœ… ~40% open access PDFs
  • โœ… AI-generated TL;DR summaries

๐ŸŒ OpenAlex

All fields, comprehensive metadata

  • โœ… 250M+ works catalogued
  • โœ… Free API (unlimited)
  • โœ… ~50% open access links
  • โœ… Rich metadata (citations, authors)

๐Ÿ“„ arXiv

STEM preprints

  • โœ… 2.4M+ preprints
  • โœ… Free API (no key needed)
  • โœ… 100% PDF access
  • โœ… Latest research (pre-publication)

Institutional Databases (Optional, Requires Access)

๐Ÿ”ฌ Scopus

Comprehensive multidisciplinary index

  • โœ… 87M+ records (1788-present)
  • โš ๏ธ Requires institutional access
  • ๐Ÿ“Š Metadata only (no PDFs)
  • โœ… Excellent for broad coverage

๐Ÿ“– Web of Science

High-impact research index

  • โœ… 171M+ records (1900-present)
  • โš ๏ธ Requires institutional subscription
  • ๐Ÿ“Š Metadata only (no PDFs)
  • โœ… Citation network analysis

๐Ÿ’ก Complete Retrieval Strategy

ScholaRAG fetches ALL available papers from each database (no arbitrary limits):

  • โœ… Comprehensive coverage - never miss relevant papers
  • โœ… Newest-first ordering - recent papers prioritized
  • โœ… Smart pagination - handles databases with 20K+ results
  • โœ… User confirmation - interactive prompts for large datasets
  • โœ… Year cutoff suggestions - manage scope effectively

Institutional databases provide metadata only but dramatically increase paper identification (3-5x more papers found).

You'll set up access to these databases in Step 6 of Getting Started, and learn to query them effectively in Stage 2 of the workflow.

Core Concepts

1. AI-Powered PRISMA Screening

PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) is the gold standard for systematic reviews. ScholaRAG implements PRISMA 2020 with AI-enhanced multi-dimensional evaluation:

  • Identification: Comprehensive database search with complete retrieval (no limits)
  • Screening: AI-powered multi-dimensional evaluation using large language models
  • Eligibility: Confidence-based routing (auto-include/exclude/human-review)
  • Inclusion: Validated final set with optional human agreement metrics (Cohen's Kappa)

โœ… Multi-Dimensional AI Evaluation

ScholaRAG uses AI-PRISMA Rubric with transparent criteria:

  • Sub-criteria scoring - Population, Intervention, Comparison, Outcomes (PICO framework)
  • Evidence grounding - AI must quote abstract text to justify decisions
  • Confidence thresholds - Auto-include โ‰ฅ90%, auto-exclude โ‰ค10%, human-review 11-89%
  • Hallucination detection - Cross-check quoted evidence against abstracts
  • Human validation - Optional quality check with inter-rater reliability (ฮบ)

This approach achieves 10-20% pass rates matching manual systematic review standards (vs. 93% with simple keyword matching).

2. RAG (Retrieval-Augmented Generation)

RAG combines two powerful capabilities:

  1. Retrieval: Semantic search finds the most relevant papers and sections
  2. Generation: LLM synthesizes answers grounded in retrieved content

This architecture prevents hallucinations by ensuring every statement is backed by actual research. Learn more about RAG in our Implementation Guide.

3. 7-Stage Workflow

ScholaRAG breaks down the complex process into 7 conversational stages with Claude Code:

1

Research Domain Setup

Define your research question, scope, and objectives

15 min

2

Query Strategy Design

Craft Boolean search queries for multiple databases

10 min

3

PRISMA Configuration

Set inclusion criteria and screen papers automatically

20 min

4

RAG System Design

Configure vector database and embedding model

15 min

5

Execution Plan

Review automation pipeline before execution

10 min

6

Research Conversation

Download PDFs, build RAG, run queries

2-3 hrs automated

7

Documentation Writing

Generate PRISMA diagrams and research reports

1-2 hrs

Who Should Use ScholaRAG?

๐ŸŽ“ PhD Students

Dissertation literature reviews, qualifying exams, and proposal development

๐Ÿ”ฌ Researchers

Meta-analysis preparation, grant writing, and systematic reviews

๐Ÿ‘จโ€๐Ÿซ Professors

Course material updates, research synthesis, and mentoring students

๐Ÿ“š Librarians

Systematic review consulting and research data management

Prerequisites

Before starting, ensure you have:

๐Ÿ“ Note on API Costs & Efficiency

ScholaRAG supports the latest AI coding models optimized for research automation:

  • Claude Sonnet 4.5 (Oct 2025): Currently the most effective coding model for research automation, achieving state-of-the-art performance on SWE-bench
  • Claude Haiku 4.5 (Oct 2025): Frontier performance at 1/3 cost, 4-5x faster than Sonnet 3.5 - excellent for high-volume screening tasks
  • GPT-5-Codex: Advanced code generation model with superior reasoning for complex research workflows

A typical literature review (500 papers screened, 150 included) costs under $20 with Haiku 4.5 or $25-40 with Sonnet 4.5. Compare this to weeks of manual labor!

Next Steps

Ready to start building? Head to Chapter 2: Getting Started to set up your environment and run your first ScholaRAG workflow.

Quick start preview:

# Clone the repository
git clone https://github.com/HosungYou/ScholaRAG.git
cd ScholaRAG

# Install dependencies
pip install -r requirements.txt

# Open in VS Code with Claude Code
code .

Further Reading: PRISMA Guidelines ยท Contextual Retrieval (Anthropic) ยท Templates & Examples