Core Concepts

Understand the key technologies and methodologies behind ScholaRAG: why we use PRISMA for systematic reviews, why RAG beats generic chatbots, and why these specific tools were chosen.

💡 For Researchers

This chapter explains why ScholaRAG works this way, not how to code it. Technical implementation details are in the Codebook.

PRISMA: The Gold Standard for Systematic Reviews

PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) is an evidence-based framework for conducting transparent, reproducible literature reviews. Updated in 2020, it's the standard for academic systematic reviews and meta-analyses.

Why ScholaRAG Uses PRISMA

🚫 Generic RAG Systems

❌ Dump random PDFs into vector DB
❌ No quality control or screening
❌ Can't defend why papers were included
❌ Mix high-quality and low-quality sources
❌ Not reproducible by other researchers
❌ Can't publish findings

"I threw 500 random PDFs from Google Scholar into a database."

✅ ScholaRAG with PRISMA

✓ Systematic database search with documented queries
✓ Clear inclusion/exclusion criteria
✓ AI-powered screening with transparent rubric
✓ Only high-quality, relevant papers included
✓ Fully reproducible methodology
✓ Publication-ready systematic review

"67 papers screened from 1,243 using PRISMA 2020 guidelines."

⚠️ Critical Understanding

PRISMA is NOT optional—it's what makes your RAG system academically valid. Stages 1-3 (PRISMA screening) happen BEFORE building your vector database (Stages 4-5).

PRISMA 2020 Flow

ScholaRAG automates the screening stages (C and D) using AI-PRISMA rubrics, saving weeks of manual work while maintaining academic rigor.

AI-PRISMA: Transparent Automated Screening

AI-PRISMA is ScholaRAG's approach to combining PRISMA 2020 systematic review methodology with AI automation. Unlike traditional "black box" human screening, AI-PRISMA makes every decision transparent, traceable, and verifiable.

Human-AI Collaboration Model

AI-PRISMA follows a 3-zone hybrid workflow where AI and humans collaborate based on decision confidence and task type:

✅ Zone 1: 100% AI Automation

Deduplication

• Exact duplicate detection (DOI, arXiv ID)
• Title similarity matching (≥90%)
• No human review needed
• Deterministic, verifiable rules

⚠️ Zone 2: AI-Assisted

High-confidence screening

• Score ≥90% or ≤10%: Auto-include/exclude
• 10-20% random sample validation
• Cohen's Kappa ≥ 0.61 required
• AI provides transparent rationale

👤 Zone 3: Human-Required

Borderline cases

• Score 11-89%: Manual dual screening
• AI provides dimension breakdown
• Human makes final decision
• Required for Systematic Review workflow

💡 Project Type Determines Thresholds:

• Systematic Review: 90/10 thresholds (strict) - Zone 3 human review required
• Knowledge Repository: 50/20 thresholds (lenient) - Zone 3 optional, AI-only screening acceptable

Multi-Dimensional Scoring System

Unlike simple keyword matching, AI-PRISMA uses 6 weighted dimensions to score each paper. This provides transparency and prevents arbitrary decisions. Total score range: -20 to 50 points.

Dimension	Points	Evaluates	Example Keywords
Domain	0-10	Core research area relevance	"language learning", "chatbot", "AI tutor"
Intervention	0-10	Specific treatment/tool	"conversational agent", "dialogue system", "feedback"
Method	0-5	Study design quality	"RCT", "quasi-experimental", "qualitative"
Outcomes	0-10	Measured results	"speaking fluency", "pronunciation", "motivation"
Exclusion	-20 to 0	Hard exclusions (penalize)	"animal study", "K-12", "non-English"
Title Bonus	0 or 10	Direct title-query match	Title contains all query keywords

🔍 Evidence Grounding: All dimension scores must be supported by direct quotes from the abstract. If the AI cannot find supporting evidence, the dimension receives 0 points. Hallucinated quotes result in a -20 confidence penalty.

📊 Example Scoring

Paper: "AI Chatbots for Speaking Practice in EFL Classrooms"

Domain: 10/10 (language learning)

Intervention: 10/10 (conversational AI)

Method: 4/5 (quasi-experimental)

Outcomes: 9/10 (speaking fluency)

Exclusion: 0 (no exclusion criteria)

Title Bonus: 10 (all keywords match)

Total: 43/50 (86% confidence) → AUTO-INCLUDE (Zone 2)

Paper: "Using Mobile Apps for Pronunciation Feedback"

Domain: 7/10 (language, not chatbots)

Intervention: 5/10 (app, not conversational)

Method: 3/5 (descriptive study)

Outcomes: 8/10 (pronunciation)

Exclusion: 0 (no exclusion criteria)

Title Bonus: 0 (missing keywords)

Total: 23/50 (46% confidence) → HUMAN REVIEW (Zone 3)

Paper: "Grammar Checkers in K-12 Writing Instruction"

Domain: 3/10 (writing, not speaking)

Intervention: 2/10 (grammar tool)

Method: 3/5 (descriptive study)

Outcomes: 0/10 (writing outcomes)

Exclusion: -15 (K-12 excluded)

Title Bonus: 0 (no keyword match)

Total: -7/50 (-14% confidence) → AUTO-EXCLUDE (Zone 2)

Transparency & Validation

AI-PRISMA generates detailed audit trails for every decision:

✓ Score breakdown: Which keywords matched, how many points per dimension
✓ AI rationale: Why the paper was included/excluded (generated by LLM)
✓ Confidence score: How certain is the AI (0-100%)
✓ Human override: Researchers can correct AI decisions, providing reasons
✓ Exportable reports: CSV with all scores, PRISMA flowchart with counts

🔬 Academic Validation Status

AI-PRISMA is currently under academic validation. The multi-dimensional scoring system and confidence thresholds require empirical validation for:

• Inter-rater reliability (AI vs. human agreement rates)
• Domain-specific weight optimization (education, medicine, etc.)
• Threshold calibration (auto-include/exclude cutoffs)

Early adopters should manually validate a sample of AI decisions (recommend 10-20% random sample) and report findings to help refine the methodology.

Project Type: Different Workflows

ScholaRAG supports two distinct project types with different workflows, thresholds, and validation requirements. Choose based on your research goals:

Systematic Review: Publication-Quality Rigor

For meta-analysis, dissertation chapters, and journal publications requiring PRISMA 2020 compliance.

✅ Requirements (MANDATORY)

• PICO-based 6-dimension scoring rubric
• Human validation on 10-20% random sample
• Cohen's Kappa ≥ 0.61 (substantial agreement)
• PRISMA 2020 flow diagram with AI transparency

📊 Characteristics

Thresholds: 90/10 (strict auto-include/exclude)
Human review: Required for all 11-89% confidence papers
Final papers: 50-300 (highly selective)
Validation: Cohen's Kappa ≥ 0.61 on 10-20% sample
Output: Publication-ready systematic review + RAG chatbot

🔄 Workflow Overview

Stage 1-2: Narrow, precise queries → Target 500-2,000 papers

Stage 3: Strict PICO criteria → Define inclusion/exclusion rules

Stage 5: AI screening (90/10 thresholds) → 3-zone separation

→ Zone 2: Auto-include (≥90% confidence)

→ Zone 2: Auto-exclude (≤10% confidence)

→ Zone 3: Human review (11-89% confidence) ⚠️

Stage 5b: Human validation → Expert review of borderline cases

Stage 5c: Cohen's Kappa → Calculate inter-rater reliability

Stage 6-7: RAG + PRISMA diagram → Final 50-300 papers

⚠️ Important: This path requires significant manual effort (10-50 hours for human review). Only choose if you need publication-quality output.

💡 Decision Guide

Choose Systematic Review if:

✓ You plan to publish in academic journals (BMJ, Lancet, PLOS, etc.)
✓ You're writing a dissertation/thesis systematic review chapter
✓ You need meta-analysis or quantitative synthesis
✓ You require PRISMA 2020 compliance

Choose Knowledge Repository if:

✓ You're doing exploratory research or background reading
✓ You need comprehensive domain coverage (10,000+ papers)
✓ You want a RAG chatbot for quick literature queries
✓ You do NOT plan to publish a systematic review paper

Configuration: Project type is set in Stage 1 (Research Domain Setup) and cannot be changed after Stage 3. The system auto-adjusts all screening behavior, thresholds, and validation requirements based on your choice. See Stage 3 tutorial for detailed PRISMA configuration.

Database Strategy: Open Access + Institutional

ScholaRAG supports two types of academic databases: open-access APIs (with PDFs) and institutional subscription APIs (metadata only).

Open-Access Databases (Primary)

These databases provide direct PDF access through their APIs, enabling full automation without institutional subscriptions.

Semantic Scholar

AI-powered academic search

Coverage: 200M+ papers across all fields

Open Access: ~40% have PDF URLs

API: Free, no authentication required

Best for: Broad interdisciplinary searches

OpenAlex

Open catalog of scholarly papers

Coverage: 240M+ works

Open Access: ~50% with OA URLs

API: Free, polite pool available

Best for: Comprehensive coverage

arXiv

Preprint repository

Coverage: 2.4M+ preprints

Open Access: 100% free PDFs

API: Free XML API

Best for: CS, physics, math, stats

✅ Combined Strategy

ScholaRAG queries all three and deduplicates by DOI/title. This achieves:

✓ ~50-60% overall PDF retrieval success
✓ Maximum coverage across domains
✓ Fallback when one source is incomplete
✓ No institutional subscriptions required

Institutional Databases (Optional)

If your institution has subscriptions to Scopus, Web of Science, or PubMed, ScholaRAG can fetch metadata only through their APIs. PDFs must be downloaded separately.

Scopus

Elsevier's abstract & citation database

Coverage: 84M+ records, all fields

API Access: Requires institutional API key + Inst Token

Data Available: Title, abstract, DOI, authors, citations

PDFs: ❌ Not available via API (metadata only)

Web of Science

Clarivate's research database

Coverage: 171M+ records, curated journals

API Access: Requires institutional API key

Data Available: Title, abstract, DOI, authors, WoS ID

PDFs: ❌ Not available via API (metadata only)

PubMed

NCBI's biomedical database

Coverage: 36M+ biomedical literature

API Access: Free (E-utilities API), no key required

Data Available: Title, abstract, PMID, authors, MeSH terms

PDFs: ⚠️ Some via PubMed Central (PMC), most metadata-only

⚠️ Important: Metadata-Only Limitation

Institutional APIs provide bibliographic metadata (title, abstract, DOI) but NOT PDF files. You must:

1. Fetch metadata via API (automated)
2. Download PDFs manually via your institution's library portal (or use DOI links)
3. Match filenames to DOIs using ScholaRAG's PDF matcher

Why metadata-only? Publisher licensing restrictions prevent API-based PDF distribution. Even with institutional access, PDFs must be accessed through authenticated library gateways (e.g., EZProxy, Shibboleth).

When to Use Institutional Databases

✅ Good Use Cases

High-quality metadata: Need accurate citation counts, journal rankings, or curated indexes
Complementary search: Combine with open-access APIs to maximize coverage
Domain-specific: PubMed for medicine, Scopus for engineering
Publication-ready: Scopus/WoS required for some journal submissions

❌ Not Ideal For

Full automation: Manual PDF download breaks workflow
Large-scale projects: Downloading 1,000+ PDFs manually is impractical
No institutional access: API keys require institutional subscription
PDF-only needs: If you only need full text, stick to open-access APIs

Setup Instructions (Brief)

To enable institutional databases in ScholaRAG:

1. Obtain API Keys

• Scopus: Request from your library → Get API Key + Inst Token
• Web of Science: Contact Clarivate rep → Get API Key
• PubMed: Optional (no key required, but recommended for higher rate limits)

2. Add to .env file

SCOPUS_API_KEY=your_scopus_key_here
SCOPUS_INST_TOKEN=your_institution_token
WOS_API_KEY=your_wos_key_here
PUBMED_API_KEY=your_pubmed_key  # Optional

3. Enable in config.yaml

databases:
  open_access:
    semantic_scholar: true
    openalex: true
    arxiv: true

  institutional:  # NEW: Enable institutional APIs
    scopus:
      enabled: true
    web_of_science:
      enabled: true
    pubmed:
      enabled: false  # Only if needed

Full guide: See docs/INSTITUTIONAL_APIS.md in the ScholaRAG repository for detailed setup, query syntax, and troubleshooting.

Recommended Hybrid Workflow

🎯 Best Practice: Open Access First

Stage 1: Fetch from Semantic Scholar + OpenAlex + arXiv (get ~50-60% PDFs automatically)
Stage 2: Run PRISMA screening on available metadata
Stage 3: Identify high-priority papers missing PDFs
Stage 4: Query Scopus/WoS for those specific DOIs (fill metadata gaps)
Stage 5: Manually download remaining PDFs via library portal (batch ~50-200 papers, not 10,000)

This minimizes manual work while maximizing coverage. Most papers (80%+) come from open-access APIs with PDFs included.

RAG vs. Generic Chatbots: Why RAG?

You might wonder: "Why not just ask ChatGPT or Claude about my research topic?" Here's why RAG is essential for academic work:

❌ Direct ChatGPT/Claude

Training Data Cutoff

Doesn't know papers published after training

Hallucinations

Can invent citations that don't exist

No Verification

Can't check if claims are accurate

Generic Knowledge

Doesn't focus on your specific corpus

✅ ScholaRAG System

Current & Complete

Searches up-to-date databases (2025)

Grounded Answers

Every claim backed by actual paper in your DB

Verifiable Citations

Includes paper titles, authors, page numbers

Your Curated Knowledge

Only searches PRISMA-screened papers

How RAG Works

Key insight: The LLM (Claude/GPT) doesn't "remember" papers—it only sees the 5-10 most relevant chunks you give it. This prevents hallucinations and ensures citations are real.

Why Vector Databases?

Traditional databases use exact keyword matching. Vector databases enable semantic search—finding papers by meaning, not just keywords.

Example: Semantic Search

Your Question:

"What are the benefits of conversational AI for pronunciation?"

Papers Found (semantically similar):

✓ "Effects of chatbot interaction on L2 speaking fluency"
✓ "Dialogue systems for accent reduction in ESL learners"
✓ "AI-powered feedback on oral proficiency"

Note: None use exact words "conversational AI" or "pronunciation"

Why ChromaDB for ScholaRAG?

ChromaDB (Recommended)

✓ Zero configuration setup
✓ Works locally (no cloud required)
✓ Handles 50-500 papers easily
✓ Python-native integration
✓ Open-source and free

Alternatives

FAISS: For 10,000+ papers (complex setup)

Qdrant: For cloud deployment (requires server)

pgvector: For PostgreSQL users (complex)

📝 ScholaRAG Default

Start with ChromaDB. It's what Claude Code sets up automatically. You can migrate to FAISS/Qdrant later if you scale to thousands of papers.

What Are Embeddings? (Simplified)

Embeddings convert text into numbers (vectors) that capture meaning. Similar concepts get similar numbers, enabling semantic search.

Intuitive Example

"Machine Learning" → [0.23, -0.45, 0.12, ...] (768 numbers)

"Artificial Intelligence" → [0.21, -0.43, 0.15, ...] (close to above)

"Pizza Recipe" → [-0.67, 0.82, -0.34, ...] (far from above)

The database calculates distance between vectors to find similar papers.

Which Embedding Model?

Model	Cost	Quality	Best For
OpenAI text-embedding-3-small	$0.02 / 1M tokens	⭐⭐⭐⭐	Most users (default)
sentence-transformers	Free (local)	⭐⭐⭐	Budget-conscious
Voyage AI	$0.10 / 1M tokens	⭐⭐⭐⭐⭐	Highest accuracy needed

📝 ScholaRAG Default

Uses OpenAI text-embedding-3-small. For a typical project (100 papers), embedding costs ~$0.50 total. High quality, low cost.

Putting It All Together

STEP 1-3

PRISMA Screening

Ensures only high-quality, relevant papers

STEP 4-5

Vector Database

Enables semantic search across papers

STEP 6-7

RAG Queries

Grounded answers with real citations

What's Next?

Now that you understand why ScholaRAG uses these technologies, you're ready to build your own system. The Complete Tutorial walks you through all 7 stages with a real example.

Learn More

PRISMA 2020 Official Guidelines — Comprehensive guide to systematic reviews

RAG Paper (Lewis et al., 2020) — Original research on Retrieval-Augmented Generation

ChromaDB Documentation — Learn more about vector databases

Contextual Retrieval (Anthropic) — Advanced RAG techniques

Getting Started

Complete Tutorial