Core Concepts
Understand the key technologies and methodologies behind ScholaRAG: why we use PRISMA for systematic reviews, why RAG beats generic chatbots, and why these specific tools were chosen.
π‘ For Researchers
This chapter explains why ScholaRAG works this way, not how to code it. Technical implementation details are in the Codebook.
PRISMA: The Gold Standard for Systematic Reviews
PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) is an evidence-based framework for conducting transparent, reproducible literature reviews. Updated in 2020, it's the standard for academic systematic reviews and meta-analyses.
Why ScholaRAG Uses PRISMA
π« Generic RAG Systems
- β Dump random PDFs into vector DB
- β No quality control or screening
- β Can't defend why papers were included
- β Mix high-quality and low-quality sources
- β Not reproducible by other researchers
- β Can't publish findings
"I threw 500 random PDFs from Google Scholar into a database."
β ScholaRAG with PRISMA
- β Systematic database search with documented queries
- β Clear inclusion/exclusion criteria
- β AI-powered screening with transparent rubric
- β Only high-quality, relevant papers included
- β Fully reproducible methodology
- β Publication-ready systematic review
"67 papers screened from 1,243 using PRISMA 2020 guidelines."
β οΈ Critical Understanding
PRISMA is NOT optionalβit's what makes your RAG system academically valid. Stages 1-3 (PRISMA screening) happen BEFORE building your vector database (Stages 4-5).
PRISMA 2020 Flow
ScholaRAG automates the screening stages (C and D) using AI-PRISMA rubrics, saving weeks of manual work while maintaining academic rigor.
AI-PRISMA: Transparent Automated Screening
AI-PRISMA is ScholaRAG's approach to combining PRISMA 2020 systematic review methodology with AI automation. Unlike traditional "black box" human screening, AI-PRISMA makes every decision transparent, traceable, and verifiable.
Human-AI Collaboration Model
AI-PRISMA follows a 3-zone hybrid workflow where AI and humans collaborate based on decision confidence and task type:
β Zone 1: 100% AI Automation
Deduplication
- β’ Exact duplicate detection (DOI, arXiv ID)
- β’ Title similarity matching (β₯90%)
- β’ No human review needed
- β’ Deterministic, verifiable rules
β οΈ Zone 2: AI-Assisted
High-confidence screening
- β’ Score β₯90% or β€10%: Auto-include/exclude
- β’ 10-20% random sample validation
- β’ Cohen's Kappa β₯ 0.61 required
- β’ AI provides transparent rationale
π€ Zone 3: Human-Required
Borderline cases
- β’ Score 11-89%: Manual dual screening
- β’ AI provides dimension breakdown
- β’ Human makes final decision
- β’ Required for Systematic Review workflow
- β’ Systematic Review: 90/10 thresholds (strict) - Zone 3 human review required
- β’ Knowledge Repository: 50/20 thresholds (lenient) - Zone 3 optional, AI-only screening acceptable
Multi-Dimensional Scoring System
Unlike simple keyword matching, AI-PRISMA uses 6 weighted dimensions to score each paper. This provides transparency and prevents arbitrary decisions. Total score range: -20 to 50 points.
| Dimension | Points | Evaluates | Example Keywords |
|---|---|---|---|
| Domain | 0-10 | Core research area relevance | "language learning", "chatbot", "AI tutor" |
| Intervention | 0-10 | Specific treatment/tool | "conversational agent", "dialogue system", "feedback" |
| Method | 0-5 | Study design quality | "RCT", "quasi-experimental", "qualitative" |
| Outcomes | 0-10 | Measured results | "speaking fluency", "pronunciation", "motivation" |
| Exclusion | -20 to 0 | Hard exclusions (penalize) | "animal study", "K-12", "non-English" |
| Title Bonus | 0 or 10 | Direct title-query match | Title contains all query keywords |
π Example Scoring
Paper: "AI Chatbots for Speaking Practice in EFL Classrooms"
Total: 43/50 (86% confidence) β AUTO-INCLUDE (Zone 2)
Paper: "Using Mobile Apps for Pronunciation Feedback"
Total: 23/50 (46% confidence) β HUMAN REVIEW (Zone 3)
Paper: "Grammar Checkers in K-12 Writing Instruction"
Total: -7/50 (-14% confidence) β AUTO-EXCLUDE (Zone 2)
Transparency & Validation
AI-PRISMA generates detailed audit trails for every decision:
- β Score breakdown: Which keywords matched, how many points per dimension
- β AI rationale: Why the paper was included/excluded (generated by LLM)
- β Confidence score: How certain is the AI (0-100%)
- β Human override: Researchers can correct AI decisions, providing reasons
- β Exportable reports: CSV with all scores, PRISMA flowchart with counts
π¬ Academic Validation Status
AI-PRISMA is currently under academic validation. The multi-dimensional scoring system and confidence thresholds require empirical validation for:
- β’ Inter-rater reliability (AI vs. human agreement rates)
- β’ Domain-specific weight optimization (education, medicine, etc.)
- β’ Threshold calibration (auto-include/exclude cutoffs)
Early adopters should manually validate a sample of AI decisions (recommend 10-20% random sample) and report findings to help refine the methodology.
Project Type: Different Workflows
ScholaRAG supports two distinct project types with different workflows, thresholds, and validation requirements. Choose based on your research goals:
Systematic Review: Publication-Quality Rigor
For meta-analysis, dissertation chapters, and journal publications requiring PRISMA 2020 compliance.
β Requirements (MANDATORY)
- β’ PICO-based 6-dimension scoring rubric
- β’ Human validation on 10-20% random sample
- β’ Cohen's Kappa β₯ 0.61 (substantial agreement)
- β’ PRISMA 2020 flow diagram with AI transparency
π Characteristics
- Thresholds: 90/10 (strict auto-include/exclude)
- Human review: Required for all 11-89% confidence papers
- Final papers: 50-300 (highly selective)
- Validation: Cohen's Kappa β₯ 0.61 on 10-20% sample
- Output: Publication-ready systematic review + RAG chatbot
π Workflow Overview
π‘ Decision Guide
Choose Systematic Review if:
- β You plan to publish in academic journals (BMJ, Lancet, PLOS, etc.)
- β You're writing a dissertation/thesis systematic review chapter
- β You need meta-analysis or quantitative synthesis
- β You require PRISMA 2020 compliance
Choose Knowledge Repository if:
- β You're doing exploratory research or background reading
- β You need comprehensive domain coverage (10,000+ papers)
- β You want a RAG chatbot for quick literature queries
- β You do NOT plan to publish a systematic review paper
Configuration: Project type is set in Stage 1 (Research Domain Setup) and cannot be changed after Stage 3. The system auto-adjusts all screening behavior, thresholds, and validation requirements based on your choice. See Stage 3 tutorial for detailed PRISMA configuration.
Database Strategy: Open Access + Institutional
ScholaRAG supports two types of academic databases: open-access APIs (with PDFs) and institutional subscription APIs (metadata only).
Open-Access Databases (Primary)
These databases provide direct PDF access through their APIs, enabling full automation without institutional subscriptions.
Semantic Scholar
AI-powered academic search
Coverage: 200M+ papers across all fields
Open Access: ~40% have PDF URLs
API: Free, no authentication required
Best for: Broad interdisciplinary searches
OpenAlex
Open catalog of scholarly papers
Coverage: 240M+ works
Open Access: ~50% with OA URLs
API: Free, polite pool available
Best for: Comprehensive coverage
arXiv
Preprint repository
Coverage: 2.4M+ preprints
Open Access: 100% free PDFs
API: Free XML API
Best for: CS, physics, math, stats
β Combined Strategy
ScholaRAG queries all three and deduplicates by DOI/title. This achieves:
- β ~50-60% overall PDF retrieval success
- β Maximum coverage across domains
- β Fallback when one source is incomplete
- β No institutional subscriptions required
Institutional Databases (Optional)
If your institution has subscriptions to Scopus, Web of Science, or PubMed, ScholaRAG can fetch metadata only through their APIs. PDFs must be downloaded separately.
Scopus
Elsevier's abstract & citation database
Coverage: 84M+ records, all fields
API Access: Requires institutional API key + Inst Token
Data Available: Title, abstract, DOI, authors, citations
PDFs: β Not available via API (metadata only)
Web of Science
Clarivate's research database
Coverage: 171M+ records, curated journals
API Access: Requires institutional API key
Data Available: Title, abstract, DOI, authors, WoS ID
PDFs: β Not available via API (metadata only)
PubMed
NCBI's biomedical database
Coverage: 36M+ biomedical literature
API Access: Free (E-utilities API), no key required
Data Available: Title, abstract, PMID, authors, MeSH terms
PDFs: β οΈ Some via PubMed Central (PMC), most metadata-only
β οΈ Important: Metadata-Only Limitation
Institutional APIs provide bibliographic metadata (title, abstract, DOI) but NOT PDF files. You must:
- 1. Fetch metadata via API (automated)
- 2. Download PDFs manually via your institution's library portal (or use DOI links)
- 3. Match filenames to DOIs using ScholaRAG's PDF matcher
Why metadata-only? Publisher licensing restrictions prevent API-based PDF distribution. Even with institutional access, PDFs must be accessed through authenticated library gateways (e.g., EZProxy, Shibboleth).
When to Use Institutional Databases
β Good Use Cases
- High-quality metadata: Need accurate citation counts, journal rankings, or curated indexes
- Complementary search: Combine with open-access APIs to maximize coverage
- Domain-specific: PubMed for medicine, Scopus for engineering
- Publication-ready: Scopus/WoS required for some journal submissions
β Not Ideal For
- Full automation: Manual PDF download breaks workflow
- Large-scale projects: Downloading 1,000+ PDFs manually is impractical
- No institutional access: API keys require institutional subscription
- PDF-only needs: If you only need full text, stick to open-access APIs
Setup Instructions (Brief)
To enable institutional databases in ScholaRAG:
1. Obtain API Keys
- β’ Scopus: Request from your library β Get API Key + Inst Token
- β’ Web of Science: Contact Clarivate rep β Get API Key
- β’ PubMed: Optional (no key required, but recommended for higher rate limits)
2. Add to .env file
SCOPUS_API_KEY=your_scopus_key_here
SCOPUS_INST_TOKEN=your_institution_token
WOS_API_KEY=your_wos_key_here
PUBMED_API_KEY=your_pubmed_key # Optional3. Enable in config.yaml
databases:
open_access:
semantic_scholar: true
openalex: true
arxiv: true
institutional: # NEW: Enable institutional APIs
scopus:
enabled: true
web_of_science:
enabled: true
pubmed:
enabled: false # Only if neededFull guide: See docs/INSTITUTIONAL_APIS.md in the ScholaRAG repository for detailed setup, query syntax, and troubleshooting.
Recommended Hybrid Workflow
π― Best Practice: Open Access First
- Stage 1: Fetch from Semantic Scholar + OpenAlex + arXiv (get ~50-60% PDFs automatically)
- Stage 2: Run PRISMA screening on available metadata
- Stage 3: Identify high-priority papers missing PDFs
- Stage 4: Query Scopus/WoS for those specific DOIs (fill metadata gaps)
- Stage 5: Manually download remaining PDFs via library portal (batch ~50-200 papers, not 10,000)
This minimizes manual work while maximizing coverage. Most papers (80%+) come from open-access APIs with PDFs included.
RAG vs. Generic Chatbots: Why RAG?
You might wonder: "Why not just ask ChatGPT or Claude about my research topic?" Here's why RAG is essential for academic work:
β Direct ChatGPT/Claude
Training Data Cutoff
Doesn't know papers published after training
Hallucinations
Can invent citations that don't exist
No Verification
Can't check if claims are accurate
Generic Knowledge
Doesn't focus on your specific corpus
β ScholaRAG System
Current & Complete
Searches up-to-date databases (2025)
Grounded Answers
Every claim backed by actual paper in your DB
Verifiable Citations
Includes paper titles, authors, page numbers
Your Curated Knowledge
Only searches PRISMA-screened papers
How RAG Works
Key insight: The LLM (Claude/GPT) doesn't "remember" papersβit only sees the 5-10 most relevant chunks you give it. This prevents hallucinations and ensures citations are real.
Why Vector Databases?
Traditional databases use exact keyword matching. Vector databases enable semantic searchβfinding papers by meaning, not just keywords.
Example: Semantic Search
Your Question:
"What are the benefits of conversational AI for pronunciation?"
Papers Found (semantically similar):
- β "Effects of chatbot interaction on L2 speaking fluency"
- β "Dialogue systems for accent reduction in ESL learners"
- β "AI-powered feedback on oral proficiency"
Note: None use exact words "conversational AI" or "pronunciation"
Why ChromaDB for ScholaRAG?
ChromaDB (Recommended)
- β Zero configuration setup
- β Works locally (no cloud required)
- β Handles 50-500 papers easily
- β Python-native integration
- β Open-source and free
Alternatives
FAISS: For 10,000+ papers (complex setup)
Qdrant: For cloud deployment (requires server)
pgvector: For PostgreSQL users (complex)
π ScholaRAG Default
Start with ChromaDB. It's what Claude Code sets up automatically. You can migrate to FAISS/Qdrant later if you scale to thousands of papers.
What Are Embeddings? (Simplified)
Embeddings convert text into numbers (vectors) that capture meaning. Similar concepts get similar numbers, enabling semantic search.
Intuitive Example
"Machine Learning" β [0.23, -0.45, 0.12, ...] (768 numbers)
"Artificial Intelligence" β [0.21, -0.43, 0.15, ...] (close to above)
"Pizza Recipe" β [-0.67, 0.82, -0.34, ...] (far from above)
The database calculates distance between vectors to find similar papers.
Which Embedding Model?
| Model | Cost | Quality | Best For |
|---|---|---|---|
| OpenAI text-embedding-3-small | $0.02 / 1M tokens | ββββ | Most users (default) |
| sentence-transformers | Free (local) | βββ | Budget-conscious |
| Voyage AI | $0.10 / 1M tokens | βββββ | Highest accuracy needed |
π ScholaRAG Default
Uses OpenAI text-embedding-3-small. For a typical project (100 papers), embedding costs ~$0.50 total. High quality, low cost.
Putting It All Together
STEP 1-3
PRISMA Screening
Ensures only high-quality, relevant papers
STEP 4-5
Vector Database
Enables semantic search across papers
STEP 6-7
RAG Queries
Grounded answers with real citations
What's Next?
Now that you understand why ScholaRAG uses these technologies, you're ready to build your own system. The Complete Tutorial walks you through all 7 stages with a real example.
Learn More
PRISMA 2020 Official Guidelines β Comprehensive guide to systematic reviews
RAG Paper (Lewis et al., 2020) β Original research on Retrieval-Augmented Generation
ChromaDB Documentation β Learn more about vector databases
Contextual Retrieval (Anthropic) β Advanced RAG techniques