System Architecture

๐Ÿ—๏ธ For Developers & Code Reviewers

This page explains how all files in ScholaRAG connect and communicate. Essential reading for contributors, code reviewers, and AI assistants.

High-Level Architecture

User (via Claude Code)
    โ†“
prompts/*.md (Stage 1-7 conversation flows)
    โ†“
scholarag_cli.py (Orchestration & initialization)
    โ†“
config.yaml (Project configuration)
    โ†“
scripts/*.py (Automated pipeline execution)
    โ†“
data/ (Processed results)
    โ†“
outputs/ (Final RAG system + PRISMA diagram)

ScholaRAG follows a layered architecture where each layer has a specific responsibility:

  • Conversation Layer: prompts/*.md guide users through decisions
  • Configuration Layer: config.yaml stores all project settings
  • Orchestration Layer: scholarag_cli.py coordinates script execution
  • Execution Layer: scripts/*.py process data
  • Data Layer: data/ folders store intermediate results

File Dependency Map

This diagram shows the complete file dependency flow in ScholaRAG, organized into 4 distinct layers. Each layer has a specific role in the automated research pipeline.

ScholaRAG Architecture Diagram

๐Ÿ”ด Critical: project_type Branching

Red nodes (03_screen_papers.py, 07_generate_prisma.py) read project_type from config.yaml and adjust their behavior accordingly. Thick red arrows (==>) indicate critical branching points.

  • 03_screen_papers.py: Sets screening threshold (50% for knowledge_repository, 90% for systematic_review)
  • 07_generate_prisma.py: Changes PRISMA diagram title based on project type

config.yaml: The Central Hub

config.yaml is the single source of truth for all project settings. Every script reads from it, making it the most important file in the system.

Critical Fields by Script

ScriptReads From config.yamlWhy It Matters
01_fetch_papers.pysearch_query, databasesDetermines which databases to query and what keywords to use
03_screen_papers.pyproject_type,ai_prisma_rubricCritical: Sets screening thresholds (50% for knowledge_repository, 90% for systematic_review)
05_build_rag.pyrag_settings.embedding_model, rag_settings.llmDetermines quality and cost of RAG system
06_query_rag.pyrag_settings.llm, rag_settings.temperatureControls answer generation quality and randomness
07_generate_prisma.pyproject_type,project_nameCritical: Changes diagram title based on project type

๐Ÿ’ก Design Principle

Scripts never hardcode values. Everything comes from config.yaml, making projects portable and reproducible.

Data Flow: Stage by Stage

1

01_fetch_papers.py

Fetches papers from Semantic Scholar, OpenAlex, arXiv using configured query

Input:
config.yaml (search_query, databases)
Output:
data/01_identification/*.csv
2

02_deduplicate.py

Removes duplicates by DOI, arXiv ID, and title similarity

Input:
data/01_identification/*.csv
Output:
data/01_identification/deduplicated.csv
3

03_screen_papers.py

โš ๏ธ CRITICAL: Adjusts screening threshold based on project_type

Input:
deduplicated.csv + config.yaml (project_type)
Output:
data/02_screening/relevant.csv, excluded.csv
4

04_download_pdfs.py

Downloads PDFs from open_access URLs with retry logic

Input:
data/02_screening/relevant.csv
Output:
data/pdfs/*.pdf
5

05_build_rag.py

Chunks PDFs, generates embeddings, stores in vector database

Input:
data/pdfs/*.pdf + config.yaml (RAG settings)
Output:
data/chroma/ (ChromaDB)
6

06_query_rag.py

Retrieves relevant chunks, generates answers with citations

Input:
data/chroma/ + config.yaml (LLM)
Output:
Interactive console output
7

07_generate_prisma.py

โš ๏ธ CRITICAL: Title changes based on project_type

Input:
All data/ folders + config.yaml (project_type)
Output:
outputs/prisma_diagram.png

AI Assistant Integration: CLAUDE.md + Skills + Scripts

๐Ÿค– Three-Layer Architecture for Claude Code

ScholaRAG uses a complementary three-layer system where CLAUDE.md, skills/, and scripts/ work together without conflict. Each layer activates at different times and serves different purposes.

Layer Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Layer 1: CLAUDE.md (Foundation - Always Active)            โ”‚
โ”‚  โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” โ”‚
โ”‚  โ€ข Loaded when Claude Code opens ScholaRAG directory        โ”‚
โ”‚  โ€ข Provides base behavior rules, automation principles      โ”‚
โ”‚  โ€ข Defines researcher profile, detection patterns           โ”‚
โ”‚  โ€ข Handles: "I want to review AI chatbots" โ†’ Stage 1        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Layer 2: skills/ (Knowledge - Conditional Load)            โ”‚
โ”‚  โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” โ”‚
โ”‚  โ€ข SKILL.md โ†’ Entry point (Claude Skills feature)           โ”‚
โ”‚  โ€ข skills/claude_only/ โ†’ Detailed stage guides              โ”‚
โ”‚  โ€ข skills/reference/ โ†’ API docs, decision trees             โ”‚
โ”‚  โ€ข Triggered: "Help me with Stage 3" or stage transition    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Layer 3: scripts/ (Execution - After Completion)           โ”‚
โ”‚  โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” โ”‚
โ”‚  โ€ข Actual Python code execution                             โ”‚
โ”‚  โ€ข 01_fetch_papers.py โ†’ 07_generate_prisma.py               โ”‚
โ”‚  โ€ข Called when stage conversation completes                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

When Each Layer Activates

ScenarioCLAUDE.mdskills/Result
User opens ScholaRAG folderโœ… Activeโธ StandbyBase rules loaded
"I want to review AI chatbots"โœ… Pattern detectedโธ StandbyStage 1 starts
"Help me with Stage 3"โœ… Activeโœ… LoadedDetailed PRISMA guide
Stage 1 โ†’ Stage 2 transitionโœ… Activeโšก Auto-loadQuery strategy guide
Stage conversation completesโœ… Activeโœ… Activescripts/*.py executed

Why This Works: No Conflicts

๐Ÿ“˜ CLAUDE.md

Always-on foundation

  • โ€ข Detection patterns
  • โ€ข Researcher profile
  • โ€ข Automation rules
  • โ€ข CLI command formats

๐ŸŽฏ skills/

Extended knowledge

  • โ€ข Stage conversation flows
  • โ€ข Turn-by-turn patterns
  • โ€ข Divergence handling
  • โ€ข API reference details

โš™๏ธ scripts/

Actual execution

  • โ€ข Python implementation
  • โ€ข API calls
  • โ€ข Data processing
  • โ€ข File I/O

๐Ÿ’ก Key Insight

Skills enhance but don't replace CLAUDE.md. Even without explicit skill triggers, CLAUDE.md provides sufficient guidance. Skills add detailed conversation patterns for complex stages, making them complementary, not competing.

File Structure

# Layer 1: Foundation (always loaded)
CLAUDE.md โ†’ Base behavior, automation rules
# Layer 2: Skills (loaded on-demand)
SKILL.md โ†’ Entry point for Claude Skills
skills/claude_only/
โ”œโ”€โ”€ stage1_research_setup.md
โ”œโ”€โ”€ stage2_query_strategy.md
โ”œโ”€โ”€ ...
โ””โ”€โ”€ stage7_documentation.md
skills/reference/
โ”œโ”€โ”€ api_reference.md
โ””โ”€โ”€ project_type_decision_tree.md
# Layer 3: Execution (called after conversation)
scripts/
โ”œโ”€โ”€ 01_fetch_papers.py
โ”œโ”€โ”€ ...
โ””โ”€โ”€ 07_generate_prisma.py

Example Workflow

User: "I want to conduct a systematic review on AI chatbots for language learning"
CLAUDE.mdDetects pattern โ†’ Activates Stage 1 behavior
CLAUDE.mdAsks clarifying questions, recommends project_type
skills/stage1_research_setup.md loads for detailed turn-by-turn guidance
skills/Validates completion checklist, handles divergences
scripts/scholarag_cli.py init --project-type systematic_review

Common Pitfalls for Contributors

Adding a new field to config.yaml

โŒ Problem: You add a field but forget to update documentation

โœ… Solution:
  • 1. Update templates/config_base.yaml with inline comments
  • 2. Update relevant prompts/*.md to collect this field
  • 3. Update relevant scripts/*.py to read this field
  • 4. Update ARCHITECTURE.md to document dependencies
  • 5. Update RELEASE_NOTES_vX.X.X.md

Changing project_type logic

โŒ Problem: You modify thresholds but only update one script

โœ… Solution:
  • Files to update: 03_screen_papers.py, 07_generate_prisma.py
  • Prompts to update: 01_research_domain_setup.md, 03_prisma_configuration.md
  • Template to update: templates/config_base.yaml
  • Search for all occurrences: grep -r "project_type" .

Creating a new script

โŒ Problem: Script reads config but doesn't validate required fields

โœ… Solution:
  • 1. Add load_config() method to validate required fields
  • 2. Use self.config.get('field', default_value) with safe defaults
  • 3. Print clear error messages: "โŒ Missing required field: X"
  • 4. Add script to dependency map in ARCHITECTURE.md

Quick Reference Tables

Where is X defined?

WhatDefined InUsed By
project_typeconfig.yaml (Stage 1)03_screen_papers.py, 07_generate_prisma.py, all prompts
search_queryconfig.yaml (Stage 2)01_fetch_papers.py
ai_prisma_rubricconfig.yaml (Stage 3)03_screen_papers.py
rag_settingsconfig.yaml (Stage 4)05_build_rag.py, 06_query_rag.py
API keys.env (Stage 5)03_screen_papers.py, 05_build_rag.py, 06_query_rag.py

Script Execution Order

1. 01_fetch_papers.py โ†’ data/01_identification/*.csv
2. 02_deduplicate.py โ†’ data/01_identification/deduplicated.csv
3. 03_screen_papers.py โ†’ data/02_screening/*.csv
4. 04_download_pdfs.py โ†’ data/pdfs/*.pdf
5. 05_build_rag.py โ†’ data/chroma/
6. 06_query_rag.py โ†’ Interactive
7. 07_generate_prisma.py โ†’ outputs/prisma_diagram.png

Ready to Contribute?

Now that you understand the architecture, check out the detailed script documentation to see how each component works internally.

View Script Documentation โ†’