System Architecture

๐Ÿ—๏ธ For Developers & Code Reviewers

This page explains how all files in ScholaRAG connect and communicate. Essential reading for contributors, code reviewers, and AI assistants.

High-Level Architecture

User (via Claude Code)
    โ†“
prompts/*.md (Stage 1-7 conversation flows)
    โ†“
scholarag_cli.py (Orchestration & initialization)
    โ†“
config.yaml (Project configuration)
    โ†“
scripts/*.py (Automated pipeline execution)
    โ†“
data/ (Processed results)
    โ†“
outputs/ (Final RAG system + PRISMA diagram)

ScholaRAG follows a layered architecture where each layer has a specific responsibility:

  • Conversation Layer: prompts/*.md guide users through decisions
  • Configuration Layer: config.yaml stores all project settings
  • Orchestration Layer: scholarag_cli.py coordinates script execution
  • Execution Layer: scripts/*.py process data
  • Data Layer: data/ folders store intermediate results

File Dependency Map

This diagram shows the complete file dependency flow in ScholaRAG, organized into 4 distinct layers. Each layer has a specific role in the automated research pipeline.

ScholaRAG Architecture Diagram

๐Ÿ”ด Critical: project_type Branching

Red nodes (03_screen_papers.py, 07_generate_prisma.py) read project_type from config.yaml and adjust their behavior accordingly. Thick red arrows (==>) indicate critical branching points.

  • 03_screen_papers.py: Sets screening threshold (50% for knowledge_repository, 90% for systematic_review)
  • 07_generate_prisma.py: Changes PRISMA diagram title based on project type

config.yaml: The Central Hub

config.yaml is the single source of truth for all project settings. Every script reads from it, making it the most important file in the system.

Critical Fields by Script

ScriptReads From config.yamlWhy It Matters
01_fetch_papers.pysearch_query, databasesDetermines which databases to query and what keywords to use
03_screen_papers.pyproject_type,ai_prisma_rubricCritical: Sets screening thresholds (50% for knowledge_repository, 90% for systematic_review)
05_build_rag.pyrag_settings.embedding_model, rag_settings.llmDetermines quality and cost of RAG system
06_query_rag.pyrag_settings.llm, rag_settings.temperatureControls answer generation quality and randomness
07_generate_prisma.pyproject_type,project_nameCritical: Changes diagram title based on project type

๐Ÿ’ก Design Principle

Scripts never hardcode values. Everything comes from config.yaml, making projects portable and reproducible.

Data Flow: Stage by Stage

1

01_fetch_papers.py

Fetches papers from Semantic Scholar, OpenAlex, arXiv using configured query

Input:
config.yaml (search_query, databases)
Output:
data/01_identification/*.csv
2

02_deduplicate.py

Removes duplicates by DOI, arXiv ID, and title similarity

Input:
data/01_identification/*.csv
Output:
data/01_identification/deduplicated.csv
3

03_screen_papers.py

โš ๏ธ CRITICAL: Adjusts screening threshold based on project_type

Input:
deduplicated.csv + config.yaml (project_type)
Output:
data/02_screening/relevant.csv, excluded.csv
4

04_download_pdfs.py

Downloads PDFs from open_access URLs with retry logic

Input:
data/02_screening/relevant.csv
Output:
data/pdfs/*.pdf
5

05_build_rag.py

Chunks PDFs, generates embeddings, stores in vector database

Input:
data/pdfs/*.pdf + config.yaml (RAG settings)
Output:
data/chroma/ (ChromaDB)
6

06_query_rag.py

Retrieves relevant chunks, generates answers with citations

Input:
data/chroma/ + config.yaml (LLM)
Output:
Interactive console output
7

07_generate_prisma.py

โš ๏ธ CRITICAL: Title changes based on project_type

Input:
All data/ folders + config.yaml (project_type)
Output:
outputs/prisma_diagram.png

Common Pitfalls for Contributors

Adding a new field to config.yaml

โŒ Problem: You add a field but forget to update documentation

โœ… Solution:
  • 1. Update templates/config_base.yaml with inline comments
  • 2. Update relevant prompts/*.md to collect this field
  • 3. Update relevant scripts/*.py to read this field
  • 4. Update ARCHITECTURE.md to document dependencies
  • 5. Update RELEASE_NOTES_vX.X.X.md

Changing project_type logic

โŒ Problem: You modify thresholds but only update one script

โœ… Solution:
  • Files to update: 03_screen_papers.py, 07_generate_prisma.py
  • Prompts to update: 01_research_domain_setup.md, 03_prisma_configuration.md
  • Template to update: templates/config_base.yaml
  • Search for all occurrences: grep -r "project_type" .

Creating a new script

โŒ Problem: Script reads config but doesn't validate required fields

โœ… Solution:
  • 1. Add load_config() method to validate required fields
  • 2. Use self.config.get('field', default_value) with safe defaults
  • 3. Print clear error messages: "โŒ Missing required field: X"
  • 4. Add script to dependency map in ARCHITECTURE.md

Quick Reference Tables

Where is X defined?

WhatDefined InUsed By
project_typeconfig.yaml (Stage 1)03_screen_papers.py, 07_generate_prisma.py, all prompts
search_queryconfig.yaml (Stage 2)01_fetch_papers.py
ai_prisma_rubricconfig.yaml (Stage 3)03_screen_papers.py
rag_settingsconfig.yaml (Stage 4)05_build_rag.py, 06_query_rag.py
API keys.env (Stage 5)03_screen_papers.py, 05_build_rag.py, 06_query_rag.py

Script Execution Order

1. 01_fetch_papers.py โ†’ data/01_identification/*.csv
2. 02_deduplicate.py โ†’ data/01_identification/deduplicated.csv
3. 03_screen_papers.py โ†’ data/02_screening/*.csv
4. 04_download_pdfs.py โ†’ data/pdfs/*.pdf
5. 05_build_rag.py โ†’ data/chroma/
6. 06_query_rag.py โ†’ Interactive
7. 07_generate_prisma.py โ†’ outputs/prisma_diagram.png

Ready to Contribute?

Now that you understand the architecture, check out the detailed script documentation to see how each component works internally.

View Script Documentation โ†’