Codebook/Scripts Workflow

🔄 Scripts Workflow

Understanding why scripts must run in a specific sequence (30% of Codebook content).

🔗The Data Dependency Chain

ScholaRAG's scripts MUST run in order (01 → 02 → 03 → 04 → 05 → 06 → 07) because each script depends on the output of the previous one. It's like cooking - you can't frost a cake before baking it!

Think of it as an assembly line: raw materials → processing → quality check → packaging → shipping. Each stage transforms the output of the previous stage.

Complete ScholaRAG Pipeline:

┌────────────────────────────────────────────────────────────────────────┐
│                         SCHOLARAG PIPELINE                         │
│                  Data flows DOWN through each stage                    │
└────────────────────────────────────────────────────────────────────────┘

   📝 config.yaml + .env
         │
         ├─────────────────────────────────────────────────┐
         │                                                 │
         ▼                                                 ▼
┌─────────────────────┐                          ┌──────────────────┐
│  01_fetch_papers    │                          │   Your research  │
│  🔍 Search & Fetch  │                          │   question &     │
│                     │                          │   criteria       │
│  Queries databases  │◄─────────────────────────┤                  │
│  Downloads metadata │                          └──────────────────┘
└──────────┬──────────┘
           │
           │ papers.json (500-5000 papers)
           │ [title, authors, abstract, year, DOI...]
           │
           ▼
┌─────────────────────┐
│  02_title_abstract  │
│  📋 Initial Screen  │
│                     │
│  Claude reads:      │
│  - Title            │◄──── Needs: papers.json
│  - Abstract         │      Your PRISMA criteria
│  Fast filtering     │
└──────────┬──────────┘
           │
           │ screened.json (100-500 papers)
           │ [included=true/false, reason...]
           │
           ▼
┌─────────────────────┐
│  03_full_text       │
│  📄 Deep Screen     │
│                     │
│  Claude reads:      │
│  - Full paper PDF   │◄──── Needs: screened.json
│  - Methods section  │      (only included=true)
│  Detailed analysis  │
└──────────┬──────────┘
           │
           │ eligible.json (30-100 papers)
           │ [final_included=true, quality_score...]
           │
           ▼
┌─────────────────────┐
│  04_embeddings      │
│  🧠 Vectorize       │
│                     │
│  OpenAI converts:   │
│  Text → Vectors     │◄──── Needs: eligible.json
│  Stores in ChromaDB │      (only final papers)
└──────────┬──────────┘
           │
           │ ChromaDB collection
           │ [1536-dimensional vectors for each paper]
           │
           ▼
┌─────────────────────┐
│  05_rag_query       │
│  💬 Interactive Q&A │
│                     │
│  Your questions →   │
│  Semantic search →  │◄──── Needs: ChromaDB populated
│  Claude answers     │      with paper vectors
│  with evidence      │
└──────────┬──────────┘
           │
           │ insights.json
           │ [queries, answers, citations...]
           │
           ▼
┌─────────────────────┐
│  06_synthesis       │
│  📊 Meta-Analysis   │
│                     │
│  Claude analyzes:   │
│  - Patterns         │◄──── Needs: insights.json
│  - Effect sizes     │      eligible.json
│  - Gaps             │
└──────────┬──────────┘
           │
           │ synthesis.json
           │ [themes, statistics, recommendations...]
           │
           ▼
┌─────────────────────┐
│  07_documentation   │
│  📝 Write Report    │
│                     │
│  Generates:         │
│  - PRISMA flowchart │◄──── Needs: ALL previous outputs
│  - Methods section  │      (papers → screened →
│  - Results tables   │       eligible → synthesis)
│  - Bibliography     │
└─────────────────────┘
           │
           │ Final outputs/
           │ ├── prisma_flowchart.md
           │ ├── methods_section.md
           │ ├── results_tables.md
           │ └── bibliography.bib
           │
           ▼
     Publication Ready! 🎉

Script-by-Script Breakdown

Fetch Papers - The Foundation

🎯 What it does:

Searches academic databases (Semantic Scholar, PubMed, ERIC, etc.) and downloads paper metadata (title, authors, abstract, DOI, year).

📥 Inputs:

• config.yaml - Research question, databases, date range
• .env - API keys for database access

📤 Outputs:

• data/papers.json - All fetched papers with metadata

❓ Why run this FIRST?

You need papers before you can screen them! This creates the initial dataset. Without papers.json, scripts 02-07 have nothing to work with.

Title/Abstract Screening - Quick Filter

🎯 What it does:

Claude AI reads each paper's title and abstract, applies your PRISMA inclusion/exclusion criteria, and marks papers as included/excluded with reasoning.

📥 Inputs:

• data/papers.json - From script 01
• config.yaml - PRISMA screening criteria
• .env - Anthropic API key for Claude

📤 Outputs:

• data/screened.json - Papers with screening decisions

❓ Why run this AFTER 01?

Depends on papers.json existing. Can't screen papers you haven't fetched yet! This reduces 5000 papers to ~500 relevant ones quickly (reading only abstracts, not full PDFs).

Full-Text Screening - Deep Dive

🎯 What it does:

Downloads and reads FULL PDFs of papers that passed abstract screening. Claude evaluates methodology, data quality, and detailed inclusion criteria.

📥 Inputs:

• data/screened.json - From script 02 (only included=true)
• config.yaml - Detailed eligibility criteria

📤 Outputs:

• data/eligible.json - Final included papers with quality ratings
• pdfs/ folder - Downloaded full-text PDFs

❓ Why run this AFTER 02?

Only screens papers that passed abstract screening (screened.json where included=true). Reading 500 full PDFs is expensive and slow - script 02 filters it down to ~100 first. Saves time and API costs!

Build Embeddings - Create Search Index

🎯 What it does:

Converts each eligible paper into a semantic vector (1536 numbers) using OpenAI embeddings. Stores vectors in ChromaDB for lightning-fast semantic search.

📥 Inputs:

• data/eligible.json - From script 03 (final papers only)
• .env - OpenAI API key

📤 Outputs:

• chroma_db/ - Vector database with paper embeddings

❓ Why run this AFTER 03?

Only embeds papers that passed FULL screening (eligible.json). No point creating vectors for papers you're going to exclude! This is like creating an index for a book - but the book (eligible papers) must exist first.

RAG Query - Interactive Research

🎯 What it does:

You ask research questions in natural language. The system searches ChromaDB for relevant papers, then Claude answers using evidence from those papers with citations.

📥 Inputs:

• chroma_db/ - From script 04 (populated vector database)
• data/eligible.json - Paper metadata for citations
• Your questions (interactive)

📤 Outputs:

• data/insights.json - Q&A pairs with citations
• Console output (your research conversation)

❓ Why run this AFTER 04?

Requires ChromaDB to be populated with embeddings! Can't do semantic search on an empty database. Think of it like asking a librarian questions - the library must have books (vectors) first!

Synthesis - Meta-Analysis

🎯 What it does:

Claude analyzes ALL eligible papers together, identifying patterns, calculating aggregate statistics, finding research gaps, and synthesizing themes.

📥 Inputs:

• data/insights.json - From script 05 (research findings)
• data/eligible.json - All final papers

📤 Outputs:

• data/synthesis.json - Meta-analysis results
• Themes, patterns, effect sizes, recommendations

❓ Why run this AFTER 05?

Builds on insights from RAG queries. Uses both insights.json (specific findings) and eligible.json (all papers) to identify cross-study patterns. Can't synthesize what you haven't analyzed yet!

Documentation - Publication Ready

🎯 What it does:

Generates publication-ready documentation: PRISMA flowchart, methods section, results tables, discussion, bibliography in APA/BibTeX format.

📥 Inputs:

• data/papers.json - Total fetched (for flowchart numbers)
• data/screened.json - Abstract screening results
• data/eligible.json - Final included papers
• data/synthesis.json - Meta-analysis findings

📤 Outputs:

• outputs/prisma_flowchart.md
• outputs/methods_section.md
• outputs/results_tables.md
• outputs/bibliography.bib

❓ Why run this LAST?

Needs data from ALL previous scripts! PRISMA flowchart shows the entire journey (fetched → screened → eligible). Methods describe the full pipeline. Results come from synthesis. This is the final report that ties everything together.

⚠️What Happens If You Skip Steps?

❌ Skip 01 (Fetch) → Run 02 (Screen)

Error: FileNotFoundError: data/papers.json does not exist

Can't screen papers that don't exist!

❌ Skip 03 (Full-text) → Run 04 (Embeddings)

Error: FileNotFoundError: data/eligible.json does not exist

Can't vectorize papers that haven't been screened!

❌ Skip 04 (Embeddings) → Run 05 (RAG)

Error: ChromaDB collection is empty or doesn't exist

Can't search an empty vector database!

❌ Skip 05-06 → Run 07 (Documentation)

Result: Incomplete documentation with missing synthesis and insights

Documentation needs ALL previous outputs to be complete!

💡Key Takeaway

The pipeline is a dependency chain. Each script is like a Lego brick that builds on the previous one. You can't build the roof before the foundation!

The Order Ensures:

Data integrity: Each stage validates and transforms data correctly
Efficiency: Filter early (abstract screening) before expensive operations (full-text, embeddings)
Reproducibility: Same order = same results every time
Traceability: PRISMA flowchart tracks papers through every stage

← Previous: Tools & Technologies Back to Codebook

Quick Start