Codebook/Scripts Workflow

๐Ÿ”„ Scripts Workflow

Understanding why scripts must run in a specific sequence (30% of Codebook content).

๐Ÿ”—The Data Dependency Chain

ScholaRAG's scripts MUST run in order (01 โ†’ 02 โ†’ 03 โ†’ 04 โ†’ 05 โ†’ 06 โ†’ 07) because each script depends on the output of the previous one. It's like cooking - you can't frost a cake before baking it!

Think of it as an assembly line: raw materials โ†’ processing โ†’ quality check โ†’ packaging โ†’ shipping. Each stage transforms the output of the previous stage.

Complete ScholaRAG Pipeline:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                         SCHOLARAG PIPELINE                         โ”‚
โ”‚                  Data flows DOWN through each stage                    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

   ๐Ÿ“ config.yaml + .env
         โ”‚
         โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚                                                 โ”‚
         โ–ผ                                                 โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  01_fetch_papers    โ”‚                          โ”‚   Your research  โ”‚
โ”‚  ๐Ÿ” Search & Fetch  โ”‚                          โ”‚   question &     โ”‚
โ”‚                     โ”‚                          โ”‚   criteria       โ”‚
โ”‚  Queries databases  โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค                  โ”‚
โ”‚  Downloads metadata โ”‚                          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ”‚
           โ”‚ papers.json (500-5000 papers)
           โ”‚ [title, authors, abstract, year, DOI...]
           โ”‚
           โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  02_title_abstract  โ”‚
โ”‚  ๐Ÿ“‹ Initial Screen  โ”‚
โ”‚                     โ”‚
โ”‚  Claude reads:      โ”‚
โ”‚  - Title            โ”‚โ—„โ”€โ”€โ”€โ”€ Needs: papers.json
โ”‚  - Abstract         โ”‚      Your PRISMA criteria
โ”‚  Fast filtering     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ”‚
           โ”‚ screened.json (100-500 papers)
           โ”‚ [included=true/false, reason...]
           โ”‚
           โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  03_full_text       โ”‚
โ”‚  ๐Ÿ“„ Deep Screen     โ”‚
โ”‚                     โ”‚
โ”‚  Claude reads:      โ”‚
โ”‚  - Full paper PDF   โ”‚โ—„โ”€โ”€โ”€โ”€ Needs: screened.json
โ”‚  - Methods section  โ”‚      (only included=true)
โ”‚  Detailed analysis  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ”‚
           โ”‚ eligible.json (30-100 papers)
           โ”‚ [final_included=true, quality_score...]
           โ”‚
           โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  04_embeddings      โ”‚
โ”‚  ๐Ÿง  Vectorize       โ”‚
โ”‚                     โ”‚
โ”‚  OpenAI converts:   โ”‚
โ”‚  Text โ†’ Vectors     โ”‚โ—„โ”€โ”€โ”€โ”€ Needs: eligible.json
โ”‚  Stores in ChromaDB โ”‚      (only final papers)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ”‚
           โ”‚ ChromaDB collection
           โ”‚ [1536-dimensional vectors for each paper]
           โ”‚
           โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  05_rag_query       โ”‚
โ”‚  ๐Ÿ’ฌ Interactive Q&A โ”‚
โ”‚                     โ”‚
โ”‚  Your questions โ†’   โ”‚
โ”‚  Semantic search โ†’  โ”‚โ—„โ”€โ”€โ”€โ”€ Needs: ChromaDB populated
โ”‚  Claude answers     โ”‚      with paper vectors
โ”‚  with evidence      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ”‚
           โ”‚ insights.json
           โ”‚ [queries, answers, citations...]
           โ”‚
           โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  06_synthesis       โ”‚
โ”‚  ๐Ÿ“Š Meta-Analysis   โ”‚
โ”‚                     โ”‚
โ”‚  Claude analyzes:   โ”‚
โ”‚  - Patterns         โ”‚โ—„โ”€โ”€โ”€โ”€ Needs: insights.json
โ”‚  - Effect sizes     โ”‚      eligible.json
โ”‚  - Gaps             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ”‚
           โ”‚ synthesis.json
           โ”‚ [themes, statistics, recommendations...]
           โ”‚
           โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  07_documentation   โ”‚
โ”‚  ๐Ÿ“ Write Report    โ”‚
โ”‚                     โ”‚
โ”‚  Generates:         โ”‚
โ”‚  - PRISMA flowchart โ”‚โ—„โ”€โ”€โ”€โ”€ Needs: ALL previous outputs
โ”‚  - Methods section  โ”‚      (papers โ†’ screened โ†’
โ”‚  - Results tables   โ”‚       eligible โ†’ synthesis)
โ”‚  - Bibliography     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ”‚
           โ”‚ Final outputs/
           โ”‚ โ”œโ”€โ”€ prisma_flowchart.md
           โ”‚ โ”œโ”€โ”€ methods_section.md
           โ”‚ โ”œโ”€โ”€ results_tables.md
           โ”‚ โ””โ”€โ”€ bibliography.bib
           โ”‚
           โ–ผ
     Publication Ready! ๐ŸŽ‰

Script-by-Script Breakdown

01

Fetch Papers - The Foundation

๐ŸŽฏ What it does:

Searches academic databases (Semantic Scholar, PubMed, ERIC, etc.) and downloads paper metadata (title, authors, abstract, DOI, year).

๐Ÿ“ฅ Inputs:

  • โ€ข config.yaml - Research question, databases, date range
  • โ€ข .env - API keys for database access

๐Ÿ“ค Outputs:

  • โ€ข data/papers.json - All fetched papers with metadata

โ“ Why run this FIRST?

You need papers before you can screen them! This creates the initial dataset. Without papers.json, scripts 02-07 have nothing to work with.

02

Title/Abstract Screening - Quick Filter

๐ŸŽฏ What it does:

Claude AI reads each paper's title and abstract, applies your PRISMA inclusion/exclusion criteria, and marks papers as included/excluded with reasoning.

๐Ÿ“ฅ Inputs:

  • โ€ข data/papers.json - From script 01
  • โ€ข config.yaml - PRISMA screening criteria
  • โ€ข .env - Anthropic API key for Claude

๐Ÿ“ค Outputs:

  • โ€ข data/screened.json - Papers with screening decisions

โ“ Why run this AFTER 01?

Depends on papers.json existing. Can't screen papers you haven't fetched yet! This reduces 5000 papers to ~500 relevant ones quickly (reading only abstracts, not full PDFs).

03

Full-Text Screening - Deep Dive

๐ŸŽฏ What it does:

Downloads and reads FULL PDFs of papers that passed abstract screening. Claude evaluates methodology, data quality, and detailed inclusion criteria.

๐Ÿ“ฅ Inputs:

  • โ€ข data/screened.json - From script 02 (only included=true)
  • โ€ข config.yaml - Detailed eligibility criteria

๐Ÿ“ค Outputs:

  • โ€ข data/eligible.json - Final included papers with quality ratings
  • โ€ข pdfs/ folder - Downloaded full-text PDFs

โ“ Why run this AFTER 02?

Only screens papers that passed abstract screening (screened.json where included=true). Reading 500 full PDFs is expensive and slow - script 02 filters it down to ~100 first. Saves time and API costs!

04

Build Embeddings - Create Search Index

๐ŸŽฏ What it does:

Converts each eligible paper into a semantic vector (1536 numbers) using OpenAI embeddings. Stores vectors in ChromaDB for lightning-fast semantic search.

๐Ÿ“ฅ Inputs:

  • โ€ข data/eligible.json - From script 03 (final papers only)
  • โ€ข .env - OpenAI API key

๐Ÿ“ค Outputs:

  • โ€ข chroma_db/ - Vector database with paper embeddings

โ“ Why run this AFTER 03?

Only embeds papers that passed FULL screening (eligible.json). No point creating vectors for papers you're going to exclude! This is like creating an index for a book - but the book (eligible papers) must exist first.

05

RAG Query - Interactive Research

๐ŸŽฏ What it does:

You ask research questions in natural language. The system searches ChromaDB for relevant papers, then Claude answers using evidence from those papers with citations.

๐Ÿ“ฅ Inputs:

  • โ€ข chroma_db/ - From script 04 (populated vector database)
  • โ€ข data/eligible.json - Paper metadata for citations
  • โ€ข Your questions (interactive)

๐Ÿ“ค Outputs:

  • โ€ข data/insights.json - Q&A pairs with citations
  • โ€ข Console output (your research conversation)

โ“ Why run this AFTER 04?

Requires ChromaDB to be populated with embeddings! Can't do semantic search on an empty database. Think of it like asking a librarian questions - the library must have books (vectors) first!

06

Synthesis - Meta-Analysis

๐ŸŽฏ What it does:

Claude analyzes ALL eligible papers together, identifying patterns, calculating aggregate statistics, finding research gaps, and synthesizing themes.

๐Ÿ“ฅ Inputs:

  • โ€ข data/insights.json - From script 05 (research findings)
  • โ€ข data/eligible.json - All final papers

๐Ÿ“ค Outputs:

  • โ€ข data/synthesis.json - Meta-analysis results
  • โ€ข Themes, patterns, effect sizes, recommendations

โ“ Why run this AFTER 05?

Builds on insights from RAG queries. Uses both insights.json (specific findings) and eligible.json (all papers) to identify cross-study patterns. Can't synthesize what you haven't analyzed yet!

07

Documentation - Publication Ready

๐ŸŽฏ What it does:

Generates publication-ready documentation: PRISMA flowchart, methods section, results tables, discussion, bibliography in APA/BibTeX format.

๐Ÿ“ฅ Inputs:

  • โ€ข data/papers.json - Total fetched (for flowchart numbers)
  • โ€ข data/screened.json - Abstract screening results
  • โ€ข data/eligible.json - Final included papers
  • โ€ข data/synthesis.json - Meta-analysis findings

๐Ÿ“ค Outputs:

  • โ€ข outputs/prisma_flowchart.md
  • โ€ข outputs/methods_section.md
  • โ€ข outputs/results_tables.md
  • โ€ข outputs/bibliography.bib

โ“ Why run this LAST?

Needs data from ALL previous scripts! PRISMA flowchart shows the entire journey (fetched โ†’ screened โ†’ eligible). Methods describe the full pipeline. Results come from synthesis. This is the final report that ties everything together.

โš ๏ธWhat Happens If You Skip Steps?

โŒ Skip 01 (Fetch) โ†’ Run 02 (Screen)

Error: FileNotFoundError: data/papers.json does not exist

Can't screen papers that don't exist!

โŒ Skip 03 (Full-text) โ†’ Run 04 (Embeddings)

Error: FileNotFoundError: data/eligible.json does not exist

Can't vectorize papers that haven't been screened!

โŒ Skip 04 (Embeddings) โ†’ Run 05 (RAG)

Error: ChromaDB collection is empty or doesn't exist

Can't search an empty vector database!

โŒ Skip 05-06 โ†’ Run 07 (Documentation)

Result: Incomplete documentation with missing synthesis and insights

Documentation needs ALL previous outputs to be complete!

๐Ÿ’กKey Takeaway

The pipeline is a dependency chain. Each script is like a Lego brick that builds on the previous one. You can't build the roof before the foundation!

The Order Ensures:

  • Data integrity: Each stage validates and transforms data correctly
  • Efficiency: Filter early (abstract screening) before expensive operations (full-text, embeddings)
  • Reproducibility: Same order = same results every time
  • Traceability: PRISMA flowchart tracks papers through every stage