๐ Scripts Workflow
Understanding why scripts must run in a specific sequence (30% of Codebook content).
๐The Data Dependency Chain
ScholaRAG's scripts MUST run in order (01 โ 02 โ 03 โ 04 โ 05 โ 06 โ 07) because each script depends on the output of the previous one. It's like cooking - you can't frost a cake before baking it!
Think of it as an assembly line: raw materials โ processing โ quality check โ packaging โ shipping. Each stage transforms the output of the previous stage.
Complete ScholaRAG Pipeline:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SCHOLARAG PIPELINE โ
โ Data flows DOWN through each stage โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ config.yaml + .env
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โผ โผ
โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
โ 01_fetch_papers โ โ Your research โ
โ ๐ Search & Fetch โ โ question & โ
โ โ โ criteria โ
โ Queries databases โโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ
โ Downloads metadata โ โโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโฌโโโโโโโโโโโ
โ
โ papers.json (500-5000 papers)
โ [title, authors, abstract, year, DOI...]
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโ
โ 02_title_abstract โ
โ ๐ Initial Screen โ
โ โ
โ Claude reads: โ
โ - Title โโโโโโ Needs: papers.json
โ - Abstract โ Your PRISMA criteria
โ Fast filtering โ
โโโโโโโโโโโโฌโโโโโโโโโโโ
โ
โ screened.json (100-500 papers)
โ [included=true/false, reason...]
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโ
โ 03_full_text โ
โ ๐ Deep Screen โ
โ โ
โ Claude reads: โ
โ - Full paper PDF โโโโโโ Needs: screened.json
โ - Methods section โ (only included=true)
โ Detailed analysis โ
โโโโโโโโโโโโฌโโโโโโโโโโโ
โ
โ eligible.json (30-100 papers)
โ [final_included=true, quality_score...]
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโ
โ 04_embeddings โ
โ ๐ง Vectorize โ
โ โ
โ OpenAI converts: โ
โ Text โ Vectors โโโโโโ Needs: eligible.json
โ Stores in ChromaDB โ (only final papers)
โโโโโโโโโโโโฌโโโโโโโโโโโ
โ
โ ChromaDB collection
โ [1536-dimensional vectors for each paper]
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโ
โ 05_rag_query โ
โ ๐ฌ Interactive Q&A โ
โ โ
โ Your questions โ โ
โ Semantic search โ โโโโโโ Needs: ChromaDB populated
โ Claude answers โ with paper vectors
โ with evidence โ
โโโโโโโโโโโโฌโโโโโโโโโโโ
โ
โ insights.json
โ [queries, answers, citations...]
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโ
โ 06_synthesis โ
โ ๐ Meta-Analysis โ
โ โ
โ Claude analyzes: โ
โ - Patterns โโโโโโ Needs: insights.json
โ - Effect sizes โ eligible.json
โ - Gaps โ
โโโโโโโโโโโโฌโโโโโโโโโโโ
โ
โ synthesis.json
โ [themes, statistics, recommendations...]
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโ
โ 07_documentation โ
โ ๐ Write Report โ
โ โ
โ Generates: โ
โ - PRISMA flowchart โโโโโโ Needs: ALL previous outputs
โ - Methods section โ (papers โ screened โ
โ - Results tables โ eligible โ synthesis)
โ - Bibliography โ
โโโโโโโโโโโโโโโโโโโโโโโ
โ
โ Final outputs/
โ โโโ prisma_flowchart.md
โ โโโ methods_section.md
โ โโโ results_tables.md
โ โโโ bibliography.bib
โ
โผ
Publication Ready! ๐Script-by-Script Breakdown
Fetch Papers - The Foundation
๐ฏ What it does:
Searches academic databases (Semantic Scholar, PubMed, ERIC, etc.) and downloads paper metadata (title, authors, abstract, DOI, year).
๐ฅ Inputs:
- โข
config.yaml- Research question, databases, date range - โข
.env- API keys for database access
๐ค Outputs:
- โข
data/papers.json- All fetched papers with metadata
โ Why run this FIRST?
You need papers before you can screen them! This creates the initial dataset. Without papers.json, scripts 02-07 have nothing to work with.
Title/Abstract Screening - Quick Filter
๐ฏ What it does:
Claude AI reads each paper's title and abstract, applies your PRISMA inclusion/exclusion criteria, and marks papers as included/excluded with reasoning.
๐ฅ Inputs:
- โข
data/papers.json- From script 01 - โข
config.yaml- PRISMA screening criteria - โข
.env- Anthropic API key for Claude
๐ค Outputs:
- โข
data/screened.json- Papers with screening decisions
โ Why run this AFTER 01?
Depends on papers.json existing. Can't screen papers you haven't fetched yet! This reduces 5000 papers to ~500 relevant ones quickly (reading only abstracts, not full PDFs).
Full-Text Screening - Deep Dive
๐ฏ What it does:
Downloads and reads FULL PDFs of papers that passed abstract screening. Claude evaluates methodology, data quality, and detailed inclusion criteria.
๐ฅ Inputs:
- โข
data/screened.json- From script 02 (only included=true) - โข
config.yaml- Detailed eligibility criteria
๐ค Outputs:
- โข
data/eligible.json- Final included papers with quality ratings - โข
pdfs/folder - Downloaded full-text PDFs
โ Why run this AFTER 02?
Only screens papers that passed abstract screening (screened.json where included=true). Reading 500 full PDFs is expensive and slow - script 02 filters it down to ~100 first. Saves time and API costs!
Build Embeddings - Create Search Index
๐ฏ What it does:
Converts each eligible paper into a semantic vector (1536 numbers) using OpenAI embeddings. Stores vectors in ChromaDB for lightning-fast semantic search.
๐ฅ Inputs:
- โข
data/eligible.json- From script 03 (final papers only) - โข
.env- OpenAI API key
๐ค Outputs:
- โข
chroma_db/- Vector database with paper embeddings
โ Why run this AFTER 03?
Only embeds papers that passed FULL screening (eligible.json). No point creating vectors for papers you're going to exclude! This is like creating an index for a book - but the book (eligible papers) must exist first.
RAG Query - Interactive Research
๐ฏ What it does:
You ask research questions in natural language. The system searches ChromaDB for relevant papers, then Claude answers using evidence from those papers with citations.
๐ฅ Inputs:
- โข
chroma_db/- From script 04 (populated vector database) - โข
data/eligible.json- Paper metadata for citations - โข Your questions (interactive)
๐ค Outputs:
- โข
data/insights.json- Q&A pairs with citations - โข Console output (your research conversation)
โ Why run this AFTER 04?
Requires ChromaDB to be populated with embeddings! Can't do semantic search on an empty database. Think of it like asking a librarian questions - the library must have books (vectors) first!
Synthesis - Meta-Analysis
๐ฏ What it does:
Claude analyzes ALL eligible papers together, identifying patterns, calculating aggregate statistics, finding research gaps, and synthesizing themes.
๐ฅ Inputs:
- โข
data/insights.json- From script 05 (research findings) - โข
data/eligible.json- All final papers
๐ค Outputs:
- โข
data/synthesis.json- Meta-analysis results - โข Themes, patterns, effect sizes, recommendations
โ Why run this AFTER 05?
Builds on insights from RAG queries. Uses both insights.json (specific findings) and eligible.json (all papers) to identify cross-study patterns. Can't synthesize what you haven't analyzed yet!
Documentation - Publication Ready
๐ฏ What it does:
Generates publication-ready documentation: PRISMA flowchart, methods section, results tables, discussion, bibliography in APA/BibTeX format.
๐ฅ Inputs:
- โข
data/papers.json- Total fetched (for flowchart numbers) - โข
data/screened.json- Abstract screening results - โข
data/eligible.json- Final included papers - โข
data/synthesis.json- Meta-analysis findings
๐ค Outputs:
- โข
outputs/prisma_flowchart.md - โข
outputs/methods_section.md - โข
outputs/results_tables.md - โข
outputs/bibliography.bib
โ Why run this LAST?
Needs data from ALL previous scripts! PRISMA flowchart shows the entire journey (fetched โ screened โ eligible). Methods describe the full pipeline. Results come from synthesis. This is the final report that ties everything together.
โ ๏ธWhat Happens If You Skip Steps?
โ Skip 01 (Fetch) โ Run 02 (Screen)
Error: FileNotFoundError: data/papers.json does not exist
Can't screen papers that don't exist!
โ Skip 03 (Full-text) โ Run 04 (Embeddings)
Error: FileNotFoundError: data/eligible.json does not exist
Can't vectorize papers that haven't been screened!
โ Skip 04 (Embeddings) โ Run 05 (RAG)
Error: ChromaDB collection is empty or doesn't exist
Can't search an empty vector database!
โ Skip 05-06 โ Run 07 (Documentation)
Result: Incomplete documentation with missing synthesis and insights
Documentation needs ALL previous outputs to be complete!
๐กKey Takeaway
The pipeline is a dependency chain. Each script is like a Lego brick that builds on the previous one. You can't build the roof before the foundation!
The Order Ensures:
- Data integrity: Each stage validates and transforms data correctly
- Efficiency: Filter early (abstract screening) before expensive operations (full-text, embeddings)
- Reproducibility: Same order = same results every time
- Traceability: PRISMA flowchart tracks papers through every stage