Codebook/Fundamentals

🌱 Fundamentals

If you've never written code before, start here. We explain the basic building blocks of ScholaRAG in simple, non-technical language.

What is a "Script"?

A script is a file containing step-by-step instructions for a computer to follow. Think of it like a recipe:

🍰Cooking Recipe

Preheat oven to 350°F
Mix flour and sugar
Add eggs and butter
Bake for 30 minutes

💻Python Script

Connect to database
Search for papers
Filter by year
Save results to file

In ScholaRAG, we have 7 scripts (named 01_fetch_papers.py through 07_generate_prisma.py). Each script does one specific job in your research workflow.

What is Python?

Python is a programming language - the "language" that scripts are written in. Think of it like English is a language for humans; Python is a language for computers.

Why Python for research?

1. Easy to Read

Python looks almost like English. Even non-programmers can understand what code is trying to do.

2. Powerful Libraries

Has thousands of pre-built tools for AI, data analysis, and scientific computing.

3. Used by Researchers

The most popular language in academia - used in biology, psychology, economics, physics.

4. Free and Open Source

Anyone can download and use Python for free. No licenses or subscriptions needed.

💡 For ScholaRAG users:

You don't need to learn Python deeply! ScholaRAG scripts are already written. You just need to understand what they do, not how they work internally.

What is Terminal / Command Line?

The Terminal (also called "Command Line" or "Shell") is a text-based way to control your computer. Instead of clicking buttons with a mouse, you type commands with your keyboard.

🖱️ GUI (Graphical User Interface)

What you're used to:

Click folder icons to open them
Drag files to move them
Use mouse to navigate

⌨️ Terminal (Command Line)

Text-based control:

Type cd folder to open it
Type mv file.txt new/ to move
Use keyboard only

Common Terminal commands for ScholaRAG:

cd /path/to/project

Change directory - move to your project folder

python 01_fetch_papers.py

Run a Python script

ls

List files in current directory

⚠️ Don't worry if Terminal feels scary!

You'll only need to type a few simple commands. The documentation provides exact commands to copy and paste.

What is an API?

API stands for "Application Programming Interface." Think of it as a waiter at a restaurant:

👤

You (Customer)

You want food but can't go to the kitchen

🍽️

Waiter (API)

Takes your order to the kitchen and brings back your food

👨‍🍳

Kitchen (Service)

Prepares the food but doesn't interact with you directly

In ScholaRAG, APIs let your scripts communicate with external services:

Anthropic API (Claude AI)

Your script sends a paper → Claude reads and screens it → Returns include/exclude decision

OpenAI API

Your script sends text → OpenAI creates semantic embedding → Returns vector (numbers)

Semantic Scholar API

Your script sends search query → Semantic Scholar searches database → Returns matching papers

🔑 API Keys:

To use an API, you need an API key - like a password that identifies you. Keep these secret!

What is a Vector Database?

A vector database is a special kind of database that stores information as meaning-based coordinates instead of exact text.

Traditional Database

Exact matching only:

Search: "machine learning"

Finds: Papers with exact phrase "machine learning"

Misses: Papers about "neural networks", "deep learning", "AI models"

Vector Database

Meaning-based search:

Search: "machine learning"

Finds: Papers about "machine learning"

Also finds: "neural networks", "deep learning", "AI models" (similar concepts!)

How it works:

Convert to vectors: Each paper becomes a list of numbers (e.g., [0.23, -0.15, 0.89, ...]) that represents its meaning in "semantic space"
Store vectors: The database stores these number lists instead of raw text
Search by similarity: When you search, it finds papers with similar number patterns, which means similar meanings

💡 In ScholaRAG:

We use ChromaDB as our vector database. It lets you ask questions in natural language and find relevant papers based on meaning, not just keywords.

What is RAG (Retrieval-Augmented Generation)?

RAG combines searching for information with AI-generated answers. It's like having a research assistant who can read all your papers and answer questions with citations.

The RAG Process (4 steps):

Ask a Question

"What learning outcomes were reported in AI tutoring studies?"

Retrieval

Search vector database for papers about "learning outcomes" and "AI tutoring"

Augmentation

Give Claude AI your question + relevant paper excerpts as context

Generation

Claude reads the excerpts and writes an answer with citations

❌ Without RAG:

Question: "What effect sizes were reported?"

Claude AI: "I don't have access to your specific papers. I can only provide general information."

No citations, generic answer

✅ With RAG:

Question: "What effect sizes were reported?"

Claude AI: "Based on your papers: Smith (2023) reported d=0.72 for test scores. Lee (2024) found d=0.58 for retention..."

Specific to your research, with citations!

💡 Why RAG is powerful:

AI models like Claude don't "know" about your specific research papers. RAG gives the AI temporary accessto your papers during each conversation, so it can answer based on your actual data.

What is PRISMA 2020?

PRISMA 2020 (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) is a quality standard for conducting systematic literature reviews. Think of it as a checklist and roadmap that ensures your review is rigorous and transparent.

The 4-Stage PRISMA Process:

Identification

Search databases and identify all potentially relevant papers (e.g., 5000 papers)

Screening

Read titles and abstracts, remove clearly irrelevant papers (5000 → 500 papers)

Eligibility

Read full texts, apply detailed inclusion criteria (500 → 100 papers)

Included

Final set of papers for your systematic review and meta-analysis (100 papers)

Why PRISMA matters:

Reproducibility: Other researchers can follow your exact process and get the same results
Transparency: You document every decision (why papers were included or excluded)
Publication requirement: Most journals require PRISMA compliance for systematic reviews
Quality assurance: Reduces bias and ensures comprehensive coverage

💡 In ScholaRAG:

ScholaRAG automates the PRISMA workflow! Scripts 01-03 handle Identification, Screening, and Eligibility. Script 07 generates the PRISMA flowchart diagram required for publication.

← Back to Codebook Next: File Formats →

Quick Start