Codebook/File Formats

📄 File Formats

ScholaRAG uses different file formats for different purposes. Here's what each one is and why it exists.

YAML Files (.yaml)

YAML stands for "YAML Ain't Markup Language" (yes, it's recursive!). Think of it as a configuration checklist- like filling out a form where you set all your preferences.

Why YAML for configuration?

Human-readable: Easy to read and edit, even without programming knowledge
Hierarchical: Shows relationships clearly with indentation (like an outline)
No mess: No curly braces or commas - just clean text
Standard: Used across research tools, AI systems, and web services

Example: config.yaml

# Research Project Settings
project_name: "AI in Education Meta-Analysis"
research_question: "How effective is AI tutoring?"

# Which databases to search
databases:
  - semantic_scholar
  - pubmed
  - eric

# AI settings
ai_model: "claude-3-5-sonnet"
max_papers: 5000

⚠️ Important: Indentation matters!

YAML uses spaces (not tabs) for indentation. Two spaces = one level deeper. If spacing is wrong, the file won't work.

JSON Files (.json)

JSON stands for "JavaScript Object Notation". Think of it as a structured storage container- like organizing your research data in labeled boxes within boxes.

Why JSON for data?

Structured: Data organized in key-value pairs (like a dictionary)
Machine-readable: Easy for programs to read and write
Flexible: Can store numbers, text, lists, and nested data
Universal: Works across all programming languages and platforms

Example: papers.json

{
  "papers": [
    {
      "title": "AI Tutoring Systems: A Meta-Analysis",
      "authors": ["Smith, J.", "Lee, K."],
      "year": 2023,
      "doi": "10.1234/example",
      "citations": 45,
      "screened": true,
      "included": false,
      "exclusion_reason": "Not RCT design"
    }
  ],
  "total_count": 503,
  "last_updated": "2024-01-15"
}

💡 In ScholaRAG:

JSON files store your fetched papers, screening results, and analysis outputs. They're like your research filing cabinet - organized and searchable.

Markdown Files (.md)

Markdown is a simple formatting language - like writing with basic formatting shortcuts. Think of it as "Microsoft Word, but using symbols instead of toolbar buttons."

Why Markdown for documentation?

Simple syntax: # = heading, ** = bold, - = bullet point
Plain text: Works everywhere, never becomes outdated
Version control friendly: Easy to track changes in Git
Converts easily: Can become PDF, HTML, Word docs

You write this:

# Methods

## Inclusion Criteria

- Published 2020-2024
- **RCT design**
- Sample size > 30

> Important: Must report
> effect sizes.

It becomes this:

Methods

Inclusion Criteria

Published 2020-2024
RCT design
Sample size > 30

Important: Must report effect sizes.

💡 In ScholaRAG:

All prompts (01-07.md) and documentation are written in Markdown. It's the universal language for research documentation and GitHub.

Python Files (.py)

A .py file contains Python code - the actual instructions that make things happen. We covered Python earlier, but here's what the file itself represents.

Structure of a Python script:

Imports: Loading tools and libraries (like importing cookware)
Configuration: Setting up variables and settings
Functions: Reusable blocks of code (like sub-recipes)
Main execution: The actual work that runs when you execute the script

Example: 01_fetch_papers.py (simplified)

# 1. IMPORTS - Load tools
import requests
from datetime import datetime

# 2. CONFIGURATION - Settings
API_KEY = "your-key-here"
MAX_PAPERS = 5000

# 3. FUNCTIONS - Reusable logic
def fetch_from_database(query):
    """Fetch papers from API"""
    # ... code here ...
    return papers

# 4. MAIN EXECUTION - What runs
if __name__ == "__main__":
    results = fetch_from_database("AI tutoring")
    print(f"Found {len(results)} papers")

⚠️ Don't edit Python files unless:

You know what you're doing! Changing code can break the entire pipeline. Start with configuration files (YAML) instead.

Environment Files (.env)

A .env file stores secret information like passwords and API keys. Think of it as your personal keychain - you don't share it with anyone!

🚨 CRITICAL: Security Rules

Never share: .env files contain sensitive secrets
Never commit to Git: These should NOT be uploaded to GitHub
Use .env.example: Share templates without real keys
Regenerate if exposed: If leaked, create new API keys immediately

❌ Bad: .env (real secrets)

# NEVER SHARE THIS FILE!
ANTHROPIC_API_KEY=sk-ant-abc123...
OPENAI_API_KEY=sk-proj-xyz789...
DATABASE_PASSWORD=MySecret123

⚠️ Do NOT upload to GitHub!

✅ Good: .env.example (template)

# Share this template safely
ANTHROPIC_API_KEY=your-key-here
OPENAI_API_KEY=your-key-here
DATABASE_PASSWORD=your-password-here

✅ Safe to share as template

💡 How it works:

Create .env file in project root
Add your API keys: ANTHROPIC_API_KEY=sk-ant-...
Python scripts read these variables automatically
Keys stay private, code stays shareable

← Previous: Fundamentals Next: Tools & Technologies →

Quick Start