Codebook/File Formats

๐Ÿ“„ File Formats

ScholaRAG uses different file formats for different purposes. Here's what each one is and why it exists.

YAML Files (.yaml)

YAML stands for "YAML Ain't Markup Language" (yes, it's recursive!). Think of it as a configuration checklist- like filling out a form where you set all your preferences.

Why YAML for configuration?

  • Human-readable: Easy to read and edit, even without programming knowledge
  • Hierarchical: Shows relationships clearly with indentation (like an outline)
  • No mess: No curly braces or commas - just clean text
  • Standard: Used across research tools, AI systems, and web services

Example: config.yaml

# Research Project Settings
project_name: "AI in Education Meta-Analysis"
research_question: "How effective is AI tutoring?"

# Which databases to search
databases:
  - semantic_scholar
  - pubmed
  - eric

# AI settings
ai_model: "claude-3-5-sonnet"
max_papers: 5000

โš ๏ธ Important: Indentation matters!

YAML uses spaces (not tabs) for indentation. Two spaces = one level deeper. If spacing is wrong, the file won't work.

JSON Files (.json)

JSON stands for "JavaScript Object Notation". Think of it as a structured storage container- like organizing your research data in labeled boxes within boxes.

Why JSON for data?

  • Structured: Data organized in key-value pairs (like a dictionary)
  • Machine-readable: Easy for programs to read and write
  • Flexible: Can store numbers, text, lists, and nested data
  • Universal: Works across all programming languages and platforms

Example: papers.json

{
  "papers": [
    {
      "title": "AI Tutoring Systems: A Meta-Analysis",
      "authors": ["Smith, J.", "Lee, K."],
      "year": 2023,
      "doi": "10.1234/example",
      "citations": 45,
      "screened": true,
      "included": false,
      "exclusion_reason": "Not RCT design"
    }
  ],
  "total_count": 503,
  "last_updated": "2024-01-15"
}

๐Ÿ’ก In ScholaRAG:

JSON files store your fetched papers, screening results, and analysis outputs. They're like your research filing cabinet - organized and searchable.

Markdown Files (.md)

Markdown is a simple formatting language - like writing with basic formatting shortcuts. Think of it as "Microsoft Word, but using symbols instead of toolbar buttons."

Why Markdown for documentation?

  • Simple syntax: # = heading, ** = bold, - = bullet point
  • Plain text: Works everywhere, never becomes outdated
  • Version control friendly: Easy to track changes in Git
  • Converts easily: Can become PDF, HTML, Word docs

You write this:

# Methods

## Inclusion Criteria

- Published 2020-2024
- **RCT design**
- Sample size > 30

> Important: Must report
> effect sizes.

It becomes this:

Methods

Inclusion Criteria

  • Published 2020-2024
  • RCT design
  • Sample size > 30
Important: Must report effect sizes.

๐Ÿ’ก In ScholaRAG:

All prompts (01-07.md) and documentation are written in Markdown. It's the universal language for research documentation and GitHub.

Python Files (.py)

A .py file contains Python code - the actual instructions that make things happen. We covered Python earlier, but here's what the file itself represents.

Structure of a Python script:

  • Imports: Loading tools and libraries (like importing cookware)
  • Configuration: Setting up variables and settings
  • Functions: Reusable blocks of code (like sub-recipes)
  • Main execution: The actual work that runs when you execute the script

Example: 01_fetch_papers.py (simplified)

# 1. IMPORTS - Load tools
import requests
from datetime import datetime

# 2. CONFIGURATION - Settings
API_KEY = "your-key-here"
MAX_PAPERS = 5000

# 3. FUNCTIONS - Reusable logic
def fetch_from_database(query):
    """Fetch papers from API"""
    # ... code here ...
    return papers

# 4. MAIN EXECUTION - What runs
if __name__ == "__main__":
    results = fetch_from_database("AI tutoring")
    print(f"Found {len(results)} papers")

โš ๏ธ Don't edit Python files unless:

You know what you're doing! Changing code can break the entire pipeline. Start with configuration files (YAML) instead.

Environment Files (.env)

A .env file stores secret information like passwords and API keys. Think of it as your personal keychain - you don't share it with anyone!

๐Ÿšจ CRITICAL: Security Rules

  • Never share: .env files contain sensitive secrets
  • Never commit to Git: These should NOT be uploaded to GitHub
  • Use .env.example: Share templates without real keys
  • Regenerate if exposed: If leaked, create new API keys immediately

โŒ Bad: .env (real secrets)

# NEVER SHARE THIS FILE!
ANTHROPIC_API_KEY=sk-ant-abc123...
OPENAI_API_KEY=sk-proj-xyz789...
DATABASE_PASSWORD=MySecret123

โš ๏ธ Do NOT upload to GitHub!

โœ… Good: .env.example (template)

# Share this template safely
ANTHROPIC_API_KEY=your-key-here
OPENAI_API_KEY=your-key-here
DATABASE_PASSWORD=your-password-here

โœ… Safe to share as template

๐Ÿ’ก How it works:

  1. Create .env file in project root
  2. Add your API keys: ANTHROPIC_API_KEY=sk-ant-...
  3. Python scripts read these variables automatically
  4. Keys stay private, code stays shareable