๐ File Formats
ScholaRAG uses different file formats for different purposes. Here's what each one is and why it exists.
YAML Files (.yaml)
YAML stands for "YAML Ain't Markup Language" (yes, it's recursive!). Think of it as a configuration checklist- like filling out a form where you set all your preferences.
Why YAML for configuration?
- Human-readable: Easy to read and edit, even without programming knowledge
- Hierarchical: Shows relationships clearly with indentation (like an outline)
- No mess: No curly braces or commas - just clean text
- Standard: Used across research tools, AI systems, and web services
Example: config.yaml
# Research Project Settings project_name: "AI in Education Meta-Analysis" research_question: "How effective is AI tutoring?" # Which databases to search databases: - semantic_scholar - pubmed - eric # AI settings ai_model: "claude-3-5-sonnet" max_papers: 5000
โ ๏ธ Important: Indentation matters!
YAML uses spaces (not tabs) for indentation. Two spaces = one level deeper. If spacing is wrong, the file won't work.
JSON Files (.json)
JSON stands for "JavaScript Object Notation". Think of it as a structured storage container- like organizing your research data in labeled boxes within boxes.
Why JSON for data?
- Structured: Data organized in key-value pairs (like a dictionary)
- Machine-readable: Easy for programs to read and write
- Flexible: Can store numbers, text, lists, and nested data
- Universal: Works across all programming languages and platforms
Example: papers.json
{
"papers": [
{
"title": "AI Tutoring Systems: A Meta-Analysis",
"authors": ["Smith, J.", "Lee, K."],
"year": 2023,
"doi": "10.1234/example",
"citations": 45,
"screened": true,
"included": false,
"exclusion_reason": "Not RCT design"
}
],
"total_count": 503,
"last_updated": "2024-01-15"
}๐ก In ScholaRAG:
JSON files store your fetched papers, screening results, and analysis outputs. They're like your research filing cabinet - organized and searchable.
Markdown Files (.md)
Markdown is a simple formatting language - like writing with basic formatting shortcuts. Think of it as "Microsoft Word, but using symbols instead of toolbar buttons."
Why Markdown for documentation?
- Simple syntax: # = heading, ** = bold, - = bullet point
- Plain text: Works everywhere, never becomes outdated
- Version control friendly: Easy to track changes in Git
- Converts easily: Can become PDF, HTML, Word docs
You write this:
# Methods ## Inclusion Criteria - Published 2020-2024 - **RCT design** - Sample size > 30 > Important: Must report > effect sizes.
It becomes this:
Methods
Inclusion Criteria
- Published 2020-2024
- RCT design
- Sample size > 30
Important: Must report effect sizes.
๐ก In ScholaRAG:
All prompts (01-07.md) and documentation are written in Markdown. It's the universal language for research documentation and GitHub.
Python Files (.py)
A .py file contains Python code - the actual instructions that make things happen. We covered Python earlier, but here's what the file itself represents.
Structure of a Python script:
- Imports: Loading tools and libraries (like importing cookware)
- Configuration: Setting up variables and settings
- Functions: Reusable blocks of code (like sub-recipes)
- Main execution: The actual work that runs when you execute the script
Example: 01_fetch_papers.py (simplified)
# 1. IMPORTS - Load tools
import requests
from datetime import datetime
# 2. CONFIGURATION - Settings
API_KEY = "your-key-here"
MAX_PAPERS = 5000
# 3. FUNCTIONS - Reusable logic
def fetch_from_database(query):
"""Fetch papers from API"""
# ... code here ...
return papers
# 4. MAIN EXECUTION - What runs
if __name__ == "__main__":
results = fetch_from_database("AI tutoring")
print(f"Found {len(results)} papers")โ ๏ธ Don't edit Python files unless:
You know what you're doing! Changing code can break the entire pipeline. Start with configuration files (YAML) instead.
Environment Files (.env)
A .env file stores secret information like passwords and API keys. Think of it as your personal keychain - you don't share it with anyone!
๐จ CRITICAL: Security Rules
- Never share: .env files contain sensitive secrets
- Never commit to Git: These should NOT be uploaded to GitHub
- Use .env.example: Share templates without real keys
- Regenerate if exposed: If leaked, create new API keys immediately
โ Bad: .env (real secrets)
# NEVER SHARE THIS FILE! ANTHROPIC_API_KEY=sk-ant-abc123... OPENAI_API_KEY=sk-proj-xyz789... DATABASE_PASSWORD=MySecret123
โ ๏ธ Do NOT upload to GitHub!
โ Good: .env.example (template)
# Share this template safely ANTHROPIC_API_KEY=your-key-here OPENAI_API_KEY=your-key-here DATABASE_PASSWORD=your-password-here
โ Safe to share as template
๐ก How it works:
- Create
.envfile in project root - Add your API keys:
ANTHROPIC_API_KEY=sk-ant-... - Python scripts read these variables automatically
- Keys stay private, code stays shareable