Codebook/Tools & Technologies

🛠️ Tools & Technologies

Why ScholaRAG uses specific tools and technologies.

ChromaDB - The Vector Database

ChromaDB is a vector database - a special kind of database that understands meaning, not just exact matches. We covered what a vector database is earlier, but why ChromaDB specifically?

Why ChromaDB?

Easy to use: No complex database setup - just install and run
Runs locally: Your data stays on your computer (privacy!)
Python-friendly: Integrates seamlessly with research scripts
Fast semantic search: Find similar papers in milliseconds

💡 In ScholaRAG:

ChromaDB stores all your papers as "embeddings" (meaning vectors). When you ask a question, it finds the most relevant papers based on conceptual similarity, not keyword matching.

Example: Searching "learning outcomes" will also find papers about "educational achievement" and "student performance" - without you specifying those exact terms!

Claude AI - The Screening Assistant

Claude is Anthropic's AI assistant - think of it as your tireless research assistant who can read hundreds of papers, apply screening criteria, and explain its reasoning.

Why Claude for screening?

Large context window: Can read entire papers (200,000+ tokens)
Strong reasoning: Applies complex inclusion/exclusion criteria accurately
Explains decisions: Shows why a paper was included or excluded
Consistent: Doesn't get tired or biased like human reviewers can

⚠️ AI as Assistant, Not Replacement

Claude helps with initial screening and organization, but researchers should always review final decisions. AI accelerates the process; you maintain the quality.

💡 In ScholaRAG:

Claude runs the 02_screening and 03_eligibility stages, applying your PRISMA criteria to hundreds of papers in minutes instead of weeks.

OpenAI - The Embedding Engine

OpenAI (the company behind ChatGPT) provides the embedding models that convert your papers into semantic vectors. Think of it as the translator that turns text into meaning-coordinates.

Why OpenAI embeddings?

Industry standard: text-embedding-3-small is fast, accurate, and affordable
Semantic quality: Captures nuanced meaning and context
Multilingual: Works across languages (useful for international research)
Well-documented: Easy to integrate and troubleshoot

How embeddings work:

Paper text→OpenAI API→[0.23, -0.15, 0.89, ...]

The vector contains 1536 numbers representing the paper's meaning in "semantic space"

💡 In ScholaRAG:

Script 04 uses OpenAI to create embeddings for all your papers, then stores them in ChromaDB for fast semantic search during the RAG stage.

GitHub - The Code Repository

GitHub is where we store and share code. Think of it as Google Drive for programmers - but with powerful features like version history, collaboration, and automatic backups.

Why GitHub?

Version control: Every change is tracked - you can go back to any previous version
Collaboration: Multiple researchers can work on the same project
Open source: Share your methods with the research community
Documentation: README files, wikis, and issue tracking built-in

Key GitHub concepts:

Repository (repo): A project folder containing all files and history
Commit: A saved snapshot of your changes (like "Save Version")
Clone: Download a copy of a repository to your computer
Fork: Create your own copy to customize
Pull: Download latest updates from the repository

💡 In ScholaRAG:

The ScholaRAG code lives on GitHub at github.com/HosungYou/ScholaRAG. You clone it to your computer, customize for your research, and can contribute improvements back to the community.

Git - The Version Control System

Git is the underlying technology that powers GitHub. While GitHub is the website, Git is the tool that tracks changes. Think: Git = the engine, GitHub = the car.

Why Git?

Time machine: Go back to any previous version of your code
Safe experimentation: Try changes in "branches" without breaking your main code
Accountability: See who changed what and when
Industry standard: Used by virtually all software developers

Basic Git workflow:

git clone [repository-url]

# Download project

git pull

# Get latest updates

git add .

# Stage your changes

git commit -m "message"

# Save snapshot

git push

# Upload to GitHub

⚠️ For researchers:

You don't need to master Git to use ScholaRAG! Basic commands (clone, pull) are enough to get started. Think of it like using Word - you don't need to know how spell-check works internally.

Python Libraries (Packages)

Python libraries are pre-built tools that add functionality to Python. Think of them as specialized kitchen appliances - you don't build a blender from scratch, you just use one!

Key libraries in ScholaRAG:

anthropic: Communicates with Claude AI
openai: Creates semantic embeddings
chromadb: Vector database for semantic search
requests: Fetches data from academic APIs
pandas: Organizes data in tables (like Excel, but in Python)
python-dotenv: Reads API keys from .env files

Installing libraries:

pip install anthropic openai chromadb

This command downloads and installs all necessary tools automatically!

💡 In ScholaRAG:

All required libraries are listed in requirements.txt. Just run pip install -r requirements.txt and everything installs automatically!

← Previous: File Formats Next: Scripts Workflow →

Quick Start