Codebook/Tools & Technologies

๐Ÿ› ๏ธ Tools & Technologies

Why ScholaRAG uses specific tools and technologies.

ChromaDB - The Vector Database

ChromaDB is a vector database - a special kind of database that understands meaning, not just exact matches. We covered what a vector database is earlier, but why ChromaDB specifically?

Why ChromaDB?

  • Easy to use: No complex database setup - just install and run
  • Runs locally: Your data stays on your computer (privacy!)
  • Python-friendly: Integrates seamlessly with research scripts
  • Fast semantic search: Find similar papers in milliseconds

๐Ÿ’ก In ScholaRAG:

ChromaDB stores all your papers as "embeddings" (meaning vectors). When you ask a question, it finds the most relevant papers based on conceptual similarity, not keyword matching.

Example: Searching "learning outcomes" will also find papers about "educational achievement" and "student performance" - without you specifying those exact terms!

Claude AI - The Screening Assistant

Claude is Anthropic's AI assistant - think of it as your tireless research assistant who can read hundreds of papers, apply screening criteria, and explain its reasoning.

Why Claude for screening?

  • Large context window: Can read entire papers (200,000+ tokens)
  • Strong reasoning: Applies complex inclusion/exclusion criteria accurately
  • Explains decisions: Shows why a paper was included or excluded
  • Consistent: Doesn't get tired or biased like human reviewers can

โš ๏ธ AI as Assistant, Not Replacement

Claude helps with initial screening and organization, but researchers should always review final decisions. AI accelerates the process; you maintain the quality.

๐Ÿ’ก In ScholaRAG:

Claude runs the 02_screening and 03_eligibility stages, applying your PRISMA criteria to hundreds of papers in minutes instead of weeks.

OpenAI - The Embedding Engine

OpenAI (the company behind ChatGPT) provides the embedding models that convert your papers into semantic vectors. Think of it as the translator that turns text into meaning-coordinates.

Why OpenAI embeddings?

  • Industry standard: text-embedding-3-small is fast, accurate, and affordable
  • Semantic quality: Captures nuanced meaning and context
  • Multilingual: Works across languages (useful for international research)
  • Well-documented: Easy to integrate and troubleshoot

How embeddings work:

Paper textโ†’OpenAI APIโ†’[0.23, -0.15, 0.89, ...]

The vector contains 1536 numbers representing the paper's meaning in "semantic space"

๐Ÿ’ก In ScholaRAG:

Script 04 uses OpenAI to create embeddings for all your papers, then stores them in ChromaDB for fast semantic search during the RAG stage.

GitHub - The Code Repository

GitHub is where we store and share code. Think of it as Google Drive for programmers - but with powerful features like version history, collaboration, and automatic backups.

Why GitHub?

  • Version control: Every change is tracked - you can go back to any previous version
  • Collaboration: Multiple researchers can work on the same project
  • Open source: Share your methods with the research community
  • Documentation: README files, wikis, and issue tracking built-in

Key GitHub concepts:

  • Repository (repo): A project folder containing all files and history
  • Commit: A saved snapshot of your changes (like "Save Version")
  • Clone: Download a copy of a repository to your computer
  • Fork: Create your own copy to customize
  • Pull: Download latest updates from the repository

๐Ÿ’ก In ScholaRAG:

The ScholaRAG code lives on GitHub at github.com/HosungYou/researcherRAG. You clone it to your computer, customize for your research, and can contribute improvements back to the community.

Git - The Version Control System

Git is the underlying technology that powers GitHub. While GitHub is the website, Git is the tool that tracks changes. Think: Git = the engine, GitHub = the car.

Why Git?

  • Time machine: Go back to any previous version of your code
  • Safe experimentation: Try changes in "branches" without breaking your main code
  • Accountability: See who changed what and when
  • Industry standard: Used by virtually all software developers

Basic Git workflow:

git clone [repository-url]

# Download project

git pull

# Get latest updates

git add .

# Stage your changes

git commit -m "message"

# Save snapshot

git push

# Upload to GitHub

โš ๏ธ For researchers:

You don't need to master Git to use ScholaRAG! Basic commands (clone, pull) are enough to get started. Think of it like using Word - you don't need to know how spell-check works internally.

Python Libraries (Packages)

Python libraries are pre-built tools that add functionality to Python. Think of them as specialized kitchen appliances - you don't build a blender from scratch, you just use one!

Key libraries in ScholaRAG:

  • anthropic: Communicates with Claude AI
  • openai: Creates semantic embeddings
  • chromadb: Vector database for semantic search
  • requests: Fetches data from academic APIs
  • pandas: Organizes data in tables (like Excel, but in Python)
  • python-dotenv: Reads API keys from .env files

Installing libraries:

pip install anthropic openai chromadb

This command downloads and installs all necessary tools automatically!

๐Ÿ’ก In ScholaRAG:

All required libraries are listed in requirements.txt. Just run pip install -r requirements.txt and everything installs automatically!