Script Documentation

🔬 For Code Reviewers & Contributors

Deep dive into each Python script: parameters, logic, error handling, and extension points. Essential for understanding implementation details and contributing improvements.

Pipeline Scripts (7 Stages)

01_fetch_papers.pyStage 5: Fetch

Fetches papers from Semantic Scholar, OpenAlex, and arXiv using configured query

Medium

02_deduplicate.pyStage 5: Dedup

Removes duplicate papers using DOI, arXiv ID, and title similarity

Low

03_screen_papers.pyStage 5: Screen

AI-assisted screening using Claude API with project_type-aware thresholds

High

04_download_pdfs.pyStage 5: Download

Downloads PDFs from open_access URLs with retry logic and error handling

Medium

05_build_rag.pyStage 5: RAG Build

Chunks PDFs, generates embeddings, and stores in ChromaDB vector database

High

06_query_rag.pyStage 6: Query

Interactive RAG query system with semantic search and LLM answer generation

Medium

07_generate_prisma.pyStage 7: PRISMA

Generates PRISMA 2020 flow diagram with project_type-aware title

Low

01_fetch_papers.py

Purpose

Fetches papers from Semantic Scholar, OpenAlex, and arXiv using configured query

Command Line Usage

python scripts/01_fetch_papers.py --project projects/my-project

Parameters

Parameter	Type	Required	Description
--project	Path	✓ Yes	Path to project directory containing config.yaml

config.yaml Dependencies

⚠️ search_query.simple → Main search query string
databases.open_access.* → Which databases to search
retrieval_settings.year_range → Filter papers by publication year

Core Logic

Loads config.yaml and reads search query
For each enabled database (Semantic Scholar, OpenAlex, arXiv):
- Constructs API request with query and filters
- Fetches results with pagination (handles rate limits)
- Parses response and extracts metadata (title, abstract, DOI, etc.)
- Saves to data/01_identification/{database}_results.csv
Logs total papers fetched from each database

Error Handling

❌ API rate limit exceeded

→ Script automatically retries with exponential backoff. If persists, add Semantic Scholar API key to .env

❌ No results found

→ Query too narrow. Broaden search terms or remove year constraints in config.yaml

❌ Network timeout

→ Check internet connection. Script will retry failed requests automatically

Extension Points

💡 How to Extend This Script

Add new database: Create new method fetch_from_X(), add to database loop, update config.yaml schema
Custom filters: Add field-specific filters (e.g., only open access) by modifying API request parameters

Key Code Snippet

def fetch_from_semantic_scholar(self, query: str) -> List[Dict]:
    """Fetch papers from Semantic Scholar API"""
    url = "https://api.semanticscholar.org/graph/v1/paper/search"
    params = {
        "query": query,
        "fields": "title,abstract,authors,year,doi,openAccessPdf",
        "limit": 100
    }

    # Add API key if available
    headers = {}
    if self.api_key:
        headers["x-api-key"] = self.api_key

    response = requests.get(url, params=params, headers=headers)
    response.raise_for_status()

    return response.json().get("data", [])

02_deduplicate.py

Purpose

Removes duplicate papers using DOI, arXiv ID, and title similarity

Command Line Usage

python scripts/02_deduplicate.py --project projects/my-project

Parameters

Parameter	Type	Required	Description
--project	Path	✓ Yes	Path to project directory

config.yaml Dependencies

None → This script does not read config.yaml

Core Logic

Loads all CSV files from data/01_identification/
Combines into single DataFrame
Deduplication strategy (in order):
1. Exact DOI match → Keep first occurrence
2. Exact arXiv ID match → Keep first occurrence
3. Title similarity > 90% (fuzzy matching) → Keep first occurrence
Saves deduplicated results to data/01_identification/deduplicated.csv
Logs: X papers → Y papers (Z duplicates removed)

Error Handling

❌ No CSV files found

→ Run 01_fetch_papers.py first to generate identification data

❌ Empty CSV files

→ Check if 01_fetch_papers.py succeeded. Verify API keys if needed

Extension Points

💡 How to Extend This Script

Custom similarity threshold: Adjust SIMILARITY_THRESHOLD constant (default 0.9) for stricter/looser matching
Manual review: Add --interactive flag to review borderline duplicates before removing

Key Code Snippet

from difflib import SequenceMatcher

def is_duplicate_title(title1: str, title2: str, threshold=0.9) -> bool:
    """Check if two titles are similar enough to be duplicates"""
    similarity = SequenceMatcher(None, title1.lower(), title2.lower()).ratio()
    return similarity >= threshold

03_screen_papers.py

Purpose

AI-assisted screening using Claude API with project_type-aware thresholds

Command Line Usage

python scripts/03_screen_papers.py --project projects/my-project

Parameters

Parameter	Type	Required	Description
--project	Path	✓ Yes	Path to project directory
--batch-size	int	○ No	Number of papers to screen per API call (default: 1)

config.yaml Dependencies

⚠️ project_type → CRITICAL: Sets screening thresholds
⚠️ ai_prisma_rubric.decision_confidence.auto_include → Minimum confidence % to auto-include
⚠️ ai_prisma_rubric.decision_confidence.auto_exclude → Maximum confidence % to auto-exclude
ai_prisma_rubric.human_validation.required → Whether to prompt for human review
⚠️ research_question → Used in AI prompt for relevance assessment

Core Logic

Loads config.yaml and sets thresholds based on project_type:
- knowledge_repository: auto_include=50, auto_exclude=20, no human review
- systematic_review: auto_include=90, auto_exclude=10, optional human review
For each paper in deduplicated.csv:
- Constructs prompt: "Is this relevant to [research_question]?"
- Calls Claude API with title + abstract
- Parses response: relevance score (0-100) + reasoning
- Classifies: Include (≥threshold), Exclude (<threshold), or Review (borderline)
Saves relevant.csv, excluded.csv, and review_queue.csv
If human review required: prompts user to review borderline cases

Error Handling

❌ ANTHROPIC_API_KEY not found

→ Add ANTHROPIC_API_KEY=sk-ant-... to .env file

❌ Rate limit exceeded

→ Script pauses automatically. Upgrade to Claude Pro for higher limits

❌ Paper has no abstract

→ Script auto-excludes papers without abstracts (cannot assess relevance)

Extension Points

💡 How to Extend This Script

Multi-criteria scoring: Modify prompt to score on multiple dimensions (methodology, population, outcomes) instead of single relevance score
Batch processing: Increase --batch-size to screen multiple papers per API call (faster but less accurate)

Key Code Snippet

def load_config(self):
    """Load config and set thresholds based on project_type"""
    config_file = self.project_path / "config.yaml"
    with open(config_file) as f:
        self.config = yaml.safe_load(f)

    project_type = self.config.get('project_type', 'systematic_review')

    if project_type == 'knowledge_repository':
        # Lenient thresholds for comprehensive coverage
        self.screening_threshold = 50
        self.exclude_threshold = 20
        self.require_human_review = False
    else:
        # Strict thresholds for systematic review
        self.screening_threshold = 90
        self.exclude_threshold = 10
        self.require_human_review = True

04_download_pdfs.py

Purpose

Downloads PDFs from open_access URLs with retry logic and error handling

Command Line Usage

python scripts/04_download_pdfs.py --project projects/my-project

Parameters

Parameter	Type	Required	Description
--project	Path	✓ Yes	Path to project directory
--max-workers	int	○ No	Number of parallel downloads (default: 5)

config.yaml Dependencies

None → Reads from data/02_screening/relevant.csv

Core Logic

Loads relevant.csv (papers that passed screening)
For each paper with pdf_url:
- Attempts download with timeout (30 seconds)
- Saves to data/pdfs/{doi or arxiv_id}.pdf
- If fails: retries up to 3 times with exponential backoff
- If still fails: logs error and continues
Reports: X/Y PDFs downloaded successfully
Creates data/pdfs/failed.csv with papers that failed to download

Error Handling

❌ Download timeout

→ Increase timeout in script or skip large PDFs (>50MB). Failed papers logged to failed.csv

❌ 403 Forbidden / 404 Not Found

→ PDF URL expired or restricted. Paper will be excluded from RAG (but kept in metadata)

❌ Disk space full

→ Free up space or move data/ to external drive

Extension Points

💡 How to Extend This Script

Publisher authentication: Add institution proxy support for paywalled papers (requires login credentials)
OCR fallback: For scanned PDFs, add OCR processing using Tesseract

Key Code Snippet

def download_pdf(self, url: str, output_path: Path, retries=3):
    """Download PDF with retry logic"""
    for attempt in range(retries):
        try:
            response = requests.get(url, timeout=30, stream=True)
            response.raise_for_status()

            with open(output_path, 'wb') as f:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)

            return True
        except Exception as e:
            if attempt == retries - 1:
                print(f"❌ Failed after {retries} attempts: {e}")
                return False
            time.sleep(2 ** attempt)  # Exponential backoff

05_build_rag.py

Purpose

Chunks PDFs, generates embeddings, and stores in ChromaDB vector database

Command Line Usage

python scripts/05_build_rag.py --project projects/my-project

Parameters

Parameter	Type	Required	Description
--project	Path	✓ Yes	Path to project directory

config.yaml Dependencies

⚠️ rag_settings.embedding_model → Which embedding model to use (affects quality)
rag_settings.chunk_size → Size of text chunks (default: 1000)
rag_settings.chunk_overlap → Overlap between chunks (default: 200)
rag_settings.vector_db → Database backend (chromadb or faiss)

Core Logic

Loads all PDFs from data/pdfs/
For each PDF:
- Extracts text using PyMuPDF
- Splits into chunks (chunk_size with chunk_overlap)
- Generates embeddings using OpenAI API or local model
- Stores chunks + embeddings in ChromaDB
Creates collection in data/chroma/ with metadata:
- paper_id, title, authors, year, doi
- chunk_index, page_number
Reports: X papers → Y chunks → ChromaDB

Error Handling

❌ OPENAI_API_KEY not found

→ Add OPENAI_API_KEY=sk-... to .env if using OpenAI embeddings

❌ PDF extraction failed (corrupted PDF)

→ Script skips corrupted PDFs and logs warning. Paper metadata kept but no embeddings

❌ Out of memory

→ Reduce chunk_size or process PDFs in smaller batches

Extension Points

💡 How to Extend This Script

Custom chunking strategy: Instead of fixed-size chunks, split by sections (Introduction, Methods, etc.)
Multi-modal embeddings: Extract figures/tables and embed separately using vision models

Key Code Snippet

def chunk_text(self, text: str, chunk_size: int, overlap: int):
    """Split text into overlapping chunks"""
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - overlap  # Overlap

    return chunks

06_query_rag.py

Purpose

Interactive RAG query system with semantic search and LLM answer generation

Command Line Usage

python scripts/06_query_rag.py --project projects/my-project

Parameters

Parameter	Type	Required	Description
--project	Path	✓ Yes	Path to project directory
--query	str	○ No	Direct query (non-interactive mode)

config.yaml Dependencies

rag_settings.retrieval_k → Number of chunks to retrieve (default: 10)
⚠️ rag_settings.llm → Which LLM to use for answer generation
rag_settings.llm_temperature → Randomness of answers (0.0-1.0)

Core Logic

Loads ChromaDB collection from data/chroma/
Starts interactive loop:
- User enters query
- Generates query embedding
- Searches ChromaDB for top-k most similar chunks
- Constructs prompt: "Answer based on these papers: [chunks]"
- Calls LLM (Claude/GPT) to generate answer
- Displays answer with paper citations
User can ask follow-up questions in same session

Error Handling

❌ ChromaDB not found

→ Run 05_build_rag.py first to create vector database

❌ No relevant chunks found

→ Query too specific or outside domain. Try broader query terms

❌ LLM timeout

→ Reduce retrieval_k to retrieve fewer chunks (less context)

Extension Points

💡 How to Extend This Script

Conversation history: Maintain conversation context across multiple queries for follow-up questions
Citation formatting: Auto-format citations in APA/MLA style

Key Code Snippet

def query(self, question: str):
    """Query RAG system and generate answer"""
    # Retrieve relevant chunks
    results = self.collection.query(
        query_texts=[question],
        n_results=self.k
    )

    # Construct prompt with context
    context = "\n\n".join(results['documents'][0])
    prompt = f"Answer this question based on the papers:\n{context}\n\nQuestion: {question}"

    # Generate answer
    response = self.llm.generate(prompt)

    return {
        "answer": response,
        "sources": results['metadatas'][0]  # Paper citations
    }

07_generate_prisma.py

Purpose

Generates PRISMA 2020 flow diagram with project_type-aware title

Command Line Usage

python scripts/07_generate_prisma.py --project projects/my-project

Parameters

Parameter	Type	Required	Description
--project	Path	✓ Yes	Path to project directory

config.yaml Dependencies

⚠️ project_type → CRITICAL: Changes diagram title
project_name → Displayed on diagram

Core Logic

Collects statistics from all data/ folders:
- data/01_identification/*.csv → papers fetched
- data/01_identification/deduplicated.csv → duplicates removed
- data/02_screening/*.csv → papers screened and excluded
- data/pdfs/ → PDFs downloaded
- data/chroma/ → papers in RAG
Generates PRISMA 2020 flow diagram using matplotlib
Title changes based on project_type:
- knowledge_repository → "Paper Processing Pipeline"
- systematic_review → "PRISMA 2020 Flow Diagram"
Saves to outputs/prisma_diagram.png

Error Handling

❌ Missing data files

→ Ensure all previous stages completed successfully

❌ Matplotlib font errors

→ Install system fonts or use default sans-serif

Extension Points

💡 How to Extend This Script

Interactive PRISMA: Generate HTML version with clickable stages that show detailed breakdowns
Export to LaTeX: Generate TikZ code for publication-quality diagrams

Key Code Snippet

def create_prisma_diagram(self, stats):
    """Generate PRISMA diagram with project-type-aware title"""
    project_type = self.config.get('project_type', 'systematic_review')

    if project_type == 'knowledge_repository':
        title = 'Paper Processing Pipeline'
        subtitle = 'Comprehensive Knowledge Repository'
    else:
        title = 'PRISMA 2020 Flow Diagram'
        subtitle = 'Systematic Literature Review'

    # Create matplotlib figure
    fig, ax = plt.subplots(figsize=(12, 14))
    ax.text(5, 13.5, title, ha='center', fontsize=16, fontweight='bold')
    ax.text(5, 13, subtitle, ha='center', fontsize=12, style='italic')

Ready to Review Code?

Check out the Code Review Guide for best practices, testing strategies, and contribution guidelines.

View Code Review Guide →

Quick Start