Script Documentation

πŸ”¬ For Code Reviewers & Contributors

Deep dive into each Python script: parameters, logic, error handling, and extension points. Essential for understanding implementation details and contributing improvements.

Pipeline Scripts (7 Stages)

01_fetch_papers.py

Purpose

Fetches papers from Semantic Scholar, OpenAlex, and arXiv using configured query

Command Line Usage

python scripts/01_fetch_papers.py --project projects/my-project

Parameters

ParameterTypeRequiredDescription
--projectPathβœ“ YesPath to project directory containing config.yaml

config.yaml Dependencies

  • ⚠️ search_query.simple β†’ Main search query string
  • databases.open_access.* β†’ Which databases to search
  • retrieval_settings.year_range β†’ Filter papers by publication year

Core Logic

  1. Loads config.yaml and reads search query
  2. For each enabled database (Semantic Scholar, OpenAlex, arXiv):
  3. - Constructs API request with query and filters
  4. - Fetches results with pagination (handles rate limits)
  5. - Parses response and extracts metadata (title, abstract, DOI, etc.)
  6. - Saves to data/01_identification/{database}_results.csv
  7. Logs total papers fetched from each database

Error Handling

❌ API rate limit exceeded

β†’ Script automatically retries with exponential backoff. If persists, add Semantic Scholar API key to .env

❌ No results found

β†’ Query too narrow. Broaden search terms or remove year constraints in config.yaml

❌ Network timeout

β†’ Check internet connection. Script will retry failed requests automatically

Extension Points

πŸ’‘ How to Extend This Script

  • Add new database: Create new method fetch_from_X(), add to database loop, update config.yaml schema
  • Custom filters: Add field-specific filters (e.g., only open access) by modifying API request parameters

Key Code Snippet

def fetch_from_semantic_scholar(self, query: str) -> List[Dict]:
    """Fetch papers from Semantic Scholar API"""
    url = "https://api.semanticscholar.org/graph/v1/paper/search"
    params = {
        "query": query,
        "fields": "title,abstract,authors,year,doi,openAccessPdf",
        "limit": 100
    }

    # Add API key if available
    headers = {}
    if self.api_key:
        headers["x-api-key"] = self.api_key

    response = requests.get(url, params=params, headers=headers)
    response.raise_for_status()

    return response.json().get("data", [])

02_deduplicate.py

Purpose

Removes duplicate papers using DOI, arXiv ID, and title similarity

Command Line Usage

python scripts/02_deduplicate.py --project projects/my-project

Parameters

ParameterTypeRequiredDescription
--projectPathβœ“ YesPath to project directory

config.yaml Dependencies

  • None β†’ This script does not read config.yaml

Core Logic

  1. Loads all CSV files from data/01_identification/
  2. Combines into single DataFrame
  3. Deduplication strategy (in order):
  4. 1. Exact DOI match β†’ Keep first occurrence
  5. 2. Exact arXiv ID match β†’ Keep first occurrence
  6. 3. Title similarity > 90% (fuzzy matching) β†’ Keep first occurrence
  7. Saves deduplicated results to data/01_identification/deduplicated.csv
  8. Logs: X papers β†’ Y papers (Z duplicates removed)

Error Handling

❌ No CSV files found

β†’ Run 01_fetch_papers.py first to generate identification data

❌ Empty CSV files

β†’ Check if 01_fetch_papers.py succeeded. Verify API keys if needed

Extension Points

πŸ’‘ How to Extend This Script

  • Custom similarity threshold: Adjust SIMILARITY_THRESHOLD constant (default 0.9) for stricter/looser matching
  • Manual review: Add --interactive flag to review borderline duplicates before removing

Key Code Snippet

from difflib import SequenceMatcher

def is_duplicate_title(title1: str, title2: str, threshold=0.9) -> bool:
    """Check if two titles are similar enough to be duplicates"""
    similarity = SequenceMatcher(None, title1.lower(), title2.lower()).ratio()
    return similarity >= threshold

03_screen_papers.py

Purpose

AI-assisted screening using Claude API with project_type-aware thresholds

Command Line Usage

python scripts/03_screen_papers.py --project projects/my-project

Parameters

ParameterTypeRequiredDescription
--projectPathβœ“ YesPath to project directory
--batch-sizeintβ—‹ NoNumber of papers to screen per API call (default: 1)

config.yaml Dependencies

  • ⚠️ project_type β†’ CRITICAL: Sets screening thresholds
  • ⚠️ ai_prisma_rubric.decision_confidence.auto_include β†’ Minimum confidence % to auto-include
  • ⚠️ ai_prisma_rubric.decision_confidence.auto_exclude β†’ Maximum confidence % to auto-exclude
  • ai_prisma_rubric.human_validation.required β†’ Whether to prompt for human review
  • ⚠️ research_question β†’ Used in AI prompt for relevance assessment

Core Logic

  1. Loads config.yaml and sets thresholds based on project_type:
  2. - knowledge_repository: auto_include=50, auto_exclude=20, no human review
  3. - systematic_review: auto_include=90, auto_exclude=10, optional human review
  4. For each paper in deduplicated.csv:
  5. - Constructs prompt: "Is this relevant to [research_question]?"
  6. - Calls Claude API with title + abstract
  7. - Parses response: relevance score (0-100) + reasoning
  8. - Classifies: Include (β‰₯threshold), Exclude (<threshold), or Review (borderline)
  9. Saves relevant.csv, excluded.csv, and review_queue.csv
  10. If human review required: prompts user to review borderline cases

Error Handling

❌ ANTHROPIC_API_KEY not found

β†’ Add ANTHROPIC_API_KEY=sk-ant-... to .env file

❌ Rate limit exceeded

β†’ Script pauses automatically. Upgrade to Claude Pro for higher limits

❌ Paper has no abstract

β†’ Script auto-excludes papers without abstracts (cannot assess relevance)

Extension Points

πŸ’‘ How to Extend This Script

  • Multi-criteria scoring: Modify prompt to score on multiple dimensions (methodology, population, outcomes) instead of single relevance score
  • Batch processing: Increase --batch-size to screen multiple papers per API call (faster but less accurate)

Key Code Snippet

def load_config(self):
    """Load config and set thresholds based on project_type"""
    config_file = self.project_path / "config.yaml"
    with open(config_file) as f:
        self.config = yaml.safe_load(f)

    project_type = self.config.get('project_type', 'systematic_review')

    if project_type == 'knowledge_repository':
        # Lenient thresholds for comprehensive coverage
        self.screening_threshold = 50
        self.exclude_threshold = 20
        self.require_human_review = False
    else:
        # Strict thresholds for systematic review
        self.screening_threshold = 90
        self.exclude_threshold = 10
        self.require_human_review = True

04_download_pdfs.py

Purpose

Downloads PDFs from open_access URLs with retry logic and error handling

Command Line Usage

python scripts/04_download_pdfs.py --project projects/my-project

Parameters

ParameterTypeRequiredDescription
--projectPathβœ“ YesPath to project directory
--max-workersintβ—‹ NoNumber of parallel downloads (default: 5)

config.yaml Dependencies

  • None β†’ Reads from data/02_screening/relevant.csv

Core Logic

  1. Loads relevant.csv (papers that passed screening)
  2. For each paper with pdf_url:
  3. - Attempts download with timeout (30 seconds)
  4. - Saves to data/pdfs/{doi or arxiv_id}.pdf
  5. - If fails: retries up to 3 times with exponential backoff
  6. - If still fails: logs error and continues
  7. Reports: X/Y PDFs downloaded successfully
  8. Creates data/pdfs/failed.csv with papers that failed to download

Error Handling

❌ Download timeout

β†’ Increase timeout in script or skip large PDFs (>50MB). Failed papers logged to failed.csv

❌ 403 Forbidden / 404 Not Found

β†’ PDF URL expired or restricted. Paper will be excluded from RAG (but kept in metadata)

❌ Disk space full

β†’ Free up space or move data/ to external drive

Extension Points

πŸ’‘ How to Extend This Script

  • Publisher authentication: Add institution proxy support for paywalled papers (requires login credentials)
  • OCR fallback: For scanned PDFs, add OCR processing using Tesseract

Key Code Snippet

def download_pdf(self, url: str, output_path: Path, retries=3):
    """Download PDF with retry logic"""
    for attempt in range(retries):
        try:
            response = requests.get(url, timeout=30, stream=True)
            response.raise_for_status()

            with open(output_path, 'wb') as f:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)

            return True
        except Exception as e:
            if attempt == retries - 1:
                print(f"❌ Failed after {retries} attempts: {e}")
                return False
            time.sleep(2 ** attempt)  # Exponential backoff

05_build_rag.py

Purpose

Chunks PDFs, generates embeddings, and stores in ChromaDB vector database

Command Line Usage

python scripts/05_build_rag.py --project projects/my-project

Parameters

ParameterTypeRequiredDescription
--projectPathβœ“ YesPath to project directory

config.yaml Dependencies

  • ⚠️ rag_settings.embedding_model β†’ Which embedding model to use (affects quality)
  • rag_settings.chunk_size β†’ Size of text chunks (default: 1000)
  • rag_settings.chunk_overlap β†’ Overlap between chunks (default: 200)
  • rag_settings.vector_db β†’ Database backend (chromadb or faiss)

Core Logic

  1. Loads all PDFs from data/pdfs/
  2. For each PDF:
  3. - Extracts text using PyMuPDF
  4. - Splits into chunks (chunk_size with chunk_overlap)
  5. - Generates embeddings using OpenAI API or local model
  6. - Stores chunks + embeddings in ChromaDB
  7. Creates collection in data/chroma/ with metadata:
  8. - paper_id, title, authors, year, doi
  9. - chunk_index, page_number
  10. Reports: X papers β†’ Y chunks β†’ ChromaDB

Error Handling

❌ OPENAI_API_KEY not found

β†’ Add OPENAI_API_KEY=sk-... to .env if using OpenAI embeddings

❌ PDF extraction failed (corrupted PDF)

β†’ Script skips corrupted PDFs and logs warning. Paper metadata kept but no embeddings

❌ Out of memory

β†’ Reduce chunk_size or process PDFs in smaller batches

Extension Points

πŸ’‘ How to Extend This Script

  • Custom chunking strategy: Instead of fixed-size chunks, split by sections (Introduction, Methods, etc.)
  • Multi-modal embeddings: Extract figures/tables and embed separately using vision models

Key Code Snippet

def chunk_text(self, text: str, chunk_size: int, overlap: int):
    """Split text into overlapping chunks"""
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - overlap  # Overlap

    return chunks

06_query_rag.py

Purpose

Interactive RAG query system with semantic search and LLM answer generation

Command Line Usage

python scripts/06_query_rag.py --project projects/my-project

Parameters

ParameterTypeRequiredDescription
--projectPathβœ“ YesPath to project directory
--querystrβ—‹ NoDirect query (non-interactive mode)

config.yaml Dependencies

  • rag_settings.retrieval_k β†’ Number of chunks to retrieve (default: 10)
  • ⚠️ rag_settings.llm β†’ Which LLM to use for answer generation
  • rag_settings.llm_temperature β†’ Randomness of answers (0.0-1.0)

Core Logic

  1. Loads ChromaDB collection from data/chroma/
  2. Starts interactive loop:
  3. - User enters query
  4. - Generates query embedding
  5. - Searches ChromaDB for top-k most similar chunks
  6. - Constructs prompt: "Answer based on these papers: [chunks]"
  7. - Calls LLM (Claude/GPT) to generate answer
  8. - Displays answer with paper citations
  9. User can ask follow-up questions in same session

Error Handling

❌ ChromaDB not found

β†’ Run 05_build_rag.py first to create vector database

❌ No relevant chunks found

β†’ Query too specific or outside domain. Try broader query terms

❌ LLM timeout

β†’ Reduce retrieval_k to retrieve fewer chunks (less context)

Extension Points

πŸ’‘ How to Extend This Script

  • Conversation history: Maintain conversation context across multiple queries for follow-up questions
  • Citation formatting: Auto-format citations in APA/MLA style

Key Code Snippet

def query(self, question: str):
    """Query RAG system and generate answer"""
    # Retrieve relevant chunks
    results = self.collection.query(
        query_texts=[question],
        n_results=self.k
    )

    # Construct prompt with context
    context = "\n\n".join(results['documents'][0])
    prompt = f"Answer this question based on the papers:\n{context}\n\nQuestion: {question}"

    # Generate answer
    response = self.llm.generate(prompt)

    return {
        "answer": response,
        "sources": results['metadatas'][0]  # Paper citations
    }

07_generate_prisma.py

Purpose

Generates PRISMA 2020 flow diagram with project_type-aware title

Command Line Usage

python scripts/07_generate_prisma.py --project projects/my-project

Parameters

ParameterTypeRequiredDescription
--projectPathβœ“ YesPath to project directory

config.yaml Dependencies

  • ⚠️ project_type β†’ CRITICAL: Changes diagram title
  • project_name β†’ Displayed on diagram

Core Logic

  1. Collects statistics from all data/ folders:
  2. - data/01_identification/*.csv β†’ papers fetched
  3. - data/01_identification/deduplicated.csv β†’ duplicates removed
  4. - data/02_screening/*.csv β†’ papers screened and excluded
  5. - data/pdfs/ β†’ PDFs downloaded
  6. - data/chroma/ β†’ papers in RAG
  7. Generates PRISMA 2020 flow diagram using matplotlib
  8. Title changes based on project_type:
  9. - knowledge_repository β†’ "Paper Processing Pipeline"
  10. - systematic_review β†’ "PRISMA 2020 Flow Diagram"
  11. Saves to outputs/prisma_diagram.png

Error Handling

❌ Missing data files

β†’ Ensure all previous stages completed successfully

❌ Matplotlib font errors

β†’ Install system fonts or use default sans-serif

Extension Points

πŸ’‘ How to Extend This Script

  • Interactive PRISMA: Generate HTML version with clickable stages that show detailed breakdowns
  • Export to LaTeX: Generate TikZ code for publication-quality diagrams

Key Code Snippet

def create_prisma_diagram(self, stats):
    """Generate PRISMA diagram with project-type-aware title"""
    project_type = self.config.get('project_type', 'systematic_review')

    if project_type == 'knowledge_repository':
        title = 'Paper Processing Pipeline'
        subtitle = 'Comprehensive Knowledge Repository'
    else:
        title = 'PRISMA 2020 Flow Diagram'
        subtitle = 'Systematic Literature Review'

    # Create matplotlib figure
    fig, ax = plt.subplots(figsize=(12, 14))
    ax.text(5, 13.5, title, ha='center', fontsize=16, fontweight='bold')
    ax.text(5, 13, subtitle, ha='center', fontsize=12, style='italic')

Ready to Review Code?

Check out the Code Review Guide for best practices, testing strategies, and contribution guidelines.

View Code Review Guide β†’