Script Documentation
π¬ For Code Reviewers & Contributors
Deep dive into each Python script: parameters, logic, error handling, and extension points. Essential for understanding implementation details and contributing improvements.
Pipeline Scripts (7 Stages)
01_fetch_papers.pyStage 5: FetchFetches papers from Semantic Scholar, OpenAlex, and arXiv using configured query
02_deduplicate.pyStage 5: DedupRemoves duplicate papers using DOI, arXiv ID, and title similarity
03_screen_papers.pyStage 5: ScreenAI-assisted screening using Claude API with project_type-aware thresholds
04_download_pdfs.pyStage 5: DownloadDownloads PDFs from open_access URLs with retry logic and error handling
05_build_rag.pyStage 5: RAG BuildChunks PDFs, generates embeddings, and stores in ChromaDB vector database
06_query_rag.pyStage 6: QueryInteractive RAG query system with semantic search and LLM answer generation
07_generate_prisma.pyStage 7: PRISMAGenerates PRISMA 2020 flow diagram with project_type-aware title
01_fetch_papers.py
Purpose
Fetches papers from Semantic Scholar, OpenAlex, and arXiv using configured query
Command Line Usage
python scripts/01_fetch_papers.py --project projects/my-projectParameters
| Parameter | Type | Required | Description |
|---|---|---|---|
| --project | Path | β Yes | Path to project directory containing config.yaml |
config.yaml Dependencies
- β οΈ
search_query.simpleβ Main search query string databases.open_access.*β Which databases to searchretrieval_settings.year_rangeβ Filter papers by publication year
Core Logic
- Loads config.yaml and reads search query
- For each enabled database (Semantic Scholar, OpenAlex, arXiv):
- - Constructs API request with query and filters
- - Fetches results with pagination (handles rate limits)
- - Parses response and extracts metadata (title, abstract, DOI, etc.)
- - Saves to data/01_identification/{database}_results.csv
- Logs total papers fetched from each database
Error Handling
β API rate limit exceeded
β Script automatically retries with exponential backoff. If persists, add Semantic Scholar API key to .env
β No results found
β Query too narrow. Broaden search terms or remove year constraints in config.yaml
β Network timeout
β Check internet connection. Script will retry failed requests automatically
Extension Points
π‘ How to Extend This Script
- Add new database: Create new method fetch_from_X(), add to database loop, update config.yaml schema
- Custom filters: Add field-specific filters (e.g., only open access) by modifying API request parameters
Key Code Snippet
def fetch_from_semantic_scholar(self, query: str) -> List[Dict]:
"""Fetch papers from Semantic Scholar API"""
url = "https://api.semanticscholar.org/graph/v1/paper/search"
params = {
"query": query,
"fields": "title,abstract,authors,year,doi,openAccessPdf",
"limit": 100
}
# Add API key if available
headers = {}
if self.api_key:
headers["x-api-key"] = self.api_key
response = requests.get(url, params=params, headers=headers)
response.raise_for_status()
return response.json().get("data", [])02_deduplicate.py
Purpose
Removes duplicate papers using DOI, arXiv ID, and title similarity
Command Line Usage
python scripts/02_deduplicate.py --project projects/my-projectParameters
| Parameter | Type | Required | Description |
|---|---|---|---|
| --project | Path | β Yes | Path to project directory |
config.yaml Dependencies
Noneβ This script does not read config.yaml
Core Logic
- Loads all CSV files from data/01_identification/
- Combines into single DataFrame
- Deduplication strategy (in order):
- 1. Exact DOI match β Keep first occurrence
- 2. Exact arXiv ID match β Keep first occurrence
- 3. Title similarity > 90% (fuzzy matching) β Keep first occurrence
- Saves deduplicated results to data/01_identification/deduplicated.csv
- Logs: X papers β Y papers (Z duplicates removed)
Error Handling
β No CSV files found
β Run 01_fetch_papers.py first to generate identification data
β Empty CSV files
β Check if 01_fetch_papers.py succeeded. Verify API keys if needed
Extension Points
π‘ How to Extend This Script
- Custom similarity threshold: Adjust SIMILARITY_THRESHOLD constant (default 0.9) for stricter/looser matching
- Manual review: Add --interactive flag to review borderline duplicates before removing
Key Code Snippet
from difflib import SequenceMatcher
def is_duplicate_title(title1: str, title2: str, threshold=0.9) -> bool:
"""Check if two titles are similar enough to be duplicates"""
similarity = SequenceMatcher(None, title1.lower(), title2.lower()).ratio()
return similarity >= threshold03_screen_papers.py
Purpose
AI-assisted screening using Claude API with project_type-aware thresholds
Command Line Usage
python scripts/03_screen_papers.py --project projects/my-projectParameters
| Parameter | Type | Required | Description |
|---|---|---|---|
| --project | Path | β Yes | Path to project directory |
| --batch-size | int | β No | Number of papers to screen per API call (default: 1) |
config.yaml Dependencies
- β οΈ
project_typeβ CRITICAL: Sets screening thresholds - β οΈ
ai_prisma_rubric.decision_confidence.auto_includeβ Minimum confidence % to auto-include - β οΈ
ai_prisma_rubric.decision_confidence.auto_excludeβ Maximum confidence % to auto-exclude ai_prisma_rubric.human_validation.requiredβ Whether to prompt for human review- β οΈ
research_questionβ Used in AI prompt for relevance assessment
Core Logic
- Loads config.yaml and sets thresholds based on project_type:
- - knowledge_repository: auto_include=50, auto_exclude=20, no human review
- - systematic_review: auto_include=90, auto_exclude=10, optional human review
- For each paper in deduplicated.csv:
- - Constructs prompt: "Is this relevant to [research_question]?"
- - Calls Claude API with title + abstract
- - Parses response: relevance score (0-100) + reasoning
- - Classifies: Include (β₯threshold), Exclude (<threshold), or Review (borderline)
- Saves relevant.csv, excluded.csv, and review_queue.csv
- If human review required: prompts user to review borderline cases
Error Handling
β ANTHROPIC_API_KEY not found
β Add ANTHROPIC_API_KEY=sk-ant-... to .env file
β Rate limit exceeded
β Script pauses automatically. Upgrade to Claude Pro for higher limits
β Paper has no abstract
β Script auto-excludes papers without abstracts (cannot assess relevance)
Extension Points
π‘ How to Extend This Script
- Multi-criteria scoring: Modify prompt to score on multiple dimensions (methodology, population, outcomes) instead of single relevance score
- Batch processing: Increase --batch-size to screen multiple papers per API call (faster but less accurate)
Key Code Snippet
def load_config(self):
"""Load config and set thresholds based on project_type"""
config_file = self.project_path / "config.yaml"
with open(config_file) as f:
self.config = yaml.safe_load(f)
project_type = self.config.get('project_type', 'systematic_review')
if project_type == 'knowledge_repository':
# Lenient thresholds for comprehensive coverage
self.screening_threshold = 50
self.exclude_threshold = 20
self.require_human_review = False
else:
# Strict thresholds for systematic review
self.screening_threshold = 90
self.exclude_threshold = 10
self.require_human_review = True04_download_pdfs.py
Purpose
Downloads PDFs from open_access URLs with retry logic and error handling
Command Line Usage
python scripts/04_download_pdfs.py --project projects/my-projectParameters
| Parameter | Type | Required | Description |
|---|---|---|---|
| --project | Path | β Yes | Path to project directory |
| --max-workers | int | β No | Number of parallel downloads (default: 5) |
config.yaml Dependencies
Noneβ Reads from data/02_screening/relevant.csv
Core Logic
- Loads relevant.csv (papers that passed screening)
- For each paper with pdf_url:
- - Attempts download with timeout (30 seconds)
- - Saves to data/pdfs/{doi or arxiv_id}.pdf
- - If fails: retries up to 3 times with exponential backoff
- - If still fails: logs error and continues
- Reports: X/Y PDFs downloaded successfully
- Creates data/pdfs/failed.csv with papers that failed to download
Error Handling
β Download timeout
β Increase timeout in script or skip large PDFs (>50MB). Failed papers logged to failed.csv
β 403 Forbidden / 404 Not Found
β PDF URL expired or restricted. Paper will be excluded from RAG (but kept in metadata)
β Disk space full
β Free up space or move data/ to external drive
Extension Points
π‘ How to Extend This Script
- Publisher authentication: Add institution proxy support for paywalled papers (requires login credentials)
- OCR fallback: For scanned PDFs, add OCR processing using Tesseract
Key Code Snippet
def download_pdf(self, url: str, output_path: Path, retries=3):
"""Download PDF with retry logic"""
for attempt in range(retries):
try:
response = requests.get(url, timeout=30, stream=True)
response.raise_for_status()
with open(output_path, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
return True
except Exception as e:
if attempt == retries - 1:
print(f"β Failed after {retries} attempts: {e}")
return False
time.sleep(2 ** attempt) # Exponential backoff05_build_rag.py
Purpose
Chunks PDFs, generates embeddings, and stores in ChromaDB vector database
Command Line Usage
python scripts/05_build_rag.py --project projects/my-projectParameters
| Parameter | Type | Required | Description |
|---|---|---|---|
| --project | Path | β Yes | Path to project directory |
config.yaml Dependencies
- β οΈ
rag_settings.embedding_modelβ Which embedding model to use (affects quality) rag_settings.chunk_sizeβ Size of text chunks (default: 1000)rag_settings.chunk_overlapβ Overlap between chunks (default: 200)rag_settings.vector_dbβ Database backend (chromadb or faiss)
Core Logic
- Loads all PDFs from data/pdfs/
- For each PDF:
- - Extracts text using PyMuPDF
- - Splits into chunks (chunk_size with chunk_overlap)
- - Generates embeddings using OpenAI API or local model
- - Stores chunks + embeddings in ChromaDB
- Creates collection in data/chroma/ with metadata:
- - paper_id, title, authors, year, doi
- - chunk_index, page_number
- Reports: X papers β Y chunks β ChromaDB
Error Handling
β OPENAI_API_KEY not found
β Add OPENAI_API_KEY=sk-... to .env if using OpenAI embeddings
β PDF extraction failed (corrupted PDF)
β Script skips corrupted PDFs and logs warning. Paper metadata kept but no embeddings
β Out of memory
β Reduce chunk_size or process PDFs in smaller batches
Extension Points
π‘ How to Extend This Script
- Custom chunking strategy: Instead of fixed-size chunks, split by sections (Introduction, Methods, etc.)
- Multi-modal embeddings: Extract figures/tables and embed separately using vision models
Key Code Snippet
def chunk_text(self, text: str, chunk_size: int, overlap: int):
"""Split text into overlapping chunks"""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk)
start += chunk_size - overlap # Overlap
return chunks06_query_rag.py
Purpose
Interactive RAG query system with semantic search and LLM answer generation
Command Line Usage
python scripts/06_query_rag.py --project projects/my-projectParameters
| Parameter | Type | Required | Description |
|---|---|---|---|
| --project | Path | β Yes | Path to project directory |
| --query | str | β No | Direct query (non-interactive mode) |
config.yaml Dependencies
rag_settings.retrieval_kβ Number of chunks to retrieve (default: 10)- β οΈ
rag_settings.llmβ Which LLM to use for answer generation rag_settings.llm_temperatureβ Randomness of answers (0.0-1.0)
Core Logic
- Loads ChromaDB collection from data/chroma/
- Starts interactive loop:
- - User enters query
- - Generates query embedding
- - Searches ChromaDB for top-k most similar chunks
- - Constructs prompt: "Answer based on these papers: [chunks]"
- - Calls LLM (Claude/GPT) to generate answer
- - Displays answer with paper citations
- User can ask follow-up questions in same session
Error Handling
β ChromaDB not found
β Run 05_build_rag.py first to create vector database
β No relevant chunks found
β Query too specific or outside domain. Try broader query terms
β LLM timeout
β Reduce retrieval_k to retrieve fewer chunks (less context)
Extension Points
π‘ How to Extend This Script
- Conversation history: Maintain conversation context across multiple queries for follow-up questions
- Citation formatting: Auto-format citations in APA/MLA style
Key Code Snippet
def query(self, question: str):
"""Query RAG system and generate answer"""
# Retrieve relevant chunks
results = self.collection.query(
query_texts=[question],
n_results=self.k
)
# Construct prompt with context
context = "\n\n".join(results['documents'][0])
prompt = f"Answer this question based on the papers:\n{context}\n\nQuestion: {question}"
# Generate answer
response = self.llm.generate(prompt)
return {
"answer": response,
"sources": results['metadatas'][0] # Paper citations
}07_generate_prisma.py
Purpose
Generates PRISMA 2020 flow diagram with project_type-aware title
Command Line Usage
python scripts/07_generate_prisma.py --project projects/my-projectParameters
| Parameter | Type | Required | Description |
|---|---|---|---|
| --project | Path | β Yes | Path to project directory |
config.yaml Dependencies
- β οΈ
project_typeβ CRITICAL: Changes diagram title project_nameβ Displayed on diagram
Core Logic
- Collects statistics from all data/ folders:
- - data/01_identification/*.csv β papers fetched
- - data/01_identification/deduplicated.csv β duplicates removed
- - data/02_screening/*.csv β papers screened and excluded
- - data/pdfs/ β PDFs downloaded
- - data/chroma/ β papers in RAG
- Generates PRISMA 2020 flow diagram using matplotlib
- Title changes based on project_type:
- - knowledge_repository β "Paper Processing Pipeline"
- - systematic_review β "PRISMA 2020 Flow Diagram"
- Saves to outputs/prisma_diagram.png
Error Handling
β Missing data files
β Ensure all previous stages completed successfully
β Matplotlib font errors
β Install system fonts or use default sans-serif
Extension Points
π‘ How to Extend This Script
- Interactive PRISMA: Generate HTML version with clickable stages that show detailed breakdowns
- Export to LaTeX: Generate TikZ code for publication-quality diagrams
Key Code Snippet
def create_prisma_diagram(self, stats):
"""Generate PRISMA diagram with project-type-aware title"""
project_type = self.config.get('project_type', 'systematic_review')
if project_type == 'knowledge_repository':
title = 'Paper Processing Pipeline'
subtitle = 'Comprehensive Knowledge Repository'
else:
title = 'PRISMA 2020 Flow Diagram'
subtitle = 'Systematic Literature Review'
# Create matplotlib figure
fig, ax = plt.subplots(figsize=(12, 14))
ax.text(5, 13.5, title, ha='center', fontsize=16, fontweight='bold')
ax.text(5, 13, subtitle, ha='center', fontsize=12, style='italic')Ready to Review Code?
Check out the Code Review Guide for best practices, testing strategies, and contribution guidelines.
View Code Review Guide β