How to Scrape 4 Million Research Papers
Files
Download How to Scrape 4 Million Research Papers Transcription (369 KB)
Loading...
Description
What does it take to scrape 4 million research papers? More patience than code. This talk distills hard-won lessons from building an instrument extraction pipeline across the chemistry literature. We'll cover the problem space—why existing research databases lack the structured metadata we needed—and the technical approach combining OpenAlex, PDF scraping, and AI-powered extraction. But the real focus is on what went wrong and how we adapted: discovering misclassified documents, learning to validate early and often, and accepting that data quality trumps data quantity. Attendees will leave with a practical framework for approaching large-scale research data projects and realistic expectations for timeline and effort.
Publication Date
2-11-2026
Recommended Citation
Chan, Alex, "How to Scrape 4 Million Research Papers" (2026). Love Data Week. 7.
https://scholarship.shu.edu/rds-ldw/7