I Built an Automatic Literature Harvesting System with Claude

Keeping up with the hydrology literature the traditional way is impossible. New papers come out every day faster than anyone can even read the abstracts. Keyword-based email alerts and RSS feeds help reduce the numbers a bit, but there are still TOO MANY OF THEM. A few weeks ago I decided to stop fighting it and instead build a tool that would do the tracking tailored just for me, on a schedule, while I sleep. The result is HydroSense — an automated paper-harvest blog for hydrology and water-resources research. Almost every line of the code, the Jekyll site, the skills, and even the scheduling was designed and polished together with Claude. This post is the story of how it works.

What HydroSense does

Two things, on two different cadences:

Daily harvest — Every day, it fetches new papers published ~7 days ago from 11 top-tier journals (Science, Nature, etc.) via the CrossRef API, enriches each one with field classifications and abstracts from Semantic Scholar and OpenAlex, filters by topic keywords, and then asks Claude to judge which ones are actually relevant to my research interests. The survivors get written up as a Jekyll post with short summaries, committed, pushed, and tweeted — all without me touching anything.
Weekly review — Every Monday, a second workflow runs a keyword-based search across Semantic Scholar and OpenAlex (no journal restrictions this time), and asks Claude to synthesize the results into thematic paragraphs grouped by subfield. This is the “what’s the field talking about last week” view.

Everything is published as a bilingual Jekyll blog hosted on GitHub Pages at hydrosense.simhydro.com, with a working English/Chinese language toggle. The Chinese version is generated by Claude during the same workflow.

The stack

Nothing exotic, just boring tools wired together:

Python for the harvesting scripts (harvest.py for journal-based, search.py for keyword-based). Both output JSON so downstream steps can process them.
CrossRef + Semantic Scholar + OpenAlex as the paper data sources. Each has its own rate limits and quirks (CrossRef is fast, S2 is slow at 1 req/sec, OpenAlex fills in the gaps when S2 doesn’t have an abstract).
Claude Opus for the two tasks a deterministic filter can’t handle well: (1) judging whether a paper is actually relevant to my research interests versus merely matching keywords, and (2) writing the thematic synthesis for the weekly review.
Jekyll + Just the Docs remote theme for the site. I picked Just the Docs because the sidebar navigation handles hierarchical content (Year → Month → Day) out of the box.
GitHub Actions to rebuild and redeploy the site on every push to main.
X API v2 for auto-tweeting each daily post with a short summary.
Claude Code scheduled triggers as the cron-equivalent — more on this below.

The automatic workflow

The actual “run every day at 3 AM” part uses Claude Code’s scheduled triggers rather than a traditional cron job. The instruction I give the scheduler looks roughly like this:

Run /daily-harvest for 8 days ago.

After pushing to main, post a tweet about the harvest using scripts/post_tweet.py.

If today is Monday, also run /weekly-review searching from 14 days ago to 7 days ago.

If no relevant papers are found for either task, skip that task entirely (no post, no push, no tweet).

The 8-day shift on the daily harvest is there by assuming papers take ~7 days to be fully indexed in Semantic Scholar — if you query “yesterday,” you’ll get abstract-less stubs for half the results. The weekly review is shifted by 7 days for the same reason, and also so it bracket-covers the same window as the most recent daily posts.

Things I’d still like to add

A real archive / search page that goes beyond the sidebar (right now browsing years manually is clunky once there are hundreds of posts).
Per-topic subscription via RSS so a reader can follow only “reservoir operations” or “ML for hydrology.”
A rating system for each paper.
An automatic system to generate BibTeX files for the papers.
An auto retrofit mechanism to backfill papers that didn’t have an author list or abstract when they were first retrieved.

This whole project is still evolving. Every few days I notice something I want to improve, describe it to Claude, and it gets fixed. I can casually maintain a daily-updating bilingual research blog without much effort. Anyone can just fork the repo and change the keywords to fit their own needs.

Comments

Loading comments...