Robots.txt Monitor
The Robots.txt Monitor fetches your site's robots.txt every
3 hours and stores a snapshot whenever the content changes. Any change — however small — triggers
a warning email. If the file adds a Disallow: / rule under
User-agent: *, a critical alert fires immediately, because a single
mis-deployed robots.txt can block every search engine crawler and wipe organic traffic within
hours.
What it measures
- Current version — the most recent robots.txt content fetched from your site,
with a red BLOCKING badge if
Disallow: /is active or a green Normal badge if crawlers are allowed in. - Last checked — the timestamp of the most recent fetch.
- Change history — every version of the file that differed from the previous one, with the date it was detected, a status badge, the short SHA-256 hash for identification, and an expandable unified diff so you can see exactly what lines were added or removed.
How we compute it
- A Celery Beat task runs every 3 hours and fetches
https://<your-domain>/robots.txtwith a short timeout. - We SHA-256 the raw content and compare against the latest stored snapshot for your site.
- If the hash is different (or there is no previous snapshot), we store a new
RobotsTxtSnapshotrow with the content, timestamp and hash. - We parse the new content line-by-line looking for a
User-agent: *block that contains aDisallow: /directive. If found,is_blockingis set toTrue. - We compute a unified diff between the previous snapshot and the new one
(identical to
diff -u) so you can read exactly what changed. - An alert email is sent to every member of the site's team. The subject line reads
[CRITICAL] Robots.txt blocking detectedif the file is blocking crawlers, or[WARNING] Robots.txt changedfor any other change. Email isfail_silently=Trueso a transient SMTP failure never crashes the task.
Scenarios you'll see
Disallow: / under User-agent: * — every
search engine bot is locked out. Usually happens when a staging robots.txt is
accidentally deployed to production. Act within minutes: Google may
de-index pages if the block persists for hours.
A new Disallow: /some-path/ rule was added. Could be
intentional (protecting admin URLs) or accidental (blocking a product category).
Verify the diff against your deploy log before dismissing.
A Crawl-delay value was increased or the
Sitemap: directive was updated. Low risk but worth logging — crawl-delay
changes can slow fresh-content indexing, and an outdated sitemap URL silently breaks
sitemap discovery.
A previous Disallow: / was removed and the file
reverted to normal. The monitor captures this as a new snapshot; the status badge
returns to Normal. Submit a
recrawl request in Search Console immediately to restore indexing.
The fetch failed with a server error or timed out. No new snapshot is stored (the old content is not overwritten), but repeated failures mean your robots.txt is currently unreachable. Check your web server's error log and CDN configuration.
The file's SHA-256 hash matched the previous snapshot — nothing was stored. This is the normal healthy state. The "Last checked" timestamp updates regardless so you can confirm the monitor is running.
What to do when you get an alert
- If CRITICAL (Disallow: /) — open your robots.txt immediately at
https://yourdomain.com/robots.txt. If the block is live, revert your last deploy or patch the file directly. Then request a recrawl in Google Search Console. - If WARNING (other change) — open the Robots.txt Monitor report, expand the diff for the new snapshot, and verify each changed line against your recent deploy log or CMS release notes.
- For intentional changes (adding a new
Disallowfor an admin path), no action is needed — but log it in Change Events so the annotation appears on your traffic charts. - If the change was unintentional, check your deployment pipeline for where robots.txt is generated — CMS plugins, environment-variable templating, and CI/CD jobs are common culprits.
- After resolving a blocking incident, monitor Index Coverage over the following 2–3 days to confirm Google is actively recrawling.
Caveats & limits
- Fetches run every 3 hours. A deployment that briefly blocks crawlers and is reverted within 3 hours may not be captured if the window of exposure falls between two fetch cycles.
- The blocking detector checks for the literal string
Disallow: /under aUser-agent: *block. Non-wildcard user-agent blocks (e.g.User-agent: Googlebot) do not trigger the CRITICAL badge — review the diff manually if you're concerned about specific bots. - Only content changes trigger a new snapshot. HTTP 5xx / connection timeouts are not stored — a missing robots.txt or server error is not the same as an explicit block directive.
- The monitor stores up to the last 50 snapshots per site for the history table.
Related reports
- Index Coverage — confirm Google is indexing what you expect after any robots.txt change.
- Crawl Efficiency — see which pages Googlebot is visiting and which are being neglected.
- Sitemaps Status — ensure your sitemap is still discoverable after any robots.txt directive change.
- Change Events — log intentional robots.txt updates so they appear as annotations on every traffic chart.