Indexing & Crawl

Robots.txt Monitor

The Robots.txt Monitor fetches your site's robots.txt every 3 hours and stores a snapshot whenever the content changes. Any change — however small — triggers a warning email. If the file adds a Disallow: / rule under User-agent: *, a critical alert fires immediately, because a single mis-deployed robots.txt can block every search engine crawler and wipe organic traffic within hours.

What it measures

Current version — the most recent robots.txt content fetched from your site, with a red BLOCKING badge if Disallow: / is active or a green Normal badge if crawlers are allowed in.
Last checked — the timestamp of the most recent fetch.
Change history — every version of the file that differed from the previous one, with the date it was detected, a status badge, the short SHA-256 hash for identification, and an expandable unified diff so you can see exactly what lines were added or removed.

How we compute it

A Celery Beat task runs every 3 hours and fetches https://<your-domain>/robots.txt with a short timeout.
We SHA-256 the raw content and compare against the latest stored snapshot for your site.
If the hash is different (or there is no previous snapshot), we store a new RobotsTxtSnapshot row with the content, timestamp and hash.
We parse the new content line-by-line looking for a User-agent: * block that contains a Disallow: / directive. If found, is_blocking is set to True.
We compute a unified diff between the previous snapshot and the new one (identical to diff -u) so you can read exactly what changed.
An alert email is sent to every member of the site's team. The subject line reads [CRITICAL] Robots.txt blocking detected if the file is blocking crawlers, or [WARNING] Robots.txt changed for any other change. Email is fail_silently=True so a transient SMTP failure never crashes the task.

Scenarios you'll see

Full crawl block

Disallow: / under User-agent: * — every search engine bot is locked out. Usually happens when a staging robots.txt is accidentally deployed to production. Act within minutes: Google may de-index pages if the block persists for hours.

Partial path disallow added

A new Disallow: /some-path/ rule was added. Could be intentional (protecting admin URLs) or accidental (blocking a product category). Verify the diff against your deploy log before dismissing.

Crawl-delay or sitemap changed

A Crawl-delay value was increased or the Sitemap: directive was updated. Low risk but worth logging — crawl-delay changes can slow fresh-content indexing, and an outdated sitemap URL silently breaks sitemap discovery.

Block removed (recovery)

A previous Disallow: / was removed and the file reverted to normal. The monitor captures this as a new snapshot; the status badge returns to Normal. Submit a recrawl request in Search Console immediately to restore indexing.

File unavailable (5xx / timeout)

The fetch failed with a server error or timed out. No new snapshot is stored (the old content is not overwritten), but repeated failures mean your robots.txt is currently unreachable. Check your web server's error log and CDN configuration.

No change detected

The file's SHA-256 hash matched the previous snapshot — nothing was stored. This is the normal healthy state. The "Last checked" timestamp updates regardless so you can confirm the monitor is running.

What to do when you get an alert

If CRITICAL (Disallow: /) — open your robots.txt immediately at https://yourdomain.com/robots.txt. If the block is live, revert your last deploy or patch the file directly. Then request a recrawl in Google Search Console.
If WARNING (other change) — open the Robots.txt Monitor report, expand the diff for the new snapshot, and verify each changed line against your recent deploy log or CMS release notes.
For intentional changes (adding a new Disallow for an admin path), no action is needed — but log it in Change Events so the annotation appears on your traffic charts.
If the change was unintentional, check your deployment pipeline for where robots.txt is generated — CMS plugins, environment-variable templating, and CI/CD jobs are common culprits.
After resolving a blocking incident, monitor Index Coverage over the following 2–3 days to confirm Google is actively recrawling.

Caveats & limits

Fetches run every 3 hours. A deployment that briefly blocks crawlers and is reverted within 3 hours may not be captured if the window of exposure falls between two fetch cycles.
The blocking detector checks for the literal string Disallow: / under a User-agent: * block. Non-wildcard user-agent blocks (e.g. User-agent: Googlebot) do not trigger the CRITICAL badge — review the diff manually if you're concerned about specific bots.
Only content changes trigger a new snapshot. HTTP 5xx / connection timeouts are not stored — a missing robots.txt or server error is not the same as an explicit block directive.
The monitor stores up to the last 50 snapshots per site for the history table.

Related reports

Index Coverage — confirm Google is indexing what you expect after any robots.txt change.
Crawl Efficiency — see which pages Googlebot is visiting and which are being neglected.
Sitemaps Status — ensure your sitemap is still discoverable after any robots.txt directive change.
Change Events — log intentional robots.txt updates so they appear as annotations on every traffic chart.