Help & Documentation
Browse the full documentation index · Press Esc to close
Testing & Experiments

SEO A/B Testing Engine

MetricsTab's SEO Testing Engine lets you measure whether an on-page change actually moved the needle. You define a deployment date, a URL pattern, and two comparison windows (before and after). MetricsTab pulls clicks, impressions, CTR, and average position from your existing Google Search Console data, runs two independent statistical tests, and tells you whether the difference is real or noise.

How the windows work

Every test has three key dates:

  • Pre-window start: deployment_date − pre_window_days
  • Pre-window end: deployment_date − 1 day
  • Post-window start: deployment_date
  • Post-window end: deployment_date + post_window_days − 1

The default is 28 days on each side. Longer windows give more statistical power at the cost of waiting longer for results. The minimum recommended window is 14 days; fewer than 7 days rarely produces meaningful results.

Google Search Console data has a 2–3 day lag. If your deployment date is today, post-window metrics will not appear in MetricsTab for 2–3 days.

The statistical tests

MetricsTab runs two tests on each metric and reports both p-values.

Welch's t-test

Compares the mean daily value in the pre-window against the mean in the post-window. Welch's version (unlike Student's t) does not assume equal variance — appropriate since SEO metrics often change seasonally or due to indexing spikes. The result tells you whether the average daily performance changed.

Mann-Whitney U test

A non-parametric alternative that compares the distributions of the two windows without assuming normality. It is robust to outlier days (e.g. a viral news spike). MetricsTab uses a normal approximation for the U statistic, which is accurate when each window has at least 8 data points.

Interpreting p-values

  • p < 0.05 — result is statistically significant at 95% confidence. The observed difference is unlikely to be random noise.
  • p ≥ 0.05 — not significant yet. This could mean: the post-window is too short, the effect size is small, or there was genuinely no change.
  • p < 0.001 — very strong evidence of a real change.
Statistical significance does not mean business significance. A CTR change from 2.0% to 2.01% might be significant but is not worth acting on. Always evaluate the effect size (the Δ% column) alongside the p-value.

The four scenarios

Significant win

p < 0.05 and the metric moved in the desired direction (e.g. CTR increased, position improved). Your change worked. What to do: roll out to similar pages and document the win in the Site Changelog.

Significant loss

p < 0.05 and the metric moved in the wrong direction. Your change hurt performance. What to do: revert the change, check for contamination (did something else change?), and revisit the hypothesis.

Not yet significant

p ≥ 0.05. The post-window may not be complete, or the effect is small. What to do: wait for the post-window to finish. If still n.s. after a full window, the change likely had no measurable impact on GSC metrics. Consider testing a larger change.

Contaminated

The simple crawler detected that title, meta description, H1, or canonical changed on one or more URLs after the test started. What to do: treat results with caution — you cannot isolate which change caused the metric movement. Consider archiving the test, reverting to a known baseline, and re-running.

Page snapshots & contamination detection

When a test is created, MetricsTab immediately crawls the URLs in scope and stores a page snapshot: title, meta description, first H1, canonical URL, and HTTP status code. The crawler runs nightly (03:30 UTC) for every active test.

Each snapshot generates a SHA-256 content hash of the concatenated title + meta description + H1 + canonical. If the hash changes between crawls, the URL is flagged as contaminated and an orange warning banner appears on the test detail page.

If no baseline snapshot exists (i.e. the crawler had not run before your deployment date), contamination detection is unavailable for that test. The results are still computed from GSC data, but you will not receive hash-change alerts.

Caveats & limitations

  • GSC lag: Data is typically available 2–3 days after the date. Results for the last 3 days of the post-window will appear with a delay.
  • No randomised assignment: This is a before/after test on the same URLs, not an A/B split test. Confounding factors (seasonality, algorithm updates, competitor changes) can affect results. Use the Site Changelog to record other concurrent changes.
  • Query-level data not included: Tests compare page-level aggregates. Individual query movements are not broken out.
  • Maximum 500 URLs per test: URL sets larger than 500 are capped. For very large sites, use a narrower URL pattern or split into multiple tests.
  • Position is position-weighted average: Avg. position from GSC is weighted by impressions per query per day. It may differ from a simple median.

Test lifecycle & statuses

Every test moves through a simple lifecycle. You control transitions manually from the test detail page.

Status What it means Nightly automation Typical next step
Draft Test is saved but not yet active. No crawls or metric recomputes run. Excluded Click Activate when you're ready to start tracking.
Running Test is live. URLs are crawled nightly (03:30 UTC) and metrics recomputed each evening (21:00 UTC). Included Wait for the post-window to elapse, then read results.
Concluded Automatically set by the system when the full post-window has elapsed. Results are final. Excluded Review findings; archive when done or reactivate to extend the window.
Archived Retired test. All historical results and URL snapshots are preserved for reference but the test is dormant. Excluded No action needed. Use Reactivate to resume tracking if required.

Draft vs Archived: Draft = not yet started (typically no results). Archived = finished or retired (results are preserved). Neither counts toward the "Running Tests" badge in the navigation.

Related features

  • Robots.txt Monitor — detect accidental crawl blocks that could invalidate your test.
  • Site Changelog — record other changes that happened during your test window to document confounders.
  • Winners & Losers — see which pages moved most in the same period to spot correlated changes.