# CorpusIQ robots.txt # Canonical host: www.corpusiq.io. The bare apex (corpusiq.io) and the # legacy .app variants 308 to www at the platform layer (Vercel Domains) # and at the Next.js layer via the host-rule redirects at the top of # next.config.ts. Per Decision #10 (2026-05-04), only one Sitemap line is # allowed and it MUST point at the www host. # Last reviewed: 2026-06-10 # # See /ai.txt for the spawning.ai style allowlist for AI training. # See /llms.txt for the structured LLM site summary. # # Group structure (2026-06-09 audit fix H1): per the robots spec # (RFC 9309), a crawler obeys only the single most specific group that # matches its user agent. The previous file gave every named bot its own # "Allow: /" group, which meant those bots never saw the Disallow rules # under "User-agent: *". The named allow-all groups are removed: every # allowed crawler now falls through to the default group below, which # allows the site AND carries the Disallow rules. Only crawlers that need # rules DIFFERENT from the default get a named group (Bytespider). # # AI crawler stance (documented per CLAUDE.md "AI Crawler Posture"). # All of the following are intentionally allowed and governed by the # default group: Googlebot, Bingbot, DuckDuckBot (organic search), # GPTBot, ChatGPT-User, OAI-SearchBot (ChatGPT citations and SearchGPT), # ClaudeBot, Claude-User, Claude-Web, anthropic-ai (Claude citations), # PerplexityBot, Perplexity-User (Perplexity citations), # Google-Extended (Google AI training opt-in), CCBot (Common Crawl), # Applebot, Applebot-Extended (Siri, Spotlight, Apple Intelligence), # Amazonbot (Alexa), Meta-ExternalAgent, FacebookBot (Meta AI and link # previews), DuckAssistBot (DuckDuckGo Assist), Cohere-ai, YouBot. # Do not re-add per-bot "Allow: /" groups for any of these: that would # reintroduce the inheritance bug by detaching them from the Disallow # rules below. # --- Explicitly blocked crawlers --- # Bytespider (ByteDance). Known for aggressive crawl rates and weak downstream # traffic quality. Blocking to preserve crawl budget for bots that produce # real citations. Review annually. User-agent: Bytespider Disallow: / # --- Default rules for all other agents --- # /_next/static/* is intentionally NOT blocked. These are the compiled JS # and CSS bundles every bot needs to render the site correctly. Blocking # them breaks rendering for any crawler not on the allowlist above. # /_next/data/ is blocked because it exposes server-side data fetches that # are internal to Next.js route prefetching. # # Trailing slashes (2026-06-10 fix): per RFC 9309, Disallow rules are # prefix matches. "Disallow: /login/" matches /login/anything but NOT # /login itself, and this site serves routes WITHOUT trailing slashes # (trailingSlash: false), so the slashed forms blocked nothing. The # private-route rules below are slash-free so they match both /login and # /login/*. Do not re-add trailing slashes to these lines. User-agent: * Allow: / Allow: /_next/static/ Disallow: /_next/data/ Disallow: /api Disallow: /admin Disallow: /dashboard Disallow: /login Disallow: /register Disallow: /oauth Sitemap: https://www.corpusiq.io/sitemap.xml