Tactics

robots.txt for AI crawlers: a configuration guide

By Abhijay Tondak, Founder · Updated June 25, 2026 · 6 min read

The short answer

robots.txt controls which AI crawlers may access your site, and you configure it per user-agent. The key decision is to allow the bots that power answer engines - if you block them, your content cannot be retrieved or cited. Many operators allow answer/retrieval bots while making a separate, deliberate choice about training bots, since those serve different purposes.

Key takeaways

robots.txt directives are per user-agent; you can allow some AI bots and block others.
Blocking an answer-engine's crawler usually means you can't be cited by it.
Distinguish training crawlers from live answer/retrieval crawlers - they differ.
robots.txt is a public, voluntary standard - it's a request, not an enforced lock.
Verify with server logs that the bots you intend to allow are actually getting through.

How robots.txt and AI crawlers interact

robots.txt is a file at your site root that tells crawlers which paths they may request, addressed per user-agent. AI companies operate named crawlers, and you can write rules for each one - allowing a search/answer bot while disallowing another. The mechanism is the same one that's governed search crawlers for years; what's new is the set of user-agents and the stakes.

The crucial point for GEO: if you disallow the crawler that an answer engine uses to fetch live content, that engine generally cannot retrieve your pages and therefore cannot cite them. So robots.txt is not just a technical hygiene file anymore - it's a direct lever on whether you're eligible to appear in AI answers.

Training bots vs. answer/retrieval bots

Not all AI crawlers do the same job, and conflating them leads to mistakes. Broadly, some crawlers gather content to train or update models, while others fetch pages in real time to ground an answer the user is asking right now. The retrieval/answer crawlers are the ones whose access most directly affects whether you get cited in live answers.

This is why the decision is per user-agent rather than all-or-nothing. A publisher might choose to allow answer/retrieval bots (to remain citable) while making a separate, considered decision about training bots based on its own policy. Decide each deliberately rather than blanket-blocking or blanket-allowing, and document why.

Identify the named user-agent for each crawler you care about before writing a rule.
Allow answer/retrieval crawlers if you want to be eligible for live citations.
Make a separate, explicit decision on training crawlers per your content policy.
Don't accidentally catch AI bots in a broad 'Disallow: /' meant for something else.
Re-check periodically - crawler names and behaviors change over time.

robots.txt is voluntary - know its limits

robots.txt is a public, voluntary standard. Well-behaved crawlers honor it, but it is a request, not an enforced barrier - it does not authenticate or block anything at the network level. If you need to actually prevent access, that requires real access controls (authentication, server-side blocking), not a robots rule. And because the file is public, your directives are visible to anyone.

The practical takeaway: use robots.txt to express intent to compliant crawlers, but verify reality in your server logs. Confirm the bots you meant to allow are getting 200s and the ones you meant to block aren't being served - intent and outcome can diverge, especially after a config change.

Frequently asked questions

If I block AI training bots, will I lose AI citations?

Not necessarily - it depends which bot. Citations in live answers depend on the answer/retrieval crawler being allowed. Some engines separate the crawler that trains models from the one that fetches pages to ground a live answer, so the per-user-agent decision matters; blocking the wrong one can cost citations.

Does robots.txt actually stop a crawler from accessing my site?

Only voluntarily. It's a standard that well-behaved crawlers obey, but it doesn't authenticate or block at the network level. To truly prevent access you need server-side controls. Treat robots.txt as a clearly-stated request, and verify behavior in your logs.

How do I know which AI crawler user-agents to list?

Each AI company publishes the user-agent strings for its crawlers, and you can see what's actually hitting your site in server logs. Identify the named agents for the engines your audience uses, then write per-agent rules - don't guess or rely on a generic wildcard.

Put this into practice — free.

Get your free AI-visibility audit and see where engines find you today.

Keep reading

GPTBot and AI crawlers AI bot traffic in server logs What is llms.txt?