What is robots.txt? It is a plain text file that sits at the root of your domain and tells search engine crawlers which parts of your site they're welcome to request and which they should leave alone. One file, a handful of rules, enormous influence over how bots spend their time on your site. The catch that trips up almost everyone: robots.txt controls crawling, not indexing. It can keep a bot from reading a page, but it cannot reliably keep that page out of Google's results. Confusing those two jobs is the most expensive robots.txt mistake there is, and it's worth understanding before you ever touch the file.
What is robots.txt, in plain English?
Robots.txt is a set of instructions for automated visitors. Before a well-behaved crawler requests pages from your site, it first fetches https://yourdomain.com/robots.txt and reads the rules. Those rules say things like "Googlebot, you may go anywhere except /admin/" or "everybody stay out of /internal-search/." The crawler then honors what it found.
The file lives at the root of each host and must be named robots.txt in lowercase. It governs only that exact host: https://example.com and https://www.example.com are separate hosts that each need their own file, and a file buried at /blog/robots.txt is ignored entirely. Google fetches the root file, caches it for about a day, and applies it across the whole site.
Two ideas do most of the work here. User-agent names which crawler a block of rules applies to (Googlebot, Bingbot, * for everyone). Disallow names a path that crawler should not request. An empty Disallow: means "nothing is off limits," and Allow: carves exceptions back out of a broader block. That's the entire core of the format, and its simplicity is exactly why people overestimate what it can do.
The most important thing to internalize: robots.txt is a request, not a fence. It works because reputable crawlers choose to obey it. A scraper or a malicious bot can read your Disallow lines as a convenient map of where you'd rather it not look, then go there anyway. So robots.txt is for managing cooperative crawlers, never for security.
How robots.txt works
A robots.txt file is a list of groups. Each group starts with one or more User-agent lines, then the Allow and Disallow rules that apply to those agents, all governed by a longest-match rule: when two rules conflict, the more specific (longer) path wins. A typical file looks like this:
```
User-agent: *
Disallow: /admin/
Disallow: /cart/
Disallow: /*?sessionid=
Allow: /admin/public-policy.html
User-agent: GPTBot
Disallow: /
Sitemap: https://example.com/sitemap.xml
```
That file tells every crawler to skip the admin and cart areas and any URL carrying a sessionid parameter, while still permitting one public file inside /admin/. It tells the GPTBot user-agent to stay out of the whole site. And it points all crawlers to the sitemap, which is the one place robots.txt helps with discovery rather than restriction. Note that * and $ are the only supported wildcards: * matches any sequence of characters, and $ anchors the end of a URL.
Crucially, a Disallow blocks the request, not the knowledge of the page. Here's the chain of events that catches people out:
| You do this | The bot does this | The result |
|---|---|---|
| Disallow: /secret-page | Skips fetching the page body | Page is not read |
| Another site links to /secret-page | Sees the link, can't read the target | Indexes the bare URL anyway |
| You wanted it gone | Shows it in results, no title, "no information available" | The opposite of what you wanted |
Because the bot never fetches the page, it never sees any noindex instruction on it either. That's the heart of the confusion that sends so many Disallow rules backfiring, and it deserves its own section.
Why robots.txt matters (and where it doesn't)
Robots.txt matters most as a crawl-management tool on sites large or messy enough for crawling to be a real constraint. If a crawler can generate millions of useless URLs from your faceted navigation, calendar pages, internal search, or session parameters, you're handing it an infinite maze and watching it spend its attention on pages that will never rank. Pointing it away from those traps is part of protecting your crawl budget so the pages you care about get visited and refreshed. This is squarely a technical SEO concern, and on a large, parameter-heavy site it's a genuine lever.
On a small site, it matters far less than people think. A 40-page service site does not have a crawling problem, and an aggressive robots.txt there is more likely to cause harm than help: one stray Disallow: / shipped to production will quietly remove an entire site from search. The number of businesses that have launched a redesign with the staging environment's "block everything" file still in place is not small, and the symptom (traffic falling off a cliff) takes days to notice and weeks to recover.
Where robots.txt does not matter, and where you should not reach for it, is anything to do with keeping a page out of search results. It has no power there. A disallowed page can still surface in the SERP as a bare URL. It also has no role in privacy: listing a sensitive path in Disallow advertises it to anyone who reads the file. The same request-not-a-wall logic applies to AI crawlers: blocking GPTBot or Google-Extended only works on bots that choose to obey, and if your goal is to be cited in AI answers rather than shut out of them, that is a generative engine optimization question, not a blocking one.
The one mistake to avoid: using Disallow as noindex
This is the mistake worth tattooing somewhere. If your goal is "I don't want this page showing up in Google," robots.txt is the wrong tool. Use noindex instead, delivered as a <meta name="robots" content="noindex"> tag in the page's <head> or an X-Robots-Tag: noindex HTTP header.
The reason is mechanical. For Google to honor a noindex, it has to crawl the page and read the instruction. If you've blocked the page in robots.txt, Google never fetches it, never sees the noindex, and is free to index the URL from links pointing at it. So the two directives don't stack the way intuition suggests:
- Want a page crawled but not in the index? Allow it in robots.txt, add
noindex. Google reads the page, sees the instruction, drops it. - Want a page neither crawled nor seen? That's not what
Disallowgives you. A disallowed page can still rank as a URL-only result.
The correct sequence to remove a page from search is to allow the crawl, add noindex, wait for Google to recrawl and drop it, and only then, if you want, add a Disallow. Block it first and you freeze the page in the index in a half-removed state you can't fix without unblocking it. Get this order wrong and you'll spend a week watching a page you tried to hide sit stubbornly in the results.
A few other common errors: blocking CSS and JavaScript that Google needs to render the page (allow your rendering assets), forgetting that each subdomain needs its own file, and shipping a robots.txt that returns a 5xx server error, which Google treats as a signal to stop crawling the whole site. A missing file (404) is safe and means "everything is open." A broken file is not.
The bottom line
Robots.txt is a small, blunt, useful instrument: it tells cooperative crawlers where not to spend their time, and on a large or parameter-heavy site that's a real way to protect crawl efficiency and point bots at your sitemap. It is table-stakes technical hygiene, not a growth lever, and on most small sites the safest robots.txt is a short, permissive one.
The single rule that prevents most pain: robots.txt controls crawling, noindex controls indexing, and they are not interchangeable. When you want a page gone from Google, you want it crawlable and carrying a noindex, not blocked. Reach for Disallow to manage traffic and crawl waste, never to hide a page, and you'll avoid the failure mode that has quietly de-ranked countless sites that thought they were being careful.
If your robots.txt, indexing rules, and crawl architecture have drifted out of sync (and on most sites that have been around a few years, they have), our technical SEO team will audit exactly what crawlers can reach, what's leaking into the index, and what important pages are being starved of attention. Email us at admin@moonsauceagency.com and you'll get a clear read on your crawl and indexing setup plus the specific fixes, in priority order.
Keep reading: What is crawl budget? · Canonical tag · Technical SEO · Back to the glossary
Sources: Google Search Central: robots.txt documentation · Google Search Central: block indexing with noindex