What Is Robots.txt? Crawl Control Explained

What is robots.txt? It is a plain text file that sits at the root of your domain and tells search engine crawlers which parts of your site they're welcome to request and which they should leave alone. One file, a handful of rules, enormous influence over how bots spend their time on your site. The catch that trips up almost everyone: robots.txt controls crawling, not indexing. It can keep a bot from reading a page, but it cannot reliably keep that page out of Google's results. Confusing those two jobs is the most expensive robots.txt mistake there is, and it's worth understanding before you ever touch the file.

What is robots.txt, in plain English?

Robots.txt is a set of instructions for automated visitors. Before a well-behaved crawler requests pages from your site, it first fetches https://yourdomain.com/robots.txt and reads the rules. Those rules say things like "Googlebot, you may go anywhere except /admin/" or "everybody stay out of /internal-search/." The crawler then honors what it found.

The file lives at the root of each host and must be named robots.txt in lowercase. It governs only that exact host: https://example.com and https://www.example.com are separate hosts that each need their own file, and a file buried at /blog/robots.txt is ignored entirely. Google fetches the root file, caches it for about a day, and applies it across the whole site.

Two ideas do most of the work here. User-agent names which crawler a block of rules applies to (Googlebot, Bingbot, * for everyone). Disallow names a path that crawler should not request. An empty Disallow: means "nothing is off limits," and Allow: carves exceptions back out of a broader block. That's the entire core of the format, and its simplicity is exactly why people overestimate what it can do.

The most important thing to internalize: robots.txt is a request, not a fence. It works because reputable crawlers choose to obey it. A scraper or a malicious bot can read your Disallow lines as a convenient map of where you'd rather it not look, then go there anyway. So robots.txt is for managing cooperative crawlers, never for security.

How robots.txt works

A robots.txt file is a list of groups. Each group starts with one or more User-agent lines, then the Allow and Disallow rules that apply to those agents, all governed by a longest-match rule: when two rules conflict, the more specific (longer) path wins. A typical file looks like this:

```
User-agent: *
Disallow: /admin/
Disallow: /cart/
Disallow: /*?sessionid=
Allow: /admin/public-policy.html

User-agent: GPTBot
Disallow: /

Sitemap: https://example.com/sitemap.xml
```

That file tells every crawler to skip the admin and cart areas and any URL carrying a sessionid parameter, while still permitting one public file inside /admin/. It tells the GPTBot user-agent to stay out of the whole site. And it points all crawlers to the sitemap, which is the one place robots.txt helps with discovery rather than restriction. Note that * and $ are the only supported wildcards: * matches any sequence of characters, and $ anchors the end of a URL.

Crucially, a Disallow blocks the request, not the knowledge of the page. Here's the chain of events that catches people out:

You do this	The bot does this	The result
Disallow: /secret-page	Skips fetching the page body	Page is not read
Another site links to /secret-page	Sees the link, can't read the target	Indexes the bare URL anyway
You wanted it gone	Shows it in results, no title, "no information available"	The opposite of what you wanted

Because the bot never fetches the page, it never sees any noindex instruction on it either. That's the heart of the confusion that sends so many Disallow rules backfiring, and it deserves its own section.

Why robots.txt matters (and where it doesn't)

Robots.txt matters most as a crawl-management tool on sites large or messy enough for crawling to be a real constraint. If a crawler can generate millions of useless URLs from your faceted navigation, calendar pages, internal search, or session parameters, you're handing it an infinite maze and watching it spend its attention on pages that will never rank. Pointing it away from those traps is part of protecting your crawl budget so the pages you care about get visited and refreshed. This is squarely a technical SEO concern, and on a large, parameter-heavy site it's a genuine lever.

On a small site, it matters far less than people think. A 40-page service site does not have a crawling problem, and an aggressive robots.txt there is more likely to cause harm than help: one stray Disallow: / shipped to production will quietly remove an entire site from search. The number of businesses that have launched a redesign with the staging environment's "block everything" file still in place is not small, and the symptom (traffic falling off a cliff) takes days to notice and weeks to recover.

Where robots.txt does not matter, and where you should not reach for it, is anything to do with keeping a page out of search results. It has no power there. A disallowed page can still surface in the SERP as a bare URL. It also has no role in privacy: listing a sensitive path in Disallow advertises it to anyone who reads the file. The same request-not-a-wall logic applies to AI crawlers: blocking GPTBot or Google-Extended only works on bots that choose to obey, and if your goal is to be cited in AI answers rather than shut out of them, that is a generative engine optimization question, not a blocking one.

The one mistake to avoid: using Disallow as noindex

This is the mistake worth tattooing somewhere. If your goal is "I don't want this page showing up in Google," robots.txt is the wrong tool. Use noindex instead, delivered as a <meta name="robots" content="noindex"> tag in the page's <head> or an X-Robots-Tag: noindex HTTP header.

The reason is mechanical. For Google to honor a noindex, it has to crawl the page and read the instruction. If you've blocked the page in robots.txt, Google never fetches it, never sees the noindex, and is free to index the URL from links pointing at it. So the two directives don't stack the way intuition suggests:

Want a page crawled but not in the index? Allow it in robots.txt, add noindex. Google reads the page, sees the instruction, drops it.
Want a page neither crawled nor seen? That's not what Disallow gives you. A disallowed page can still rank as a URL-only result.

The correct sequence to remove a page from search is to allow the crawl, add noindex, wait for Google to recrawl and drop it, and only then, if you want, add a Disallow. Block it first and you freeze the page in the index in a half-removed state you can't fix without unblocking it. Get this order wrong and you'll spend a week watching a page you tried to hide sit stubbornly in the results.

A few other common errors: blocking CSS and JavaScript that Google needs to render the page (allow your rendering assets), forgetting that each subdomain needs its own file, and shipping a robots.txt that returns a 5xx server error, which Google treats as a signal to stop crawling the whole site. A missing file (404) is safe and means "everything is open." A broken file is not.

The bottom line

Robots.txt is a small, blunt, useful instrument: it tells cooperative crawlers where not to spend their time, and on a large or parameter-heavy site that's a real way to protect crawl efficiency and point bots at your sitemap. It is table-stakes technical hygiene, not a growth lever, and on most small sites the safest robots.txt is a short, permissive one.

The single rule that prevents most pain: robots.txt controls crawling, noindex controls indexing, and they are not interchangeable. When you want a page gone from Google, you want it crawlable and carrying a noindex, not blocked. Reach for Disallow to manage traffic and crawl waste, never to hide a page, and you'll avoid the failure mode that has quietly de-ranked countless sites that thought they were being careful.

If your robots.txt, indexing rules, and crawl architecture have drifted out of sync (and on most sites that have been around a few years, they have), our technical SEO team will audit exactly what crawlers can reach, what's leaking into the index, and what important pages are being starved of attention. Email us at admin@moonsauceagency.com and you'll get a clear read on your crawl and indexing setup plus the specific fixes, in priority order.

Keep reading: What is crawl budget? · Canonical tag · Technical SEO · Back to the glossary

Sources: Google Search Central: robots.txt documentation · Google Search Central: block indexing with noindex

Frequently asked

What is robots.txt used for?

Robots.txt manages crawler traffic. You use it to keep bots out of areas that waste their time or strain your server: internal search results, faceted navigation, infinite parameter combinations, admin paths, and staging environments. It's a crawl-budget and politeness tool, most useful on large sites. What it is not for is hiding pages from search results. A disallowed page can still be indexed if other pages link to it, so robots.txt is the wrong lever for privacy or de-indexing.

Where does the robots.txt file go?

At the root of each host, always named robots.txt in lowercase. For example.com it lives at https://example.com/robots.txt. It applies only to that exact protocol, subdomain, and port, so https and http, and www and non-www, are treated as separate hosts that each need their own file. A robots.txt in a subfolder like /blog/robots.txt is ignored entirely. Google fetches the root file, caches it for roughly a day, and applies it across the whole host.

What's the difference between robots.txt and noindex?

Robots.txt controls crawling; noindex controls indexing, and they solve opposite problems. Disallow in robots.txt tells a bot not to request a URL. Noindex (a meta robots tag or X-Robots-Tag header) tells a bot it may show this page to nobody. The trap: if you Disallow a page, Google can't crawl it to see the noindex, so the page can still rank as a URL-only result. To remove a page from search, allow the crawl and let it see the noindex.

Does Disallow remove a page from Google?

No, and this is the single most common robots.txt mistake. Disallow stops Google from fetching the page content, but if other sites or your own internal links point to that URL, Google can still index the bare URL and show it in results, often with no title or description and a note that it can't access the page. Removing a page from the index requires noindex on a crawlable page, a password, or the Search Console removal tool, never a Disallow alone.

Can robots.txt block AI crawlers and ChatGPT?

Partly. Reputable AI crawlers publish user-agent tokens (GPTBot, Google-Extended, ClaudeBot, and others) that you can target with Disallow rules, and the well-behaved ones honor them. But robots.txt is a request, not a wall: it relies on the crawler choosing to obey, so it does nothing against bots that ignore it. If you want to be found in AI answers rather than blocked from them, that's a generative engine optimization question, not a blocking one.

What happens if I have no robots.txt file?

Nothing bad. A missing robots.txt (returning a 404) means everything is open: crawlers assume they may request any URL, which is the default. You only need a file when you want to restrict something or point to your sitemap. The dangerous failure is a robots.txt that returns a server error (5xx). Google treats a persistent 5xx as a signal to stop crawling the whole site, so a broken file is far worse than no file.

Should I block CSS and JavaScript in robots.txt?

No. This was common years ago and it now hurts you. Google renders pages much like a browser, so if you Disallow your CSS, JavaScript, or image directories, Googlebot can't see the page the way users do and may misjudge layout, mobile-friendliness, and content. Google's guidance is explicit: allow crawling of the resources needed to render the page. Block application logic and infinite URL spaces if you must, but leave rendering assets open.

30 minutes. Let us see if we are a fit.

This is not a canned pitch. We want to hear about your business, your goals, and where you are stuck, then tell you honestly how we would help, or if we are not the right fit. You will talk to a founder, every time. Zero pressure, zero BS.

A founder on the call, never a sales rep

We learn your business before we pitch anything

A straight answer on whether we can help

Free30 minutesNo obligationA reply within a business day

Rob or Roger. The founders. Every time.