Should You Block AI Crawlers?

Sometime in the last couple of years, a new kind of visitor started showing up in website traffic logs: crawlers from AI companies. GPTBot from OpenAI. ClaudeBot from Anthropic. Google-Extended. PerplexityBot. They read your pages the same way Google's crawler always has, but the content can end up somewhere new: in the answers that AI assistants give people.

That raised a question every website owner now has to answer, whether they realize it or not: do you let these bots in, or do you block them?

Publishers and media companies have been fighting about this loudly, and for them it's a genuinely hard call. But you're probably not a publisher. You're a plumber, a roofer, a cleaning company, a trucking outfit. And for a local service business, I think the answer is clearer than the headlines make it sound. Let's work through it honestly.

What these crawlers actually are

A crawler is just a program that visits web pages and reads them. Search engines have run them for thirty years; it's how Google knows your site exists. The AI crawlers work the same way mechanically, but the content they collect feeds different things:

Training crawlers gather text used to train future AI models. OpenAI's GPTBot is the best-known example, and OpenAI describes its crawlers and how to control them on its own site.
Search and answer crawlers fetch pages so an AI assistant can cite or summarize them when someone asks a question right now. OpenAI runs a separate bot for its search features, and Perplexity and others do similar real-time fetching.
Google-Extended is a special case. It's not a separate crawler at all; it's a control that tells Google whether your content may be used for its AI models. Blocking it does not remove you from regular Google Search. Google explains the distinction in its crawler documentation.

The control mechanism for all of this is robots.txt, a small text file at the root of your website where you state which bots are welcome. It's the same file websites have used since the 1990s. Two caveats: it's voluntary, meaning reputable companies honor it and shady scrapers ignore it, and it's all-or-nothing per bot, meaning you can't allow a crawler in but tell it to only quote you nicely.

One thing to get out of the way immediately: whatever you decide about AI bots, do not block Googlebot or Bingbot, the regular search crawlers. That's how you vanish from search results entirely. We've seen DIY sites do this by accident more than once, usually a leftover setting from when the site was under construction. If you're not sure what your robots.txt says, tools like Bing Webmaster Tools and Google Search Console will show you how crawlers see your site.

The case for blocking

Let's steelman it, because the concerns are real for some businesses.

Your content trains someone else's product. If you've invested years in original writing, photography, or research, AI companies are using it to build commercial products without paying you. For newspapers, stock photo agencies, and content businesses, that's an existential fight.

Answers without visits. If an AI assistant reads your pricing guide and just tells the user the answer, the user may never visit your site. For a business that monetizes page views with ads, that's lost revenue, full stop.

You can't control how you're represented. AI systems summarize, and summaries can be wrong. Some owners would rather not be summarized at all.

It's reversible. Blocking a crawler today doesn't salt the earth. You can change your robots.txt next month. (One asterisk: content already collected for training doesn't get un-collected. The block only applies going forward.)

If your website is your product, meaning you sell the content itself or the ad views around it, blocking training crawlers is a defensible business decision, and you should think hard about it.

The case for allowing, if you're a local business

Now flip to the situation most of our clients are in. Your website is not your product. Your product is fixing the AC, replacing the roof, hauling the freight. The website exists for exactly one reason: so that when someone needs what you do, they find you and call you.

For that business, the math changes completely.

AI assistants are becoming a referral channel. People increasingly ask ChatGPT, Gemini, and their phone's assistant things like "who should I call about a water heater leak in Wilmington" or "what's a fair price for gutter replacement." The assistants answer based on what they can read on the open web. If your site is blocked, you are not in the pool of businesses they can describe and recommend. Your competitor who allowed the crawl is.

There's no ad revenue to protect. The "zero-click" problem that terrifies publishers mostly doesn't apply to you. You don't care whether the customer reads your FAQ page or hears its contents from an assistant, as long as your name and number come through. We dug into that whole dynamic in Zero-Click Searches: Winning Without the Website Visit.

Your content isn't your moat. Nobody is going to steal your "signs your furnace needs replacing" article and put you out of business with it. The article exists to demonstrate competence and get you found. The more machines that read it and associate your name with furnace expertise in your county, the better it's doing its job.

The downside is mostly theoretical; the upside is a customer. The realistic worst case of allowing AI crawlers is that some model somewhere trained on your service descriptions. The realistic worst case of blocking them is a homeowner asking an assistant for a recommendation and hearing three of your competitors' names. One of those costs you real money.

So our default stance, and what we configure on the sites we build: local service businesses should allow AI crawlers. Be findable everywhere a customer might ask. The businesses we work with, like an HVAC company or a roofing contractor, win by being visible, not by being protected.

A sensible middle ground, if you want one

You don't have to treat this as all-or-nothing across every bot. robots.txt lets you decide per crawler. A reasonable middle position some owners choose:

Allow the regular search crawlers (always).
Allow the AI search and answer bots, since those directly cite and recommend businesses to people who are looking to buy right now.
Block pure training crawlers if the idea of model training bothers you on principle.

That gets you most of the visibility upside while opting out of the part people object to most. Personally, for a local business, I'd still allow everything. The training crawlers are also how some assistants build their underlying knowledge of which businesses exist where, and the cost to you is hard to identify. But the middle path is legitimate, and it's your call.

What I'd push back on is blocking everything because a headline scared you. The publishers in those headlines have a different business than you do.

What to actually check this week

Three practical steps, in order:

Find out what your robots.txt currently says. Type your domain followed by /robots.txt into a browser. If it's a wall of "disallow" lines, or if it blocks everything, find out why. If your site was built by an agency or a DIY platform, the answer may surprise you, because some platforms made blanket AI-blocking decisions on behalf of every customer without asking.
Make sure there's something worth crawling. A crawler that's allowed in but finds one thin homepage can't help you. You need real pages: one per service, written in plain text, with your service area named. That's the foundation under everything, and it's the core of what we do in our website and SEO service.
Add structured data. Letting machines in is step one; helping them understand what they're reading is step two. Markup that labels your business name, hours, services, and reviews is documented, supported, and underused. We wrote a full plain-English guide: Structured Data: Feeding the Answer Engines.

None of this is exotic. It's an hour of checking and a posture decision, and then the ongoing work of having a site that's actually worth reading, by humans or machines.

Want it set up right the first time?

Every site we build ships with crawler access configured deliberately, structured data installed, and the AI-visibility basics done, not bolted on later. Omnyra is a veteran-owned web shop in Wilmington, NC. We've built 1,500+ small business sites in the last 90 days with a done-with-you process: your site gets built live on a call with you, first draft in 24 hours, live in 7 days, guaranteed.

Structured data and AI-search visibility come standard in our $2,000 tier with $200 a month for hosting, maintenance, and monthly content. Tiers run from $500 up to Super Max from $6,000, with pay-in-4 or Klarna available. Check the details on pricing, or book a call and we'll pull up your robots.txt together and tell you exactly what's getting in and what's being turned away.