Eclipse Marketing

Robots.txt files don’t directly boost search rankings, but they help search engines crawl your site more efficiently. This simple text file acts as a traffic director, guiding crawlers to your most valuable content while blocking low-value pages. With AI Overviews now appearing in over 50% of searches, proper crawler management becomes even more critical for maintaining visibility. When search engines focus their crawl budget on pages that matter, it can lead to better indexing of important content. This comprehensive guide from Eclipse Marketing shows you exactly how to create, configure, and optimize your robots.txt file to maximize your site’s crawling efficiency and protect your server resources.

Understanding Robots.txt FIle

What Does a Robots.txt File Do?

A robots.txt file gives directions to web crawlers. It tells them which website pages to crawl and which ones to avoid. 

Take a look at this example:

User-agent: Googlebot

Allow: /blog/*

Allow: /guides/*

Disallow: /private/

Disallow: /temp/

User-agent: Bingbot

Allow: /resources/*

Allow: /articles/*

Disallow: /internal/*

Disallow: /test-data/

User-agent: DuckDuckBot

Allow: /public/*

Disallow: /beta/

Disallow: /drafts/

User-agent: LinkedInBot

Allow: /case-studies/*

Allow: /pdfs/*

Disallow: /confidential/

Robots.txt files can look confusing when you first see them. But the code structure is actually straightforward. “Allow” commands tell crawlers to visit those sections. “Disallow” commands tell crawlers to stay away from those areas. Keep this crucial fact in mind: Robots.txt files only suggest how crawlers should act. They cannot guarantee pages will stay out of search results. Other factors like external links can still get pages indexed by Google. To completely block indexing, you need Meta Robots and X-Robots-Tag instead.

Understanding the Three Main Control Methods

Robots.txt tells search engines what not to crawl while meta robots tags and X-robots-tags tell them what not to index. Learning the difference helps you choose the right tool for each situation. 

Here’s how they work:

  • Robots.txt sits in your website’s main directory and gives site-wide instructions to search engine crawlers. It shows them which areas of the site they should crawl and which areas they shouldn’t crawl.
  • Meta robots tags are code snippets in the head sections of individual webpages. They give page-specific instructions to search engines about whether to index pages in search results. They also control whether to follow the links on each page.
  • X-robot tags are code snippets used mainly for non-HTML files like PDFs and images. These tags go in the file’s HTTP header instead of the page code.

Want to keep something out of search results completely? Use a noindex meta tag on a crawlable page or password-protect the page instead.

Why Should You Care About Robots.txt?

A robots.txt file helps control how bots interact with your site. SEO professionals often use it to manage crawl load and improve efficiency by blocking unimportant or duplicate pages. It can also be used to deter scraping and prevent content from being used to train AI models. Here’s a breakdown of why robots.txt files matter specifically for SEO:

It Makes Your Crawl Budget Work Harder

A robots.txt file helps search engines focus their crawl budgets on your most valuable pages. Blocking low-value pages like cart, login, or filter pages helps bots prioritize crawling content that actually drives traffic and rankings. This becomes especially important on large sites with thousands of URLs. For example, blocking “/cart/” or “/login/” pages helps bots focus on your blog posts or product pages instead.

It Helps Shape Your Search Presence

Robots.txt gives you some control over how your site appears in search by managing what gets crawled. While it doesn’t directly affect indexing, it works with other tools to guide search engines toward your important content.

  • Sitemap is a file that lists the important pages on your site to help search engines discover and crawl them more efficiently.
  • Canonical tags are HTML tags that tell search engines which version of a page is the preferred one to index when duplicate or similar content exists.
  • Noindex directives are signals via a meta tag or HTTP header that tell search engines not to include a specific page or pages in the index used for search results.

It Helps Stop Scrapers and Unwanted Bots

Robots.txt is the first line of defense against unwanted crawlers, such as scrapers or bots harvesting content for training AI models.

For example, many sites now disallow AI bots’ user-agents via robots.txt.

This sends a clear signal to bots that respect the protocol and helps reduce server load from non-essential crawlers.

We partnered with SEO Consultant Bill Widmer to run a quick experiment and demonstrate how robots.txt rules impact crawler behavior in real-world conditions.

Here’s what happened:

Bill had a rule in his robots.txt file blocking a number of crawlers.

He used Semrush’s Site Audit tool to crawl the entire site, setting the crawl limit high enough to catch all live pages.

But his website wasn’t crawled due to the robots.txt directives. After adjusting the robots.txt file, he ran the crawl again. This time, his website was successfully crawled and included in the report.

Setting Up Robots.txt File

How to Set Up a Robots.txt File

A robots.txt file is easy to create—decide what to block, write your rules in a text file, and upload it to your site’s root directory.

Just follow these steps:

Step 1: Choose What to Control

Identify which parts of your site should or shouldn’t be crawled. 

  • Consider blocking login and user account pages like /login/ that don’t offer public value and can waste crawl budget.
  • Cart and checkout pages like /cart/ you don’t want in search results should also be blocked.
  • Thank-you pages or form submission confirmation screens like /thank-you/ that aren’t useful to searchers are good candidates too.

If you’re unsure, it’s best to err on the side of allowing rather than disallowing.

Incorrect disallow rules can cause search engines to miss important content or fail to render your pages correctly.

Step 2: Target Specific Bots if Needed

You can write rules for all bots using User-agent: * or target specific ones like Googlebot using User-agent: Googlebot or Bingbot using User-agent: Bingbot, depending on your specific needs.

Here are two situations when this makes sense:

  • Controlling aggressive or less important bots becomes necessary when some bots crawl frequently and can put an unnecessary load on your server. You might want to limit or block these types of bots.
  • Blocking AI crawlers used for training generative models helps if you don’t want your content included in the training data for tools like ChatGPT or other LLMs. You can block their crawlers like GPTBot in your robots.txt file.

Step 3: Build Your Robots.txt File and Add Rules

Use a simple text editor like Notepad on Windows or TextEdit on Mac to create your file and save it as “robots.txt.”

In this file, you’ll add your directives—the syntax that tells search engine crawlers which parts of your site they should and shouldn’t access.

A robots.txt file contains one or more groups of directives, and each group includes multiple lines of instructions.

Each group starts with a user-agent and specifies:

  • which user-agent the group applies to
  • which directories or files the user-agent should access
  • which directories or files the user-agent shouldn’t access.

You can also include a sitemap to tell search engines which pages and files are most important. Just don’t forget to submit your sitemap directly in Google Search Console.

Imagine you don’t want Google to crawl your “/clients/” directory because it’s primarily for internal use and doesn’t provide value for searchers.

The first group in your file would look like this block:

User-agent: Googlebot Disallow: /clients/

You can add more instructions for Google after that, like this:

User-agent: Googlebot Disallow: /clients/ Disallow: /not-for-google

Then press enter twice to start a new group of directives.

For example, to prevent access to “/archive/” and “/support/” directories for all search engines.

Here’s a block preventing access to those directories:

User-agent: * Disallow: /archive/ Disallow: /support/

Once you’re finished, add your sitemap:

User-agent: Googlebot Disallow: /clients/ Disallow: /not-for-google User-agent: * Disallow: /archive/ Disallow: /support/ Sitemap: https://www.yourwebsite.com/sitemap.xml

Feeling unsure?

Use a free robots.txt generator to help you generate the text for your robots.txt file. Then, copy and paste the output to a text editor.

Step 4: Upload the File to Your Site’s Main Directory

Search engines will only read your robots.txt file if it’s placed in the root directory of your domain.

This means the file must be at the top level of your site—not in a subfolder.

To upload the file correctly, use your web hosting file manager, FTP client, or CMS settings to upload the file to the root directory usually called “public_html” or “/www”.

If you’re using WordPress, you can use a plugin like Yoast SEO or Rank Math to upload the file to your site’s root directory for you.

Just open the plugin’s settings, navigate to the robots.txt option, and upload your file.

Step 5: Check That Your File Uploaded Correctly

Use Google’s robots.txt report in Search Console to check for errors and confirm your rules work as intended.

In Search Console, navigate to the Settings page and click Open Report next to “robots.txt.”

It should have a green checkmark next to “Fetched” under the status column.

But if there was an error, you’ll see a red exclamation mark next to “Not Fetched.” In that case, check Google’s guidelines to determine what the error was and how to fix it.

It can be difficult to understand Google’s solutions to errors if you’re new to robots.txt.

If you want an easier way, use Semrush’s Site Audit tool to check your robots.txt file for technical issues and get detailed instructions on how to fix them.

Set up a project and run an audit.

When the tool is ready, navigate to the Issues tab and search for “robots.txt.”

Click Robots.txt file has format errors if it appears.

View the list of invalid lines to determine exactly what needs to be addressed.

Check your robots.txt file regularly. Even small errors can affect your site’s indexability.

Smart Ways to Use Robots.txt

Smart Ways to Use Robots.txt

Follow these best practices to ensure your robots.txt file helps your SEO and site performance:

Use Wildcards with Care

Wildcards like * and $ let you match broad patterns in URLs, and using them precisely is important to avoid accidentally blocking important pages.

The asterisk matches any sequence of characters including slashes. It’s used to block multiple URLs that share a pattern. For example, “Disallow: /search*” blocks “/search,” “/search?q=shoes,” and “/search/results/page/2.”

The dollar sign matches the end of a URL. It’s used when you want to block only URLs that end in a specific way. For example, “Disallow: /thank-you$” blocks “/thank-you” but not /thank-you/page.

Here are some examples of how not to use them:

Disallow: /*.php blocks every URL ending in “.php,” which could include important pages like “/product.php” or “/blog-post.php.”

Disallow: /.html$ blocks all pages ending in “.html,” which might include all your main site content.

If you’re unsure, it’s wise to consult a professional before using wildcards in your robots.txt file.

Don’t Block Important Website Resources

Don’t block CSS, JavaScript, or API endpoints required to render your site. Google needs them to understand layout, functionality, and mobile-readiness.

So, let crawlers access:

  • /assets/
  • /js/
  • /css/
  • /api/

Blocking these could cause Google to see a broken version of your pages and hurt your rankings.

Always test your site in Google’s URL Inspection Tool to ensure blocked assets aren’t interfering with rendering.

Enter a URL you want to test.

You should see a green checkmark if it’s done properly. If you see “Blocked by robots.txt,” the page or an asset it depends on is blocked from crawling.

Don’t Use Robots.txt to Hide Pages from Search Results

If a URL is linked from elsewhere, Google can still index it and show it in search results—even if you’ve disallowed it in robots.txt.

That means you shouldn’t rely on robots.txt to hide:

  • sensitive or private data like admin dashboards or internal reports.
  • Don’t use it to hide duplicate content like filtered or paginated URLs.
  • Staging or test sites shouldn’t depend on robots.txt for protection either.
  • Any page you don’t want appearing in Google needs stronger protection than robots.txt provides.

Add Comments to Your File

Use comments to document your rules, so others or future you can understand your intentions.

Start a comment by adding a “#” symbol.

Anything after it on the same line will be ignored by crawlers.

For example:

# Block internal search results but allow all other pages for all crawlers User-agent: * Disallow: /search/ Allow: /

Comments are especially important for growing teams and complex sites.

Robots.txt and AI: Should You Block Language Models?

AI tools like ChatGPT and those built on other large language models are trained on web content—and your robots.txt file is the primary way for you to manage how they crawl your site.

To allow or block AI crawlers used for training models, add user-agent directives to your robots.txt file just like you would for Googlebot.

For example, OpenAI’s GPTBot is used to collect publicly available data that can be used for training large language models. To block it, you can include a line like “User-agent: GPTBot” followed by your chosen disallow rule.

When should you allow or block AI crawlers?

You should allow AI crawlers if:

  • You want to increase exposure and don’t mind your content being used in generative tools.
  • You should also allow them if you believe the benefits of increased visibility and brand awareness outweigh control over how your content is used to train generative AI tools.

You should consider blocking AI crawlers if:

  • You’re concerned about your intellectual property.
  • You should also block them if you want to maintain full control over how your content is used.

Note that a new file called llms.txt is being proposed to help AI models understand what your site is about and what content is most important. But it’s not exactly a translation of robots.txt for AI models.

We wanted to see how many .com websites have an llms.txt file to see how commonly used this new file type is.

This rough experiment shows that only around 2,830 of .com websites indexed in Google have an llms.txt file.

Check Your Website for Robots.txt and Other Technical Problems

A well-configured robots.txt file is a powerful tool for guiding search engines, protecting your resources, and keeping your site efficient.

But it’s important to ensure your file is free from technical errors.

Use Site Audit tools to automatically check for robots.txt errors, crawl issues, broken links, and other technical SEO issues.

Conclusion

Your robots.txt file is more than just a technical necessity—it’s a strategic tool for SEO success. By properly configuring this simple text file, you can guide search engines to your most valuable content, protect your server resources, and maintain control over how bots interact with your site. Remember that robots.txt works best when combined with other SEO tools like sitemaps, canonical tags, and no index directives. Start small by blocking obvious low-value pages like login and cart URLs, then expand your strategy as you learn what works for your site. Don’t forget to test your file regularly using Google Search Console and site audit tools. A well-optimized robots.txt file won’t directly boost your rankings, but it will help search engines crawl your site more efficiently—and that efficiency can lead to better indexing of your important pages.

Frequently Asked Questions

Does robots.txt directly improve my search rankings? 

No, robots.txt doesn’t directly boost your search rankings. However, it helps search engines crawl your site more efficiently by directing them to your most valuable content. Think of robots.txt as a traffic director rather than a ranking booster.

Can I use robots.txt to completely hide pages from Google? 

No, robots.txt cannot guarantee pages will stay out of search results. If other websites link to a blocked page, Google might still index it. To truly hide pages, use noindex meta tags or password protection instead.

Should I block AI bots like ChatGPT from crawling my content? 

This depends on your goals and concerns about intellectual property. Allow AI crawlers if you want increased exposure and don’t mind your content being used in AI training. Block them if you’re concerned about intellectual property.

What happens if I make a mistake in my robots.txt file? 

Mistakes can prevent search engines from crawling important pages, which could hurt your SEO. Always test your file using Google Search Console’s robots.txt report. When in doubt, it’s better to allow crawling than accidentally block important content.

Where exactly should I place my robots.txt file on my website? 

Your robots.txt file must be placed in your site’s root directory for search engines to find it. This means it should be accessible at yourwebsite.com/robots.txt, not in any subfolder.