AI Crawler

GPTBot

OpenAI's web crawler. Used to gather training data and to power ChatGPT browsing. Can be allowed or disallowed in robots.txt; blocking may reduce ChatGPT citation eligibility.

What it is

GPTBot is OpenAI's web crawler that gathers publicly available content for training foundation models and, in some configurations, for ChatGPT browsing context. It identifies itself with the GPTBot user-agent string and crawls from published IP ranges.

Why it matters

Allowing GPTBot can improve the breadth of content OpenAI ingests, which influences how well your domain is represented in model knowledge and citation behaviour. Blocking it removes your content from future training corpora and may reduce long-term familiarity with your brand in generated answers.

How it works

GPTBot honours robots.txt, so access is configured by targeting the GPTBot user-agent with Allow or Disallow rules. You can permit it site-wide, restrict it to specific paths, or block it entirely while leaving other crawlers unaffected.

When it applies

Allow GPTBot when you want your content to inform model knowledge and citation; block it when content is sensitive, paywalled, or you wish to opt out of training use.

Examples

robots.txt: User-agent: GPTBot then Disallow: / to block training access entirely
robots.txt: User-agent: GPTBot then Allow: /articles/ and Disallow: /drafts/ to expose only published work
Server log shows repeated GET requests with user-agent containing GPTBot from OpenAI published ranges

How it is measured

Request volume per day from the GPTBot user-agent in server logs
Distinct URLs and path depth crawled by GPTBot
Ratio of allowed to disallowed responses (200 versus 403) served to GPTBot
Source IP verification against OpenAI's published crawler ranges

GPTBot

Related terms in AI Crawler

Stay ahead of AI Search