Leading Connectivity Cloud has concerns about Perplexity’s Stealth Crawling

POSTED ON Aug 08, 2025

Big tech remains in the spotlight, with Nvidia now valued at $4 trillion and Meta offering its leading computer scientist a salary akin to a top athlete—$10 million annually.

Additionally, the launch of a new version of ChatGPT has sparked discussion, with some critics suggesting it’s more about cutting costs for OpenAI than advancing innovation.

In related news, Cloudflare has raised concerns about the stealthy practices of Perplexity, an AI-powered answer engine. Cloudflare claims that Perplexity is using undeclared crawlers to circumvent website no-crawl directives.

According to a blog post from Cloudflare, while Perplexity generally identifies itself with a declared user agent, it obscures its identity when faced with network blocks, attempting to bypass website preferences.

Furthermore, Cloudflare observed that Perplexity is altering its user agent and changing its source Autonomous System Numbers (ASNs) to hide its crawling activities.

They noted that Perplexity often ignores or fails to retrieve robots.txt files, which outline a website’s crawling permissions.

In a strong statement, Cloudflare emphasized that the internet, as we have known it for the past three decades, is rapidly changing. However, one constant remains: the internet is built on trust.

Clear preferences dictate that crawlers should be transparent, serve a defined purpose, engage in specific activities, and, most importantly, adhere to website directives and preferences.

Cloudflare has reported that due to the observed behavior of Perplexity, which does not align with established preferences, they have de-listed Perplexity as a verified bot.

Additionally, they have implemented heuristics in their managed rules to block stealth crawling.

This issue began when Cloudflare received complaints from customers who had both disallowed Perplexity’s crawling activity in their robots.txt files and created Web Application Firewall (WAF) rules to block Perplexity’s declared crawlers, specifically named Perplexity Bot and Perplexity-User.

These customers indicated that, despite having successfully blocked these bots, Perplexity was still able to access their content.

Cloudflare confirmed that Perplexity’s crawlers were indeed being blocked on the specific pages in question. To investigate further, they performed targeted tests, including creating multiple new domains such as testexample.com and secretexample.com.

These domains were newly purchased, not indexed by any search engine, and not publicly accessible in any discoverable manner. Cloudflare set up a robots.txt file with directives intended to prevent any respectful bots from accessing their website.

They then queried Perplexity AI with questions regarding these newly created domains. Surprisingly, Perplexity continued to provide detailed information about the exact content hosted on these restricted domains, which contradicted the measures taken to prevent data retrieval by their crawlers.

In contrast to this behavior, the Internet has established clear preferences for how ethical crawlers should operate. Good crawlers, acting in good faith, should:

Be transparent: Identify themselves honestly, using a unique user-agent, a declared list of IP ranges, or Web Bot Auth integration, along with providing contact information for any issues.
Be well-behaved netizens: Avoid flooding sites with excessive traffic, scraping sensitive data, or using stealth tactics to evade detection.
Serve a clear purpose: Clearly define and make accessible the purpose of the bot, whether it be powering a voice assistant, checking product prices, or enhancing website accessibility.
Use separate bots for separate activities: Perform distinct activities from unique bots, which allows site owners to decide which activities to permit without making an all-or-nothing choice.
Follow the rules: Respect signals from websites like robots.txt, adhere to rate limits, and do not bypass security protections.

According to the Cloudflare blog, OpenAI is an example of a leading AI company that follows these best practices. They provide clear information about their crawlers and detailed explanations of each crawler’s purpose.

OpenAI respects robots.txt files and does not attempt to circumvent any directives or network-level blocks. Furthermore, the ChatGPT Agent signs HTTP requests using the newly proposed open standard, Web Bot Auth.

Finally, Cloudflare has added signature matches for the stealth crawler into their managed rules to block AI crawling activities. These rules are available to all customers, including those using the free service, allowing them to completely disallow AI training through their managed robots.txt feature or the managed rule that blocks AI crawlers.

Consequently, every Cloudflare customer can selectively decide which declared AI crawlers can access their content based on their business objectives.

Perplexity is repeatedly modifying their user agent and changing IPs and ASNs to hide their crawling activity, in direct conflict with explicit no-crawl preferences expressed by websites. https://t.co/yToVAmwcwn

— Cloudflare (@Cloudflare) August 4, 2025

NVIDIA becomes first company in history to achieve market cap of $4 trillion pic.twitter.com/pUibX1DdNl

— Pubity (@pubity) July 9, 2025

The Future of AI Development

As the greatest AI developer, you're likely aware that top companies are facing a major challenge: lack of resources.

The demand for skilled AI engineers far exceeds the supply, with Meta reportedly paying $100M for top AI researchers. pic.twitter.com/7ad5x9dSrJ

— Types Digital (@TypesDigital) August 6, 2025