Leading Connectivity Cloud has concerns about Perplexity’s Stealth Crawling

Hispanic Engineer & Information Technology >> National News >> Leading Connectivity Cloud has concerns about Perplexity’s Stealth Crawling

Leading Connectivity Cloud has concerns about Perplexity’s Stealth Crawling

Hispanic Engineer & Information Technology
 
POSTED ON Aug 08, 2025
 

Big tech remains in the spotlight, with Nvidia now valued at $4 trillion and Meta offering its leading computer scientist a salary akin to a top athlete—$10 million annually.

Additionally, the launch of a new version of ChatGPT has sparked discussion, with some critics suggesting it’s more about cutting costs for OpenAI than advancing innovation.

In related news, Cloudflare has raised concerns about the stealthy practices of Perplexity, an AI-powered answer engine. Cloudflare claims that Perplexity is using undeclared crawlers to circumvent website no-crawl directives.

According to a blog post from Cloudflare, while Perplexity generally identifies itself with a declared user agent, it obscures its identity when faced with network blocks, attempting to bypass website preferences.

Furthermore, Cloudflare observed that Perplexity is altering its user agent and changing its source Autonomous System Numbers (ASNs) to hide its crawling activities.

They noted that Perplexity often ignores or fails to retrieve robots.txt files, which outline a website’s crawling permissions.

In a strong statement, Cloudflare emphasized that the internet, as we have known it for the past three decades, is rapidly changing. However, one constant remains: the internet is built on trust.

Clear preferences dictate that crawlers should be transparent, serve a defined purpose, engage in specific activities, and, most importantly, adhere to website directives and preferences.

Cloudflare has reported that due to the observed behavior of Perplexity, which does not align with established preferences, they have de-listed Perplexity as a verified bot.

Additionally, they have implemented heuristics in their managed rules to block stealth crawling.

This issue began when Cloudflare received complaints from customers who had both disallowed Perplexity’s crawling activity in their robots.txt files and created Web Application Firewall (WAF) rules to block Perplexity’s declared crawlers, specifically named Perplexity Bot and Perplexity-User.

These customers indicated that, despite having successfully blocked these bots, Perplexity was still able to access their content.

Cloudflare confirmed that Perplexity’s crawlers were indeed being blocked on the specific pages in question. To investigate further, they performed targeted tests, including creating multiple new domains such as testexample.com and secretexample.com.

These domains were newly purchased, not indexed by any search engine, and not publicly accessible in any discoverable manner. Cloudflare set up a robots.txt file with directives intended to prevent any respectful bots from accessing their website.

They then queried Perplexity AI with questions regarding these newly created domains. Surprisingly, Perplexity continued to provide detailed information about the exact content hosted on these restricted domains, which contradicted the measures taken to prevent data retrieval by their crawlers.

In contrast to this behavior, the Internet has established clear preferences for how ethical crawlers should operate. Good crawlers, acting in good faith, should:

  • Be transparent: Identify themselves honestly, using a unique user-agent, a declared list of IP ranges, or Web Bot Auth integration, along with providing contact information for any issues.
    Be well-behaved netizens: Avoid flooding sites with excessive traffic, scraping sensitive data, or using stealth tactics to evade detection.
    Serve a clear purpose: Clearly define and make accessible the purpose of the bot, whether it be powering a voice assistant, checking product prices, or enhancing website accessibility.
    Use separate bots for separate activities: Perform distinct activities from unique bots, which allows site owners to decide which activities to permit without making an all-or-nothing choice.
    Follow the rules: Respect signals from websites like robots.txt, adhere to rate limits, and do not bypass security protections.
  • According to the Cloudflare blog, OpenAI is an example of a leading AI company that follows these best practices. They provide clear information about their crawlers and detailed explanations of each crawler’s purpose.

    OpenAI respects robots.txt files and does not attempt to circumvent any directives or network-level blocks. Furthermore, the ChatGPT Agent signs HTTP requests using the newly proposed open standard, Web Bot Auth.

    Finally, Cloudflare has added signature matches for the stealth crawler into their managed rules to block AI crawling activities. These rules are available to all customers, including those using the free service, allowing them to completely disallow AI training through their managed robots.txt feature or the managed rule that blocks AI crawlers.

    Consequently, every Cloudflare customer can selectively decide which declared AI crawlers can access their content based on their business objectives.

    Comment Form

    Popular News

    Hispanic Engineer & Information Technology

    USACE opens additional material distribution points in Puerto Rico

    The U.S. Army Corps of Engineers has been tasked with…

    Hispanic Engineer & Information Technology

    Dr. Allegra da Silva: Water Reuse Practice Leader

    Brown and Caldwell, a leading environmental engineering and construction firm,…

    Hispanic Engineer & Information Technology

    Developing Hispanic-Serving Institutions funds advance preparation of future educators

    Humboldt State University, one of four campuses within the California…

     

    Find us on twitter