The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall

Davriellelouna@lemmy.world · edit-2 1 day ago

The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall

Electricd@lemmybefree.net · 3 hours ago

I don’t like cloudflare but it’s nice that they allow people to stop AI scrapping if they want to

Electricd@lemmybefree.net · edit-2 3 hours ago

They do have a point though. It would be great to let per-prompt searches go through, but not mass scrapping

I believe a lot of websites don’t want both though

threeganzi@sh.itjust.works · 2 minutes ago

Does it not need to be scraped to be indexed, assuming it’s semi-typical RAG stuff?

Amberskin@europe.pub · 6 hours ago

Uh, are they admitting they are trying to circumvent technological protections setup to restrict access to a system?

Isn’t that a literal computer crime?

dinckel@lemmy.world · 2 hours ago

No-no, see. When an AI-first company does it, it’s actually called courageous innovation. Crimes are for poor people

utopiah@lemmy.world · 3 hours ago

puts on evil hat CloudFlare should DRM their protection then DMCA Perplexity and other US based “AI” companies to oblivion. Side effect, might break the Internet.

Deflated0ne@lemmy.world · 3 hours ago

Worth it.

Jimmycrackcrack@lemmy.ml · 2 hours ago

Gee that’s a real removed it ain’t it perplexity?

BadlyTimedLuck@lemmy.world · 3 hours ago

deleted by creator

kreskin@lemmy.world · edit-2 9 hours ago

they cant get their ai to check a box that says “I am not a robot”? I’d think thatd be a first year comp sci student level task. And robots.txt files were basically always voluntary compliance anyway.

Dr. Moose@lemmy.world · 4 hours ago

Cloudflare actually fully fingerprints your browser and even sells that data. Thats your IP, TLS, operating system, full browser environment, installed extensions, GPU capabilities etc. It’s all tracked before the box even shows up, in fact the box is there to give the runtime more time to fingerprint you.

5gruel@lemmy.world · 4 hours ago

Recaptcha v2 does way more than check if the box was checked.

https://stackoverflow.com/a/27299487

TheGrandNagus@lemmy.world · 11 hours ago

Can’t believe I’ve lived to see Cloudflare be the good guys

Dr. Moose@lemmy.world · 4 hours ago

It’s insane that anyone would side with Cloudflare here. To this day I cant visit many websites like nexusmods just because I run Firefox on Linux. The Cloudflare turnstile just refreshes infinitely and has been for months now.

Cloudflare is the biggest cancer on the web, fucking burn it.

Dremor@lemmy.world · 3 hours ago

Linux and Firefox here. No problem at all with Cloudflare, despite having more or less as much privacy preserving add-on as possible. I even spoof my user agent to the latest Firefox ESR on Linux.

Something’s muat be wrong with your setup.

Dr. Moose@lemmy.world · 3 hours ago

Thats not how it works. Cf uses thousands of variables to estimate a trust score and block people so just because it works for you doesn’t mean it works.

Dremor@lemmy.world · edit-2 2 hours ago

Same goes the other way. It’s not because it doesn’t work for you that it should go away.

That technology has its uses, and Cloudflare is probably aware that there are still some false positive, and probably is working on it as we write.

The decision is for the website owner to take, taking into consideration the advantages of filtering out a majority of bots and the disadvantages of loosing some legitimate traffic because of false positives. If you get Cloudflare challenge, chances are that he chosed that the former vastly outclass the later.

Now there are some self-hosted alternatives, like Anubis, but business clients prefer SaaS like Cloudflare to having to maintain their own software. Once again it is their choices and liberty to do so.

Dr. Moose@lemmy.world · 1 hour ago

lmao imagine shilling for corporate Cloudflare like this. Also false positive vs false negative are fundamentally not equal.

Cloudflare is probably aware that there are still some false positive, and probably is working on it as we write.

The main issue with Cloudflare is that it’s mostly bullshit. It does not report any stats to the admins on how many users were rejected or any false positive rates and happily put’s everyone under “evil bot” umbrella. So people from low trust score environments like Linux or IPs from poorer countries are under significant disadvantage and left without a voice.

I’m literally a security dev working with Cloudflare anti-bot myself (not by choice). It’s a useful tool for corporate but a really fucking bad one for the health of the web, much worse than any LLM agent or crawler, period.

dodos@lemmy.world · 4 hours ago

I’m on Linux with Firefox and have never had that issue before (particularly nexusmods which I use regularly). Something else is probably wrong with your setup.

jaemo@sh.itjust.works · 3 hours ago

Thirded. All three (Linux, FF, nexus)

ZERO ISSUES.

Yeller_king@reddthat.com · 4 hours ago

In my case, it’s usually the VPN.

Dr. Moose@lemmy.world · 3 hours ago

“Wrong with my setup” - thats not how internet works.

I’m based in south east asia and often work on the road so IP rating probably is the final crutch in my fingerprint score.

Either way this should be no way acceptible.

baronofclubs@lemmy.world · 4 hours ago

omg ur a hacker

Did you mean Edge on Windows? 'Cause if so, welcome in!

Wispy2891@lemmy.world · edit-2 12 hours ago

Here comes the ridiculous offer to buy Google chrome with money they don’t have: easy delicious scraping directly from the user source

kittenzrulz123@lemmy.blahaj.zone · 15 hours ago

tibi@lemmy.world · 21 hours ago

You could say they are… Perplexed.

Kissaki@feddit.org · edit-2 23 hours ago

Perplexity argues that a platform’s inability to differentiate between helpful AI assistants and harmful bots causes misclassification of legitimate web traffic.

So, I assume Perplexity uses appropriate identifiable user-agent headers, to allow hosters to decide whether to serve them one way or another?

ubergeek@lemmy.today · 4 hours ago

And I’m assuming if the robots.txt state their UserAgent isn’t allowed to crawl, it obeys it, right? :P

Kissaki@feddit.org · 2 hours ago

No, as per the article, their argumentation is that they are not web crawlers generating an index, they are user-action-triggered agents working live for the user.

ubergeek@lemmy.today · 1 hour ago

Except, it’s not a live user hitting 10 sights all the same time, trying to crawl the entire site… Live users cannot do that.

That said, if my robots.txt forbids them from hitting my site, as a proxy, they obey that, right?

Dr. Moose@lemmy.world · 4 hours ago

Its not up to the hoster to decide whom to serve content. Web is intended to be user agent agnostic.

lime!@feddit.nu · 20 hours ago

yeah it’s almost like there as already a system for this in place

WolfLink@sh.itjust.works · 24 hours ago

This is a nice CloudFlare ad

pyre@lemmy.world · 21 hours ago

yeah. still not worth dealing with fucking cloudflare. fuck cloudflare.

oppy1984@lemdro.id · 6 hours ago

I’m out of the loop, what’s wrong with cloud flare?

ubergeek@lemmy.today · 4 hours ago

Centralization, mostly, but also their hands-off approach to most fascist content.

int32@lemmy.dbzer0.com · 20 hours ago

DEATH TO CLOUDFLARE!

pressanykeynow@lemmy.world · 12 hours ago

That would be terrible for a lot of people as they are the only company providing such services that doesn’t charge for traffic.

int32@lemmy.dbzer0.com · edit-2 10 hours ago

They can use web.archive.org as a cdn(I do that to cloudflare websites). But honestly, cloudflare or not, the internet is broken.

pressanykeynow@lemmy.world · 7 hours ago

Can you explain please? How can I use archive.org as a cdn for my website?

NotASharkInAManSuit@lemmy.world · 23 hours ago

That’s the entire point, dipshit. I wish we got one of the cool techno dystopias rather than this boring corporate idiot one.

Leon@pawb.social · 23 hours ago

I’m still holding out for Stephen Hawking to mail out Demon Summoning programs.

poopkins@lemmy.world · edit-2 3 hours ago

I’ve developed my own agent for assisting me with researching a topic I’m passionate about, and I ran into the exact same barrier: Cloudflare intercepts my request and is clearly checking if I’m a human using a web browser. (For my network requests, I’ve defined my own user agent.)

So I use that as a signal that the website doesn’t want automated tools scraping their data. That’s fine with me: my agent just tells me that there might be interesting content on the site and gives me a deep link. I can extract the data and carry on my research on my own.

I completely understand where Perplexity is coming from, but at scale, implementations like ~~this~~ Perplexity’s are awful for the web.

(Edited for clarity)

IphtashuFitz@lemmy.world · 6 hours ago

I hate to break it to you but not only does Cloudflare do this sort of thing, but so does Akamai, AWS, and virtually every other CDN provider out there. And far from being awful, it’s actually protecting the web.

We use Akamai where I work, and they inform us in real time when a request comes from a bot, and they further classify it as one of a dozen or so bots (search engine crawlers, analytics bots, advertising bots, social networks, AI bots, etc). It also informs us if it’s somebody impersonating a well known bot like Google, etc. So we can easily allow search engines to crawl our site while blocking AI bots, bots impersonating Google, and so on.

poopkins@lemmy.world · 3 hours ago

What I meant with “things like this are awful for the web,” I meant that automation through AI is awful for the web. It takes away from the original content creators without any attribution and hits their bottom line.

My story was supposed to be one about responsible AI, but somehow I screwed that up in my summary.

The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall

The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall

Perplexity Says Cloudflare Is Blocking Legitimate AI Assistants