• Ekybio@lemmy.world
    link
    fedilink
    English
    arrow-up
    20
    ·
    3 days ago

    Can someone with more knowledge shine a bit more light on this while situation? Im out of the loop on the technical details

    • snooggums@lemmy.world
      link
      fedilink
      English
      arrow-up
      57
      arrow-down
      2
      ·
      3 days ago

      AI crawlers tend to overwhelm websites by doing the least efficient scraping of data possible, basically DDOSing a huge portion of the internet. Perplexity already scraped the net for training data and is now hammering it inefficiently for searches.

      Cloudflare is just trying to keep the bots from overwhelming everything.

    • panda_abyss@lemmy.ca
      link
      fedilink
      English
      arrow-up
      33
      ·
      edit-2
      3 days ago

      Cloudflare runs as a CDN/cache/gateway service in front of a ton of websites. Their service is to help protect against DDOS and malicious traffic.

      A few weeks ago cloudflare announced they were going to block AI crawling (good, in my opinion). However they also added a paid service that these AI crawlers can use, so it actually becomes a revenue source for them.

      This is a response to that from Perplexity who run an AI search company. I don’t actually know how their service works, but they were specifically called out in the announcement and Cloudflare accused them of “stealth scraping” and ignoring robots.txt and other things.

      • very_well_lost@lemmy.world
        link
        fedilink
        English
        arrow-up
        31
        ·
        edit-2
        3 days ago

        A few weeks ago cloudflare announced they were going to block AI crawling (good, in my opinion). However they also added a paid service that these AI crawlers can use, so it actually becomes a revenue source for them.

        I think it’s also worth pointing out that all of the big AI companies are currently burning through cash at an absolutely astonishing rate, and none of them are anywhere close to being profitable. So pay-walling the data they use is probably gonna be pretty painful for their already-tortured bottom line (good).

        • Dogiedog64@lemmy.world
          link
          fedilink
          English
          arrow-up
          17
          arrow-down
          1
          ·
          3 days ago

          It’s more than simply astonishing, it’s mind-blowingly bonkers how much money they have to burn to see ANY amount of return. You think a normal company is bad, blowing a few thousand bucks on materials, equipment, and labor per day in order to make a few bucks revenue (not profit)? AI companies have to blow HUNDREDS OF BILLIONS on massive data center complexes in order to train their bots, and then the energy cost and water cost of running them adds a couple more million a day. ALL so they can make negative hundreds of dollars on every prompt you can dream of.

          The ONLY reason AI firms are still a thing in the current tech tree is because Techbros everywhere have convinced the uberwealthy VC firms that AGI is RIGHT AROUND THE CORNER, and will save them SO much money on labor and efficiency that it’ll all be worth it in permanent, pure, infinite profit. If that sounds like too much of a pipe dream to be realistic, congratulations, you’re a sane and rational human being.

          • ubergeek@lemmy.today
            link
            fedilink
            English
            arrow-up
            7
            ·
            2 days ago

            It’s more than simply astonishing, it’s mind-blowingly bonkers how much money they have to burn to see ANY amount of return

            See, that’s the trick, and it’s used by LOADS of startups:

            You don’t actually have to see a return… You just have to have a good story showing there MAY be a GIANT return. The founders collect enormous salaries (Funded by VC dollars, not their own), they burn through the money to create more illusion, then ask for more, then burn through that, foretelling of the coming days when the money is just coming!

            Meanwhile, just before it’s “projected” to become insanely profitable, they sell out to someone, walk away with a giant check, and the product evaporates.

        • Tollana1234567@lemmy.today
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          1
          ·
          2 days ago

          they already said they wernt profitable, they are trying to keep on life support til the VC funds run out.

      • _cryptagion [he/him]@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        10
        ·
        3 days ago

        It should be pointed out that Cloudflare didn’t say they were going to block AI traffic, they give you the option to. The service is a free opt-in for people who want it.

      • nutsack@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        6
        ·
        edit-2
        3 days ago

        they don’t outright block ai crawlers. they added some new tools and options for managing or blocking ai bot traffic which the cloudflare customer can choose to use or to not use.

        im running a free educational resource and i let the crawlers hit my site all they want because its useful knowledge unavailable anywhere else and it’s served to them from cloudflare’s free tier cache. i just don’t know why they have to read it ten thousand times a day.

      • RogueBanana@piefed.zip
        link
        fedilink
        English
        arrow-up
        4
        ·
        3 days ago

        But the website owner can still choose to continue blocking them right? Without using additional stuff like Anubis that is.

    • BetaDoggo_@lemmy.world
      link
      fedilink
      English
      arrow-up
      24
      arrow-down
      2
      ·
      edit-2
      3 days ago

      Perplexity (an “AI search engine” company with 500 million in funding) can’t bypass cloudflare’s anti-bot checks. For each search Perplexity scrapes the top results and summarizes them for the user. Cloudflare intentionally blocks perplexity’s scrapers because they ignore robots.txt and mimic real users to get around cloudflare’s blocking features. Perplexity argues that their scraping is acceptable because it’s user initiated.

      Personally I think cloudflare is in the right here. The scraped sites get 0 revenue from Perplexity searches (unless the user decides to go through the sources section and click the links) and Perplexity’s scraping is unnecessarily traffic intensive since they don’t cache the scraped data.

      • lividweasel@lemmy.world
        link
        fedilink
        English
        arrow-up
        8
        arrow-down
        1
        ·
        3 days ago

        …and Perplexity’s scraping is unnecessarily traffic intensive since they don’t cache the scraped data.

        That seems almost maliciously stupid. We need to train a new model. Hey, where’d the data go? Oh well, let’s just go scrape it all again. Wait, did we already scrape this site? No idea, let’s scrape it again just to be sure.

        • rdri@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          3
          ·
          2 days ago

          First we complain that AI steals and trains on our data. Then we complain when it doesn’t train. Cool.

          • ubergeek@lemmy.today
            link
            fedilink
            English
            arrow-up
            1
            ·
            2 days ago

            I think it boils down to “consent” and “remuneration”.

            I run a website, that I do not consent to being accessed for LLMs. However, should LLMs use my content, I should be compensated for such use.

            So, these LLM startups ignore both consent, and the idea of remuneration.

            Most of these concepts have already been figured out for the purpose of law, if we consider websites much akin to real estate: Then, the typical trespass laws, compensatory usage, and hell, even eminent domain if needed ie, a city government can “take over” the boosted post feature to make sure alerts get pushed as widely and quickly as possible.

            • rdri@lemmy.world
              link
              fedilink
              English
              arrow-up
              1
              ·
              1 day ago

              That all sounds very vague to me, and I don’t expect it to be captured properly by law any time soon. Being accessed for LLM? What does it mean for you and how is it different from being accessed by a user? Imagine you host a weather forecast. If that information is public, what kind of compensation do you expect from anyone or anything who accesses that data?

              Is it okay for a person to access your site? Is it okay for a script written by that person to fetch data every day automatically? Would it be okay for a user to dump a page of your site with a headless browser? Would it be okay to let an LLM take a look at it to extract info required by a user? Have you heard about changedetection.io project? If some of these sound unfair to you, you might want to put a DRM on your data or something.

              Would you expect a compensation from me after reading your comment?

              • ubergeek@lemmy.today
                link
                fedilink
                English
                arrow-up
                1
                ·
                13 hours ago

                That all sounds very vague to me, and I don’t expect it to be captured properly by law any time soon.

                It already has been captured, properly in law, in most places. We can use the US as an example: Both intellectual property and real property have laws already that cover these very items.

                What does it mean for you and how is it different from being accessed by a user?

                Well, does a user burn up gigawatts of power, to access my site every time? That’s a huge different.

                Imagine you host a weather forecast. If that information is public, what kind of compensation do you expect from anyone or anything who accesses that data?

                Depends on the terms of service I set for that service.

                Is it okay for a person to access your site?

                Sure!

                Is it okay for a script written by that person to fetch data every day automatically?

                Sure! As long as it doesn’t cause problems for me, the creator and hoster of said content.

                Would it be okay for a user to dump a page of your site with a headless browser?

                See above. Both power usage and causing problems for me.

                Would it be okay to let an LLM take a look at it to extract info required by a user?

                No. I said, I do not want my content and services to be used by and for LLMs.

                Have you heard about changedetection.io project?

                I have now. And should a user want to use that service, that service, which charges 8.99/month for it needs to pay me a portion of that, or risk having their service blocked.

                There no need to use it, as I already provide RSS feeds for my content. Use the RSS feed, if you want updates.

                If some of these sound unfair to you, you might want to put a DRM on your data or something.

                Or, I can just block them, via a service like Cloud Flare. Which I do.

                Would you expect a compensation from me after reading your comment?

                None. Unless you’re wanting to access if via an LLM. Then I want compensation for the profit driven access to my content.

                • rdri@lemmy.world
                  link
                  fedilink
                  English
                  arrow-up
                  1
                  ·
                  3 hours ago

                  Both intellectual property and real property have laws already that cover these very items.

                  And it causes a lot of trouble to many people and pains me specifically. Information should not be gated or owned in a way that would make it illegal for anyone to access it under proper conditions. License expiration causing digital work to die out, DRM causing software to break, idiotic license owners not providing appropriate service, etc.

                  Well, does a user burn up gigawatts of power, to access my site every time?

                  Doing a GET request doesn’t do that.

                  As long as it doesn’t cause problems for me, the creator and hoster of said content.

                  What kind of problems that would be?

                  Both power usage and causing problems for me.

                  ?? How? And what?

                  do not want my content and services to be used by and for LLMs.

                  You have to agree that at one point “be used by LLM” would not be different from “be used by a user”.

                  which charges 8.99/month

                  It’s self-hosted and free.

                  Use the RSS feed, if you want updates.

                  How does that prohibit usage and processing of your info? That sounds like “I won’t be providing any comments on Lemmy website, if you want my opinion you can mail me at a@b.com

                  I can just block them, via a service like Cloud Flare. Which I do.

                  That will never block all of them. Your info will be used without your consent and you will not feel troubled from it. So you might not feel troubled if more things do the same.

                  None. Unless you’re wanting to access if via an LLM. Then I want compensation for the profit driven access to my content.

                  What if I use my local hosted LLM? Anyway, the point is, selling text can’t work well, and you’re going to spend much more resources on collecting and summarizing data about how your text was used and how others benefited from it, in order to get compensation, than it worths.

                  Also, it might be the case that some information is actually worthless when compared to a service provided by things like LLM, even though they use that worthless information in the process.

                  I’m all for killing off LLMs, btw. Concerns of site makers who think they are being damaged by things like Perplexity are nothing compared to what LLMs do to the world. Maybe laws should instead make it illegal to waste energy. Before energy becomes the main currency.

        • jballs@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          arrow-down
          1
          ·
          3 days ago

          It’s worth giving the article a read. It seems that they’re not using the data for training, but for real-time results.

        • snooggums@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          2
          ·
          2 days ago

          They do it this way in case the data changed, similar to how a person would be viewing the current site. The training was for the basic understanding, the real time scraping is to account for changes.

          It is also horribly inefficient and works like a small scale DDOS attack.