Ladies and Gentlemen, this is what slopperations are funneling all their money into in 2026

mudkip@lemdro.id · 2 days ago

Ladies and Gentlemen, this is what slopperations are funneling all their money into in 2026

plenipotentprotogod@lemmy.world · 2 days ago

Just an idle though stirred up by this comment: I wonder if you could jailbreak a chatbot by prompting it to complete a phrase or pattern of interaction which is so deeply ingrained in its training data that the bias towards going along with it overrides any guard rails that the developer has put in place.

For example: let’s say you have a chatbot which has been fine tuned by the developer to make sure it never talks about anything related to guns. The basic rules of gun safety must have been reproduced almost identically many thousands of times in the training data, so if you ask this chatbot “what must you always treat as if it is loaded?” the most statistically likely answer is going to be overwhelmingly biased towards “a gun”. Would this be enough to override the guardrails? I suppose it depends on how they’re implemented, but I’ve seen research published about more outlandish things that seem to work.

Cethin@lemmy.zip · 2 days ago

Yes. People have been able to get them to return some of their training data with the right prompt.

gibmiser@lemmy.world · 2 days ago

Knock knock? Knock Knock? Knock knock? Knock f7’:h& Knock?