DO NOT talk about the goblins

not_IO@lemmy.blahaj.zone · edit-2 2 days ago

DO NOT talk about the goblins

Apepollo11@lemmy.world · 2 days ago

By “made out of tissue paper”, I assume you mean written in a list in English?

These lines were added to the agent instructions to address a specific weird behaviour that had been observed in Codex’s output. How would you have done it correctly?

Filter the output to remove all instances of raccoons? What if the project is actually about racoons?

Run an adversarial LLM specifically to double check and, if necessary, correct instances of racoons? Using twice the power and still needs to be defined in text.

Train a new model with an anti-racoon bias? I’d be surprised if they didn’t for the next iteration, but it takes time.

The reality is that for something this daft, the immediate fix is this.

Biases against outputs that might encourage self-harm, murder, etc are baked into the models during training nowadays. These guardrails are there in the neural network, not as text or instructions, but part of the structure itself.

The plain text agent instructions just give the different models a push in the direction that they want. Apparently it was mentioning racoons in unexpected contexts, so for now they just told it not to anymore.