Jailbreaking is an inherent problem with LLMs that can never be solved. Any safeguard has to be less capable than the LLM it protects, else you can just target that. So there will always be a way to communicate with the LLM in a way that bypasses the safeguard.
It’s like trying to sanitise user input from SQL injections, except the database speaks every form of communication documented by humanity.
All this is to say, I’m glad I’m not responsible for any of these systems.
Jailbreaking is an inherent problem with LLMs that can never be solved. Any safeguard has to be less capable than the LLM it protects, else you can just target that. So there will always be a way to communicate with the LLM in a way that bypasses the safeguard.
It’s like trying to sanitise user input from SQL injections, except the database speaks every form of communication documented by humanity.
All this is to say, I’m glad I’m not responsible for any of these systems.