But you can trust the first model to produce the code you want it to. Or, at least get a baseline of whether it works as expected. To roll back to the simple example of secure (sanitized) user input via a form, the human sets up the testing environment. All the human needs to do is write a script that reads the entered database entry, and hashes the rest of the database / application in memory.
It should be simple for the first model to use different languages and approaches from strongly typed languages like ada to yolo implementations in Python.
The adversarial model’s job is to modify the state of the application or database outside of that entry. This should be possible with some of the first models implementations, unless they are already perfect.
The idea is with enough permutations of implementations at different temperatures and with different input context, an almost infinite number of blue team and red team examples can be iterated on and produced on this one specific problem.
This approach is already being generalized to produce more high-quality software training data for LLMs than exist in the lexicon of human output.
This is very hard to do with art or writing. Art is subject, you can not validate the variable automatically or detect subtle variations without context and opinion so easily.
This is tangental to why Machine Learning works so well for weather data. We can objectively validate the output with historic data, but we can also create synthetic weather data using physically based models. It’s different, but similar in principal.






Has there actually been evidence of Alexa or Google homes being used for government surveillance?