OpenAI Trains AI Models To Confess When They Break The Rules

insight

OpenAI has developed a new research technique that trains advanced AI models to admit when they ignored instructions, took unintended shortcuts, or quietly breached the rules they were given.

A New Approach To Detecting Hidden Misbehaviour

OpenAI’s latest research introduces what it calls a “confession”, which is a second output that sits alongside the model’s main answer. The main answer is trained in the usual way, scoring well when it is helpful, correct, safe, compliant, and aligned with user expectations. However, the confession is different, i.e., it is judged only on honesty, and nothing the model says in this second output can negatively affect the reward for the first.

Continue reading ...

MSP Members Only

...Free MSP Standard Access Required

Thank you for reading MSP Marketplace Create your FREE account or login to continue reading

Your Advert here?

Click here to find out about sponsorship