New Anthropic research: Building and evaluating alignment auditing agents.
We developed three AI agents to autonomously complete alignment auditing tasks.
In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors. pic.x.com/HMQhMaA4v0
Posted on X
Ethan Mollick
emollick
The mitigating factor for the problem with AI benchmarks (errors, saturation, contamination) is that, despite issues, they are all still fairly heavily correlated.
So if your AI does well on GPQA or MMLU or HLE it also tends to do well on other benchmarks & on vibes & real work.
Posted on X
Ethan Mollick
emollick
For better or worse, depending on your view of the future of AI, the only time the letters "AGI" appear in the new White House AI Action Plan is in the word "leveraging."
Posted on X
Min Choi
minchoi
Thoughts? 🤔
"If you talk to ChatGPT about your most sensitive stuff and there's a lawsuit, we could be required to produce that..." - Sam Altman, CEO of OpenAI
pic.x.com/akR9KExYdI