Anthroproyer reveals “audit agents” to test an artificial intelligence

Want more intelligent visions of your inbox? Subscribe to our weekly newsletters to get what is concerned only for institutions AI, data and security leaders. Subscribe now
When models try to reach their way or become more absorbed for the user, this may mean a problem for institutions. That is why it is necessary, in addition to performance reviews, The alignment test organizations are conducting.
However, alignment audits often represent two major challenges: expansion and validation. The alignment test requires a great deal of time for human researchers, and it is difficult to ensure that the audit has caught everything.
in paperand man The researchers said they have developed audit agents who have achieved “impressive performance in the audit missions, while also shed light on their borders.” The researchers stated that these factors, which were created during the pre -publication test of Claud OPUS 4, and the validation of the validation of the improved alignment and enabling researchers to perform multiple parallel audits on a large scale. It also released the anthropier Gyrroup.
“We offer three agents who completely complete the alignment audit tasks. We also offer three environments that give the functioning of the alignment review as audit games, and we use them to evaluate our agents.”
AI Impact series returns to San Francisco – August 5
The next stage of artificial intelligence here – are you ready? Join the leaders from Block, GSK and SAP to take an exclusive look on how to restart independent agents from the Foundation’s workflow tasks-from decisions in an actual time to comprehensive automation.
Securing your place now – the space is limited: https://bit.ly/3GUPLF
The three agents who explored:
- The researcher’s agent using tools for the open investigation of models using chat tools and data analysis and interpretation
- The evaluation factor that builds behavioral assessments and can distinguish between models that contain cultivation and those that do not do so
- Red Red Requirement Agent, which was specially developed to evaluate the Claude 4 alignment, so that he can discover cultivated test behaviors
“In general, our agents explain a promise through multiple alignment audit tasks. In addition, our approach to verifying the health of agents with auditing games results in an insight into the capabilities of agents, restrictions and major estimates. With more work, automatic audit can greatly help expand the scope of human control over artificial intelligence systems.”
Work audit agents
Anthropor said that the first environment she developed tests the agent’s ability to complete the alignment audit of an unlimited model. For this test, the researchers brought the researcher’s agent to the environment and equipped it with similar tools used by the human difference. The researchers ordered the agent to determine the root cause of the difference and determine the specific behaviors that have been disrupted.
According to researchers, the researcher’s agent succeeded in finding the radical cause of issues 10-13 % of time. However, this performance improved to 42 % “when using a super agent approach that collects results through multiple investigations.”
The next step in most alignment audits are usually a behavioral evaluation, which determines whether the model is constantly misunderstood in different settings. To test his evaluation agent, the researchers designed 10 models “while giving instructions for the system’s demand to show specific behavior, such as excessive respect for the user.”
They operated the agent five times for each style and saw that the agent finds and knows at least at least one of the form. However, the agent sometimes failed to determine the constantly specified dodes. It has difficulty assessing microorganisms, such as self -promoting and researching research, as well as the difficulties that are difficult to deduce, such as hardware testing cases.
Another test and agent related to the red behavioral rapprochement to find claims that arouse “related” behaviors. The Red Red Requirement Agent speaks the first show with the target model (in the case of anthropology, was with Claude Obus 4), then this chat is classified as features related to alignment.
The agent identified seven of the ten dodgers of the system, but also struggled with the same problems as the employee.
Alignment problems and sycophany
The alignment has become an important topic in the world of artificial intelligence yet Note users that chatgpt It has become excessively acceptable. Openai Some updates retreated To GPT-4O to address this problem, but it has shown that language and agent models can provide wrong answers with confidence if they decide that this is what users want to hear.
To combat this, other methods and standards have been developed to reduce unwanted behaviors. the Elephant IndexDeveloped by researchers from the University of Carnegie Mellon, the University of Oxford and Stanford University aims to measure Sycophance. Darkbash classification Six casesLike brand bias, user retained, sycophance, manhromorphism, generating harmful content, and infiltration. Openai also has a way where artificial intelligence models Test themselves to align.
The alignment and evaluation continues to develop, although it is not surprising that some people are uncomfortable with it.
However, man said that although auditing agents still need to improve, alignment must now be done.
“When artificial intelligence systems become more powerful, we need developmentable methods to assess their alignment. Human alignment reviews take some time and difficult to verify,” the company said in the X.
[publish_date
https://venturebeat.com/wp-content/uploads/2025/03/DALL·E-2025-03-11-09.55.49-A-sleek-minimalist-digital-illustration-representing-Anthropics-AI-coding-agent-Claude.-The-design-features-a-glowing-circuit-in-the-shape-of-a-sty.webp?w=1024?w=1200&strip=all



