'Bergeron: Combating Adversarial Attacks Through a Conscience-Based Alignment Framework'

‘Bergeron: Combating Adversarial Attacks Through a Conscience-Based Alignment Framework’

by Noah Lloyd

April 4, 2024

“Research into AI alignment has grown considerably since the recent introduction of increasingly capable Large Language Models (LLMs). Unfortunately, modern methods of alignment still fail to fully prevent harmful responses when models are deliberately attacked. These attacks can trick seemingly aligned models into giving manufacturing instructions for dangerous materials, inciting violence, or recommending other immoral acts. To help mitigate this issue, we introduce Bergeron: a framework designed to improve the robustness of LLMs against attacks without any additional parameter fine-tuning.”

Find the paper and full list of authors at ArXiv.

View on Site

Dakuo Wang

Artificial Intelligence, Computer Science

Research Paper October 30, 2025

‘Bergeron: Combating Adversarial Attacks Through a Conscience-Based Alignment Framework’

Related

‘Data Mining: Methodologies and Applications’

‘Machine Learning-Guided Field Site Selection for River Classification’

‘Effects of AI Feedback on Learning, the Skill Gap, and Intellectual Diversity’

‘AI’s Hidden Human Cost, and How To Avoid It’

Ultra-efficient AI for wearables and IoT

Improving communications with A

NSF grant awarded for adaptive clothing

‘Integrating AI into the Front End of New Product Development’

Patent for ‘lightweight pose estimation network’ goes to Fu