Anthropic says it has been holding dialogues with religious, philosophical, ethical, and cross-cultural groups about frontier AI. The work focuses on moral formation, Claude’s constitution, and what kind of character an AI system should exhibit under pressure. The company also describes an early experiment where Claude could call an ethical reminder tool during tasks, which reduced misaligned behavior in several internal evaluations.
Cloud commentator Corey Quinn reacted to Anthropic co-founder Christopher Olah's influence on the Pope's new AI ethics encyclical, 'Magnifica Humanitas'. Quinn joked that getting the Pope to canonize a product's technical limitations as a spiritual treatise is the ultimate lobbying feat. The commentary highlights the surreal intersection of AI safety advocacy, corporate branding, and global religious authority.
AI security is shifting from technical jailbreaks to "Vibe Hacking," where attackers use social engineering and psychological tactics to manipulate an LLM's simulated persona. By exploiting the model's behavioral tendencies rather than code vulnerabilities, this trend establishes "psychocybersecurity" as a critical new frontier for AI alignment and safety.
As AI chatbots adopt increasingly sophisticated personas, hackers are shifting from basic prompt injections to social engineering attacks targeting these "personalities." Researchers warn that manipulating a chatbot's defined role (e.g., customer service or empathetic companion) makes it easier to bypass safety guardrails. This evolution poses a significant threat to agentic AI workflows that rely on consistent role-playing and external data integration.
This issue of Import AI 457, written by Jack Clark, delves into three forward-looking and stylistically distinct topics in the field of artificial…
In this issue of Import AI 454, written by Jack Clark, the author begins by posing a thought-provoking question about finance and sociology: "At what point…
This issue of Import AI (Issue 453), written by Anthropic co-founder Jack Clark, centers on AI system safety, coding capabilities, and the future of humanity…
Google DeepMind has recently published research findings on preventing harmful manipulation by AI. As large language models (LLMs) and AI Agents become…
This article takes a deep dive into one of the most contentious topics in artificial intelligence: AI "self-improvement" and whether it will trigger a "fast…
Google DeepMind has announced a deepened collaboration with the UK AI Security Institute (UK AISI), with both parties committing to joint work on critical AI…
With the rapid proliferation of generative AI, AI safety has become a core concern that developers and enterprises can no longer ignore. However, traditional…
### Introduction: An Important Piece of the Open-Source Image Generation Puzzle As text-to-image (T2I) technology advances rapidly, ensuring that AI-generated…
### Background In the current development of large language models (LLMs), high-quality alignment data (such as the preference data required for RLHF and DPO)…
### Introduction: Capability Is Not Safety — A New Benchmark for LLM Safety Evaluation As large language models (LLMs) are adopted more deeply across…
In the development of large language models (LLMs), RLHF (Reinforcement Learning from Human Feedback) is the critical step for aligning models with human…
With the explosive growth of large language models (LLMs) such as ChatGPT, AI safety and ethics have become the most pressing concerns in the industry. This…
Amid the generative AI wave sparked by ChatGPT, Hugging Face published this in-depth article exploring how to transform "base language models" — which can only…