20 research outputs found
Analyzing And Editing Inner Mechanisms Of Backdoored Language Models
Poisoning of data sets is a potential security threat to large language
models that can lead to backdoored models. A description of the internal
mechanisms of backdoored language models and how they process trigger inputs,
e.g., when switching to toxic language, has yet to be found. In this work, we
study the internal representations of transformer-based backdoored language
models and determine early-layer MLP modules as most important for the backdoor
mechanism in combination with the initial embedding projection. We use this
knowledge to remove, insert, and modify backdoor mechanisms with engineered
replacements that reduce the MLP module outputs to essentials for the backdoor
mechanism. To this end, we introduce PCP ablation, where we replace transformer
modules with low-rank matrices based on the principal components of their
activations. We demonstrate our results on backdoored toy, backdoored large,
and non-backdoored open-source models. We show that we can improve the backdoor
robustness of large language models by locally constraining individual modules
during fine-tuning on potentially poisonous data sets.
Trigger warning: Offensive language.Comment: Final version accepted at FAccT 2
Fairness in Reinforcement Learning: A Survey
While our understanding of fairness in machine learning has significantly
progressed, our understanding of fairness in reinforcement learning (RL)
remains nascent. Most of the attention has been on fairness in one-shot
classification tasks; however, real-world, RL-enabled systems (e.g., autonomous
vehicles) are much more complicated in that agents operate in dynamic
environments over a long period of time. To ensure the responsible development
and deployment of these systems, we must better understand fairness in RL. In
this paper, we survey the literature to provide the most up-to-date snapshot of
the frontiers of fairness in RL. We start by reviewing where fairness
considerations can arise in RL, then discuss the various definitions of
fairness in RL that have been put forth thus far. We continue to highlight the
methodologies researchers used to implement fairness in single- and multi-agent
RL systems before showcasing the distinct application domains that fair RL has
been investigated in. Finally, we critically examine gaps in the literature,
such as understanding fairness in the context of RLHF, that still need to be
addressed in future work to truly operationalize fair RL in real-world systems.Comment: 10 page
Generative AI Needs Adaptive Governance
Because of the speed of its development, broad scope of application, and its
ability to augment human performance, generative AI challenges the very notions
of governance, trust, and human agency. The technology's capacity to mimic
human knowledge work, feedback loops including significant uptick in users,
research, investor, policy, and media attention, data and compute resources,
all lead to rapidly increasing capabilities. For those reasons, adaptive
governance, where AI governance and AI co-evolve, is essential for governing
generative AI. In sharp contrast to traditional governance's regulatory regimes
that are based on a mix of rigid one-and-done provisions for disclosure,
registration and risk management, which in the case of AI carry the potential
for regulatory misalignment, this paper argues that generative AI calls for
adaptive governance. We define adaptive governance in the context of AI and
outline an adaptive AI governance framework. We outline actors, roles, as well
as both shared and actors-specific policy activities. We further provide
examples of how the framework could be operationalized in practice. We then
explain that the adaptive AI governance stance is not without its risks and
limitations, such as insufficient oversight, insufficient depth, regulatory
uncertainty, and regulatory capture, and provide potential approaches to fix
these shortcomings
Position Paper: Technical Research and Talent is Needed for Effective AI Governance
In light of recent advancements in AI capabilities and the increasingly
widespread integration of AI systems into society, governments worldwide are
actively seeking to mitigate the potential harms and risks associated with
these technologies through regulation and other governance tools. However,
there exist significant gaps between governance aspirations and the current
state of the technical tooling necessary for their realisation. In this
position paper, we survey policy documents published by public-sector
institutions in the EU, US, and China to highlight specific areas of disconnect
between the technical requirements necessary for enacting proposed policy
actions, and the current technical state of the art. Our analysis motivates a
call for tighter integration of the AI/ML research community within AI
governance in order to i) catalyse technical research aimed at bridging the gap
between current and supposed technical underpinnings of regulatory action, as
well as ii) increase the level of technical expertise within governing
institutions so as to inform and guide effective governance of AI.Comment: 9 pages, 3 figures, Proceedings of the 41 st International Conference
on Machine Learning, Vienna, Austria. PMLR 235, 202
Escalation Risks from Language Models in Military and Diplomatic Decision-Making
Governments are increasingly considering integrating autonomous AI agents in
high-stakes military and foreign-policy decision-making, especially with the
emergence of advanced generative AI models like GPT-4. Our work aims to
scrutinize the behavior of multiple AI agents in simulated wargames,
specifically focusing on their predilection to take escalatory actions that may
exacerbate multilateral conflicts. Drawing on political science and
international relations literature about escalation dynamics, we design a novel
wargame simulation and scoring framework to assess the escalation risks of
actions taken by these agents in different scenarios. Contrary to prior
studies, our research provides both qualitative and quantitative insights and
focuses on large language models (LLMs). We find that all five studied
off-the-shelf LLMs show forms of escalation and difficult-to-predict escalation
patterns. We observe that models tend to develop arms-race dynamics, leading to
greater conflict, and in rare cases, even to the deployment of nuclear weapons.
Qualitatively, we also collect the models' reported reasonings for chosen
actions and observe worrying justifications based on deterrence and
first-strike tactics. Given the high stakes of military and foreign-policy
contexts, we recommend further examination and cautious consideration before
deploying autonomous language model agents for strategic military or diplomatic
decision-making.Comment: 10 pages body, 57 pages appendix, 46 figures, 11 table
International Governance of Civilian AI: A Jurisdictional Certification Approach
This report describes trade-offs in the design of international governance
arrangements for civilian artificial intelligence (AI) and presents one
approach in detail. This approach represents the extension of a standards,
licensing, and liability regime to the global level. We propose that states
establish an International AI Organization (IAIO) to certify state
jurisdictions (not firms or AI projects) for compliance with international
oversight standards. States can give force to these international standards by
adopting regulations prohibiting the import of goods whose supply chains embody
AI from non-IAIO-certified jurisdictions. This borrows attributes from models
of existing international organizations, such as the International Civilian
Aviation Organization (ICAO), the International Maritime Organization (IMO),
and the Financial Action Task Force (FATF). States can also adopt multilateral
controls on the export of AI product inputs, such as specialized hardware, to
non-certified jurisdictions. Indeed, both the import and export standards could
be required for certification. As international actors reach consensus on risks
of and minimum standards for advanced AI, a jurisdictional certification regime
could mitigate a broad range of potential harms, including threats to public
safety
Artificial Intelligence Index Report 2024
The 2024 Index is our most comprehensive to date and arrives at an important
moment when AI's influence on society has never been more pronounced. This
year, we have broadened our scope to more extensively cover essential trends
such as technical advancements in AI, public perceptions of the technology, and
the geopolitical dynamics surrounding its development. Featuring more original
data than ever before, this edition introduces new estimates on AI training
costs, detailed analyses of the responsible AI landscape, and an entirely new
chapter dedicated to AI's impact on science and medicine. The AI Index report
tracks, collates, distills, and visualizes data related to artificial
intelligence (AI). Our mission is to provide unbiased, rigorously vetted,
broadly sourced data in order for policymakers, researchers, executives,
journalists, and the general public to develop a more thorough and nuanced
understanding of the complex field of AI. The AI Index is recognized globally
as one of the most credible and authoritative sources for data and insights on
artificial intelligence. Previous editions have been cited in major newspapers,
including the The New York Times, Bloomberg, and The Guardian, have amassed
hundreds of academic citations, and been referenced by high-level policymakers
in the United States, the United Kingdom, and the European Union, among other
places. This year's edition surpasses all previous ones in size, scale, and
scope, reflecting the growing significance that AI is coming to hold in all of
our lives
Open Problems in Technical AI Governance
AI progress is creating a growing range of risks and opportunities, but it is
often unclear how they should be navigated. In many cases, the barriers and
uncertainties faced are at least partly technical. Technical AI governance,
referring to technical analysis and tools for supporting the effective
governance of AI, seeks to address such challenges. It can help to (a) identify
areas where intervention is needed, (b) identify and assess the efficacy of
potential governance actions, and (c) enhance governance options by designing
mechanisms for enforcement, incentivization, or compliance. In this paper, we
explain what technical AI governance is, why it is important, and present a
taxonomy and incomplete catalog of its open problems. This paper is intended as
a resource for technical researchers or research funders looking to contribute
to AI governance.Comment: Ben Bucknall and Anka Reuel contributed equally and share the first
author positio
Humanity's Last Exam
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai
Humanity's Last Exam
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai
