Universal Latent Representation, MONET, and METR's new Agent Evals

May 25, 2025

Harnessing the Universal Geometry of Embeddings

Abstract: We introduce the first method for translating text embeddings from one vector space to another without any paired data, encoders, or predefined sets of matches. Our unsupervised approach translates any embedding to and from a universal latent representation (i.e., a universal semantic structure conjectured by the Platonic Representation Hypothesis). Our translations achieve high cosine similarity across model pairs with different architectures, parameter counts, and training datasets. The ability to translate unknown embeddings into a different space while preserving their geometry has serious implications for the security of vector databases. An adversary with access only to embedding vectors can extract sensitive information about the underlying documents, sufficient for classification and attribute inference.

Summary (Rishi Jha): Our method, vec2vec, reveals that all encoders—regardless of architecture or training data—learn nearly the same representations! We demonstrate how to translate between these black-box embeddings without any paired data, maintaining high fidelity.

Using vec2vec, we show that vector databases reveal (almost) as much as their inputs. Given just vectors (e.g., from a compromised vector database), we show that an adversary can extract sensitive information (e.g., PII) about the underlying text.

We thus strengthen Huh et al.'s PRH to say: The universal latent structure of text representations can be learned and harnessed to translate representations from one space to another without any paired data or encoders.

Full Paper: https://arxiv.org/pdf/2505.12540

HCAST: Human-Calibrated Autonomy Software Tasks

Abstract: To understand and predict the societal impacts of highly autonomous AI systems, we need benchmarks with grounding, i.e., metrics that directly connect AI performance to real-world effects we care about. We present HCAST (Human-Calibrated Autonomy Software Tasks), a benchmark of 189 machine learning engineering, cybersecurity, software engineering, and general reasoning tasks. We collect 563 human baselines (totaling over 1500 hours) from people skilled in these domains, working under identical conditions as AI agents, which lets us estimate that HCAST tasks take humans between one minute and 8+ hours. Measuring the time tasks take for humans provides an intuitive metric for evaluating AI capabilities, helping answer the question "can an agent be trusted to complete a task that would take a human X hours?" We evaluate the success rates of AI agents built on frontier foundation models, and we find that current agents succeed 70-80% of the time on tasks that take humans less than one hour, and less than 20% of the time on tasks that take humans more than 4 hours.

Summary (David Rein): AI systems are clearly improving quickly, and benchmarks help measure this progress. Many benchmarks focus on tasks that are intellectually demanding for humans, like graduate level science or mathematics questions, or competition-level programming problems.

But AI systems are advantaged at tasks involving significant amounts of knowledge compared to humans, and it’s apparent that progress on these benchmarks seems to outpace people’s direct experience of using frontier AIs.

One intuitive measure of an AI agents’ capabilities is to ask the question “can I hand an agent a task that would take me X hours to complete, and be confident that the agent will complete it?”. To measure this, you need a) a realistic task distribution, and b) accurate task time estimates.

Over the past year we’ve manually created 189 realistic, open-ended, agentic tasks across software engineering, machine learning engineering, cybersecurity, and general reasoning. We then carefully measure the time they take humans to complete.

Tasks are defined via a flexible interface: they’re Docker images with task instructions, necessary resources (GPUs, VMs, datasets, etc.), and an algorithmic scoring function that is hidden from the agent by default.

Typically results from AI agents and humans are not directly comparable, e.g. because humans are given significantly more time or resources than AI agents. Uniquely, we measure the length of time that tasks take humans in essentially identical conditions that agents are given.

We had 140 people skilled in the relevant domains spend over 1500 hours in total attempting the tasks. We find that the tasks in HCAST take humans between one minute, and over eight hours.

We then evaluate AI agents built on four foundation models, including the new Claude 3.7 Sonnet model with extended thinking mode. The best models succeed 70-80% of the time on <1hr tasks, and less than 20% of the time on >4hr tasks.

Making nearly 200 diverse, complex software tasks required significant effort. Tasks had many revisions to address bugs discovered by manually reviewing thousands of agent transcripts. We also developed dozens of automated tests and manually verified most tasks for solvability.

We’re releasing a few of the tasks in HCAST publicly as examples, but we’re withholding the majority of them, to reduce the likelihood of accidental or intentional data contamination.

The development of HCAST has been a highly iterative process—there are a lot of technical challenges in running and maintaining a large and diverse suite of autonomy tasks, and being able to use a subset of the tasks in METR’s pre-release evaluations of frontier AI systems has been crucial to making the tasks useful.

Full Paper: https://arxiv.org/abs/2503.17354

Monet: Mixture of Monosemantic Experts for Transformers

Abstract: Understanding the internal computations of large language models (LLMs) is crucial for aligning them with human values and preventing undesirable behaviors like toxic content generation. However, mechanistic interpretability is hindered by polysemanticity -- where individual neurons respond to multiple, unrelated concepts. While Sparse Autoencoders (SAEs) have attempted to disentangle these features through sparse dictionary learning, they have compromised LLM performance due to reliance on post-hoc reconstruction loss. To address this issue, we introduce Mixture of Monosemantic Experts for Transformers (Monet) architecture, which incorporates sparse dictionary learning directly into end-to-end Mixture-of-Experts pretraining. Our novel expert decomposition method enables scaling the expert count to 262,144 per layer while total parameters scale proportionally to the square root of the number of experts. Our analyses demonstrate mutual exclusivity of knowledge across experts and showcase the parametric knowledge encapsulated within individual experts. Moreover, Monet allows knowledge manipulation over domains, languages, and toxicity mitigation without degrading general performance. Our pursuit of transparent LLMs highlights the potential of scaling expert counts to enhance mechanistic interpretability and directly resect the internal knowledge to fundamentally adjust model behavior.

Summary (Caleb Maresca): MONET is a novel neural network architecture that achieves interpretability by design rather than through post-hoc analysis. Using a specialized Mixture of Experts (MoE) architecture with ~250k experts per layer, MONET encourages individual components to learn specific, interpretable tasks (e.g., Python coding, biology knowledge, or toxicity generation). This enables selective removal of capabilities without harming performance in other domains.

MONET can be framed as a standard transformer MLP with two key modifications: (1) forced sparse activations in the hidden layer and (2) dynamic reshuffling of the hidden layer. This interpretation helps explain how MONET achieves interpretability: the sparsity forces specialization while the reshuffling maintains expressivity by allowing flexible composition of specialized components. Performance is competitive with traditional architectures, suggesting we might not need to trade off performance for interpretability.

If these results hold up, MONET represents a significant step toward architectures that are interpretable by design rather than requiring post-hoc analysis. This could fundamentally change how we approach AI interpretability, moving from trying to understand black boxes after the fact to building systems that are naturally decomposable and controllable.

Full paper: https://arxiv.org/abs/2412.04139

Code: https://github.com/dmis-lab/Monet

AI Safety Papers

Discussion about this post