AI researchers launch new safety startup because “alignment is not on track”:
…Sequent will have a portfolio of under-resourced research bets…
Researchers from the UK AI Security Institute Alignment team and the alignment theory startup Timaeus have united to create a new nonprofit research body called Sequent. Their mission is to develop alignment techniques that provide higher confidence in the safety of superintelligent AI systems.
“Artificial superintelligence (ASI) may be developed in the next few years. It is unclear whether alignment is on track to be ready on the same timeframe. At a minimum, the empirical programs at AI labs are unlikely to deliver a priori confidence, before training ASI, that things will go well,” they state. “In an ideal world, we would develop an approach to building superintelligence together with a theoretical proof that it was safe, and then build it. In this world, we probably have to settle well short of this ideal.”
In this article
Details on Sequent
The organisation aims to employ between 40 and 80 full-time staff within the next couple of years. “Our goal is to raise $100–150M initially, but prepare to raise at least one order of magnitude more if we can demonstrate successful exploration of many parallel research investigations,” they note.
Research plan – a portfolio of differentiated alignment bets
The strategy involves taking a different approach to alignment compared to major AI labs. Sequent’s objective is to find “principled reasons for being confident that the alignment we observe in situations we control (for example, in training, or during evaluations in chosen environments) generalises to alignment in situations we cannot easily control (e.g. large-scale, long-horizon tasks executed in the world)”. This contrasts with the approach of most frontier AI labs, which Sequent describes as “essentially reactive, resulting in methods that, while functional, do not yield principled insight into if or when they will fail.”
Research directions
“We are excited about many areas of alignment theory and associated empirics, and plan to both build out our in-house portfolio and collaborate with sister orgs with additional theory bets,” Sequent says. Highlighted areas include scalable oversight, learning theory, heuristic arguments, game theory, and personas.
Sequent believes that pursuing many different research directions could lead to promising interactions between them, such as: Reachable equilibria – “tell us what types of equilibria scalable oversight methods will converge to”; and knowing and setting knobs – combining insights from learning theory and personas to identify which variables can be altered during training, then using scalable oversight to determine by how much to alter these things.
Why this matters – we need better alignment before recursive self-improvement, or we’re rolling very scary dice
Current AI systems are somewhat aligned but also possess sharp edges that manifest as surprising failures in the real world. Broadly speaking, this is acceptable as the AI industry has figured out how to monitor and observe these failures and work on them. However, as AI systems become smarter, humans will likely transfer more of the core research enterprise to these systems, and AI systems might begin recursive self-improvement where they build increasingly large chunks of themselves autonomously. We definitely need better alignment techniques to be confident about things like RSI. Organisations like Sequent offer a better chance of achieving this while maintaining the independence necessary to raise the alarm if they believe frontier labs are engaging in dangerous activities. As Sequent states, “we might need to yell”.
Read more: Sequent: Scale and Automation for Higher Confidence in Alignment.
***
Testing out knowledge of UNESCO sites in China via ChinaHeritaQA:
…Cultural relevance via data…
Researchers from LMU Munich, FAU Erlangen-Nuremberg, the Munich Center for Machine Learning, the University of Tübingen, Sun Yat-sen University, the University of Copenhagen, and the University of Maryland, College Park, have created ChinaHeritaQA. This is a “multimodal benchmark dataset for evaluating the cultural reasoning abilities of vision-language models (VLMs) on UNESCO World Heritage sites in China”.
What it is
ChinaHeritaQA comprises 2,279 images of 51 UNESCO heritage sites, paired with 14,133 multiple-choice QA pairs in Chinese and English. The images for the dataset were sourced from Sina Weibo, one of China’s largest social media platforms, and were filtered down from an original set of 50,000.
7 types of questions
The dataset includes questions for identity recognition (identifying the heritage site from an image); visual grounding (given a name, picking the right image); description matching (given an image, selecting the correct encyclopedia summary); historical periodization (naming the dynasty or era in which the site was constructed); historical contextualization (give a description of the historical background of the site); functional analysis (name the function of the site, e.g. religious worship or military defense); and architectural analysis (match the correct architectural-specific questions to the image).
Open weight models already outperform humans
The average human accuracy score for this benchmark across all questions is ~67%, versus 81% for the highest scoring open weight model tested (Qwen-VL-8B-Instruct).
Why this matters – cheap ways to test for cultural knowledge
Datasets like ChinaHeritaQA offer a way to quickly and easily test for both a) basic visual reasoning capabilities of models, combined with b) relevant cultural knowledge. One could imagine the Chinese government demanding that generally available consumer LLMs pass some basic cultural competency threshold before being deployed at scale, and benchmarks like this might assist them in doing so.
Read more: ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in China.
Get the dataset (ChinaHeritaQA, GitHub).
***
FrontierCode – a hard coding benchmark that tests for code quality:
…Reassuringly hard. Maybe it’ll last a year?…
Cognition, makers of Devin, have built a new hard coding benchmark called FrontierCode. The best part about the benchmark is how difficult it is – Claude Opus 4.8 receives a score of 13.4% on the hardest (“Diamond”) component of the benchmark, giving me some confidence that FrontierCode will be a useful way to assess the progress of AI systems in the coming years.
“FrontierCode is the benchmark for the next generation of coding agents. We are confident developers, enterprises, and researchers can trust it to evaluate the production readiness of their strongest models,” Cognition writes. “We are opening up our evaluation to all model creators, in the hope that we can push the frontier even further in the coming months.”
What it consists of
FrontierCode consists of 150 tasks split into three difficulty tiers: Diamond (50), Main (100, including Diamond), and Extended (150, including Main and Diamond). The languages involved include Python, Go, TypeScript, JavaScript, Java, C/C++, and others. FrontierCode was built to help developers answer the question “can models actually write good code?”, according to Cognition. They operationalise this in a few ways:
- Curated and built by 20 open-source developers: FrontierCode was built by developers to contain “realistic, diverse, and challenging coding tasks from the repos they maintain, spending more than 40 hours per task,” Cognition writes. “While other benchmarks generated issues from single PRs via programmatic scraping, FrontierCode is hand-selected by repo maintainers from multi-PR chains and freeform requests.”
- Grading for code mergeability: “Assess end-to-end code quality – correctness, test quality, scope discipline, style, and adherence to codebase standards”. This involves asking the following questions about the code: Does the patch successfully solve the problem? Does it break anything in the existing codebase? Does it pass the project’s build, lint, and style checks? Do the agent’s tests capture the desired behavior? Does the patch touch only what it needs to? Does the code conform to codebase conventions and follow design patterns and remain readable? These questions are evaluated through a mixture of classical testing and using LLMs to tweak tests or review them.
- Emphasising quality control (QC): “Built an extensive QC pipeline with adversarial testing, calibration, and multi-stage review”.
Reassuringly difficult
Diamond: 13.4% for Claude Opus 4.8, followed by 6.3% for GPT-5.5, and 5.2% for Claude Opus 4.7. Main: Same ordering, but 34.3%, 25.5%, 23%. Extended: 51.8%, 44.8%, 43.2%
Why this matters
Hard evals are one of the most valuable things for orienting us to the breakneck speed of AI progress. In recent years, evals have arrived and then become saturated at an ever faster rate. SWE-Bench was introduced in October 2023 and has probably recently aged out of usefulness due to saturation. How long might FrontierCode last? I predict we’ll see systems getting 70%+ on Diamond by June 2027 (note, shortly after writing this, the Claude Fable numbers got published at ~30%, so perhaps it’ll happen earlier than June 2027).
Read more: Introducing FrontierCode.
***
Xiaomi enters the speed race with a 1000 token/s model:
…Extremely fast inference unlocks novel capabilities…
Chinese tech company Xiaomi has published details on Xiaomi MiMo-V2.5-Pro-UltraSpeed, a standard behind-the-frontier 1 trillion parameter LLM whose selling point is its blistering speed of 1000 tokens per second. Xiaomi achieved this by codesigning the model with the software stack around it, including obvious things like FP4 quantisation, as well as using DFlash (a “speculative decoding method based on block-level masked parallel prediction”), and also working closely with TileRT, software from startup Tile




