Humanity’s Last Exam for Forecasting?

Cassi's Performance on Forecast Bench & its Implications

Feb 23, 2026

Humanity’s Last Exam for Forecasting: The very best humans still ahead (for now, just) in forecasting accuracy and calibration; most humans surpassed, and all humans outmatched for depth, breadth and volume in forecasting by AI.

Introduction

ForecastBench is a rolling benchmark that generates questions about the future, collects probabilistic forecasts from humans and models, then scores them as the underlying events resolve. It cannot be gamed. There is no teaching to the test (‘benchmaxxing’) when the questions are about the future. It is as honest a test of AI capabilities as exists.

Cassi has been the top, or second placed AI in forecasting the future since our first results were released in January, trading places and tying with xAI for leadership. We are performing well above the average human forecaster, essentially at the level of a superforecaster. We are close to matching the weighted median superforecaster score – at which point we could defensibly claim to be superhuman at prediction. Here, we look a little more closely at the results and what they might mean.

Everything Is Prediction

Every board paper, box note, underwriting memo and investment thesis or policy paper contains a forecast. Sometimes it is explicit (a revenue number). Often it is vague (a risk described as “unlikely”). Other times it might be smuggled in as an unstated assumption (an argument that a given policy will achieve a given outcome, with not a probability in sight). Either way, the organisation is making an explicit or implicit prediction: if it does ‘this’ rather than ‘that’, its revealed thinking is the odds and/or returns are better for ‘this’ over ‘that’. This is a fully generalisable feature of all organisations

ForecastBench is one of the few places where bets are publicly audited. Within most organisations, they never are – the feedback loop for the stated or implied predictions on which decisions rest are never tested, nor revisited.

This post does three things. First, it summarises Cassi’s current position on ForecastBench. Second, it explains why the “human benchmark” people now cite is a very high bar, and why that matters to senior decision makers. Third, it translates tournament performance into practical value for finance, insurance and government.

1. What Cassi has achieved on ForecastBench

On the Tournament leaderboard, Cassi’s model entry (ensemble_2_crowdadj) currently scores 0.102 on ForecastBench’s difficulty-adjusted Brier metric. Today (20 Feb 26) that places it joint-second overall behind the superforecaster median (0.086), and tied for first among AI systems alongside xAI’s Grok 4.20 (Preview).

This is a living score. As of 20 February 2026, 955 dataset questions and 170 market questions have resolved and are in the calculation. Around 3,000 dataset questions and 330 market questions are still pending, so results will continue to move as more questions resolve. Dataset questions are asked at eight horizons, from 7 days out to 10 years, but so far only the shortest horizons (roughly up to one month) have come due. Some of today’s questions will not resolve until the mid-2030s. We should expect the leaderboard to change over time as further questions resolve and new models are entered, including our own.

Nevertheless, over the past month Cassi has a defensible claim to be either the best in the world, joint best, or second only to Elon’s xAI at AI machine prediction, and very close to superhuman performance.

How ForecastBench works, in plain English

ForecastBench is a live benchmark of forecasting accuracy for humans and AI systems. Every two weeks it generates new questions then scores forecasts as those questions resolve.

Two question types:

• Dataset questions: automatically generated from real-world time series (ACLED, DBnomics, FRED, Yahoo! Finance, Wikipedia). Each dataset question is asked at eight horizons, from 7 days out to 10 years.
• Market questions: drawn from prediction platforms. Each market question has one resolution date.

What do the questions look like?

Dataset examples (generated by the Forecast Bench team)
• Economic: Will securities held by US Federal Reserve Banks be higher on the resolution date than on the forecast date?
• Economic: Will the European Central Bank’s deposit facility rate be higher on the resolution date than on the forecast date?
• Climate: Will the daily average temperature at Rennes Saint-Jacques Airport be higher on the resolution date than on the forecast date?
• Conflict: Will protests in Sri Lanka in the 30 days before the resolution date exceed the average level over the year before the forecast date?
• Finance/Business: Will Pfizer’s closing share price be higher on the resolution date than on the forecast date?

Market examples (from prediction platforms)
• Economy: Will gold close at $3,200 or more at the end of 2025?
• Sport: Will a legal sub-two-hour marathon be run before 31 December 2025?
• Technology: Will AI have a trillion-dollar-plus impact by the end of 2025?
• Health: Will the number of deaths from antibiotic-resistant infections per year double by 31 December 2025?

2. The goalposts have moved, this is now ‘Humanity’s Last Exam’ for Forecasting

Normally, we would be talking about when AIs might outperform most humans at forecasting. But the truth is, they already are. The usual goalpost-shifting in AI debates mean we are not talking about when models exceed most humans, nor even when they exceed the best humans at forecasting – superforecasters – but instead the weighted superforecaster median. Most organisations would, on the evidence, be better off using our forecasts today than they are relying on those they currently generate. Only if an organisation specifically cultivated or recruited or consulted superforecasters, scored and fed back on their prediction performance, and had systemic practices for aggregating their forecasts, would this not be true.

ForecastBench’s headline human reference point is the median forecast of superforecasters, a group selected for consistent above average performance, and then the median across their predictions. This is useful as a target, but is not representative of most forecasts made by most people in most organisations most of the time. It is the best humanity can produce. It is humanity’s last exam for forecasting.

Nor are the ‘community’ forecasts, which LLMs have surpassed, representative of most people in most organisations. The community participating in forecasting tournaments are a self-selecting, unrepresentative sample in the first place. They also get feedback on their forecasts, which we know improves performance. Even in fields like intelligence, where we used to preach ‘no insight without foresight’, very few professionals get any feedback on their forecasting performance. If LLMs are outperforming self-selected forecasters who get regular feedback, they are also surpassing professionals making forecasts who never get such feedback.

But in one sense, no competition can capture the main advantage of AI-based forecasting: given rough parity in skill, AI forecasting dominates all human forecasters in breadth, depth and volume. AI can generate, quickly and accurately, more-or-less as many forecasts as you need or want. It takes longer to read the rationale for a Cassi forecast than it does to generate one in the first place.

In October 2025, the Forecasting Research Institute noted that linear projections suggested AI would match or exceed human forecasters in November 2026. The Metaculus median forecast suggests LLMs will exceed human forecasters in mid-June 2027. The way we support and make decisions is set to change dramatically. The rewards for those that see this first will be significant.

3. Beyond tournaments: what this means for business

How much is being better able to predict the future worth to your organisation?

If your organisation could improve its probabilistic accuracy by even a modest margin, what would that mean for capital allocation, procurement, operational effectiveness, pricing, hiring, inventory, policy success? A small edge in calibration compounds.

We founded Cassi because we think that eventually all organisations will adopt such methods: those that do so later being forced by the success of those who do so earlier, if they survive long enough.

Senior decision makers who can see that ‘everything is prediction’ should see the opportunity and the risk of non-adoption. Most do not have a superforecaster bench. They have busy experts, stretched teams and risk committees that meet monthly at best. At some point, someone will ask why they didn’t adopt, or at least experiment with, more effective methods.

To begin this now secures immediate advantage, but it is also to prepare for the world of consistently superhuman prediction that is coming.

If you work in defence, finance or insurance, you are likely aware how central prediction is to your decision-making. It is how you price risk, allocate capital and protect the country, your fellows, and the balance sheet. A small improvement in probabilistic accuracy can compound into fewer military blunders or mispriced policies, better hedges, and earlier warnings on emerging exposures. But decisions in all industries and all areas of life are if/then predictions too. The disruption, risk and opportunity, will be widespread.

Three practical applications stand out.

Financial, corporate and regulatory risk

On the one hand, making more accurate forecasts allows more profitable opportunities to be found and exploited; and on the other, more accurate forecasts enables much more efficient risk management and mitigation. No more red-amber-green based on crude heuristics. but a rich set of calibrated forecasts, constantly and automatically updated.

Strategy, tactics and decisions

All strategic and tactical decisions have the essence of gambles - planning and decision-making under conditions of uncertainty ‘thinking in bets’. More accurate forecasting allows better decisions to be made by refining those odds so commanders have a better sense of the risks and rewards they are running and whether a given option is coherent under the assumptions which are held. Better forecasting reduces the chances of being surprised, and increases the likelihood of surprising an adversary – noting that surprise is often said to be among the most tactically decisive factors in warfare.

Resource allocation

Resource allocation is forecasting in disguise. Headcount plans, insurance purchases and capital allocations all assume a future and imply beliefs about where risks and rewards lie. More accurate forecasting helps organisations to get these trade-offs right and allocate those resources where they will be most efficient in generating those rewards, or mitigating those risks. A forecasting system is most valuable when it is wired into thresholds that trigger action.

Everything is Prediction.

https://cassi-ai.com

Discussion about this post

Ready for more?