Falcon-H1R: Pushing the Reasoning Frontiers with a Hybrid Model for Efficient Test-Time Scaling

TII Launches Falcon Reasoning: Best 7B AI Model Globally, Also Outperforms Larger Models

Introducing Falcon H1R 7B

We’re excited to unveil Falcon H1R 7B, a decoder-only large language model, developed by the Technology Innovation Institute (TII) in Abu Dhabi. Building upon the robust foundation of Falcon-H1 Base model, Falcon H1R 7B takes a major leap forward in reasoning capabilities.

Despite its modest 7 billion‑parameter size, Falcon H1R 7B matches or outperforms state‑of‑the‑art reasoning models that are 2–7× larger, proving its exceptional parameter efficiency and does so consistently across a wide range of reasoning‑intensive benchmarks.

Its performance stems from a carefully curated training set and a two‑stage pipeline of efficient supervised fine‑tuning followed by RL scaling.

Falcon H1R 7B’s design rests on three key axes of reasoning efficiency: speed, token‑efficiency, and accuracy that together set the “3‑D limits” of performance. By integrating Deep Think with confidence (DeepConf) during test‑time scaling, the model achieves state‑of‑the‑art efficiency, delivering substantial accuracy gains while generating fewer tokens than competing models.

This iteration includes:

Training recipe

Falcon H1R 7B’s training regimen is a two‑stage, data‑driven pipeline designed to maximize reasoning quality.

  • Cold‑start supervised fine‑tuning (SFT): Starting from the Falcon‑H1‑7B backbone, we train on curated datasets that contain step‑by‑step long‑form reasoning traces across three domains: mathematics, coding, and science. We additionally include non-reasoning domains: chat, tool‑calling, safety, etc. Difficulty‑aware filtering is applied during SFT to prioritize challenging examples. Training targets extremely long response lengths (up to 48 k tokens).
  • Reinforcement learning with GRPO: The SFT checkpoint is further refined using the GRPO algorithm. Rewards are given for correct reasoning chains, encouraging the model to generate high‑quality, diverse outputs while still staying within the tokens budget-limit. The RL stage balances exploration and exploitation to improve output quality while respecting token constraints.

Model’s Capabilities

Below, the bar plot compares Falcon H1R 7B’s performance across math, code & agentic, and general benchmarks against the leading 7B to 47B models.

  • Math: Falcon H1R 7B leads (73.96 %) by a wide margin, beating the next best (Apriel 1.5 15B at 69.32 %) and outpacing all larger baselines such as Qwen3‑32B (63.66 %) and Nemotron H 47B (49.72 %).
  • Code & Agentic: Falcon H1R 7B has the highest score in this group (33.95 %), ahead of Qwen3‑32B (33.40 %) and Apriel 1.5 (31.60 %).
  • General: Falcon H1R 7B remains highly competitive (49.48 %), sitting just below Apriel 1.5 (53.10 %) and Phi 4 Reasoning Plus 14B (51.18 %).

Math Benchmarks

Falcon H1R 7B delivers top‑tier math results across a spectrum of difficulty levels, all while staying at only 7B parameters.

BenchmarkFalcon H1R 7BNext best
AIME‑2488.1 %Apriel 1.5 15B – 86.2 %
AIME‑2583.1 %Apriel 1.5 15B – 80.0 %
HMMT‑2564.9 %Apriel 1.5 15B – 61.0 %
AMO-Bench36.3 %DeepSeek R1‑0528 Qwen3‑8B – 23.3 %

Code & agentic Benchmarks

Falcon H1R 7B delivers solid reasoning across a spectrum of code and agentic challenges.

BenchmarkFalcon H1R 7BRelative standing
LCB v668.6 %Highest of all models – outperforms even the 32B Qwen3 by ~7 pp
SciCode (sub-problem)28.3 %Best for <8B models
TB Hard4.96 %Second best (Apriel 1.5 15B at 9.9 %) and beats the 8B/32B Qwen3 models

General Benchmarks

Falcon H1R 7B proves its versatility across a broad set of general‑purpose tasks, consistently matching or surpassing larger competitors while staying at only 7B parameters.

BenchmarkFalcon H1R 7BRelative standing
GPQA‑D61.3 %On-par with other 8B models (Qwen3‑8B 61.2 %, DeepSeek 61.4 %)
MMLU‑Pro72.1 %Outperforms all 8B rivals and close to the 14/32B cohort.
HLE11.1 %Slightly behind Apriel 1.5 15B and beats every other 8B/32B variant
IFBench53.4 %Second best after Apriel (55.8 %) and outpaces all 8B models; demonstrates robust instruction‑following at a compact scale.

Inference

Here we benchmark Falcon H1R 7B’s token throughput per GPU against Qwen3 8B under realistic test‑time scaling workloads.

Falcon H1R 7B outperforms Qwen3 8B across the board, especially as batch size grows. In the typical test‑time scaling case (512 → 32k), Falcon reaches roughly 1,000 tokens/s/GPU at batch 32 and ≈ 1,500 at batch 64, nearly double Qwen3’s rates. The advantage widens further for longer inputs (8k → 16k), where Falcon again delivers ≈ 1,800 tokens/s/GPU while Qwen3 stays below 900. The hybrid Transformer–Mamba backbone is the key to this superior scaling and memory efficiency.I/O size1 of 2 selectedModels2 of 2 selecte

Test time scaling

Test‑time scaling (TTS) boosts a model’s reasoning by running many parallel solution chains and aggregating the best answer, unlocking latent capability without extra training. In Falcon H1R 7B we employ Deep Think with Confidence (DeepConf), a lightweight, confidence‑aware filtering method that dynamically discards low‑quality reasoning traces during or after generation. DeepConf leverages the model’s own next‑token confidence scores to identify and prune noisy traces, requiring no additional training or hyper‑parameter tuning.

Falcon H1R 7B thrives at high batch sizes and is token‑efficient, generating fewer tokens per inference for a given accuracy level, making TTS especially effective and positioning the model on a new Pareto frontier of performance vs. inference compute.

The grid below shows how many tokens were generated for a given accuracy. Falcon H1R 7B sits on the Pareto frontier of low cost, high performance:

  • AIME 24 / 25 – 96.7 % accuracy with <100 M tokens, beating every other 8B model and matching the best 14/32B systems.
  • AMO-Bench – 35.9 % accuracy with just 217 M tokens, surpassing every other model.

Falcon H1R 7B demonstrates that a 7 billion‑parameter model can rival larger peers in reasoning tasks while delivering efficient inference, making it an attractive choice for developers and researchers alike.

To learn more, please refer to the original article, click here.

————

Disclaimer: The content of the above information is sourced (or provided), in entirety or in parts from an external source and the content may or may not be edited. The Wealth Today shall not be held liable for damages arising out of any action taken with respect to the use or consumption of information or service published above or anywhere else on the website. This website does not guarantee the accuracy, views, opinions, or any promises expressed in the above news. If you find any errors or discrepancies in the above information, you can write to us at editor@thewealth.today.

Total
0
Shares
0 Tweet
0 Share
Related Posts