Nous Research introduced NousCoder-14B, a competitive Olympic programming model that is post-trained on Qwen3-14B using reinforcement learning (RL) with verifiable rewards. In the LiveCodeBench v6 benchmark covering issues from August 1, 2024 to January 5, 2025, the model reaches a Pass@1 accuracy of 67.87 percent. This is 7.08 percent higher than Qwen3-14B’s baseline of 60.79 percent in the same benchmark. The research team trained the model on 24,000 verifiable coding problems for four days using 48 B200 GPUs and published the weights on Hugging Face under the Apache 2.0 license.


Benchmark focus and what Pass@1 means
LiveCodeBench v6 is designed for competitive programming evaluation. The test split used here contains 454 questions. This training set uses the same recipe as Agentica and Together AI’s DeepCoder-14B project. This combines issues from TACO Verified, PrimeIntellect SYNTHETIC 1, and LiveCodeBench issues created before July 31, 2024.
Benchmarks include only competitive programming style tasks. For each problem, the solution must adhere to strict time and memory limits and pass extensive covert input/output tests. Pass@1 is the percentage of problems in which the initially generated program passes all tests, including time and memory constraints.


Dataset construction for execution-based RL
All datasets used for training consist of verifiable code generation problems. Each problem has a reference implementation and many test cases. The training set contains 24,000 questions extracted from:
- Octopus Follow
- PrimeIntellect Synthesis 1
- LiveCodeBench issues that occurred before July 31, 2024
The test set is LiveCodeBench v6 and contains 454 questions from August 1, 2024 to May 1, 2025.
All questions are complete competitive programming tasks with instructions, input formats, output formats, and test cases. This setting is important for RL because it gives a computationally cheap binary reward signal after code execution.
RL environment using Atropos and Modal
The RL environment is built using the Atropos framework. NousCoder-14B uses the standard LiveCodeBench prompt format to display prompts and generate Python code for each problem. Each rollout receives scalar rewards depending on the test case results.
- Reward 1 if the generated code passes all test cases for that problem.
- Reward -1 if the code outputs a wrong answer, exceeds the 15 second time limit, or exceeds the 4 GB memory limit in any test case.
To run untrusted code securely and at scale, the team uses Modal as an autoscaled sandbox. The system launches one modal container for each main design rollout, which the research team describes as a usage configuration. Each container runs all test cases for its rollout. This avoids mixing training and validation computes and keeps the RL loop stable.
The research team also pipelines inference and validation. When the inference worker finishes generating, it sends its completion to the modal validator, which immediately starts a new generation. This design uses a fixed pool of many inference workers and modal containers to keep the training loop’s inference computations at the limit rather than the validation limit.
The team discusses three verification parallelization strategies. Investigate one container per issue, one container per rollout, and one container per test case. Ultimately, we avoid per-test case configuration due to container startup overhead and use an approach where each container evaluates many test cases and focuses on a small set of the most difficult test cases first. If any of these fail, the system may stop validation prematurely.
GRPO target, DAPO, GSPO, GSPO+
NousCoder-14B uses Group Relative Policy Optimization (GRPO), which does not require separate value models. The research team is conducting tests based on GRPO Three goals: Dynamic sAmpling Policy Optimization (DAPO), Group Sequence Policy Optimization (GSPO), and a modified GSPO variant called GSPO+.
All three objectives share the same definition of benefit. The benefit of each rollout is the reward for that rollout normalized by the mean and standard deviation of the rewards within the group. DAPO applies importance weighting and clipping at the token level. Three main changes have been introduced in relation to GRPO.
- Top clip rules to increase exploration of low probability tokens
- Token-level policy gradient loss that gives equal weight to each token
- Dynamic sampling. Groups that are all correct or all incorrect are excluded because they have zero benefit.
GSPO moves the importance weighting to the sequence level. Define the sequence importance ratio, which is the sum of the token ratios for the entire program. GSPO+ maintains the sequence-level correction, but rescales the gradient so that tokens are equally weighted regardless of sequence length.
In LiveCodeBench v6, the differences between these goals are small. At a context length of 81,920 tokens, DAPO’s Pass@1 reaches 67.87 percent, and GSPO and GSPO+ reach 66.26 percent and 66.52 percent. At 40,960 tokens, all three goals are centered around 63 percent Pass@1.
Repetitive context expansion and overly long filtering
Qwen3-14B supports long contexts and training follows an iterative context expansion schedule. The team first trains the model on 32k context windows, then continues training on up to 40k Qwen3-14B context windows. At each stage, we select the checkpoint with the highest LiveCodeBench score in 40k contexts and use YaRN context extensions during evaluation to reach 80k tokens, or 81,920 tokens.
The key trick is to filter too long. If the generated program exceeds the maximum context window, its advantage is reset to zero. This removes that rollout from the gradient signal rather than penalizing it. The researchers report that this approach avoids pushing the model toward shorter solutions purely for optimization purposes and helps maintain quality when adjusting context length during testing.
Important points
- NousCoder 14B, a Qwen3-14B-based competitive programming model trained with execution-based RL, reached 67.87 percent Pass@1 on LiveCodeBench v6, 7.08 percentage points higher than the Qwen3-14B baseline’s 60.79 percent on the same benchmark.
- The model was trained on 24,000 verifiable coding problems from TACO Verified, PrimeIntellect SYNTHETIC-1, and LiveCodeBench tasks before July 31, 2024, and evaluated on an independent LiveCodeBench v6 test set of 454 problems from August 1, 2024 to May 1, 2025.
- The RL setup uses Atropos, the Python solution runs in a sandbox container, gives a simple reward of 1 for solving all test cases, minus 1 for failure or resource limit violation, and uses a pipeline design where inference and validation are performed asynchronously.
- Group Relative Policy Optimization Objectives DAPO, GSPO, and GSPO+ were used for long context code RL, all working with group normalized rewards, and showed similar performance, with DAPO reaching the highest Pass@1 with the longest 81,920 token context.
- This training uses iterative context expansion initially with 32,000 tokens, then 40,000 tokens, and YaRN-based expansion up to 81,920 tokens during evaluation. It also includes very long rollout filtering for stability and ships as a fully reproducible open stack with Apache 2.0 weights and RL pipeline code.
Please check model weights and technical details. Please feel free to follow us too Twitter Don’t forget to join us 100,000+ ML subreddits and subscribe our newsletter. hang on! Are you on telegram? You can now also participate by telegram.
Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for social good. His latest endeavor is the launch of Marktechpost, an artificial intelligence media platform. It stands out for its thorough coverage of machine learning and deep learning news, which is technically sound and easily understood by a wide audience. The platform boasts over 2 million views per month, demonstrating its popularity among viewers.
