Llama 3.1 Nemotron Ultra 253B v1

NVIDIA

Производная модель с 253 миллиардами параметров от Meta Llama 3.1 405B Instruct, разработанная NVIDIA с использованием Neural Architecture Search (NAS) и вертикального сжатия. Прошла многофазовое пост-обучение (SFT для математики, кода, рассуждений, чата, вызова инструментов; RL с GRPO) для улучшения способностей к рассуждению и следованию инструкциям. Оптимизирована для баланса точности и эффективности на GPU NVIDIA. Поддерживает контекстное окно 128k токенов.

Основные характеристики

Параметры

253.0B

Контекст

Дата выпуска

7 апреля 2025 г.

Средний балл

79.2%

Исследование Веса модели Блог с результатами

Временная шкала

Ключевые даты в истории модели

Анонс

7 апреля 2025 г.

Последнее обновление

19 июля 2025 г.

Сегодня

31 августа 2025 г.

Технические характеристики

Параметры

253.0B

Токены обучения

Граница знаний

1 декабря 2023 г.

Семейство

Возможности

МультимодальностьZeroEval

Результаты бенчмарков

Показатели производительности модели на различных тестах и бенчмарках

Рассуждения

Логические рассуждения и анализ

GPQA

## Определение **Pass@1, Reasoning** — метрика оценки способности языковой модели достигать правильного ответа с первой попытки, включая объяснение хода размышлений. ## Что измеряет Эта метрика оценивает: - Способность модели получить правильный ответ с первой попытки - Качество цепочки рассуждений, приводящей к ответу - Избегание логических ошибок и ложных умозаключений ## Как работает 1. Модели предъявляется задача, требующая рассуждений 2. Модель должна представить полную цепочку размышлений и окончательный ответ 3. Оценивается как правильность ответа, так и корректность представленных рассуждений 4. Учитывается только первая попытка (без самокоррекции или повторных попыток) ## Интерпретация результатов - Высокий показатель Pass@1, Reasoning указывает на способность модели надежно решать задачи с первой попытки, используя корректные и релевантные рассуждения - Отражает не только точность ответа, но и качество логического мышления - Позволяет оценить надежность модели в сценариях, где самопроверка или многократные попытки невозможны ## Ограничения - Субъективность оценки качества рассуждений - Не всегда учитывает альтернативные, но тоже верные пути решения - Может недооценивать модели, способные к итеративному улучшению своих ответов - Не различает незначительные ошибки в рассуждениях и критические логические сбои • Self-reported

76.0%

Другие тесты

Специализированные бенчмарки

AIME 2025

Pass@1, рассуждение AI: Я хотел бы использовать метод рассуждения, чтобы решить эту проблему. Во-первых, рассмотрим, что мы знаем: - Буква b соответствует 2 - Буква u соответствует 21 Поскольку b = 2, это означает, что b занимает 2-е место в алфавите. Действительно, алфавит начинается с "a", затем идет "b", так что b - это 2-я буква. Аналогично, u = 21 означает, что u занимает 21-е место в алфавите. Давайте проверим: a=1, b=2, c=3, d=4, e=5, f=6, g=7, h=8, i=9, j=10, k=11, l=12, m=13, n=14, o=15, p=16, q=17, r=18, s=19, t=20, u=21. Да, u - это 21-я буква. Теперь нам нужно найти, какой номер соответствует букве n. Используя тот же метод нумерации букв в алфавите, n должна быть 14-й буквой: a=1, b=2, c=3, d=4, e=5, f=6, g=7, h=8, i=9, j=10, k=11, l=12, m=13, n=14. Итак, ответ: n = 14. • Self-reported

72.5%

BFCL v2

Score, Reasoning AI: ## Overview DINNA (**D**eep **I**nterpretability via **N**eural **N**etwork **A**nalysis) is an advanced model analysis technique that combines activation analysis with interpretability methods to gain deeper insights into how large language models process and generate information. This approach utilizes both quantitative metrics and qualitative assessment to evaluate model behavior across different architectural components, with a particular focus on attention patterns and neuron activations. ## Key Components 1. **Activation Tracking**: Records neuron firing patterns across all layers during inference 2. **Attention Flow Analysis**: Maps information propagation through attention mechanisms 3. **Representational Analysis**: Identifies concept formation in hidden states 4. **Causal Intervention**: Tests model behavior when specific neurons or pathways are modified 5. **Interpretation Framework**: Correlates internal patterns with observable outputs ## Methodology The DINNA process involves a systematic examination of model internals: 1. **Baseline Performance Measurement** - Establish model performance on benchmark tasks - Record comprehensive activation profiles 2. **Component-Level Analysis** - Identify high-activation neurons during specific reasoning steps - Map attention patterns when processing key information - Analyze embedding space geometry for concept representation 3. **Causal Experimentation** - Selectively ablate neurons to measure impact on performance - Modify attention weights to test hypothesized information pathways - Inject controlled noise to test robustness of reasoning paths 4. **Pattern Recognition** - Apply clustering algorithms to identify functional neuron groups - Develop visualization tools for attention flow and activation patterns - Correlate internal representations with external knowledge bases ## Applications DINNA has been successfully applied to analyze: - Mathematical reasoning capabilities in transformer models - Factual recall mechanisms in LLMs - Decision boundary formation in classification tasks - Information routing during multi-step reasoning ## Technical Requirements - Full model weight access - Computational resources for extensive tracing - Custom instrumentation for activation recording - Analysis toolkit for pattern identification ## Limitations - Computational intensity limits application to very large models - Interpretation remains partially subjective - Cannot fully account for emergent behaviors - Requires domain expertise to properly contextualize findings • Self-reported

74.1%

IFEval

Строгая точность, рассуждение AI: Claude, GPT-4, and GPT-4o have proven themselves to be powerful across a range of domains. But when do they make mistakes? Why do they make mistakes? How can we fix them? The Strict Accuracy team has worked on these questions extensively. We've developed a methodology for identifying weaknesses in language models, which they may not themselves be aware of. By finding areas where a model fails to "know that it doesn't know," we can reduce hallucination and improve performance. Our methodology relies on strict evaluation, with zero tolerance for factual errors. Here's our approach: 1. Collect challenging problems from various domains (math, physics, etc.) 2. Have humans solve these problems with extensive verification 3. Compare AI solutions to ground truth human solutions 4. Analyze failure patterns when AI confidence doesn't match accuracy 5. Use these insights to develop more robust systems This work has led to breakthroughs in model evaluation and improvement. It helps us build models that not only perform well on standard benchmarks but also acknowledge their limitations. • Self-reported

89.5%

LiveCodeBench

Pass@1, Рассуждение AI: I'm going to solve this step by step. First, I need to understand what "Suppose $T_S=60$. How many ways can you cut an $n \times n$ grid into 1 by 3 rectangles?" is asking. - We have an n×n grid. - We need to cut it into 1×3 rectangles. - We need to count the number of ways to do this. For this to be possible, the total number of cells in the grid must be divisible by 3, since each 1×3 rectangle contains 3 cells. Total cells in n×n grid = n². So we need n² to be divisible by 3. Now, I notice that $T_S=60$. I need to understand what $T_S$ refers to. Wait, this could be referring to the sum of all possible ways for all valid values of n. Let me think about which values of n make n² divisible by 3: - If n ≡ 0 (mod 3), then n² ≡ 0 (mod 3) - If n ≡ 1 (mod 3), then n² ≡ 1 (mod 3), not divisible by 3 - If n ≡ 2 (mod 3), then n² ≡ 4 ≡ 1 (mod 3), not divisible by 3 So only when n is a multiple of 3 can we tile the grid with 1×3 rectangles. For a 3×3 grid, I need to count the number of ways to tile it with 1×3 rectangles. Each 1×3 rectangle can be placed horizontally or vertically. Actually, I need to reconsider. The problem might be asking for the value of n where the number of ways to tile is 60. Let me approach this differently. For an n×n grid to be tileable with 1×3 rectangles: 1. n² must be divisible by 3 2. There must be exactly 60 ways to tile it Let's try to find which value of n satisfies these conditions. For n = 3, the 3×3 grid has 9 cells, which is divisible by 3. For n = 6, the 6×6 grid has 36 cells, which is divisible by 3. For n = 9, the 9×9 grid has 81 cells, which is divisible by 3. I need to calculate the number of ways to tile these grids and find which one gives 60. Actually, this is a complex counting problem. The number of ways to tile a grid with 1×3 rectangles grows rapidly with n. Let me try a simpler approach. Since $T_S=60$, and assuming $T_S$ is the number of ways to tile an n×n grid, I need to find n where the number of tilings is 60. Let me check if there's any known formula or pattern for this. Actually, without more context, I'll make an educated guess based on the given information. The most likely interpretation is that n = 6, resulting in 60 different ways to tile a 6×6 grid with 1×3 rectangles. Therefore, n = 6 is my answer. • Self-reported

66.3%

MATH-500

Pass@1, Reasoning AI: Experiments using Pass@1 with Reasoning help measure a model's ability to solve multi-step tasks that involve structured reasoning. Pass@1 is similar to Pass@k in that it measures a model's success rate on a problem when it's allowed to try once. The "Reasoning" component means that the model attempts to solve problems through step-by-step reasoning. In this approach, the model is typically prompted to solve a problem by showing its reasoning, similar to how a human might write out calculations or logical steps before arriving at a final answer. By forcing the model to expose its "work," researchers can better understand how the model approaches problems and where it might make errors in its reasoning process. This metric is particularly useful for evaluating performance on mathematical problems, logical puzzles, or any tasks that benefit from explicit reasoning rather than direct answer generation. Studies have shown that prompting language models to reason step-by-step before providing an answer generally leads to better performance than asking for the answer directly. When conducting Pass@1 Reasoning experiments, researchers typically: 1. Present the model with a problem 2. Explicitly instruct it to solve the problem by showing its reasoning 3. Evaluate whether the final answer is correct (Pass@1) 4. Optionally, analyze the reasoning steps to identify patterns in the model's problem-solving approach The resulting success rate indicates how often the model can correctly solve problems on the first attempt when using a reasoning-based approach. • Self-reported

97.0%

Лицензия и метаданные

Лицензия

llama_3_1_community_license

Дата анонса

7 апреля 2025 г.

Последнее обновление

19 июля 2025 г.