GPT OSS 20B

Мультимодальная

OpenAI

Модель gpt-oss-20b достигает практически равных результатов с OpenAI o4-mini на основных бенчмарках для оценки рассуждений, при этом эффективно работая на одном GPU с 80 ГБ памяти. Модель gpt-oss-20b показывает результаты, сопоставимые с OpenAI o3-mini на распространенных бенчмарках, и может работать на пограничных устройствах всего с 16 ГБ памяти, что делает её идеальной для использования на устройствах, локального инференса или быстрого итерирования без дорогостоящей инфраструктуры. Обе модели также демонстрируют высокие результаты в использовании инструментов, few-shot вызове функций, CoT-рассуждениях (что видно из результатов на агентском наборе для оценки Tau-Bench) и HealthBench (даже превосходя проприетарные модели, такие как OpenAI o1 и GPT-4o).

Основные характеристики

Параметры

20.0B

Контекст

131.0K

Дата выпуска

5 августа 2025 г.

Средний балл

37.8%

Репозиторий Веса модели Блог с результатами

Временная шкала

Ключевые даты в истории модели

Анонс / Последнее обновление

5 августа 2025 г.

Сегодня

31 августа 2025 г.

Технические характеристики

Параметры

20.0B

Токены обучения

Граница знаний

Семейство

Возможности

МультимодальностьZeroEval

Ценообразование и доступность

Вход (за 1М токенов)

$0.10

Выход (за 1М токенов)

$0.50

Макс. входящих токенов

131.0K

Макс. исходящих токенов

30.0K

Поддерживаемые возможности

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Результаты бенчмарков

Показатели производительности модели на различных тестах и бенчмарках

Общие знания

Тесты на общие знания и понимание

MMLU benchmark

Без инструментов • Self-reported

85.3%

Рассуждения

Логические рассуждения и анализ

GPQA

Diamond (без инструментов) • Self-reported

71.5%

Другие тесты

Специализированные бенчмарки

Codeforces Competition code

Elo (с инструментами) AI: LLM with capability of reasoning and using appropriate tools to solve tasks. We test it within competitive scenarios using the Elo rating system. In our evaluations, we run a large set of matchups between models on problems of different difficulties across datasets. To quantify overall model performance, we use Elo ratings, which are typically used to rank two-player games like chess and Go, where each player (in our case, model) has an Elo score. We can use the difference in Elo scores to predict the chance of a model winning against another on a given problem. In traditional Elo, when model A and model B go head-to-head, a model wins if it gets the correct answer while the other gets an incorrect answer. If both models get the same outcome (both correct or both incorrect), the result is a tie. To compute Elo scores, we use a logistic model where the probability of model A (with rating rₐ) winning against model B (with rating rᵦ) is: P(A beats B) = 1 / (1 + 10^((rᵦ - rₐ) / 400)) After each matchup, we update the Elo ratings based on the expected and actual outcomes. The magnitude of the update is controlled by a K-factor, which we set to 4 based on tuning to our data. • Self-reported

25.2%

Codeforces Competition code

Elo (без инструментов) • Self-reported

22.3%

Humanity's Last Exam

Точность (с инструментами) AI: Отвечая на высшие математические вопросы, как в роли ассистента, так и в режиме размышления, я использую подходящие инструменты, которые имеются в моем распоряжении. Это включает Python для вычислений и проверки рассуждений (особенно для вероятностных задач), Sage для символьных вычислений, и Wolfram Alpha для проверки ответов или получения вычислительной помощи. Я подхожу к задаче пошагово, сначала понимая задачу и разбивая ее на компоненты, затем применяя соответствующие математические инструменты и, наконец, проверяя мое решение с помощью альтернативных методов, когда это возможно. • Self-reported

17.3%

Humanity's Last Exam

Точность (без инструментов) • Self-reported

10.9%

HealthBench - Realistic health conversations

Score • Self-reported

42.5%

HealthBench Hard - Challenging health conversations

Score AI: A model that can solve a problem will typically achieve the correct result. But what about models that cannot fully solve a given problem, or models that may have made a mistake during the solution? A simple binary metric that only checks if the model got the final answer right doesn't provide much insight into how the model is reasoning or where it might be going wrong. The Score metric aims to address this limitation by evaluating not just whether the final answer is correct, but how well the model reasoned throughout its solution attempt. This provides a more nuanced view of model performance and helps identify specific weaknesses in reasoning capabilities. • Self-reported

10.8%

TAU-bench Retail benchmark

Вызов функций AI: Function calling, a crucial method for integrating language models with external tools and services, represents a significant advance in AI capability. In function calling, a language model parses a request, determines that a specific external function should be invoked, and formats the necessary parameters for that function in a structured format (typically JSON). The implementation can vary across models and platforms. For example, OpenAI's API allows developers to define function schemas that the model can reference, while open-source implementations like LangChain provide frameworks to handle the execution of identified functions. When a model employs function calling, it typically: 1. Recognizes when a request requires external computation or data 2. Selects the appropriate function based on the need 3. Structures the required arguments correctly 4. Generates proper syntax for the function call This capability transforms language models from pure text generators into systems that can trigger specific actions in software applications, query databases, or interact with external APIs. The model doesn't execute the function itself but rather identifies when a function should be called and prepares the call appropriately. Function calling is particularly valuable for: - Retrieving real-time information - Performing calculations - Executing database operations - Interfacing with external services - Controlling application features The ability to properly identify when function calling is needed (versus handling a request directly) and to correctly format the required parameters represents a sophisticated form of reasoning that bridges natural language understanding and programmatic execution. • Self-reported

54.8%

Лицензия и метаданные

Лицензия

apache_2_0

Дата анонса

5 августа 2025 г.

Последнее обновление

5 августа 2025 г.

GPT OSS 20B

Основные характеристики

Временная шкала

Технические характеристики

Ценообразование и доступность

Результаты бенчмарков

Общие знания

Рассуждения

Другие тесты

Лицензия и метаданные

Похожие модели

o4-mini

GPT-4o

GPT-4o mini

GPT-4.1

GPT-5 nano

GPT-4

GPT-4o

o3