Source: https://openai.com/research/gpt-4
A new and powerful tool: GPT-4
We’ve
created GPT-4, the latest milestone in OpenAI’s effort in scaling up
deep learning. GPT-4 is a large multimodal model (accepting image and
text inputs, emitting text outputs) that, while less capable than humans
in many real-world scenarios, exhibits human-level performance on
various professional and academic benchmarks.
More resources
We’ve created GPT-4, the latest milestone in OpenAI’s effort in scaling up deep learning. GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks. For example, it passes a simulated bar exam with a score around the top 10% of test takers; in contrast, GPT-3.5’s score was around the bottom 10%. We’ve spent 6 months iteratively aligning GPT-4 using lessons from our adversarial testing program as well as ChatGPT, resulting in our best-ever results (though far from perfect) on factuality, steerability, and refusing to go outside of guardrails.
Over the past two years, we rebuilt our entire deep learning stack and, together with Azure, co-designed a supercomputer from the ground up for our workload. A year ago, we trained GPT-3.5 as a first “test run” of the system. We found and fixed some bugs and improved our theoretical foundations. As a result, our GPT-4 training run was (for us at least!) unprecedentedly stable, becoming our first large model whose training performance we were able to accurately predict ahead of time. As we continue to focus on reliable scaling, we aim to hone our methodology to help us predict and prepare for future capabilities increasingly far in advance—something we view as critical for safety.
We are releasing GPT-4’s text input capability via ChatGPT and the API (with a waitlist). To prepare the image input capability for wider availability, we’re collaborating closely with a single partner to start. We’re also open-sourcing OpenAI Evals,
our framework for automated evaluation of AI model performance, to
allow anyone to report shortcomings in our models to help guide further
improvements.
Capabilities
In a casual conversation, the distinction between GPT-3.5 and GPT-4 can be subtle. The difference comes out when the complexity of the task reaches a sufficient threshold—GPT-4 is more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5.
To understand the difference between the two models, we tested on a variety of benchmarks, including simulating exams that were originally designed for humans. We proceeded by using the most recent publicly-available tests (in the case of the Olympiads and AP free response questions) or by purchasing 2022–2023 editions of practice exams. We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training, but we believe the results to be representative—see our technical report for details.
Simulated exams | GPT-4 | GPT-4 (no vision) | GPT-3.5 |
Uniform Bar Exam (MBE+MEE+MPT)1 | 298 / 400 | 298 / 400 | 213 / 400 |
LSAT | 163 | 161 | 149 |
SAT Evidence-Based Reading & Writing | 710 / 800 | 710 / 800 | 670 / 800 |
SAT Math | 700 / 800 | 690 / 800 | 590 / 800 |
Graduate Record Examination (GRE) Quantitative | 163 / 170 | 157 / 170 | 147 / 170 |
Graduate Record Examination (GRE) Verbal | 169 / 170 | 165 / 170 | 154 / 170 |
Graduate Record Examination (GRE) Writing | 4 / 6 | 4 / 6 | 4 / 6 |
USABO Semifinal Exam 2020 | 87 / 150 | 87 / 150 | 43 / 150 |
USNCO Local Section Exam 2022 | 36 / 60 | 38 / 60 | 24 / 60 |
Medical Knowledge Self-Assessment Program | 75% | 75% | 53% |
Codeforces Rating | 392 | 392 | 260 |
AP Art History | 5 | 5 | 5 |
AP Biology | 5 | 5 | 4 |
AP Calculus BC | 4 | 4 | 1 |
We
also evaluated GPT-4 on traditional benchmarks designed for machine
learning models. GPT-4 considerably outperforms existing large language
models, alongside most state-of-the-art (SOTA) models which may include
benchmark-specific crafting or additional training protocols:
Benchmark | GPT-4 | GPT-3.5 | LM SOTA | SOTA |
86.4% | 70.0% | 70.7% | 75.2% | |
95.3% | 85.5% | 84.2% | 85.6% | |
96.3% | 85.2% | 84.2% | 85.6% | |
87.5% | 81.6% | 84.2% | 85.6% | |
67.0% | 48.1% | 26.2% | 65.8% | |
DROP (f1 score) | 80.9 | 64.1 | 70.8 | 88.4 |
Many existing ML benchmarks are written in English. To get an initial sense of capability in other languages, we translated the MMLU benchmark—a suite of 14,000 multiple-choice problems spanning 57 subjects—into a variety of languages using Azure Translate (see Appendix). In the 24 of 26 languages tested, GPT-4 outperforms the English-language performance of GPT-3.5 and other LLMs (Chinchilla, PaLM), including for low-resource languages such as Latvian, Welsh, and Swahili:
We’ve
also been using GPT-4 internally, with great impact on functions like
support, sales, content moderation, and programming. We also are using
it to assist humans in evaluating AI outputs, starting the second phase
in our alignment strategy.
Visual inputs
GPT-4 can accept a prompt of text and images, which—parallel to the text-only setting—lets the user specify any vision or language task. Specifically, it generates text outputs (natural language, code, etc.) given inputs consisting of interspersed text and images. Over a range of domains—including documents with text and photographs, diagrams, or screenshots—GPT-4 exhibits similar capabilities as it does on text-only inputs. Furthermore, it can be augmented with test-time techniques that were developed for text-only language models, including few-shot and chain-of-thought prompting. Image inputs are still a research preview and not publicly available.
No comments:
Post a Comment