Tuesday, 14 March 2023

A new and powerful tool: GPT-4

 Source: https://openai.com/research/gpt-4

A new and powerful tool: GPT-4


GPT-4

Ruby Chen

We’ve created GPT-4, the latest milestone in OpenAI’s effort in scaling up deep learning. GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks.

March 14, 2023

More resources


  • View system card

  • Try on ChatGPT Plus

  • Join API waitlist

  • Rewatch developer demo livestream

  • Contribute to OpenAI Evals
  • We’ve created GPT-4, the latest milestone in OpenAI’s effort in scaling up deep learning. GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks. For example, it passes a simulated bar exam with a score around the top 10% of test takers; in contrast, GPT-3.5’s score was around the bottom 10%. We’ve spent 6 months iteratively aligning GPT-4 using lessons from our adversarial testing program as well as ChatGPT, resulting in our best-ever results (though far from perfect) on factuality, steerability, and refusing to go outside of guardrails.

    Over the past two years, we rebuilt our entire deep learning stack and, together with Azure, co-designed a supercomputer from the ground up for our workload. A year ago, we trained GPT-3.5 as a first “test run” of the system. We found and fixed some bugs and improved our theoretical foundations. As a result, our GPT-4 training run was (for us at least!) unprecedentedly stable, becoming our first large model whose training performance we were able to accurately predict ahead of time. As we continue to focus on reliable scaling, we aim to hone our methodology to help us predict and prepare for future capabilities increasingly far in advance—something we view as critical for safety.

    We are releasing GPT-4’s text input capability via ChatGPT and the API (with a waitlist). To prepare the image input capability for wider availability, we’re collaborating closely with a single partner to start. We’re also open-sourcing OpenAI Evals, our framework for automated evaluation of AI model performance, to allow anyone to report shortcomings in our models to help guide further improvements.

    Capabilities

    In a casual conversation, the distinction between GPT-3.5 and GPT-4 can be subtle. The difference comes out when the complexity of the task reaches a sufficient threshold—GPT-4 is more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5.

    To understand the difference between the two models, we tested on a variety of benchmarks, including simulating exams that were originally designed for humans. We proceeded by using the most recent publicly-available tests (in the case of the Olympiads and AP free response questions) or by purchasing 2022–2023 editions of practice exams. We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training, but we believe the results to be representative—see our technical report for details.

    Simulated examsGPT-4estimated percentileGPT-4 (no vision)estimated percentileGPT-3.5estimated percentile
    Uniform Bar Exam (MBE+MEE+MPT)1298 / 400~90th298 / 400~90th213 / 400~10th
    LSAT163~88th161~83rd149~40th
    SAT Evidence-Based Reading & Writing710 / 800~93rd710 / 800~93rd670 / 800~87th
    SAT Math700 / 800~89th690 / 800~89th590 / 800~70th
    Graduate Record Examination (GRE) Quantitative163 / 170~80th157 / 170~62nd147 / 170~25th
    Graduate Record Examination (GRE) Verbal169 / 170~99th165 / 170~96th154 / 170~63rd
    Graduate Record Examination (GRE) Writing4 / 6~54th4 / 6~54th4 / 6~54th
    USABO Semifinal Exam 202087 / 15099th–100th87 / 15099th–100th43 / 15031st–33rd
    USNCO Local Section Exam 202236 / 6038 / 6024 / 60
    Medical Knowledge Self-Assessment Program75%75%53%
    Codeforces Rating392below 5th392below 5th260below 5th
    AP Art History586th–100th586th–100th586th–100th
    AP Biology585th–100th585th–100th462nd–85th
    AP Calculus BC443rd–59th443rd–59th10th–7th

    We also evaluated GPT-4 on traditional benchmarks designed for machine learning models. GPT-4 considerably outperforms existing large language models, alongside most state-of-the-art (SOTA) models which may include benchmark-specific crafting or additional training protocols:

    Benchmark
    GPT-4
    Evaluated few-shot
    GPT-3.5
    Evaluated few-shot
    LM SOTA
    Best external LM evaluated few-shot
    SOTA
    Best external model (includes benchmark-specific training)
    Multiple-choice questions in 57 subjects (professional & academic)
    86.4%
    5-shot
    70.0%
    5-shot
    70.7%
    75.2%
    Commonsense reasoning around everyday events
    95.3%
    10-shot
    85.5%
    10-shot
    84.2%
    85.6%
    Grade-school multiple choice science questions. Challenge-set.
    96.3%
    25-shot
    85.2%
    25-shot
    84.2%
    85.6%
    Commonsense reasoning around pronoun resolution
    87.5%
    5-shot
    81.6%
    5-shot
    84.2%
    85.6%
    Python coding tasks
    67.0%
    0-shot
    48.1%
    0-shot
    26.2%
    65.8%
    DROP (f1 score)
    Reading comprehension & arithmetic.
    80.9
    3-shot
    64.1
    3-shot
    70.8
    88.4

    Many existing ML benchmarks are written in English. To get an initial sense of capability in other languages, we translated the MMLU benchmark—a suite of 14,000 multiple-choice problems spanning 57 subjects—into a variety of languages using Azure Translate (see Appendix). In the 24 of 26 languages tested, GPT-4 outperforms the English-language performance of GPT-3.5 and other LLMs (Chinchilla, PaLM), including for low-resource languages such as Latvian, Welsh, and Swahili:

    We’ve also been using GPT-4 internally, with great impact on functions like support, sales, content moderation, and programming. We also are using it to assist humans in evaluating AI outputs, starting the second phase in our alignment strategy.

    Visual inputs

    GPT-4 can accept a prompt of text and images, which—parallel to the text-only setting—lets the user specify any vision or language task. Specifically, it generates text outputs (natural language, code, etc.) given inputs consisting of interspersed text and images. Over a range of domains—including documents with text and photographs, diagrams, or screenshots—GPT-4 exhibits similar capabilities as it does on text-only inputs. Furthermore, it can be augmented with test-time techniques that were developed for text-only language models, including few-shot and chain-of-thought prompting. Image inputs are still a research preview and not publicly available.

    No comments:

    Post a Comment