Wednesday 22 March 2023

How good are AI “Answering Engines” really?

 Source: https://blog.kagi.com/kagi-ai-search#aitest

How good are AI “Answering Engines” really?

When implementing a feature of this nature, it is crucial to establish the level of accuracy that users can anticipate. This can be accomplished by constructing a test question dataset encompassing challenging and complex queries, typically necessitating human investigation but answerable with certainty using the web. It is important to note that AI answering engines aim to streamline the user’s experience in this realm. To that end, we have developed a dataset of ‘hard’ questions from the most challenging we could source from Natural Questions dataset, Twitter and Reddit.

The questions included in the dataset range in difficulty, starting from easy and becoming progressively more challenging. We plan to release the dataset with the next update of the test results in 6 months. Some of the questions can be answered “from memory,” but many require access to the web (we wanted a good mix). Here are a few sample questions from the dataset:

  • “Easy” questions like “Who is known as the father of Texas?” - 15 / 15 AI providers got this right (all AI providers answered only four other questions).
  • Trick questions like “During world cup 2022, Argentina lost to France by how many points?” - 8 / 15 AI providers were not fooled by this and got it right.
  • Hard questions like “What is the name of Joe Biden’s wife’s mother?” - 5 / 15 AI providers got this right.
  • Very hard questions like “Which of these compute the same thing: Fourier Transform on real functions, Fast Fourier Transform, Quantum Fourier Transform, Discrete Fourier Transform?” that only one provider got right. (thanks to @noop_noob for suggesting this question on Twitter.

In addition to testing Kagi AI’s capabilities, we also sought to assess the performance of every other “answering engine” available for our testing purposes. These included Bing, Neeva, You.com, Perplexity.ai, ChatGPT 3.5 and 4, Bard, Google Assistant (mobile app), Lexii.ai, Friday.page, Komo.ai, Phind.com, Poe.com, and Brave Search. It is worth noting that all providers, except for ChatGPT, have access to the internet, which enhances their ability to provide accurate answers. As Google’s Bard is not yet officially available, we opted to test the Google Assistant mobile app, considered state-of-the-art in question-answering on the web just a few months ago. Update 3 / 21: We now include Bard results.

To conduct the test, we asked each engine the same set of 56 questions and recorded whether or not the answer was provided in the response. The answered % rate reflects the number of questions correctly answered, expressed as a percentage (e.g., 75% means that 42 out of 56 questions were answered correctly).

And now the results.

Answering engine Questions Answered Answered %
Human with a search engine [1] 56 100.0%
——————————- ——————– ———-
Phind 44 78.6%
Kagi 43 76.8%
You 42 75.0%
Google Bard 41 73.2%
Bing Chat 41 73.2%
ChatGPT 4 41 73.2%
Perplexity 40 71.4%
Lexii 38 67.9%
Komo 37 66.1%
Poe (Sage) 37 66.1%
Friday.page 37 66.1%
ChatGPT 3.5 36 64.3%
Neeva 31 55.4%
Google Assistant (mobile app) 27 48.2%
Brave Search 19 33.9%

AI answering engines accuracy on “hard questions” dataset, March 21 (updated with Bard), 2023

[1] Test was not timed and this particular human wanted to make sure they were right

Disclaimer: Take these results with a grain of salt, as we’ve seen a lot of diversity in the style of answers and mixing of correct answers and wrong context, which made keeping the objective score challenging. The relative strength should generally hold true on any diverse set of questions.

Our findings revealed that the top-performing AI engines exhibited an accuracy rate of approximately 75% on these questions, which means that users can rely on state-of-the-art AI to answer approximately three out of four questions. When unable to answer, these engines either did not provide an answer or provided a convincing but inaccurate answer.

ChatGPT 4 has shown improvement over ChatGPT 3.5 and was close to the best answering engines, although having no internet access. This means that access to the web provided only a marginal advantage to others and that answering engines still have a lot of room to improve.

On the other hand, three providers (Neeva, Google Assistant, and Brave Search), all of which have internet access, performed worse than ChatGPT 3.5 without internet access.

Additionally, it is noteworthy that the previous state-of-the-art AI, Google Assistant, was outperformed by almost every competitor, many of which are relatively small companies. This speaks to the remarkable democratization of the ability to answer questions on the web, enabled by the recent advancements in AI.

The main limitation of the top answering engines at this time seems to be the quality of the underlying ‘zero-shot’ search results available for the verbatim queries. When humans perform the same task, they will search multiple times, adjusting the query if needed, until they are satisfied with the answer. Such an approach still needs to be implemented in any tested answering engine. In addition, the search results returned could be optimized for use in answering engines, which is currently not the case.

In general, we are catiously optimistic with Kagi’s present abilities, but we also see a lot of opportunities to improve. We plan to update the test results and release the questions in 6 months as we compare the progress made by the field.

No comments:

Post a Comment