Wednesday, 4 October 2023

List of academic search engines that use Large Language models for generative answers and some factors to consider when using

Source: https://musingsaboutlibrarianship.blogspot.com/2023/09/list-of-academic-search-engines-that.html

List of academic search engines that use Large Language models for generative answers and some factors to consider when using

List of academic search engines that use Large Language models for generative answers

This is a non-comprehensive list of academic search engines that use generative AI (almost always Large language models) to generate direct answers on top of list of relevant results, typically using Retrieval Augmented Generation (RAG) Techniques. We expect a lot more!

This technique involves grounding the generated answer by using a retriever to find text chunks or sentences (also known as context) that may answer the question.

Besides generating direct answers with citations, it seems to me this new class of search engine often but not always

a) Use Semantic Search (as opposed to Lexical search)

b) Use the ability of Large Language Models to extract information from papers such as "method", "limitations", "region" and display them in a literature review matrix format

For more see recording by me - The possible impact of AI on search and discovery (July 2023)

The table below is updated to 28th Sept 2023

Name	Sources	LLM used	Upload your own PDF?	Produces literature review matrix?	Other features
Elicit.com/old.elicit.org	Semantic Scholar	OpenAI GPT models & other opensource LLMs	Yes	Yes	List of concept search
Consensus	Semantic Scholar	GPT4 for summarises	No	No, has Consensus meter
scite.ai assistant	Open Scholarly metadata and citation statements from selected partners	"We use a variety of Language models depending on situation." GPT3.5 (generally), GPT4 (enterprise client), Claude instant (fallback)	No	No	Summaries include text from citation statements Many options to control what is being cited
scispace	Unknown	Unknown	Yes	Yes
Zeta alpha (R&D in AI)	Mostly Comp Science content only	- OpenAI GPT Models	No	NA	ability to turn on/off semantic/neural search doc visualization map, showing semantic similarity with cluster labels autogenerated
Core-GPT / technical paper (unreleased?)	CORE	GPT4	No	No
Scopus.ai (closed beta)	Scopus index	?	No	No	Graphical representation to see connections between keywords
Dimensions AI assistant (closed beta)	Dimension index	Dimensions General Sci-Bert and Open AI’s ChatGPT.	No	NA	Provides TLDR

Technical aspects to consider

What is the source used for the search engine?

A lot of these tools currently use Semantic Scholar, OpenAlex, Arxiv etc which are basically open scholarly metadata and open access full-text sources. Open Scholarly metadata is quite comprehensive, however using open access full text only may lead to unknown biases.

Scite.ai here probably has the biggest advantage here given it also has some paywall full-text (technically citation statements only) from publisher partners.

That said, you cannot assume that just because the source includes full-text it is being used for extraction.

For example, Dimensions and Elicit which do have access to full-text do not appear to be currently using it for direct answers. For technical or perhaps legal reasons their direct answers are only extracted from abstracts. This is unlike Scite assistant which does cite text beyond abstracts.

Elicit does seem to use the available full-text (open access) for generate of the literature review matrix.

Are there ways for users to check/verify accuracy of the generated direct answer, or extracted information in the literature review matrix?

RAG type systems ensures hat the citations made are always "real" citations found in their search index, however there is no guarantee that the generated statement is supported by the citation.

In my view, a basic feature such systems should have is a feature to make it easy to check the accuracy of the answers generated.

When a sentence is followed by a citation, typically the whole paper isn't being cited. The system grounds ititsnswer based on a sentence or two from the paper. The best systems like Elicit or scite assistant make it easy to see which are the extracted sentences/context used to support the answer. This can be done via mouseover (scite assistant) or with highlights (elicit).

How accurate are the generated direct answers and/or extracted information in the literature review matrix in general?

Features that allow users to check, verify answers are great, but even better is if the system can provide some scores to give users a sense of how generally reliable the results are over a large number of examples.

One way to measure such citation accuracy is via citation precision and recall scores. However, such scores only measures whether the generated statement and citation given supports the generated statement but do not measure if the generated statements actually answer the question!

A more complete solution is based on ragas framework which measures four aspects of the generated answer

The first two relate to generation part of the pipeline

faithfulness - measures how consistent the generated answer is with the contexts retrieved. This is done by checking if the claims in the generated answers can be deduced from the context
Answer Relevancy - measures if the generated answer tries to address the question. This does not actually check if the answer is factually correct (which is checked by faithfulness), there might be a tradeoff between the first two

The second two relate to the retrieval part of the pipeline or measures how good the retrieval is

Context Precision - This looks at whether the retriever is able to consistently find contexts that are relevant to the answer such that most of the citations retrieved are relevant.
Context Recall - This is the converse of the context precision, is the system able to retrieve most of the contexts that might answer the question

The final score could be a harmonic mean of all four scores.

It would be good if systems could generate these stats for users to have a sense of the reliability of these systems, though as of time of writing none of the academic search systems have released such evaluations.

How generative AI features are integrated in the search and how it affects you should search

We are still very early in the days of search+generative AI. It's unclear how such features will be integrated into the search.

There are also dozens of ways to do RAG/generative AI + search, either at inference time or even at pretraining stage

How does the query get converted to match the retrieved contexts - some examples

It could just do simple type of keyword matching
It could ask prompt the language model to come up with search strategy which is then used
It could convert the query into embedding and match with preindexed embeddings of documents/text

How do you combine the retrieved contexts with the LLM (Large Language model)

How it is implemented can lead to different optimal ways of searching.

For example, say you looking for papers on whether there is an open access citation advantage. Should you search like...

1. Keyword Style - Open Access citation advantage

2. Natural Language style - Is there an Open Access citation advantage?

3. Prompt engineering style - You are a top researcher in the subject of Scholarly communication. Write a 500 word essay on the evidence around Open Access citation advantage with references

Not all methods will work equally well (or at all) for these systems even those based on RAG, e,g, Elicit works for 1&2 but not 3, scite assistant works for all even #3.

Other additional features

As shown in the table above, other nice features include the ability to upload PDFs for extraction to supplement the limitations of the tool's index is clearly highly desirable.

Scite assistant currently provides dozens of options to control how the generation of answers work is also an interesting direction. For example, you can specify the citations must come from a certain topic, journal or even individual set of papers you specify,

Other Non-technical factors

The usual non-technical factors when choosing systems to use apply of course. This includes, user privacy (is the system training on your queries), sustainability of the system (what's their business model?) etc,

Some (non-comprehensive) list of general web search engines that use LLMs to generate answers

Bing Chat
Perplexity.ai
You.com

Side note : Some systems are chatbots where it may decide to search when necessary, as opposed to Elicit, Scispace which are search engines that always search....

Some (non-comprehensive) list of Chatgpt plugins that search academic papers - Requires ChatGPT Plus (default is Bing Chat)

Note a lot just cover arxiv or at best open access papers or metadata.

Research Tools

Wednesday, 4 October 2023