Sunday, 22 August 2021

The Invisible Citation Commons

Source: https://commonplace.knowledgefutures.org/pub/w88y7brs/release/3

The Invisible Citation Commons

Introduction

Scholarly knowledge relies on citations. Discovering and acknowledging prior work is fundamental to knowing what has been done before, synthesizing the state of the field, and identifying spaces for new research.

Despite being so crucial, citations — the pieces of metadata that serve as references to works — are often ignored in discussions of types of open knowledge. Historically, citations and their cross-references have been laboriously collected (by hand, then by computational techniques) in bibliographic indexes. These indexes, once published in serial print volumes, are today generally offered as web-based, paywalled subscription products by scholarly entities or for-profit companies. Indexing and abstracting services are big business: Web of Science, an academic journal and proceedings index that has existed since the 1960s, can cost subscribing libraries hundreds of thousands of dollars a year. Web of Science is currently owned by Clarivate, which in 2020 had revenues of 1.2 billion USD1 from its portfolio of analytics and intellectual property management tools that monetize the research process.

Indexes like Web of Science collect and annotate references with subject information, mine published works for their citations, and provide tools to help researchers discover and analyze those citations. Because of their subscription status, access to these indexes, much like subscription-based journals, is typically limited to affiliates of subscribing libraries. The introduction of Google Scholar in 2004 changed how a generation of researchers work, by providing easy and unpaywalled access to an interdisciplinary database of citations derived from the web. However, Google Scholar isn’t transparent about its processes, doesn’t provide openly licensed or downloadable data, and includes citations that are subject to missing information and poor disambiguation.

In recent years, there has been a push to openly license citation metadata to better enable large-scale analyses and discoverability of scholarly work. The “Initiative for Open Citations” (I4OC),2 launched in 2017, has led the way in helping publishers share citations to their works under a public domain CC0 license. As of early 2021, over a billion citations from one scholarly article to another are collected in public domain databases, a major shift from just a few years earlier.3 These open databases provide the backbone for new discovery tools, and are used by academics training artificial intelligence tools. Open corpora like the Microsoft Academic Graph are themselves widely cited.4 However, Microsoft Academic Graph will be shuttered in 2021; despite their importance, new citation projects are reliant on continued funding and support by their host, and longevity is not always guaranteed.

Wikidata and WikiCite

Wikidata is a freely licensed and editable online database of linked data, with 94 million items as of June 2021.5 Like its sister project Wikipedia, it has a vibrant multilingual volunteer community that develops and maintains it, and is supported by the non-profit Wikimedia Foundation. Wikidata also includes bibliographic metadata: as of June 2021, nearly 40 million items on Wikidata represented publications, accounting for 43% of all items.6 These are a combination of semi-automated uploads of citations from other open databases, items about notable publications that have their own Wikipedia articles, and items added manually by editors. Wikidata is also attractive for libraries, archives, and cultural institutions that want to make their metadata more openly available and reusable, and there are several ongoing projects to incorporate Wikidata into library and archival cataloging processes and connect Wikidata to new open knowledgebases.7

Wikidata items can also be created about the authors, institutions, publishers, journals, and ideas related to citations, which creates a rich network of queryable information. Wikidata items about publications can include identifiers from a vast number of other catalogs and indexes, such as national library catalogs and authority files. Thus, the Wikidata item about the Origin of Species links to the Wikidata items for Darwin and the concept of “natural selection,” but also includes 22 other national and international library catalog identifiers, as well as linking to the 73 Wikipedia articles in various language editions that exist about the work (and, because the book is in the public domain, the full text on Wikisource in six languages). Wikidata serves as a hub, including identifiers for the same work from, say, Project Gutenberg, the National Library of France, and the French Wikipedia, providing a way to map connections and coverage among these diverse entities.

WikiCite is a collective name for the volunteer community and projects focused on improving the representation of open bibliographic metadata on Wikidata and the other Wikimedia projects. WikiCite provides a home for participants from a broad range of geographies and professions — librarians, developers, GLAM practitioners, data modelers, ethnographers, and Wikidatians — who are interested in improving the citation practices and infrastructure for free knowledge. From 2016-2021, the WikiCite project was funded by several grants to the Wikimedia Foundation, most recently from the Alfred P. Sloan Foundation, which supported a series of four community conferences and funding for innovative technical and outreach projects.8

This focus on bibliographic metadata in Wikidata has led to a rich ecosystem of tools developed by volunteers to assist in uploading, editing and analyzing these records. One such tool is Scholia,9 which creates visual scholarly profiles based on Wikidata records. Viewing a heavily-cited author — such as Jennifer Doudna, 2020 Nobel Prize winner for chemistry — shows the power that can come with visualizing citations to scholarly works. The Wikidata item for Doudna gives us biographical information such as awards received. Viewing Doudna’s author record in Scholia, however, provides a list of associated publications by year, a map and word cloud of topics, an interactive diagram of co-authors, and a list of citing authors, all based on citations in Wikidata. Scholia and related tools provide a possible open alternative for expensive and proprietary scholarly metrics tools that are currently sold by major companies like Elsevier and Clarivate.

Unsolved challenges and future directions for open citations

Unlike other open citation databases, Wikidata, like Wikipedia, relies on a dedicated and highly skilled global group of volunteer maintainers and editors. Though the Wikimedia Foundation provides a stable platform for Wikidata, with a long-term commitment to preservation and availability, stewarding this collection of data means continuing to develop and support the editor community and making it possible for new editors and entities to contribute. The openly editable model of Wikidata differs from traditional library catalogs or indexes, where editing is restricted to a small group of staff who also ensure quality and accuracy. In Wikidata, users of the data can also contribute both small fixes and large updates, but doing so requires learning complex new workflows and navigating Wikidata’s culture.

There are technical challenges as well to representing citation metadata in Wikidata. Wikidata contains only a fraction of all open citations, which is only part of all possible bibliographic metadata; tools like Scholia draw on incomplete data. However, drastically expanding the number of items about publications (such as by importing the entire open citation corpus, which would double the current size of Wikidata) raises issues of scalability, both in terms of technical infrastructure and human curation ability. An open question in the WikiCite community is whether items about publications should remain in Wikidata, or become a separate interlinked knowledgebase that could be connected to the other Wikimedia projects. Starting a new initiative like this is a complex decision with both technical and social implications.10

A related problem is how to make citations easily reusable within the Wikimedia projects and beyond. Citations form the backbone of Wikipedia articles. In an open collaborative environment where authorship is largely pseudonymous, Wikipedia articles rely on outside references for every factual claim. However, it is not yet possible to, for instance, easily add a reference to an article in Wikipedia and see how that reference is also used in other articles and language editions, trace a citation to a retracted article, or see whether the usage of a particular citation can be characterized.11 The infrastructure provided by Wikidata, or by a new interlinked project focused only on bibliographic metadata, could make this possible. There are at least 29 million citations in the English Wikipedia alone.12 Storing the citations that Wikipedia articles across 300 languages rely on as structured data would make them available for analysis and querying, which could help identify content gaps, fight misinformation, and lead to a much deeper understanding of “how we know what we know” on Wikipedia

Research Tools

Sunday, 22 August 2021