Saturday, 12 January 2019

Altmetrics for all

Source: https://medium.com/thunken/altmetrics-for-all-4b9039249850

Altmetrics for all

Jul 3, 2018

Cobaltmetrics was first released in January 2018, just in time for PIDapalooza in Girona. Five months and 78 million documents later, it is time to reflect on where we stand in the altmetrics movement and what we want to push for.

Are altmetrics providers alt- enough?

Altmetrics, for the uninitiated, were designed to complement traditional, journal-to-journal citation metrics and provide the scientific community with better proxies for scientific impact.

When Jason Priem coined the term in 2010, diversity was part of the message:

Diversity of measures has been and remains the main goal of the altmetrics movement. But what about the underlying data? What do you gain by changing the statistic if the sample remains biased?

Our work on altmetrics stemmed from a simple observation: existing altmetrics providers are not alt- enough. While they process significant amounts of data, they operate on a very specific subset of the global scientific production:

Target languages: less than one percent of the world population speak English as their first language, yet altmetrics providers tend to ignore content in languages other than English, or they only support a handful of languages. For example, Altmetric only monitors Wikipedia in English, Finnish, and Swedish, and PlumX Metrics recently made its first step on the path to multilingualism by adding Wikipedia in Portuguese and Spanish. Advances in natural language processing, however, make massively multilingual text mining more efficient than ever and, whenever identifiers or URLs are used, extracting citations becomes mostly language-independent. In any case, algorithmic complexity cannot be used as an excuse for a lack of linguistic diversity in scientometrics. Anglo-centrism is prejudicial to science—see this note by Vladimir Lazarev & Serhii Nazarovets for another recent example—and we must fight it.
Target documents: a recent study by Martin Klein et al. has found that preprints are largely indistinguishable from the versions that appear in academic journals—you can even compare the preprint and the final version of this study, so meta—yet existing altmetrics providers treat non peer-reviewed documents as second-class citizens. Altmetric, for example, can only merge citations for a publication and the corresponding preprint if the preprint was not assigned a DOI. The value of preprints is now recognized, and that case will soon be closed. But, moving forward, will we need to have the same conversation about every new type of document? It is not up to altmetrics providers to decide what is citable. What about patents, trademarks, clinical trials, or law articles? What about non-textual digital objects like datasets, software, videos, etc.?

Credit where credit is due, altmetrics providers like Altmetric, ImpactStory, and Plum Analytics are great projects. They have profoundly changed the way all stakeholders in science and research policy think of scientific impact, and they have paved the way for new efforts. But diversity is good, and we think we can do even better by joining forces on different challenges.

How is Cobaltmetrics different?

Projects like Altmetric, ImpactStory and Plum Analytics focus on the -metrics side of altmetrics: they provide scores, rankings, and badges. On the other hand, Cobaltmetrics focuses on the alt- part of altmetrics: by gathering data about alt-citations using alt-identifiers in alt-documents written in alt-languages, we aim to solve selection effects and the lack of diversity in scientometrics.

Our mission is to provide data. We provide citation data that is clean, stable, and reproducible, and can thus be used not only to compute impact metrics, but also to build knowledge graphs, train statistical models, test recommendation systems, and do a lot of other things. Therefore, Cobaltmetrics is more similar to Crossref Event Data than to any other altmetrics project. To quote Joe Wass’ thoughts on Crossref Event Data:

I should make clear that [we] are not in the business of making metrics. The services that we were compared to, Altmetric.com and Plum Analytics, collect the same kind of data, but their ultimate aim is metrics. Our purpose begins and ends with collecting this underlying data so that anyone can analyze it. It could be used to make metrics, but could be used for a lot more other purposes besides.

With Cobaltmetrics, we strive to go deeper than other altmetrics providers:

We collect citations and backlinks from documents in all languages, and we have validated the methodology by crawling data from Wikipedia in more than 180 languages.
We collect citations and backlinks to all types of documents and digital objects, including but not limited to scientific publications, books, patents, trademarks, clinical trials, financial statements, security vulnerabilities, social media posts, software, videos, etc.
We collect citations and backlinks to both canonical and non-canonical URIs. End-users cannot be expected to know whether a given identifier is persistent, or whether a given URI is canonical. Citations can also be hidden behind shortened URLs, and different databases will use different identifiers for the same document. We want our users to be able to copy any URI into Cobaltmetrics and defer to us for the heavy-lifting.

We currently index 78 million documents and 55 million citations and backlinks extracted from data made available by Hypothesis, StackExchange (all sites), and Wikimedia (all projects and languages). We mine data in 180+ languages, we unroll shortened URLs from 175+ shorteners, we crack open URLs to extract persistent identifiers, and we convert between 50+ types of identifiers.

Our search engine is powered by a knowledge base that already includes more than 7 billion groups of identifiers known to be equivalent. It includes data from trusted sources like Wikidata and PubMed, but also linked data made available by publishers and content creators. The knowledge base is used to automagically enrich your queries, try it out!

Where do we go from here?

Cobaltmetrics is still very young, and so is Thunken. We have already opened our public API for the sake of openness and transparency. Although we have put limits and quotas on API requests to prevent abuse and ensure availability, we are committed to providing a free plan for general and reasonable usage.

We also decided early on to release early and release often, so here is a list of the next big features:

New data sources: not everything that counts can be counted, and not everything that can be counted counts, but we have seen that altmetrics are a sampling game, and there are many types of digital objects that we want to track.
Stable releases: reproducibility is a potential issue with web-based services, and we need to guarantee that any two users who query Cobaltmetrics with the same query on the same day will retrieve the same results. Rather than updating every source on a rolling basis, we plan to build and release a stable index and a stable knowledge base every month.
Self-reporting table: the NISO-sponsored working group on altmetrics recommend that altmetrics providers release a self-reporting table on data quality (cf. NISO RP-25–201X-3). We are working on a tool that, before each release, inspects our code and our data to document how the data was aggregated, how it can be accessed, and how quality is monitored.

Stay tuned.

Interested in learning more about Cobaltmetrics? Try it out, check out the public API, join our newsletter, and reach out at contact@thunken.com!

Research Tools

Saturday, 12 January 2019