Friday, 8 May 2015

Metadata in Scientific Publication - News - MyScienceWork


Metadata in Scientific Publication

General Public
as a dictionary defines words, metadata is data that describes digital
or physical objects. To understand its utility, compare metadata with
the labels used in Ancient Greece to describe the content of papyruses,
the latter being piled onto shelves in large numbers. The label attached
to each papyrus provided a quick overview of its content without having
to take them out of the pile or unroll them. Such a system was
efficient in Ancient Greece, but today, due to the large amount of
digital data available, it is essential to improve the efficiency of
classification systems.
This article is a translation of “Les métadonnées de la publication scientifique” available at: was translated from French into English by Mayte Perea López.

of the digital metadata used today took their inspiration from the
referencing methods that existed long before the digital era.
metadata methods  digital era

Using metadata for better identification and classification
We have already discussed the case of metadata for the music industry in a series of articles available here.
On this blog, the subject we are particularly interested in is
scientific publication. As a central tool for the dissemination of the
knowledge produced by research, the scientific article is also at the
heart of an important trade issue associated with its diffusion and
archiving. Scientific articles are the main tool for scientific
communication. Their primary purpose is to be exchanged and shared, and
to achieve this they need to be indexed and placed in archives and
various computer systems.
In order to
foster sharing and facilitate the interoperability between different
systems, bibliographical standards had to be developed. Most of the
digital metadata used today took their inspiration from the referencing
methods and cataloguing standards that existed long before the digital
era. Each document available in a library had to be described on a
bibliographical card including fields like title, author, number of pages, discipline,
etc., to be easily identified and located. To meet these needs, a large
number of cataloguing standards were created (for instance the Dewey
Decimal System, MARC-21, Unimarc, etc.) but they remain, in part,
mutually incompatible.
independent generic bibliographical standards for scientific disciplines
makes it possible to offer standards for the metadata associated with
scientific publications and to broaden the possibilities for sharing
them. In 1995, an international workgroup called the Dublin Core Metadata Initiative
(DCMI), made up of professionals specialized in disciplines such as
library and information science, computer science, and tagging, the
museological community, and others, established a number of generic
metadata to describe digital resources (videos, images, books, websites,
etc.). The Dublin Core describes each resource thanks to the following
15 optional fields: Title, Creator/Author, Subject, Description,
Publisher, Contributor, Date, Type, Format, Identifier, Source,
Language, Relation, Coverage, Rights. There are other, much more complex
standards, for example MarcXML or the JATS, which is used by PubMed and implemented by the U.S. National Library of Medicine.
However, the standards defined by the Dublin Core are by far the most
commonly used. Content producers are encouraged to use these standards
to describe their products. Metadata is not intended for direct use by
the human being; it is not visible to the user, but it helps to develop
services related to the processing of scientific documents, for example
specialized search engines. The semantic web represents all the
practices and standards whose purpose is to enrich the initial data with
semantic metadata to produce files that are more suitable for new uses
(see Leading the Web to its Full Potential with the Semantic Web).
Metadata for Open Access Scientific Publishing

standards introduced by the Dublin Core represent an important step
forward in the unification of descriptive data sharing formats for
digital resources. If every new format defined is intended to meet some
specific needs, the question that naturally arises is the following: in
the field of scientific publishing, what are these specific needs?
Scientists and other people using their publications need, among other
things, to find quickly the articles dealing with a subject of study and
the corresponding authors. The authors’ institution or research
laboratory, as well as the related rights and the release date, are also
potentially useful information. If the majority of publishers have
already adopted the Dublin Core format, the way fields are filled in can
still vary depending on the different sources. For example, an author’s
family name and given name can be written using several different
formats (“Family Name, G.N.” or “Given Name FAMILY NAME”, for example.).
Bibliographical management software is one of the most meaningful
applications in terms of centralized use of scientific articles coming
from various publishers. But to obtain a consistent metadata base, it is
sometimes necessary to correct the errors manually.

Today, due to the large amount of digital data available, it is essential to improve the efficiency of classification systems.
 digital data classification systems
scientific articles in Open Access allows for greater visibility as
dissemination is free and can be achieved via a simple Internet
connection. The Open Archive Initiative (OAI), whose aim is to promote
Open Access through the development of interoperability standards,
implemented the OAI-PMH protocol (Open Archives Initiative Protocol for
Metadata Harvesting) that facilitates the exchange of information
between repositories (archives of scientific publications) and service
providers. Service providers are all the institutions that make it
possible to use the collected metadata, for example search engines like
GoogleScholar or websites such as the social network MyScienceWork.
The OAI-PMH protocol, which is invoked over HTTP, searches in article
repositories to collect the metadata of scientific documents and
possibly to download the text files. Therefore, it is possible for
anyone to “harvest” – that is to say, collect – the metadata of the
contents of Open Access repositories like PubMed, ArXiv or HAL. Several directories (DOAJ, ROAR)
list thousands of Open Access repositories. This system makes it
possible to access to significant databases in rather short amounts of
time. In most cases, the metadata provided through the OAI-PMH is
defined according to the Dublin Core. It is worth noting that Wikipedia
is one of the repositories that offer access in OAI-PMH to its data.
Using metadata for scientific data sharing?

Dublin Core standards are quite simple. They can describe scientific
publications regardless of the discipline concerned. As for data
sharing, also known as Open Data, additional difficulties related to the
diversity of formats make the definition of universal standards much
more complex.
Open Data is a concept
that is gaining momentum within our institutions and governments. Open
Data in science could dramatically change the current functioning of
research. In fact, if all the raw data used by scientists were freely
accessible, all the actors of society, provided that they have the
necessary capacity and knowledge, could potentially conduct research on
the same data. The scientific community, in general, would benefit from
this sharing. It would simplify the implementation of collaborative work
to solve complex problems (see the example of the collaborative
mathematics project Polymath).
It would open the access to scientific data to a group of people who
are today excluded from this system (see the example of the Eurogenes
project). Finally, it would strongly favor transparency in the
scientific research process. Naturally, it is important to take into
account the fact that competition between teams and laboratories limits
data opening practices.
data is obviously very heterogeneous depending on the discipline and the
subject being studied.Today, there are no universal standards to
represent scientific data and such standards will be very difficult to
implement. The mere definition of scientific data has not even been
clearly established yet. Standardizing the formats used, first within
one discipline, and then in multidisciplinary fields, would enable
progress to be made towards that end. The future may well hold new
scientific practices, thanks to the progressive release of metadata and
scientific data.

Many thanks to Iana Atanassova for reviewing this article.

Find out more:
Scientific data must be managed publicly (in French):

Metadata in Scientific Publication - News - MyScienceWork

No comments:

Post a Comment