Wednesday, 23 January 2019

Why our citation practices make no sense

Source: https://musingsaboutlibrarianship.blogspot.com/2019/01/why-our-citation-practices-make-no-sense.html

Why our citation practices make no sense

People outside academia are often struck by how odd things are in academia.

The most often mentioned example is how researchers rush to give away their work (and this often means giving away the right to the IP) to publishers for nothing, allowing publishers to profit off millions from their work while the researchers are stuck with the intellectual labour.

Of course, there is indeed some method in the madness, since we know why researchers are acting this way.... they are buying prestige by publishing in some big name journal. Also we understand who is invested in the status quo and why.

But this blog post is not about this.

Rather I'm pondering on how wasteful and illogical our citing practices are. Every year, thousands of hours are devoted by everyone from students to researchers to copywriters employed by publishers into beating citations into shape. Is this really necessary? Can we make it easier?

Why are there thousands of citation styles?

Nobody is against consistency in referencing of course, but do we really need thousands (8.5k styles according to CSL Style repository) of citation styles existing?

All this creates confusion and the costs.

Think of the researchers who have to format their references everytime their paper is rejected and they need to resubmit to another journal with it's own unique style. Given the low acceptance rates for top journals and the desire of researchers to try submitting to the top journals first , this means a typical journal article can be resubmitted to more than 1 journal before it is accepted. Even if researchers don't do it properly, copywriters are employed who work to clean up references in accepted manuscripts.

The fact there are thousands of styles have other less obvious costs. In the past decade there has been dozens of projects that try to parse text references and process them into structured data e.g Parscite which then can be used for various functions from finding an appropriate copy in link resolvers, to importing into reference managers.

FreeCite a tool by Brown University Library inspired by ParsCit

The fact that there are thousands of citation styles rather than dozens has made this task difficult even with the latest generation of citation parsers based on machine learning techniques.

For instance take the recent work by Crossref to identify reference strings deposited by publishers and insert dois into them. See also this series of blog posts.

Crossref work on reference matching strings to dois

Even though Crossref has managed to improve the recall of reference matching, they still can't do this perfectly because there are so many styles out there and they even note that performance varies depending on the citation style used. In fact, they note the matching rates is much worse on certain styles in Chemistry or Physics which does not require titles or where citation elements like journal titles are abbreviated.

Most rules in citation styles are outdated, arbitrary and make little sense

If you have ever helped University freshman in writing courses, one of the things you notice is how much anguish they have over citation styles and how many requests for help this generates. A recent study I did, yielded a result that among various skills, confidence in how to cite was actually lower after the writing module!

Part of the reason I suspect, they are so obsessed about citation style is because citation styles are an area where there seems to be a "right" and "wrong" answer and moreover it was one of the areas where it was easy to see whether they have done well or badly (marks lost due to not following citation style are clearly indicated). This is as opposed to less well defined skills like "evaluating sources" or "scoping" where marks lost or rather not gained were less salient and not as less easily attributed to.

It doesn't help that a lot of students merely google and land on some site like the famous Purdue OWL site or use some librarian created libguide that merely covers roughly what should be done and students get anxious when they inevitably run into cases not covered by those guides say APA 6th.

Even if they do go through the official APA 6th guide which is I understand has a pretty comprehensive style guide compared to other styles, they spend hours worrying say about who if anyone to attribute to a webpage they are citing. The surprising answer often is nobody! The problem of course is it's not always clear cut to me when you attribute the organization behind a website and when you leave out an author and I bet when a lot of lecturers who mark these things see a student leave out an author, they will just assume it was a mistake and take marks off...

And if you think citation styles are easy or that they don't really affect grades, look at this Tweet from an extremely accomplished academic librarian (among other accomplishments , Lisa is a ex-ACRL President)
But does all this hand wringing over punctuation in citation styles really do anything?

If you ask a freshman who has gone through library IL classes, and ask them why they need to cite properly, they will give the typical answer (often on the Librarian's "Why cite" slide) that it is to give credit, allow readers to find the sources etc.

All this is well and good, but does it really justify all the uncertainity and doubt that student suffers when they try to cite something that doesn't fit nicely in the guidelines?

Moreover even in clear cut cases is there really a reason to obsess over complicated rules over minor punctation marks? Why do you sometimes need to italize or quote titles in MLA sometimes not? Why do you sometimes need to add et al. sometimes not depending on various conditions? Is it really critical to add the location (city) of the publisher press as this tweet asks?

Call me naive but if you have the main citation elements that cover the "who created it", "when was it created", "What it is called" and"where it can be found" that's more than enough.

Clearly when you look at the citation styles many rules don't seem to have much logic or reason to them beyond "this is the way to do it". At best there might be a reason in the past that has since been superceded by the current online environment.

Also if there is a logical reason for any of this, why do the 8,000+ over styles all disagree on these details? While I can believe some disciplines would place different emphasis on different aspects of citation elements, I can't imagine this justifies thousands of them.

When you think about it, many of these citing styles were formulated and designed for the print era, and much of the logic for them no longer applies to a online world, particularly one that is increasingly open access. Here's a simple example, there are some styles in Chemistry, Physics etc where everything is abbreviated or there are no titles to save space. This of course makes no sense today and actually makes things harder if you want to follow the references.

Talking about following the references, that's one of the major reasons for citing right? In today's online environment, sticking a doi would be the easiest way to help a user to do that. I wouldn't go so far to say just adding in a doi and stop citing other reference elements, because someone might mess up the doi and some redundancy is good but still you would think most citation styles would mandate dois.

But you would be wrong, currently only 41.2% of styles require dois (APA and MLA only very recently required or recommended dois) and even this 41.2% figure is somewhat inflated.

The effect of inconsistency - how much is bad?

The APA 6th Style guide (p 181) defends the importance of consistency in references as follows.

"Consistency in referencing style is important, especially in light of evolving technologies in database indexing such as automatic indexing by database crawlers. These computer programs use algorithms to capture data from primary article as well as the article reference list."

They then state "If reference elements are out of order or incomplete, the algorithm might not recognise them, lowering the likelihood that the reference will be captured for indexing".

I would imagine these algorithms are referring to citation parsers so a interesting question is how much inconsistency can you get away with and still have these citation parsers work. Sure I can imagine missing or messing up an article title, making a big difference but what about forgetting to italize? Lower case versus upper case? Putting an additional dot or not between elements? Are our algorithms that frail?

I haven't looked at studies that specifically study this , where the citation elements are mostly there (as opposed to outright missing or wrong) but are "inconsistent" in minor punctation marks but again the Crossref study referenced above might be helpful.

In one earlier study they took a random sample of 2,500 items in Crossref , generated reference strings in 11 styles (including APA, MLA, Chicago author-date) and then tried to simulate noise.

Simulating dirty dataset by Crossref for matching of dois

Surprisingly the new search based method they tested is actually quite robust even with an attempt to mutate these strings quite badly, with the top method scoring a precision of 0.99 and recall of 0.79.

I wish they had broken up the data further (for example, the title scrambled degraded style and degraded styles seems to be unrealistic examples) and just give recall and precision stats for the known style + random noise ones, but still the results are suggestive.

Some caveats, this study only deals with a specific task in citation parsing, linking to dois so it won't work for non-doi materials. Also this method proposes a "search based" technique as opposed to a older field based approach (where the system tries to calculate a similarity score on various important fields like title and author), one might suspect a field based approach would be more sensitive to variance?

Are citation style makers dismissive of reference managers?

Of course, we live in the computer age and time saving software like reference managers eg Zotero exist, but shockingly, Sebastian Karcher and Philipp Zumstein (who contribute to the Zotero project) argue that "influential style guides such as the Chicago Manual (14.13) and APA (which makes no mention of automated references at all) are dismissive of such software,"

In the fascinating article "Citation Styles: History, Practice, and Future" they traces the history of citation styles and the rise and decline of 3 main classes of citation styles, "Note style", "Numbered style" and "Author-date style".

They goes beyond a mere narration of the history but also includes and analysis of styles in the CSL citation styles repository by discipline and type , but it is the final chapter where they write about the future of citation styles that makes me sit up.

They write on the need to standardise to a few standard citation styles rather than create their own which often tends to be inconsistent and are unclear in instructions. I fully agree. I remember once I had students worried sick because they were given a nameless citation style, with a few scant examples as a sample to follow for their thesis. So they almost spent more time worrying about that than actually doing the research.

Sadly, they predict things might become even worse, with the citation style landscape to become more diverse. But what about automated reference managers like EndNote, Zotero, Mendeley you say? This is where it got shocking.

"Currently, influential style guides such as the Chicago Manual (14.13) and APA (which makes no mention of automated references at all) are dismissive of such software, in spite of its growing popularity among academics. Style guides and publishers can and should help, rather than belittle, these efforts. Most importantly, they should refrain from imposing rules that are virtually impossible to automate."

They go on to give an example of APA, that has the "use of journal issue numbers dependent on whether a journal is continuously paginated per volume." and since there is no way to tell if a journal is continously paginated (no database exists with this information) this aspect of APA can't be automated. Maybe I'm missing something, but this is yet another one of the arcane rules that exist for no real reason I can think of.

Reading all this gives me the overall impression the people in charge of citation styles are if not openly hostile to reference managers, at the very least they are ignoring the issue , leading to a lot of time wasted by both students and researchers, particularly for manuscripts that are resubmited to multiple journals with differing citation formats. Granted the journal employs staff to do editing and cleanup work after the manuscript is accepted but why the additional effort and expense?

I was struck by a comment by Jordi Scheider recently at Crossref Live 2018 where she pointed out that publishers now spend time helping researchers check references and formatting but don't help them check if a reference has been retracted! Surely publishers have better things to do?

References are converted from structured references to plan text and back again during journal submissions

I was reminded further of this absurdity when I read that Scholarcy - a tool that can extract references from PDF was used by BMJ publishing group to convert back-files PDF to create structured references (XML probably?) . Granted this was for legacy PDF, but it made me think.

Scholarcy Reference Extraction API

Today, many manuscript submission systems accept documents in Word or PDF and I would guess it's still the most common way. The thing is quite a lot of authors are now using reference managers like Endnote and when you think about it the manuscript goes through a process that involves authors creating citations in structured format using reference managers and then stripping the information and submitting them in plain text, after which when accepted publishers would convert it back again to XML structured format....

In fact , it seems to me that if references were submitted in structured format during the submission and peer review phase, one could more easily do useful things like check for retractions, do citation chains/analysis to automatically identify suitable reviewers etc.

I have been reading about new services for Publishers like UNSILO that use machine learning and AI, to help publishers improve the screening and evaluation of manuscript submission system. Why not collect as much structured data as possible, rather than rely on extraction techniques which may not be as accurate?

I'm perhaps naive here, but it seems to be more efficient for the citations/references to be submitted in the structured format RIS/Bibtex in the first place. Here is where it gets odd, some journals allow you to submit manuscripts in LaTeX, but it seems some of them them specifically ask you not to submit the .bib files (structured references) but only the processed bibliography (.bbl) !

As an aside, I was looking at the recent reply from Elsever regarding the revolt that caused the mass resignation of the editorial board at its Journal of Informetrics.

Part of letter from Elsevier commenting on mass resignation of the editorial board at its Journal of Informetrics.

It seems one of the demands from the editorial board was that Elsevier make the references it deposits in Crossref open like most major publishers have done.

The reply is instructive where Elsevier bemoans the fact references are received in various styles and "more importantly in natural language", hence they have invested "significantly" in citation extraction tech.

Leaving aside the fact that probably the real reason why Elsevier refuses to do this is because it will strengthen competitors to their Scopus product, both commerical (e.g. Dimensions) and open ones , if we take this response at face view, it does seem odd that Elsevier and other publishers are not mandating more sane practices like submitting citations in RIS/BibTex .

Why this state of affairs? Can we do better?

The amazing thing about this is unlike in the case of Scholarly publishing where you can see why one party wants to keep status quo, the current state of affairs seems counterproductive for everyone.

I can't think of anyone who benefits from this. Sure academic librarians get a lot more research consultations from students who want them to "go through" their references to ensure that it is 100% correct just in case. But I'm pretty sure most academic librarians would prefer to forgo all this and help out with more interesting and important aspects of research such as discussing how Scholarship is a discussion rather than going through mechanics and rules of citations.

Perhaps people behind reference managers would benefit from this? Maybe, but as we have established they don't work as well as they should.

The only group (and is a tiny one) that I can imagine benefits from this state of affairs for sure are the people in charge of the style guides.

In this APA blog post, one of them suggests there is a reason for so many styles - it is simply for signalling purposes such that knowing how to use a style "marks its user as a member of a specific culture". I'm afraid I'm not sympathetic to this argument, plenty of undergraduates can do a passable APA style, does that really mark them as a member of the psychology culture?

All in all I don't get it.

A better method?

Just before I pressed the published button, Todd Carpenter referred me to this very instructive piece he wrote in 2014, Why Are Publishers and Editors Wasting Time Formatting Citations? My blog post covers most of the same ground as he does - about the inefficiency of thousands of styles, why we should be encouraging reference managers use and submission of references in structured format etc.

But I'm struck most by his proposal.

The idea is this

"When authors are submitting references, why doesn’t the community simply send in a reference that is submitted like this:

"

This is simple , elegant and efficient. By using permanent IDs (such as ORCIDs, Dois) as much as possible, we can benefit a lot from machine readable data and linked data techniques. Of course in reality you might want some redundancy with strings in case someone messes up the doi etc.

Getting everyone to agree is of course the tough part. Again this blog post by APA paints a nice picture of how getting everyone to agree on a style will lead to arguments on the most minute points like use of periods, abbreviations vs spelling out, captials etc.

All this strikes me as rule making and following for no good reason. After all, it seems to me that the main purpose of a reference is to cover "Who created the reference", "When was the reference created", "What is this reference called" and finally "Where you can find this reference", do we really need to worry so much about things like periods, commas and case sensitivity?

Acknowledgements : This article has been years in the making and has been influenced by discussions online at various forums such as LSW, on Twitter and as mentioned was highly influenced by Sebastian Karcher and Philipp Zumstein's article on citation styles and most recently Todd Carpenter's Why Are Publishers and Editors Wasting Time Formatting Citations?

Posted 1 week ago by Aaron Tay

Labels: citation personal

Research Tools

Wednesday, 23 January 2019