Sunday 20 January 2019

The effect of inconsistency - how much is bad?

Source: http://musingsaboutlibrarianship.blogspot.com/2019/01/why-our-citation-practices-make-no-sense.html

The effect of inconsistency - how much is bad?

The APA 6th Style guide (p 181) defends the importance of consistency in references as follows.
 "Consistency in referencing style is important, especially in light of evolving technologies in database indexing such as automatic indexing by database crawlers. These computer programs use algorithms to capture data from primary article as well as the article reference list."
They then state "If reference elements are out of order or incomplete, the algorithm might not recognise them, lowering the likelihood that the reference will be captured for indexing".

I would imagine these algorithms are referring to citation parsers so a interesting question is how much inconsistency can you get away with and still have these citation parsers work. Sure I can imagine missing or messing up an article title, making a big difference but what about forgetting to italize? Lower case versus upper case? Putting an additional dot or not between elements? Are our algorithms that frail?

I haven't looked at studies that specifically study this , where the citation elements are mostly there  (as opposed to outright missing or wrong) but are "inconsistent" in minor punctation marks but again the Crossref study referenced above might be helpful.

In one earlier study they took a random sample of 2,500 items in Crossref , generated reference strings in 11 styles (including APA, MLA, Chicago author-date) and then tried to simulate noise.

Simulating dirty dataset by Crossref for matching of dois

Surprisingly the new search based method they tested is actually quite robust even with an attempt to mutate these strings quite badly, with the top method scoring a precision of 0.99 and recall of 0.79.

I wish they had broken up the data further (for example, the title scrambled degraded style and degraded styles seems to be unrealistic examples) and just give recall and precision stats for the known style + random noise ones, but still the results are suggestive. 
Some caveats, this study only deals with a specific task in citation parsing, linking to dois so it won't work for non-doi materials. Also this method proposes a "search based" technique as opposed to a older field based approach (where the system tries to calculate a similarity score on various important fields like title and author), one might suspect a field based approach would be more sensitive to variance? 

Are citation style makers dismissive of reference managers?

Of course, we live in the computer age and time saving software like reference managers eg Zotero exist, but shockingly, Sebastian Karcher and  Philipp Zumstein (who contribute to the Zotero project) argue that "influential style guides such as the Chicago Manual (14.13) and APA (which makes no mention of automated references at all) are dismissive of such software,"

In the fascinating article  "Citation Styles: History, Practice, and Future" they traces the history of citation styles and the rise and decline of 3 main classes of citation styles, "Note style", "Numbered style" and "Author-date style".

They goes beyond a mere narration of the history but also includes and analysis of styles in the CSL citation styles repository by discipline and type , but it is the final chapter where they write about the future of citation styles that makes me sit up.

They write on the need to standardise to a few standard citation styles rather than create their own which often tends to be inconsistent and are unclear in instructions. I fully agree. I remember once I had students worried sick because they were given a nameless citation style, with a few scant examples as a sample to follow for their thesis. So they almost spent more time worrying about that than actually doing the research.

Sadly, they predict things might become even worse, with the citation style landscape to become more diverse. But what about automated reference managers like EndNote, Zotero, Mendeley you say? This is where it got shocking.

"Currently, influential style guides such as the Chicago Manual (14.13) and APA (which makes no mention of automated references at all) are dismissive of such software, in spite of its growing popularity among academics. Style guides and publishers can and should help, rather than belittle, these efforts. Most importantly, they should refrain from imposing rules that are virtually impossible to automate."

They go on to give an example of APA, that has the "use of journal issue numbers dependent on whether a journal is continuously paginated per volume." and since there is no way to tell if a journal is continously paginated (no database exists with this information) this aspect of APA can't be automated.  Maybe I'm missing something, but this is yet another one of the arcane rules that exist for no real reason I can think of.

Reading all this gives me the overall impression the people in charge of citation styles are if not openly hostile to reference managers, at the very least they are ignoring the issue , leading to a lot of time wasted by both students and researchers, particularly for manuscripts that are resubmited to multiple journals with differing citation formats.  Granted the journal employs staff to do editing and cleanup work after the manuscript is accepted but why the additional effort and expense?

I was struck by a comment by Jordi Scheider recently at Crossref Live 2018 where she pointed out that publishers now spend time helping researchers check references and formatting but don't help them check if a reference has been retracted!  Surely publishers have better things to do?

References are converted from structured references to plan text and back again during journal submissions

I was reminded further of this absurdity when I read that Scholarcy - a tool that can extract references from PDF was used by BMJ publishing group to convert back-files PDF to create structured references (XML probably?) . Granted this was for legacy PDF, but it made me think.





Today, many manuscript submission systems accept documents in Word or PDF and I would guess it's still the most common way. The thing is quite a lot of authors are now using reference managers like Endnote and when you think about it the manuscript goes through a process that involves authors creating citations in structured format using reference managers and then stripping the information and submitting them in plain text, after which when accepted publishers would convert it back again to XML structured format....

In fact , it seems to me that if references were submitted in structured format during the submission and peer review phase, one could more easily do useful things like check for retractions, do citation chains/analysis to automatically identify suitable reviewers etc.

I have been reading about new services for Publishers like UNSILO that use machine learning and AI, to help publishers improve the screening and evaluation of manuscript submission system.  Why not collect as much structured data as possible, rather than rely on extraction techniques which may not be as accurate?

I'm perhaps naive here, but it seems to be more efficient for the citations/references to be submitted in the structured format RIS/Bibtex in the first place. Here is where it gets odd, some journals allow you to submit manuscripts in LaTeX, but it seems some of them them specifically ask you not to submit the .bib files (structured references) but only the processed bibliography  (.bbl) !

As an aside, I was looking at the recent reply from Elsever regarding the revolt that caused the mass resignation of the editorial board at its Journal of Informetrics. 

Part of letter from Elsevier commenting on mass resignation of the editorial board at its Journal of Informetrics.

It seems one of the demands from the editorial board was that Elsevier make the references it deposits in Crossref open like most major publishers have done.

The reply is instructive where Elsevier bemoans the fact references are received in various styles and "more importantly in natural language", hence they have invested "significantly" in citation extraction tech.

Leaving aside the fact that probably the real reason why Elsevier refuses to do this is because it will strengthen competitors to their Scopus product, both commerical (e.g. Dimensions) and open ones  , if we take this response at face view, it does seem odd that Elsevier and other publishers are not mandating more sane practices like submitting citations in RIS/BibTex .

Why this state of affairs? Can we do better?

The amazing thing about this is unlike in the case of Scholarly publishing where you can see why one party wants to keep status quo, the current state of affairs seems counterproductive for everyone.

I can't think of anyone who benefits from this. Sure academic librarians get a lot more research consultations from students who want them to "go through" their references to ensure that it is 100% correct just in case. But I'm pretty sure most academic librarians would prefer to forgo all this and help out with more interesting and important aspects of research such as discussing how Scholarship is a discussion rather than going through mechanics and rules of citations.

Perhaps people behind reference managers would benefit from this? Maybe, but as we have established they don't work as well as they should.

The only group (and is a tiny one) that I can imagine benefits from this state of affairs for sure are the people in charge of the style guides.

In this APA blog post, one of them suggests there is a reason for so many styles - it is simply for signalling purposes such that knowing how to use a style "marks its user as a member of a specific culture".  I'm afraid I'm not sympathetic to this argument, plenty of undergraduates can do a passable APA style, does that really mark them as a member of the psychology culture?

All in all I don't get it.


A better method?


Just before I pressed the published button, Todd Carpenter referred me to this very instructive piece he wrote in 2014, Why Are Publishers and Editors Wasting Time Formatting Citations?  My blog post covers most of the same ground as he does - about the inefficiency of thousands of styles, why we should be encouraging reference managers use and submission of references in structured format etc.

But I'm struck most by his proposal.

The idea is this

"When authors are submitting references, why doesn’t the community simply send in a reference that is submitted like this:


"

This is simple , elegant and efficient. By using permanent IDs (such as ORCIDs, Dois)  as much as possible, we can benefit a lot from machine readable data and linked data techniques. Of course in reality you might want some redundancy with strings in case someone messes up the doi etc.

Getting everyone to agree is of course the tough part. Again this blog post by APA paints a nice picture of how getting everyone to agree on a style will lead to arguments on the most minute points like use of periods, abbreviations vs spelling out, captials etc.

All this strikes me as rule making and following for no good reason. After all, it seems to me that the main purpose of a reference is to cover "Who created the reference", "When was the reference created", "What is this reference called" and finally "Where you can find this reference", do we really need to worry so much about things like periods, commas and case sensitivity?




Acknowledgements : This article has been years in the making and has been influenced by discussions online at various forums such as LSW, on Twitter and has mentioned was highly influenced by Sebastian Karcher and  Philipp Zumstein's article on citation styles and most recently Todd Carpenter's Why Are Publishers and Editors Wasting Time Formatting Citations?  
7

View comments

1 comment:

  1. I want to use this opportunity to say every big thanks to Dr Ekpen for restoring my relationship back to normal by casting a love spell on my partner to love me again. Contact Dr Ekpen today at (ekpentemple@gmail.com) or on whatsapp +2347050270218 if you want to be happy in your relationship again.

    ReplyDelete