VALREC (Validating Repository Contents) Project Outline

1. Background

The process of self-archiving in institutional repositories has, in the past, been hampered by research worries about publisher's policies regarding copyright and pre-publication. Thanks largely to registries like SHERPA/ROMEO[1], which demonstrate the majority support of publishers for the use of institutional repositories, these issues have been substantially addressed. However, it is relatively unusual for a publisher to permit the author to deposit the publisher's PDF, i.e. the final output of the refereeing and copy-editing process complete with house style and corporate branding. In other words, a distinction remains between the author's final version and the publisher's final version of the same work. For research purposes, this distinction is not significant, although the publisher's version may be easier to read (better formatted, fewer errors of language, style and consistency) and may have extra features (such as hyperlinked bibliographies).

Experience from the JISC IRRA project[2] has shown that there remains a crucial distinction between these two sources of research material: trust. The publisher, perceived as a disinterested third party, is seen as a more trustworthy source of material that the author's institutional repository. The author may either be careless (in providing an out-dated or inaccurate version of the published article) or, in exceptional circumstances, devious (by manipulating the article to make it look more significant than the peer-reviewing process established).

Additionally, the JISC Linking UK Repositories study[3] showed that repository managers would welcome a means to relate pre-prints, post-prints and the final published version in order better to formalise and monitor the scholarly record of the institution and that the entrepreneurial community, building services on the top of repository content, needs to understand this relationship in order to develop trusted products. This proposal, therefore, is firmly based upon the results of investigations already funded by JISC and is clearly in line with the needs of repository managers.

2. Aims and objectives

2.1 The problem with researchers' use of repositories

A great deal of money and effort is being expended in the development of repositories in the UK but there is, at present, a reluctance among the research community to fully engage with the process of depositing their scholarly research articles into their institutional or other repository, and indeed to searching those same repositories as part of their normal workflow. Even if repositories can be filled through the use of mandates, it is much less possible to mandate researchers to use them as a source of primary information. Based on market research that has been done among researchers, it is clear that a key obstacle to them placing their wholehearted trust in the scholarly output that may be found in repositories is the uncertainty about whether the material they find in a repository is the 'final' or 'accredited' (i.e. published) version.

Although the publisher per se does not add any scientific or scholarly information to the author's final version, it is typically the publisher's version that is accepted as being the 'final' version of a research article. Researchers are sometimes wary of relying on pre-prints in case there have been substantive changes between the pre-print and the final published version. Researchers may also be wary of trusting an author's post-print in case the publisher's copy-editing process has eliminated some significant flaw in wording. In the case of research assessment, there is the further anxiety that the author's post-print may have been open to tampering in order to boost its apparent significance. Until the research community has a means to judge the worth of research articles in repositories in terms of their similarity to the final version, in most subject areas they are set to remain an inferior option to sources of material that have been validated by the publisher and delivered from the publisher's web site. We believe this will act as a brake on the development and acceptance of repositories as a resource of first choice for researchers and is, therefore, an issue that must be tackled directly as a matter of some urgency.

2.2 A proposed solution

The key objective of this project is to develop and test a service designed to validate the scholarly research articles found in repositories. The system of validation will perform three essential tasks:

This system will inject a significant degree of confidence into the process of using repositories and will benefit a range of key stakeholders:

Figure 1: Mock-up of a certificate

Figure 1 shows a mock-up of the form that a certificate might take when presented to a user: it shows a thumbnail of the title page of each version, together with its official URI, DOI or URL and a checksum to guarantee the veracity of each file. Following this is a list of the individual editing differences between the two files, and a high level overview that describes the categories into which the edits fall (minor typographical changes, grammatical corrections and differences of layout). The above mock-up is based on genuine data, extracted by a script from two actual PDF versions of a single paper[4] (the first from the author's repository[5], the second from the publisher's site[6]). Figure 1 is merely a human-readable form of a digitally signed XML certificate; appropriate W3C standards and technical recommendations for the transfer of signed digital artefacts (e.g. XML Digital Signatures and WSTrust) will be deployed to provide confidence in the veracity of the certified statement.

Building on the work of the EU Provenance consortium,[7] we will ensure that the certification process is transparent, verifiable and open to auditing.




[4] Moreau, L., Dickman, P., and Jones, R. 2005. Birrell's distributed reference listing revisited. ACM Trans. Program. Lang. Syst. 27, 6 (Nov. 2005), 1344-1395



[7] In provenance terms, the proposed PDF certification process can be described as a post hoc recovery of process documentation