Skip to content

Architecture

Nuno Macedo edited this page Dec 12, 2016 · 61 revisions

PTCRISync Architecture

Overview

(WIP)

  • Synchronizes at profile level
  • Suggests updates to the local CRIS profile
  • Updates remote ORCID works source by the CRIS
  • Persistent notion of "synched"
  • Run export . import

ORCID Service

ORCID is a community-based service that aims to provide a registry of unique researcher identifiers (an ORCID iD) and a method of linking research outputs to these identifiers, based on data collected from external sources. Since an ORCID user profile is populated by these different external sources automatically, a user profile typically contains different works that actually describe the same research output (possibly containing different or even contradictory meta-data). The distinctive feature of the ORCID service is that, to ease the management of the profile, works that describe the same output are grouped together, showing only the preferred one in the web interface overview (more on this process below).

Data model

The ORCID data model is described by a XSD schema. This data model supports the registration of various relevant research activities, including research outputs, funding projects and profissional career. As of PTCRISync v0.2, version 2.0rc2 of the ORCID data model is supported. Moreover, PTCRISync v0.2 handles solely the synchronization of research productions (i.e., ORCID works). The synchronization of the remainder activities is [work in progress]. Thus, for the purpose of PTCRISync, an ORCID user profile is simply a set of works.

The ORCID schema supports both the definition of (full) works and work summaries. The latter contain information regarding external identifiers, titles, production type and publication dates. The former contain additional meta-data like contributors, venue information and a short description. The synchronization algorithms will often rely solely on work summaries to avoid the retrieval of full complete works, which would have a toll on performance (see below).

ORCID uses putcodes to uniquely identify activities within a user profile. These are automatically generated for newly created works, and can be used to update existing works.

(Privacy)

External identifiers, production matching and groups

The main conceptual difference between ORCID and typical CRIS services is that ORCID groups together works that are considered to represent the same production. The grouping mechanism is quite simple, and just assumes that two works are similar if, and only if, they share an external identifier or there is another work that is similar to both. Essentially, this recursive definition considers two works to be similar if, and only if, they share directly or indirectly (via transitivity) some external identifier. External identifiers are standard identifiers (e.g., DOI codes) that are assumed to uniquely identify productions. As of version 2.0 of the ORCID schema, these groups are now intrinsic to the ORCID data model.

Each group is comprised by a set of work summaries and a set of external identifiers, aggregating the external identifiers from all the works in the group. The collection of work summaries within a group is ordered: the first work of the collection is considered to be the one preferred by the user. To retrieve the complete meta-data of a work, the user must first find its putcode (the internal identifier) and then explicitly request it (more on the ORCID API below).

The ORCID service forbids works from the same source to share external identifiers: if an external source tries to add a work with a external identifier that is shared with a previous work owned by that source a conflict error is returned. The ORCID team also expects most of the works in its database to have at least one external identifier associated, so the API forces every work that is introduced to have some external identifier assigned (even if it is an identifier that only makes sense for the external source). As of 2.0rc2, this is still not enforced in user the web interface.

PTCRISync builds on this notion of matching local CRIS productions with remote ORCID works based solely on these external identifiers. The set of supported identifiers is the same as the one supported by ORCID. The PTCRISync synchronization procedures rely on this matching both to identify local works that need to be updated and to identify productions that are yet to be represented locally.

ORCID Member API

ORCID provides two distinguished APIs: one that is public and another that is reserved for members. The public API allows any user or service to read the public profile of a user, while the Member API allows clients to add and remove information from the user’s ORCID profile. It also allows reading information from the user’s profile that the user has set as semi-private, unlike the public API that reads only public items. As of version 2.0 of the ORCID API, the member API allows services to add, update and delete works. Interested services can register to Member credentials here.

The Member API relies on a 3-legged OAuth authentication protocol between the ORCID service, the user and the Member client (i.e., the interested CRIS service). Once the CRIS service is registered as a Member, the following "dance" is performed to access the data from a user's ORCID profile: get an authorization code from the user, use the authorization code to request an access token from the ORCID service, and use the tokens to perform calls to the Member API. Authorization codes are requested for specific scopes that restrict the permissions of the client. These scopes involve permissions to read, update activities and update the user's biographic information. As of version 0.2, PTCRISync does not require permissions to update biographic information.

Communication with the ORCID service is performed through its RESTful API on the ORCID iD of a user (resource /[orcid_id] of the request URL). For the purpose of PTCRISync 0.2, we focus on API calls over works (i.e., resource /works of the request URL). GET requests can be used either to retrieve every work summary (if called on /[orcid_id]/works) or, once the putcode of a particular activity is known, to retrieve the complete work record (if called on /[orcidid]/works/[putcode]. POST requests can be used to create new works (/[orcid_id]/works), while PUT and DELETE requests can be used to update and delete a particular work, respectively (/[orcid_id]/works/[putcode]).

Works added by a member service will automatically have their source set to that member; a fundamental constraint is that a member service may only update and delete works whose source is itself. The service is also unable to directly modify the set of preferred works selected by the user, although that may happen indirectly if a preferred work is deleted from the ORCID profile, or if a new work unifies two groups of similar works, which may only have one preferred work. Due to this fact, the impact of an update is not the same as a delete/create sequence of operations. In the latter case, it is not always clear how the new preferred work of the group is selected by the ORCID service (i.e., how the grouped works are ordered).

Requirements for CRIS services

(WIP)

Typically, the profile of a user in a PTCRIS service consists of sets of different research activities, although this report is concerned only with research productions. It is also assumed that this profile contains the ORCID iD of the user. The data contained in each production also varies with the CRIS service. For the synchronization framework, a production is assumed to contain at least a key, an internal identifier that uniquely identifies it within a CRIS user profile; a (possibly empty) set of external identifiers; and a boolean field indicating whether it is selected to be exported to ORCID. A production may only be selected to be exported if it contains at least one external identifier (in order to follow the ORCID guidelines to avoid works without external identifiers), and if two productions share external identifiers, only one of them may be selected (due to the restriction on unique external identifiers from the same source). Maintaining these constraints may not be a trivial affair: for example, when adding a external identifier to a production two productions that previously shared no external identifiers may now share some. There are several valid alternatives for enforcing this constraint: for example, ask the user which of the productions should no longer be exported, or simply deselect all conflicting productions from being exported. A production will typically also contain varied additional meta-data such as its title, publication year, publication type, authors, etc.

As will be presented in the next section, the synchronization framework is semi-automatic and notification based. Thus, each service will be required to support two kinds of notifications in a user profile: creation notifications, to alert the user that a new production has been found in ORCID; and modification notifications, to alert the user that new external identifiers for an existing production have been found. The latter are particularly useful for propagating external identifiers between different PTCRIS services, in particular from open access repositories, that provide handles for research outputs, to academic CV management services, such as DeGóis. The notification mechanism relies on the keys of modification notifications to point to the production that is to be modified, while the keys of creation notifications are unique and must not overlap with those of the existing productions, as they result in new productions if accepted. Notifications have also a set of external identifiers, which must not be empty for modification notifications, since their goal is precisely to propagate newly found external identifiers. Additionally, a creation notification contains the meta-data associated with the new production. The shape of the meta-data supported by the PTCRIS service may not be the same as the one supported by ORCID and it may not be trivial to convert one format to the other. However, as will be shown below, the synchronization procedure is oblivious to the meta-data information and relies only on external identifiers to match research outputs. Thus, no specific conversion procedure from one format to the other will be imposed, leaving to each PTCRIS service the decision of how to do so, which should be clarified as precisely as possible. In this report, this is abstracted by assuming that the constructor of creation notifications takes into consideration an ORCID work from which the meta-data is extracted. If the notification system is to be used for other purposes in the PTCRIS service, this data model must be adapted accordingly.

In this report no specific implementation for PTCRIS services is imposed, as different services may wish to pursue different approaches. Thus, the synchronization procedures presented below take as input a user profile, and are not concerned with how it was obtained nor how it can be incorporated back to into the PTCRIS service database. Nonetheless, the PTCRIS services are assumed to at least implement the following methods over a profile: resetNotifications(), that deletes all existing notifications, addCreation(ids : Set(ExternalId), work : Work), that adds a creation notification with given set of external identifiers and meta-data extracted from a given ORCID work, and addModification(key : String,ids : Set(ExternalId)) that adds a modification notification for a given production (identified by its key) with a given set of external identifiers. Since the role of the synchronization framework is solely to manage the notifications in the user’s profile, the synchronizer is not allowed to modify the productions nor the selection of productions to be exported. The service is also assumed to implement a constructor newWork(ids : Set(ExternalId), production : Production) : Work to create a new ORCID work with a given set of external identifiers and with meta-data extracted from a given PTCRIS production.

Synchronization procedures

(wip) This section specifies the synchronization procedures that allows a PTCRIS service to keep a user’s profile consistent with the one at ORCID. There are two modular procedures that can be used to synchronize the profiles according to two different modes: Import This mode aims to harvest new research outputs from ORCID, namely new publications and new UIDs of known publications. The general principle is that every UID in an ORCID profile should be harvested. The synchronization procedure supporting this mode is semi-automatic, based on a notification system, allowing the user to select which outputs or UIDs he wishes to add to his PTCRIS profile. Export This mode is targeted for PTCRIS services that intend to be ORCID sources and export their productions to ORCID, ensuring that other PTCRIS services can harvest them. The general principle is that every production selected to be exported in the PTCRIS profile should be inserted as a new work in the ORCID profile and then automatically kept up-to-date. These modes are supported by separate synchronization procedures, named IMPORT(p : PTCRIS) and EXPORT(p : PTCRIS). The IMPORT procedure does not change the ORCID user profile, managing only the notifications of the input PTCRIS profile p. This semi-automatic approach provides the user with valuable information while still allowing him to control the updates that are effectively applied to the profile. The option for a notification based semi-automatic approach is due to the fact that the ORCID user profile may contain erroneous information (for example, erroneous meta-data), and, as such, we avoid propagating such error to the PTCRIS profile, giving the opportunity for the user to clean-up his ORCID profile beforehand (for example, deleting incorrect works or creating new versions with corrected meta-data). The EXPORT procedure does not change the input PTCRIS user profile p and manages only works on the ORCID profile whose source is the PTCRIS service. The ORCID profile is updated through its API, given the ORCID iD stored in p. A PTCRIS service may choose to implement only one of the modes (e.g., RCAAP is only concerned with exporting outputs, while the SARIs are concerned with harvesting outputs) or the conjunction of both (e.g., the CV management system DeGóis). In the latter case, EXPORT must be executed prior to IMPORT, since running the EXPORT procedure may change the grouping of works (see Scenario 12). This procedure is denoted by: SYNC(p : PTCRIS) =. EXPORT(p); IMPORT(p) The consistency ensured by both modes is precisely stated in the companion formal specification (with a precise set of constraints that instantiates the above general principles), and the synchronization procedures were designed to satisfy several “well-behavedness” properties concerning such consistency7. The most important of those is correctness, namely ensuring that after running the synchronization procedures the user profiles in ORCID and in the PTCRIS service are consistent according to the specification. Another important “well-behavedness” law is stability, ensuring that if the synchronization procedures are tun on already consistent profiles the result is the same (modulo differences in the internal identifiers). Having stable synchronization procedures ensures that there is no need to explicitly check the consistency to determine whether they should be run, since running the synchronizers will not affect them. In fact, the checking procedures have the same approximate complexity as the synchronizers, and thus, no significant performance gains would be achieved by running them beforehand. It could even cause a performance degradation if the user profiles happen to be inconsistent. Each service is free to choose when to run these synchronization procedures, as long as inconsistencies in the profiles are eventually resolved within a reasonable delay. One possible choice would be to run them periodically in the specified order in batch mode, thus avoiding possible delays that can negatively affect the user interface. Premium ORCID members could also trigger the synchronization based on Webhooks Change Notifications from ORCID, by registering to be notified when a user profile changes8. Another sensible choice would be to run IMPORT at the begin of a user session and EXPORT at the end. This ensures that the visible parts of the profiles are consistent when the user is logged out, but that whenever he logs in again the correct notifications are shown. We believe that invoking the synchronization procedures every time the user performs an edit within a session may be counterproductive, as new notifications might keep popping-in and confuse the user. Similarly to distributed systems, the goal of the synchronization framework is to ensure eventual consistency and not necessarily real-time strong consistency among all services.

Group merging

The main conceptual difference between ORCID that typical CRIS services is that ORCID automatically groups productions that are considered the same into a single group. Two productions are considered the same if they share an external identifier, and this relation is transitive. In the ORCID web interface, the user is able to select which work of the group is preferred, which is the one that will be publicly displayed in the profile.

To synchronize CRIS profiles with ORCID profiles, ORCID work groups must be merged into single productions. Due to the central role of the external identifiers in ORCID and PTCRISync, the merging of a group performs as follows:

  • collect every external identifier from every work that comprises the group
  • collect the remainder meta-data from the work of the group selected as preferred by the user

Quality criteria

Every PTCRISync procedure relies on a quality criteria over the productions that are to be synchronized, including for the remote ORCID works that are to be imported and for the local CRIS productions that are to be exported.

To promote the performance of the procedures, this criteria are defined solely over the work summaries returned by the ORCID API, and not over the full works (which would require additional calls to the API).

To pass the quality criteria, a work must have:

  • at least one external identifier assigned
  • the title
  • the work type
  • the publication year (unless the work is a data set or research technique)