What’s in a URI? Part I: The School of Salamanca, the Semantic Web and Scholarly Referencing

Salamanca logo, pinkish

Starting from experiences of the the philosophical and legal-historical project “The School of Salamanca. A digital collection of sources and a dictionary of its juridical-political language”, this article discusses an experimental approach to the Semantic Web.1 It lists both affirmative reasons and skeptical doubts related to this field in general and to its relevance for the project in particular. While for us the general question has not been settled yet, we have decided early on to discuss it in terms of a concrete implementation, and hence the article will also describe preliminary goals and their implementation along with practical and technical issues that we have had to deal with.

In the process, we have encountered a few difficult questions that — as far as we could determine — involve (arguably) systematic tensions between key technologies and traditional scholarly customs. The most important one concerns referencing and citation. In the following, I will describe a referencing scheme that we have implemented. It attempts to combine a canonical citation scheme, some technologies known primarily from semantic web contexts and a permalink system. Besides the details of our particular technical approach and the very abstract considerations about risks and benefits of the semantic web, I will point out some considerable advantages of our approach that are worthwhile pursuing independently of a full-blown semantic web offering.

This emphasis on the details and advantages of our referencing scheme is necessary because only then will it be possible to discuss its shortcomings with regards to persistent identification in a balanced way. But in order not to overload the present article, this latter discussion will be postponed to an upcoming, follow-up blogpost.

[toc]

I. The project

The project “The School of Salamanca. A digital collection of sources and a dictionary of its juridical-political language” is a long-term project (altogether 18 years, at the moment in its fourth year) of the Academy of Sciences and of Literature, Mainz, aimed at improving the possibilities of studying the discourse of the 16th- and 17th-century theologians, jurists and philosophers of the so-called School of Salamanca. This discourse had — besides other theological, philosophical and jurisprudential aspects — a strong focus on political and jurisprudential questions that came to be of seminal importance for the development of Western political and juridical thought. Yet the discourse as such, its internal structure and differentiation, its development and its limits are still under-investigated.2 Hence, even the very identification of the School of Salamanca is a contested matter: Is it a “School” and if so, in what sense (what is a “School”, to begin with)?, Who does or does not belong to it, and on which grounds? etc. The project determined ca. 115 representative texts to be edited that have been published by authors of the School of Salamanca or in its wider context, concern politics and jurisprudence, and were widely received at the time. These texts are in both Latin and Spanish, range from several dozens to several thousands of pages and come from a variety of genres, academic and pragmatic alike. One of the main goals of the project is to provide free access to those texts, i.e. to offer an online reading view, a powerful full-text search and to lay open their interconnections via hyperlinks. As its other main goal, the project aims to complement the texts with a dictionary of a number of key terms, the development of which is traced throughout the discourse of the School. In general, the project has thus editorial and philological interests as well as systematic conceptual/theoretical ones. While one can never know the interests that such resources generate and attract, the intended main audience is the community of legal and political philosophers and historians.

The digital edition of the texts is based on TEI/XML files representing the individual works.3 Since those files easily weigh in with several MB, rendering them on-the-fly for the web application’s reading view would cause intolerable delays for the user, and since even pre-rendered HTML versions of the data are still too heavy, we have decided to pre-render each TEI file as a series of shorter HTML fragments that can then be incrementally loaded in the background while the user is already reading the first one (so-called “lazy loading”). The reading view also allows the display of the corresponding facsimile scan images alongside the text and some operations via a context menu provided on a per-paragraph basis (e.g. browser bookmarking, printing or pdf export, or retrieval of some characteristic bits of information4). This context menu shall also provide a citable reference string for the current section for scholars to use in their own discussions of the texts provided in the digital edition.

II. Semantic web: promises and provisos

Besides the reading view, and reflecting the project’s interest in exploring the relations and interconnections of the discourse, we are testing and evaluating ways of offering the information as semantic data: Some biographical information about the authors, some bibliographical and structural information about the works, and some information about their respective citation profile can reasonably be rendered and offered as RDF data. The approach is not to produce any information for this specifically, but rather to extract and expose information that is present in the TEI files of works and dictionary articles anyway. This should open additional ways of interacting with our data to external scholars and projects. But it should also enable us ourselves to query our data for salient information more flexibly and more intelligently than we could do using full-text search only. We could, for example, explore relations between personal networks and discursive phenomena by asking questions like “Throughout our 115 texts, show me the section headings of those sections in which students of Francisco de Vitoria, the founder of the School of Salamanca, is mentioned”; “Of those members of the School of Salamanca that are mentioned in the same sections, show me their religious order (and also show me that of the author of the respective work containing those sections).”, etc. In addition, it could be interesting to combine the information we provide ourselves with that of other scholarly projects, like a biographical database about scholastic authors (Schmutz 2008), a project about the networks of early modern Jesuit science (Mrozik 2016) or digital library projects of the relevant time and parallel discourses (Agenjo 2012; Sytsma 2010) — once such information is offered in interoperable ways as well.

On the other hand, outside of digital humantities circles there are strong doubts about the value of the semantic web to academic scholarship. Introductions to it usually refer to the dbpedia project (DBPedia 2016) as providing one of the largest datasets available, and an undeniably cross-sectional one at that, too. Thus, they regularly trigger a set of reservations scholars have with regard to wikipedia as an academic resource. Since similar reservations even surface when the scholars are invited to refer to information that authority databases, such as the “Gemeinsame Normdatei” (gnd) of the German National Library, provide, the issues are probably more general.5 I cannot rely on systematic investigations of this complex, but in my experience the uneasiness that befalls scholars who are asked to integrate popular resources like Wikipedia, citizen science projects or “#alt-ac” initiatives originating in archives, libraries, think tanks, nonprofits, museums, etc.6 is linked to worries about the hard-to-assess quality of those resources and about how the scholars’ own results and reputation might be tainted and held liable for mistakes in information that they effectively have no control over. In fact, every scholar probably knows some other scholar whom she considers methodologically negligent, and refrains from citing their works, no matter how affirming for her own position it might be. Others may be perceived as unqualified or even “dangerous”. It is of course well possible to work even with doubtful information, but then the scholar usually holds that she is capable of (implicitly) factoring in the caveats she holds in this regard. In any case, the establishment of trust and reputation (of “authority”) in one’s own scientific community is delicate, oftentimes informal or implicit work, consisting in part of symbolic practices (such as positioning oneself with regard to the cultural and linguistic academic context that is considered to be the currently “leading” one by balancing the references to sources in particular languages, or reporting on sources the revisability of which for other readers may practically be non-existent).7 It seems quite reasonable for scholars then not to let technical feasibilities and supplies determine their practices and to resist handing over the control of research processes to algorithmic mechanisms, in particular when these are known to be ignorant of the processes’ “symbolic” and “political” dimension. In order to gain any traction in these academic contexts, linked data strategies should, in my opinion, explicitly address issues such as the handling of quality control, the management of trust, the assessment of reputation and the assumption of liability. I will come back to this point again.

A noteworthy particular form of the mentioned general uneasiness is related to the (relative) necessity for a semantic web strategy to commit to standardized vocabularies or ontologies. Of course it is possible to define the vocabulary used in one’s own dataset oneself, but the whole point of linked data is that identifiers and categories are shared across many datasets by many providers. After all, this is what facilitates the analysis of information originating from different providers and the (more or less) easy analyzability of one’s own data for users that are not already very familiar with the project. On the other hand, the fact that the vocabulary is normally defined and managed outside of the particular project often makes its expressiveness questionable: It may be too coarse to describe the information available in the project and thus lead to loss of information and even to ambigous statements; or it may be too specific, suggesting misleading interpretations of the bits of information that it encodes. Consider e.g. the lack of a distinction between the notions of Work, Expression, Manifestation and Item in the popular BIBO vocabulary (D’Arcus/Giasson 2009; Shotton 2011); the broad definition of the seemingly precise rel:enemyOf property (cf. Davis 2004), or the peculiar notion of the sor:inCahootsWith property (cf. Calzada-Prado 2015); the applicability to any Catholic of the TEI element <affiliation> with regard to the Roman Catholic Church (TEI 2016, section “13.3.2.1 Personal Characteristics“); or the cultural relativity of “spouse” that serves as an example of a name of a relation between two people in the TEI guidelines (TEI 2016, section “13.3.2.3 Personal Relationships“). In many contexts, these notions may pose no problems and be fitting, but in other contexts, a project may feel that its particular interpretations and knowledge of the domain would be misrepresented by the available vocabularies. However, the line between legitimate concerns of the described type and the so-called “NIH syndrome” fallacy (“Not invented here”, the tendency to ignore externally originated knowledge) is hard to draw. Hence, it seems important that, as a central aspect of their data provisioning activities, projects discuss possibilities both of extending existing vocabularies and of generalizing their own notions to allow for cross-domain intelligibility. Our project’s take on this and the previous question will be described below.

III. From TEI to RDF

a. TEI

As mentioned above, our approach in providing linked data relies on exploiting information that is already present in the TEI/XML files of our collection of sources and of dictionary articles. The latter comprise small biographical articles about the authors, the former contain bibliographical meta-information as well as structural and other information incorporated in the texts, including instances where other authors of the School of Salamanca and cross-references to their works are mentioned. The biographical articles make use of TEI elements from the namesdates module to express information about e.g. birthdate and -place of our authors, religious orders, education and academic career, sometimes about colleagues and associates, professors to the lectures of whom they have listened or students who have been present in the courses of the respective author.8 The works, on the other hand, make use of <persName>, <bibl>, <title> and structural elements such as <note>, <paragraph>, (typed) <div>, <head>, and they often rely on the @ref attribute and a private URI scheme (cf. TEI 2016, section “16.2.3 Using Abbreviated Pointers“) to associate such elements with authors (in the case of the <persName> element) or with (passages of) works (in the case of <title> elements9).

b. Evaluation of vocabularies/ontologies

In order to generate a mapping of the information that is implicit in our TEI files to some RDF vocabulary, I have investigated a couple of vocabularies and in some cases there were competing alternatives. The general criteria I have applied (on an intuitive rather than systematic scale) have been the popularity of the vocabulary, i.e. how easily can our data blend in with other datasets? Can other researchers use our data without having to translate it beforehand? Besides general impressions resulting from web searches and reading up on the vocabularies and on other projects’ decisions, one approach to assess the popularity of a vocabulary is to look it up at prefix.cc (Cyganiak 2013), LODstats (LODstats 2012) and similar pages. While these may not give a complete picture, let alone one applicable to the particular domain one is interested in, they still can deliver some indication nonetheless. Second, is the granularity of the vocabulary fit to receive the information contained in the markup or would we incur loss of information? For example, if we usually recorded the attitude of a citation, i.e. whether it is critical or affirmative (we do not), we would not want to lose this information in the translation to linked data. If the vocabulary for expressing citations was not differentiating between such attitudes, we should better look for an alternative that did. Finally, the third criterion is the correspondence between the definition of the vocabulary’s terms and the interpretations entertained by the project. If our biographical articles contain TEI <relation> elements expressing the fact that two of our authors have been opponents in a trial of the Spanish Inquisition, the antagonistOf property of the relations vocabulary could be adequate. However, if this vocabulary’s use of the term referred rather to the narratological elements protagonist/antagonist, it would be less so — perhaps the actual definition of the term (“A person who opposes and contends against this person”) is still inconclusive and one would even have to consider the vocabulary’s usage in other projects. Likewise, it is debatable whether the apprenticeTo / mentorOf properties can adequately reflect the relation between 16th century students and their professors in a Spanish university. (In some cases, it might be feasible to encode one piece of information in two vocabularies. A work mentioning another work can be expressed both with the cites property of the CiTO vocabulary and with the mentions property of the schema.org/book vocabulary. If extraction is automatic and technical limits allow for it, there is nothing in the way of encoding this in both ways.)

A more pragmatic question is: how easily would other researchers recognize our data to be relevant to their concerns, and how easily can we recognize the vocabulary as adequate to our data? Even though the friend-of-a-friend and social network perspective is somewhat unfamiliar to scholars in the history of ideas and legal philosophy, people (who in the end have to define queries that computers can then solve) tend to know what a foaf:person is more often than what a dc:agent or even a crm:e21_person is (cf. Pattuelli 2012). And vice versa, how easily can we determine, based on our data, and without thorough study of hundreds of pages of vocabulary documentation or even formal specification, which property we can use to mirror the semantic content of, say, a TEI <orgName> element? Does a <persName> element represent a differentiatedPerson or an undifferentiatedPerson, and do we have to chose one of those two in the gnd vocabulary? And if yes, can we establish an automatic mapping or would the vocabulary/data force us to decide on a case-by-case basis?

c. Mapping and producing the RDF dataset

I have built a mapping of our TEI encoded information to RDF vocabularies, relying mostly on the Semantic Publishing and Referencing suite of ontologies (the SPAR ontologies FaBIO, CiTO, DoCO; cf. Peroni 2014), and on the foaf, bio and relationship vocabularies (Brickley/Miller 2014, Davis/Galbraith 2003, Davis 2004): I have linked both work and author information to the respective page of our web application via the rdfs:seeAlso property. For works, I have mapped structural information such as TEI <frontmatter>, <volume>, <division> or <note> elements to the corresponding Document Components Ontology (DoCO) classes. I have associated headings with labels of the respective structural components and kept track of the components containing the respective entity. TEI pagebreaks are mapped to pages of the FRBR-aligned Bibliographic Ontology (FaBIO), and associated both with their position in the edition’s reading view and with their facsimile scan image via rdfs:seeAlso properties. I have used the Citation Typing Ontology’s CitationAct class to be able to distinguish between citations in the context of an actual literal quote from the referenced work on the one hand (cito:includesQuotationFrom) and those where a work is mentioned but no literal quote is given (cito:citesAsRelated). For each of these instances, I also record the referenced work (cito:hasCitedEntity) and the document component containing the reference (cito:hasCitingEntity). For biographical data, the Birth and Death classes of the BIO vocabulary can associate our authors (bio:principal), a date (dcterms:date) and a place (bio:place) with the respective event. I have used the foaf:made property to associate authors and works, and am exploring possibilities of making systematic use of TEI <relationship> elements, mapping them to relationship ontology classes and of associating our authors with entries in authority databases via the owl:sameAs property (on the difficulties and reservations, see above). Finally I am probing possibilities of mapping the TEI <orgName> elements to the W3C’s organization ontology’s memberOf property (cf. W3C 2014) — if these elements are children of <affiliation>/<education>/<occupation> elements in a certain context, that is.

For processing reasons, I express this mapping in an XML file where subject, predicate and object of the target RDF assertions can be given as literal values or as XPaths. Especially this latter technique is heavily used, so that a part of the mapping looks a bit like this:

[xml]
<subject type=”uri” prepend=”http://id.salamanca.school/works.{/@work}:”>
//div[$repeatIndex]/@xml:id
</subject>
<predicate prefix=”rdfs”>seeAlso</predicate>
<object type=”uri” prepend=”http://www.salamanca.school/”>
//sal:node[@type eq ‘div’][$repeatIndex]/sal:crumbtrail/a/@href
</object>
[/xml]

This is not quite the real code,10 but you can see how the predicate is defined as rdfs:seeAlso (line 4) and how subject and object entities are generated by traversing XPath expressions (lines 2 and 6) in TEI files (in the case of the subject), respectively in a project specific index file (in the case of the object) where we keep track of which XML element ended up in which of the html fragments that make up the complete work (we need this information to construct the final URL for the reading view that is supposed to open at the correct position in the text after all).

This mapping is then taken as a configuration file for the generic RESTful xTriples webservice that has been developed by Torsten Schrade at the Digital Academy in Mainz (Schrade 2015). This service takes a configuration file consisting roughly of statements like those in the example, retrieves the resources necessary to evaluate the XPath expressions and generates an RDF dataset in a variety of possible output formats. In our case, the configuration for extracting information out of works has 800 lines, that for extracting information out of biographical articles has 320 lines. Running this on just four works produced 110,000 RDF triples in a recent test run. We keep these as RDF/XML files in our XML database and additionally feed them into a Sesame server, to be able to query them. Since runtime is a serious problem (the mentioned test ran for more than 20 hours), I am using a local, patched instance of the xTriples service and continuously try to improve it, e.g. by introducing a caching mechanism.

IV. Publishing the data

Producing the dataset may be one challenge, but it is only half the task. For once the dataset is ready, questions of how to present it pose themselves (if not earlier): Where and how shall it be accessible? How can researchers, and how can software learn about it?

a. Entity dereferencing and content negotiation

One of the tenets of the semantic web is the usage of HTTP URIs as references for entities (Cf. Berners-Lee 2006, Thompson 2006, SWEO 2008). This means that, if I want to refer to pages, volumes, chapters or paragraphs as entities that I am describing in in the dataset, I have to come up with a HTTP address for each one of them, and preferably one that is resolvable: If an external agent, a researcher, search bot or research tool wants to retrieve the information we have about such an entity, it should be possible to just take the identifier, send a request to this very address, and get useful information from the server listening there. Since we aim to provide the users of our reading view with references to paragraphs and sections of the works, too, combining the two referencing schemes suggests itself. Hence we have thought a lot about a scheme for such URIs that would be stable, unequivocal, practical to use for human readers of our online edition and capable of addressing all the entities that our RDF dataset cares about.

Another fundamental idea of the semantic web is the distinction between entities and their (various) representation(s). Just as a work is an abstract concept that may be expressed in several texts of different languages, so a chapter is, first of all, an abstract concept, that may be represented by the text it contains (coded e.g. either in XML, in HTML or as plaintext), by the collection of images that the text occupies in a certain edition of the work, or even just by metadata such as the chapter’s title, its subject matter, its position in the structure of the work, the subsections, footnotes or quotations it contains etc.11 Also, it is impossible to pipe an abstract concept or a physical entity such as a page, i.e. a sheet of paper, or even one of our authors themselves, through the wire even though the user/agent/client literally might have asked us to do so. Hence, we cannot help but answer with one of several possible representations of the entity that was requested, and the agent might better be served by delivering some relevant bits of the Salamanca RDF dataset, or a scan image of a page in a book, or a snippet of plaintext, or the web application’s reading view of a certain work at a certain position. As soon as you are capable of offering all these different representations, you have to figure out how to differentiate between them. Hence I have implemented and established different services providing the various representations of the entities, that

  • listen at some “talking” server address like “<http://tei.salamanca.school>”, “<http://data.salamanca.school>” or “<http://www.salamanca.school>”, waiting to deliver TEI/XML, RDF or HTML data, respectively,12
  • can take requests for particular resources in a uniform way,
  • parse the exact resource requested
  • and, if possible, deliver the resource’s representation that is indicated by the subdomain name

But on the other hand, we certainly also want to keep the notion that all these different representations are representations of the same entity. But if, in order to maintain this latter idea, you refrain from giving different addresses to these different representations, how can you tell which of them it is appropriate to respond with? I have thus also implemented a content negotiation service listening at a generic “<http://id.salamanca.school>” server address that does more or less the same, except it factors in a ranking of acceptable media type preferences that the client can reveal when it sends the request, thus determining the best representation of the requested resource, and forwarding the client to the most appropriate representation, i.e. to one of the servers described in the previous paragraph. For example, if a graphics viewer and an internet browser request a “page” at this generic service and do so advertising the media types they can deal with (any of several compatible image formats in the first case, and a text/html page in the second), the server can deliver a scan image to the first client and the text that is on the page to the second.

With this setup, it is possible to address the parts of the texts in a transparent, uniform and practical (brief) way, i.e. what comes after the server name should have the same meaning in all the different services; it should be easy for machines to parse, but it should also be easy for humans to understand, remember, copy and use. Since many of the texts we are editing are classical texts in their respective discipline, we have taken inspiration from the Canonical Text Services‘ URN scheme for addressing works and parts of works of classical antiquity (Blackwell/Smith 2014, CTS 2016). For the reasons described above, we are not using URN (as the CTS suggest) but HTTP URI addresses, and (currently) we are not making use of the full potential of the scheme (like, referencing single words or even single characters in a (sub-)passage of a work). But so far we have adopted the general idea: First, we are specifying a document (by means of a collection, e.g. “works”, “authors”, “lemmata” etc. and an identifier within that collection, e.g. “W0013”, “A0100”, separated by a period). After the specification of the document, and separated by a colon, comes the specification of the passage within the work as a period-limited hierarchy of structural components, e.g. “1.3.42” for the 42nd paragraph of the third chapter of the first lecture or the 42nd subsection of the third question of the first part, depending on how the document at hand uses and labels its structural components. For a couple of entities, we depart from this scheme, though: Volumes are given with a “vol” prefixed to the volume number, “titlepage”, “frontmatter” and “backmatter” make for genuine eponymous sections (giving e.g. “vol1.frontmatter.2.2” for the second paragraph of the second section of the frontmatter), TEI milestones take into account their @unit and @n attributes (our case, we often have “article” units, giving e.g. “article16”), notes are prefixed with “n”, pages with “p”, and entries of dictionaries and indices have their lemma upper-cased and are prefixed with “entry”, e.g. “entryABBAS”. It is important, however, not to formulate too many of such special rules in order for the scheme to be recognizable and reproducible (in mente) as such. For one of the advantages of such a canonical citation scheme is that the scholar may build a working reference on her own. When she knows for instance that a certain discussion is to be found in the first article of the second lecture of the second volume of a Vitoria’s Relectiones, she only needs to know the number of the work in the context of our edition (W0013) and can confidently refer to “<http://id.salamanca.school/works.W0013:vol2.2.article1>”, or to “<http://id.salamanca.school/works.W0013:vol2.p51>” if she knows that it is on page 51 of the second volume.

In the case of the HTML representation, i.e. of our reading view, this redirection is particularly useful. It allows to mask some bits of information that are present in the final URI, but are somewhat irritating in most interactions of users with the reading view: This includes the name of an html page providing the template in which the current resource is rendered (i.e. “work.html”), the currently selected language of the web application’s user interface as a path element or as a parameter, and the internal name of the html fragment where the targeted node resides in after the whole document has been split in parts in order to decrease loading times. E.g. the passage mentioned above, http://id.salamanca.school/works.W0013:vol2.2.article1 is resolved in the background to http://www.salamanca.school/es/work.html?frag=0018_Vol02Lect02&wid=W0013#Vol02Lect02Art01, http://www.salamanca.school/de/work.html?frag=0018_Vol02Lect02&wid=W0013#Vol02Lect02Art01 or to http://www.salamanca.school/en/work.html?frag=0018_Vol02Lect02&wid=W0013#Vol02Lect02Art01 (depending on the language preferences), making for an address that is really quite cumbersome to use. But also in the case of the linked data representation, it allows hiding the infrastructure that is behind the services and that would normally be reflected in path components or in calling a stored XQuery procedure named “extract.xql”; it also allows being format agnostic, i.e. to leave the decision of whether to retrieve a RDF/XML, a turtle or some other encoding of the data up to the negotiation between the server and the (then current) client.

Hence we are encouraging the use of these generic links in as many places as possible, also internally: They are offered in the context menu of sections of the works, they are used in the crossreferences within (e.g. table of contents) and between works (e.g. citations).13 They are used by the web application’s image viewer to retrieve the scan images of the pages, and they are of course used as identifiers for the entities in our RDF dataset. Moreover, since these links have masked almost all technical contingencies that would be bound to change at some point, we can commit to keeping these URIs functional over a long time period — making them permalinks.

b. Other linked data services: Dumps, SPARQL, LDF

Up to now I have described the setup of identifiers and how they are dereferencable URIs. But in addition to resolving and visiting such entities, there are also other popular ways of publishing and consuming such an RDF dataset: The most popular ones are dump downloads of the complete dataset and SPARQL endpoints. We have postponed the former but we have fed our data into a Sesame server kindly provided by the Digital Academy in Mainz, so as to be able to experiment with querying our dataset (and linked sets) with SPARQL.14

SPARQL is a language for querying a knowledge base and more tailored to a collection of assertions than SQL, which is a comparable well-known query language for relational databases.15 Roughly speaking, you define filters for your assertions and you define which pieces of the resulting set of assertions should be returned. For instance, if you ask the SPARQL endpoint to select (i.e. return) “?a ?b ?c“, any way of filling the variables at the subject, predicate and object positions will be returned, in other words, the complete knowledge set consisting of all the assertions available. If you ask to “SELECT ?a WHERE { ?a rdf:type 'work'. }“, only assertions that have a predicate of “rdf:type” and the literal string “work” as an object remain in the result set, and of those, only the subject, i.e. the entity URI is going to be returned. Variables can also be referenced inside the filter expression only: The query “SELECT ?a WHERE { ?x rdf:type 'work' . ?x rdfs:label ?a . }” will build a result set from assertions describing some entities as being of type ‘work’, and assertions where those same entities (note how the “?x” occurs in both parts of the filter expression) are assigned labels; and then it will return all those labels.

Such expressions can be nested, some mathematical or other functions can be added and you can even ask the server to include assertions from other knowledge bases, provided these in turn offer a SPARQL endpoint as well (this is called “federated queries”). This makes the language, but also the underlying knowledge very powerful — provided the assertions in fact do refer to one another. For example, it should technically be possible to ask “Which authors are cited in works of scholars who have lived in dioceses the bishops of which have been influenced by Francisco de Vitoria, the founder of the School of Salamanca?”, taking into account information from our collection of works, from biographical articles, and from external sources like geographical or ecclesiological databases. The main reason why this is not fully demonstrable is that projects that establish and provide such information are missing, are reluctant to publish their data as linked data, or they do provide linked data, but no SPARQL endpoint that would allow their data to be queried in this way. Unfortunately, this is the case for some important institutions in the humanities, such as the Consortium of European Research Libraries (CERL) or the Online Computer Library Center (OCLC), a globally active IT service provider for libraries, providing e.g. the well-known WorldCat service.

One reason for this reluctance may be that executing SPARQL queries can easily become expensive in terms of time and resource use. This is because the server has to do all the combinatorial work that is required by the filter expression (and maybe it has to wait for some data to be returned by an external service before it has to do even more combinatorial work). One approach to deal with this conundrum is to put some more burden on the client: It receives a SPARQL query from the user and translates the filter expression to a query that relies on a combination of “triple pattern expressions”, a simpler type of queries. Then it can send these simpler queries to the server(s), receive relevant (albeit too large) sets of assertions from it (or them) and do the combinatorial logic client-side. While the server is thus still responsible for delivering result sets for (possibly any) combinations of subject, predicate and/or object value, this far less complex than a full SPARQL service and it is a finite effort, the results of which are even cacheable. On the other hand, it relies on a specific intelligence on the client software’s side. The Linked Data Fragments project (LDF 2013) investigates and develops technologies in this perspective. On its homepage, there are both server and client libraries and prototypes in several languages available, and while it is too early to determine the sustainability of the project and its software, we are following its development closely and are experimenting with providing a LDF server for our data.

c. Describing the dataset and the services: VoID

One task that up to now has not been discussed is the description and advertisement of the dataset and the associated services. While a prose description of the dataset and API, say on a documentation page of the project’s web presence, is certainly desirable, the description features elements that a client should be able to process automatically. For example, harvesting bots need to be able to tell whether or not the data is published under a license that allows its use in the desired way. The association of the copyright holding authors and/or institution with the dataset, or just with parts of it, should be transparent from every point in the dataset (maybe different parts of the set fall under different licenses, so this should be expressible as well). Also, interested parties should be able to get some quick impression of the vocabularies and ontologies that the data use, of some subject matter identificators that describe it, or of the volume of data, i.e. the number of triples.

Finally, as discussed above, the assessment of trust and confidence is a crucial factor in academic practices, and it is facilitated by an explicit description of quality assurance measures, information about the provenance of information and about partners collaborating in generating the data. (It is an open question to which extent this can be processed automatically, but explicitly formulating such factors at least allow for an evaluation by a human reader and interpreter and as such is at least a confidence-building measure.)

The way we have implemented this is using the Vocabulary of Interlinked Datasets (VoID). This is the conclusion of several years of work in the W3C’s Semantic Web Interest Group, providing a vocabulary definition and a description on how to use it (SWIG 2011). It works on the basic notion of a dataset class to which, using popular vocabularies such as Dublin Core or foaf, properties like creators and other contributors, copyright and licence, project homepage and creation date can be attributed. The notion of dataset is crucial, especially insofar as boundaries of datasets in the linked open data “cloud” are notoriously hard to discern, giving rise once more to the scholarly reservations that have been mentioned. Here, datasets can be identified (and delimited) by making the space of URI values explicit, by pointing to endpoints for SPARQL searches, and by listing example URIs for single resources.

VoID also defines subsets of a dataset, in particular it defines a linkset class that is a special subset, consisting just of relations of the present dataset to other datasets. This allows to articulate collaborating partners and other contributors, associating them with a subset of the dataset; it also allows to specify different levels of trust and confidence associated with different subsets of the data. While this is a very important option, as of now, we have not investigated adequate ontologies and vocabularies yet, but will surely turn to this soon.

Here are some salient excerpts of the void.ttl file describing our dataset:

[code]
<> a void:DatasetDescription ;
dcterms:title “A VoID description of the School of Salamanca Dataset” .

:Salamanca a void:Dataset ;
void:uriSpace “http://id.salamanca.school/” ;
void:exampleResource <http://id.salamanca.school/works.W0013> ;
void:sparqlEndpoint <http://t.spatialhumanities.de/openrdf-workbench/repositories/svsal> .

:Salamanca_GND a void:Linkset ;
void:target :Salamanca ;
void:target :GND ;
void:linkPredicate owl:sameAs .

:GND a void:Dataset ;
foaf:homepage <http://d-nb.info/gnd> ;
dcterms:title “Gemeinsame Normdatei” ;
void:exampleResource <http://d-nb.info/gnd/118594893> .

:Salamanca_Project a foaf:Project ;
rdfs:label “Projekt ‘Die Schule von Salamanca. Eine digitale Quellensammlung und ein Wörterbuch ihrer juridisch-politischen Sprache'”@de ;
foaf:homepage <http://www.salamanca.school/> ;
foaf:mailbox <mailto:info@salamanca.adwmainz.de> .
[/code]

A final aspect to consider is how to deploy this file so that users and agents can find the information: We provide the void.ttl file at the root of our data service (<http://data.salamanca.school/void.ttl#Salamanca>) and have a redirection to it from the “/.well-known/void” path (which is standardized and known by clients, so that it is a default location that they can look to first). Thirdly, we link to it in every single resource RDF file, which starts with:

[xml]
<rdf:Description rdf:about=””>
<void:inDataset rdf:resource=”void.ttl#Salamanca”/>
</rdf:Description>
[/xml]

Consider for example <http://data.salamanca.school/W0013.rdf>, a file containing information about the work we internally identify as work number 13, the Relectiones by Francisco de Vitoria: The empty subject of the assertion at the beginning of the file (and, by the way, also at the beginning of the void.ttl file) means that the predicate and object of this assertion apply to the current document itself, the W0013.rdf file in this case. Thus, every document in which semantic information is published contains a declaration that it is part of a dataset which is, in its totality, described in the void.ttl file. All three approaches are suggestions found in the interest group’s documentation.16

V. Open questions and conclusion

In this last section of the first post, I want to list some difficult issues that we have encountered and summarize some of our aims and the corresponding affordances that the approach and implementation described here can provide. The issues are of different types, some are conceptual and some are rather social or cultural. They do cast serious doubts on the scholarly integrity and justifiability of the semantic web strategy, however, and are so critical that, unless they can be resolved, the project is not going to implement it as an official, supported offer. What we have done so far, and what we are going to continue doing, is keeping an eye on developments, exploring the techniques in general and, of course, ways of coping with the issues identified here in particular.

a. Conceptual issues

One of the more practical — and most likely solvable — issues is the modelling of the temporal dimension of some of our data (cf. Mynarz 2013, Raymond/Abdallah 2007). This is most obviously desirable for biographical and relationship data, such as the affiliation of an author with a university or with a religious order that have beginning and ending dates. Unlike TEI relationship tags, the tripartite subject-predicate-object structure of RDF triples can not simply take up additional attributes for such dates, which means that the representation of the knowledge that is incorporated in our TEI files needs to take some detour. One of the typical approaches in such cases is to translate the relation into beginning and ending events that are then rendered as entities in their own right. These event entities can then be the subject of a whole set of assertions, predicating the class of the event, the involved persons and their roles, beginning and ending dates, location information etc. The disadvantage of this strategy is the added complexity and effort that such a creation of virtual entities and the flattening to the tripartite structure entails.17 This is only partly remedied by the automatic creation of triples, since it concerns not only the creation but also the querying of the data: It is not easily foreseeable how complex the queries would have to become if users wanted to exploit such information, and whether or not this complexity is too discouraging.

A second issue has been mentioned already in the section about dataset descriptions: While it is somewhat clear where information about confidence and trust, about partners and provenance, quality control policies and pledges can be recorded, it is not at all obvious how such information should be modeled — after all, it concerns aspects that usually remain informal and implicit in scholarly practice — and whether the project can count on it being adequately processed and taken into account by the consumers of our data.

The third issue concerns the topic of the upcoming next blog post: The permalinks identifying passages of works as described above currently do not provide versioning information. While the CTS standard defines a position for such information (as a further, period-separated element of the work specification), it is not easy to see what such information should be referring to: Our single basic data “source” are the TEI XML files, from which HTML and RDF versions are automatically generated. XML and RDF files are stored in a version control system, but our web application and database does not have access to the version history information. (Currently, the HTML versions are not even archived in the VCS and are treated like temporary files.) It remains to be discussed and decided whether a CTS-analogous indication that would include some version information, would be adequate or whether we should aim for proper persistent identifiers like DOI, Handle or comparable systems; which representation such a (possibly format-agnostic) version tag should be referring to and how it could be retrieved if it was requested. Indicating the revision of the XML files seems natural, but would it be consistent to retrieve and deliver an old version of the XML file since the old html file is probably already deleted? Would it be preferrable to retrieve the old XML file and re-generate an html version (using the current html rendering functions)? Or are we forced to store html files in the version control system as well?

b. Social or cultural issues

Some of the obstacles we have encountered indicate that efforts of persuasion may yet be profitable after all. There are several projects that offer information analogous to, parallel to, or comparable to the information our project is establishing. Unsurprisingly, they are interested in collaboration and crossreferencing to varying degrees — both in terms of general willingness (or capacity) and in terms of technological involvement. In my experience, scholars are quite willing to consider if crossreferencing, integration of heterogeneous data and the corresponding technological investment and commitment could produce benefit for their own project; but this would presuppose some more explanation of the different approaches, their benefit and the required effort, a persuasion that is up to those who see a possible benefit for themselves and want to realize it.

On a related note, I think it is necessary in such discussions to be transparent with regard to and to reflect about the intended, desired and perceived liabilities, responsibilities and forms of credit and acknowledgement that are involved in the respective collaborative initiatives and their results. How should such aspects be determined and how should they be announced?

On the other hand, there is also a good deal of critical self-reflection necessary: the reservations with regard to externally defined and externally controlled vocabularies and data assertions in my opinion need to be articulated more clearly and their reasons weighed carefully. After all, it is a central notion of scientific and scholarly integrity to critically reflect, explain and justify (or relativise) one’s methodological convictions, and in my opinion it is fairly clear that the more traditional methods implicitly make presuppositions that, in the end, are not too different from those aspects of new technologies and their workings that raise such strong doubts. A question that any project should regularly pose to itself is “Do we manage, or do we even aspire to rethink our stated goals, reflect on scholarly responsibility and liability, and on the nature and the consequences of the collaborative character of any scholarly work in our domain?”

c. Summary of some important aims and features

While the actual semantic web strategy is not yet something that we can commit to as a project, in the process of exploring it, we have learned a couple of things that are valuable to us independently of the semantic web and of linked data. Some aspects and efforts that we will maintain in any case are the use of permalinks (as described above), content redirection and URI addressability of details of our texts. Thus, as a conclusion, I will highlight some goals that have crystallized and affordances that we have developed that are side benefits of sorts, but that, in my view, remain considerable.

What we want, and what we can do with the approaches and instruments described above, is the following:

  • In the reading view, scholars are able to grab a URL, i.e. an immediately resolvable URI reference to (parts of) the works, in a practical way, where practical means more specifically the following:
    • we provide reasonably short URLs for each paragraph, section, note, page etc.
    • we provide “talking” references analogous to canonical citation customs, thereby allowing the scholar to eventually recognize some of the structural context of the reference even without actually looking it up
    • those URLs are independent of the technology providing the service and are guaranteed to be maintained permanently

In the realm of Open Data more specifically, while we are continuously pursuing efforts, as a project we are not yet fully convinced of the commitments and liabilities that an involvement in semantic web activities would entail, so that for now the project refrains from putting any emphasis on the interlinking with external data. A second caveat concerns some cases of commitment to external vocabularies or ontologies. An example would be the “antagonist” relation mentioned above. In cases of doubt like this, we will refrain from expressing any such information unless we are sure of the adequacy of the expression. This does not mean, however, that we are not going to make uncontroversial assertions (e.g. bibliographical information using Dublin Core vocabulary) available in RDF representations. Some benefits that we can draw in this realm without incurring any of the mentioned imponderabilities are:

  • Apart from the reading view, the described scheme of URI/URLs, together with the content negotiation mechanism and the RDF offerings, facilitates automatic retrieval and digital analysis of the information, where retrieval and analysis of information means more specifically the following:
    • programs can “talk about” and request information about (parts of) the works (paragraphs, sections, notes, pages etc. can be identified as particular and individual entities each with their own URI)
    • the information returned upon such requests include chapters’ headings, pages’ scan images, the reading view renditions of the entities, bibliographical metadata about the work as a whole etc.
    • it also includes structural information about the text as a whole, i.e. which marginal notes belong to which paragraph, which paragraphs in turn belong to which chapters etc.
    • finally, it includes information about relations to other works or persons that are articulated in the texts, i.e. about citations, persons being mentioned etc.

Apart from the controversial points that have been described and that will continue to be matter of discussions, there are some questions that are best understood as perspectives for future development: Their resolution can not be drawn from theory, specifications or from technological solutions, but will have to result from experiences with running the described services and from feedback given by scholars and users:

  • Which services and which data formats are expected, which are useful? (Is there demand for text excerpts being provided in plaintext, XML, PDF or even ebook formats? Is there demand for tables of contents or lists of mentioned entities provided by section (where applicable) — and if so, should such lists be offered in JSON-LD, RDF/XML, TEI/XML or even CSV?)
  • How public should those services be? We can clearly use them for feeding data into our own web application, or for querying and exploring the data in our “lab”, but is that something that we can and want to offer to the general public indiscriminately (say, without possibilities of introducing and explaining our usage and our interpretations before granting access)?
  • How should we advertise and describe these offerings in general? Should we invest efforts in a prose description on the project’s web presence or rather on refining the formalized VoID description mentioned above? What information about the services is necessary for them to be of any use to their intended audience or to the general public, and what information do we as a project want to push to the consumers of the data?

VI. Up next: Cool URIs don’t change… whereas entities might

The preceding discussion should have portrayed a conceptual and technical scheme of referencing that we have implemented and the considerable advantages that have emerged. One notorious issue has been treated only very briefly in the preceding arguments, however: the stability of the information referred to in the above scheme. While the point of permalinks may be considered to be the persistence of the linking, i.e. the avoidance of references that cannot be resolved (where it is not clear why no resource can be reached at the requested location, nor even if there ever was one in the first place), scholarly practice also relies on assumptions of the persistence of the content that is being referenced. It is vital to the confidence with which scholars refer to other scholars’ assertions in their own publications, that (a) the information being referenced cannot, after the act of referencing, change into something that the scholar would have addressed all differently, if at all. And it is crucial for the reproducibility of scholarly developments, that (b) it is possible to retrieve the particular version that a scholar has worked with and that she has indicated in her reference. Since digital resources and digital results of research may be subject to change and even to continuous development in more dynamic ways than traditional publications, a versioning system seems necessary to manage current and older versions of a resource — and this versioning system must somehow be connected to the referencing scheme.

How to maintain the advantages outlined above and at the same time provide such versioning information is a difficult question that will be discussed in the next post.

Literature

Agenjo Bullón, Xavier (2012),
“Introducción: la Biblioteca Virtual de la Escuela de Salamanca y Linked Open Data”, published in 2012, retrieved 7 Nov 2016 from http://dx.doi.org/10.18558/FIL.
Alarcon, Rosa/Wilde, Erik/Bellido, Jesus (2010),
“Hypermedia-Driven RESTful Service Composition”, in: E.M. Maximilien et al. (eds): ICSOC 2010 Workshops, LNCS 6568, Berlin: Springer, pp. 111–120, retrieved 12 Nov 2016 from http://dl.acm.org/citation.cfm?id=1987698.
Berners-Lee, Tim (2006),
“Linked Data”, Date: 2006-07-27, last change: $Date: 2009/06/18, retrieved 7 Nov 2016 from https://www.w3.org/DesignIssues/LinkedData.html.
Blackwell, Christopher / Smith, Neel (2014),
“An overview of the CTS URN notation”, published 2014, retrieved 11 Nov 2016 from http://www.homermultitext.org/hmt-docs/cite/cts-urn-overview.html.
Bourdieu, Pierre (1984),
Homo Academicus. Paris: Minuit, 1984.
Brickley, Dan / Miller, Libby (2014),
“FOAF vocabulary specification 0.99”, retrieved 12 Nov 2016 from http://xmlns.com/foaf/spec/.
Calzada-Prado, F. Javier (2015),
“SORON: Social Relationships ONtology (Ontology)”, retrieved 12 Nov 2016 from http://purl.org/net/soron.
Cronin, Blaise (2005),
The Hand of Science: Academic Writing and Its Rewards. Lanham: The Scarecrow Press, 2005.
CTS (2016),
The Canonical Text Service (CTS), last changed 2016, retrieved 11 Nov 2016 from http://cite-architecture.github.io/cts/.
Cyganiak, Richard (2013),
“namespace lookup for RDF developers”, retrieved 12 Nov 2016 from http://prefix.cc/popular/all.
D’Arcus, Bruce / Giasson, Frédérick (2009),
Bibliographic Ontology Specification, rev 1.3, published 4 Nov 2009, retrieved 7 Nov 2016 from http://bibliontology.com/.
Davis, Ian/Galbraith, David (2003),
BIO: A vocabulary for biographical information, first issued 7 Mar 2003, last change on 14 Jun 2011, retrieved 12 Nov 2016 from http://vocab.org/bio/.
Davis, Ian (2004),
RELATIONSHIP: A vocabulary for describing relationships between people, first issued on 11 Feb 2004, last change on 19 Apr 2010, retrieved 7 Nov 2016 from http://vocab.org/relationship/. (Used in 3 projects listed in LODstats, in contrast to “Agrelon” and “Soron“, used in 0 projects listed there, all as of 7 Nov 2016.)
DBPedia (2016),
“DBpedia. Towards a Public Data Infrastructure for a Large, Multilingual, Semantic Knowledge Graph”, retrieved 7 Nov 2016 from http://wiki.dbpedia.org/.
Duve, Thomas/Lutz-Bachmann, Matthias/Birr, Christiane/Niederberger, Andreas (2013),
“Die Schule von Salamanca: eine digitale Quellensammlung und ein Wörterbuch ihrer juristisch-politischen Sprache. Zu Grundanliegen und Struktur eines Forschungsvorhabens”, SvSal Working Paper No. 2013-01, urn:nbn:de:hebis:30:3-324011, retrieved 7 Nov 2016 from http://publikationen.ub.uni-frankfurt.de/files/32402/SvSal_WP_2014-01.pdf. An english version of this text is available as “The School of Salamanca: a digital collection of sources and a dictionary of its juridical-political language. The basic objectives and structure of a research project”, SvSal Working Paper No. 2014-01, urn:nbn:de:hebis:30:3-324023, retrieved 11 Nov 2016 from http://publikationen.ub.uni-frankfurt.de/files/32402/SvSal_WP_2014-01.pdf.
Jamali, Hamid R. et al. (2014),
“How scholars implement trust in their reading, citing and publishing activities: Geographical differences”, in: Library & Information Science Research, Volume 36, Issues 3–4, October 2014, pp. 192–202, retrieved 7 Nov 2016, from http://dx.doi.org/10.1016/j.lisr.2014.08.002.
LDF (2013),
Linked Data Fragments. Query the Web of data on Web-scale by moving intelligence from servers to clients, first published in 2013, retrieved 11 Nov 2016 from http://linkeddatafragments.org/.
Lincoln, Matthew (2015),
“Using SPARQL to access Linked Open Data”, published 24 Nov 2015, retrieved 11 nov 2016 from http://programminghistorian.org/lessons/graph-databases-and-SPARQL.
LODstats (2012),
LODStats: a statement-stream-based approach for gathering comprehensive statistics about RDF datasets, retrieved 12 Nov 2016 from http://stats.lod2.eu/.
McRoberts, Mo (2016),
“Inside Acropolis. A guide to the Research & Education Space for contributors and developers”, retrieved 12 Nov 2016 from https://bbcarchdev.github.io/inside-acropolis/.
Mrozik, Dagmar (2016),
The Jesuit Science Network. Retrieved 7 Nov 2016 from http://jesuitscience.net/.
Mynarz, Jindřich (2013),
“Capturing temporal dimension of linked data”, published 2013, retrieved 11 Nov 2016 from http://blog.mynarz.net/2013/07/capturing-temporal-dimension-of-linked.html.
Nicholas, David et al. (2014),
“Trust and Authority in Scholarly Communications in the Light of the Digital Transition: setting the scene for a major study”, in: Learned Publishing, Volume 27, Issue 2, April 2014, pp. 121–134, retrieved 7 Nov 2016, from https://doi.org/10.1087/20140206.
Nowviskie, Bethany (2011),
“announcing #Alt-Academy”, published Jun 22nd, 2011, retrieved 7 Nov 2016, from http://nowviskie.org/2011/announcing-alt-academy/.
Pattuelli, Cristina M. (2012),
“FOAF in the Archive: Linking Networks of Information with Networks of People.” Final Report to OCLC. Retrieved 12 Nov 2016 from http://www.oclc.org/content/dam/research/grants/reports/2012/pattuelli2012.pdf.
Peroni, Silvio (2014),
“The Semantic Publishing and Referencing Ontologies”, in: Id., Semantic Web Technologies and Legal Scholarly Publishing. Cham, Switzerland: Springer, pp. 121-193, retrieved 12 Nov 2016 from http://dx.doi.org/10.1007/978-3-319-04777-5_5. Open Access at http://speroni.web.cs.unibo.it/publications/peroni-2014-semantic-publishing-referencing.pdf.
Raymond, Yves / Abdallah, Samer (2007),
The Event Ontology, published 25 Oct 2007, retrieved 12 Nov 2016 from http://motools.sourceforge.net/event/event.html.
Schmutz, Jacob (2008),
“Scholasticon. Ressources en ligne pour l’étude de la scolastique moderne (1500-1800): auteurs, sources, institutions”, published in 2008, retrieved 7 Nov 2016 from http://scholasticon.ish-lyon.cnrs.fr/Presentation/index_fr.php.
Schrade, Torsten (2015),
XTriples. A generic webservice to extract RDF statements from XML resources, published 30 Apr 2015, last changed on 13 Mar 2016, retrieved 7 Nov 2016 from http://xtriples.spatialhumanities.de/.
Shotton, David (2011),
“Comparison of BIBO and FaBIO”, published June 29, 2011, retrieved 7 Nov 2016 from https://opencitations.wordpress.com/2011/06/29/comparison-of-bibo-and-fabio/.
SWEO (2008),
“Cool URIs for the Semantic Web”, W3C Semantic Web Education and Outreach Interest Group Note 03 December 2008, retrieved 11 Nov 2016 from https://www.w3.org/TR/cooluris/.
SWIG (2011),
“Describing Linked Datasets with the VoID Vocabulary”, W3C Semantic Web Interest Group Note 03 March 2011, retrieved 9 Nov 2016 from https://www.w3.org/TR/void/.
Sytsma , David (2010),
Post-Reformation Digital Library. Published in 2010, retrieved 7 Nov 2016 from http://www.prdl.org/.
TEI (2015),
“ogrophy elements should be in att.canonical”, Ticket #1414, opened in 2015, retrieved 7 Nov 2016 from https://github.com/TEIC/TEI/issues/1414.
TEI (2016),
TEI Guidelines Version 3.0.0. Last updated on 29th March 2016, retrieved 7 Nov 2016 from http://www.tei-c.org/release/doc/tei-p5-doc/en/html/.
Thompson, Henry S. (2006),
“Identity, URIs and the Semantic Web”, published 13 October 2006, retrieved 11 Nov 2016 from http://www.ltg.ed.ac.uk/~ht/eSI_URIs.html.
Verborgh, Ruben (2014),
Serendipitous Web Applications through Semantic Hypermedia (PhD thesis). Ghent University, Ghent, Belgium. Retrieved on 12 Nov 2016 from https://ruben.verborgh.org/phd/.
W3C (2013),
“SPARQL 1.1 Query Language. W3C Recommendation 21 March 2013”, retrieved 11 Nov 2016 from https://www.w3.org/TR/sparql11-query/.
W3C (2014),
“The Organization Ontology. W3C Recommendation 16 January 2014”, retrieved 12 Nov 2016 from https://www.w3.org/TR/vocab-org/.
Weaver, Ryan/Pelham, Leanna (2014),
“A Homepage for your API?”, retrieved 12 Nov 2016 from https://knpuniversity.com/screencast/rest-ep2/api-homepage.
Wozniak, Thomas/Nemitz, Jürgen/Rohwedder, Uwe (eds.) (2015),
Wikipedia und Geschichtswissenschaft. Berlin, Boston: De Gruyter Oldenbourg, 2015. Retrieved 7 Nov. 2016, from http://www.degruyter.com/view/product/433564.

[cite]


  1. I would like to acknowledge the help and insight that many collaborators have provided: As members of the project “The School of Salamanca”, Ingo Caesar, Christiane Birr, Thomas Duve and Matthias Lutz-Bachmann have spent much time discussing the perspectives, issues and possible solutions with me. Torsten Schrade from the Digital Academy in Mainz has not only provided us with the xTriples service that has become such a central element of the strategy, but has also readily helped me with good advice on both the conceptual and the more technical aspects. The fact that I often write “we” in the following reflects this; I can hardly claim the insights, developments and implementations as my very own. On the other hand, the responsibility for any mistakes lies of course solely on my part. I have presented preliminary versions of this at the workshop “Historische Semantik und Semantic Web” of the Working Group “elektronisches Publizieren” of the Union of German Academies in Heidelberg 2015, at the DH2016 conference in Kraków and, as a poster, at DHd2016 in Leipzig; I want to thank all discussants for their critical questions and for their helpful suggestions. 
  2. For a more thorough introduction to the relevance of the School of Salamanca and to the particular problems that the project seeks to address, see the project’s presentation in Duve et al. 2013
  3. Eventually, these TEI files comprise in turn several TEI subfiles, say, for volumes of the work, via XInclude directives. 
  4. Not all of these functions are already available. In particular the last bit, display of information per paragraph or section, is currently being tested. It displays the citations, and the mentioned persons and places that the queried section of the text includes, along with their respective counts, in a small window. 
  5. For recent and thorough discussions of various aspects of the role of Wikipedia in academia, see the contributions to Wozniak et al. 2016
  6. For the term and the sector (if it is one) called “#alt-ac”, see e.g. Nowviskie 2011
  7. This is not something to be discussed here in detail, and it is in fact not something I have investigated thoroughly at all, but I guess there should be more than enough literature on the evolution of knowledge as a social product to buttress the (rather humble) point I have been making. For a start, literature as diverse as the following can be mentioned: Bourdieu 1984; Cronin 2004; Nicholas 2014; Jamali et al. 2014
  8. Having established contact with related projects that already offer such information as linked data (Agenjo 2012) or are in the process of evaluating possibilities to do so (Schmutz 2008; Mrozik 2016), prospects of eventual cooperation and mutual integration of data make linked data an all the more interesting field. 
  9. Since a reference often mentions author, work title and a particular passage thereof, it is a somewhat unhappy situation that the whole reference, marked up with the <bibl> element, cannot technically refer to a particular passage in one of the other works. This is because the TEI scheme does not allow the @ref attribute on a <bibl> element. It does allow this in the <title> element, but then again the title does not refer to the specific passage. The Text Encoding Initiative has recognized this demand and has taken measures to allow the attribute also in the <bibl> element in a future revision of the guidelines/schema. (Cf. TEI 2015) At the moment we are keeping some of this information in the <bibl>’s @sortKey attribute and using the actual @ref attributes without passage information. 
  10. Work-in-progress versions of the actual work and person mapping resp. configuration files can be downloaded at http://files.salamanca.school/svsal-xtriples-work.xml and http://files.salamanca.school/svsal-xtriples-person.xml. In addition to (many) statement definitions like the one from the example, only with much more complicated XPath expressions, the configuration has to additionally include the vocabulary declarations and the definition of the “current” resource that is to be treated as the context of the XPath expressions. 
  11. Apart from the subject matter, this metadata is contained in our dataset. 
  12. The provision of facsimile images and plain text renditions of the text is functional but, as of now, does not yet follow this scheme: the server “http://facs.salamanca.school” does not use our URI scheme (described below) yet, and the plaintext export is available only under “http://api.salamanca.school/txt/“. Both services will soon be made to comply with the scheme described in this article. 
  13. What is not settled yet is then question of whether or how to integrate this scheme with the TEI’s provisions for canonical referencing (cf. TEI 2016, section “16.2.5 Canonical References). 
  14. Sesame is now Eclipse RDF4J, but at the moment, we are still going with an old version. At some point, upgrading should be reconsidered, as should using alternative storage backends or an alternative solution altogether. However, as long as the endpoint — which is just another name for interface — keeps the same address and uses the same SPARQL standard, such a move should be transparent to users when it happens. 
  15. Cf. W3C 2013. See also a SPARQL introduction for humanists at Lincoln 2015
  16. What is left to do — besides formulating trust and confidence matters — is launching a “service homepage” so that requests without particular resources do return useful information nonetheless (cf. Weaver/Pelham 2014). This should include a lookup service that can then be declared in the void.ttl file as well (cf. McRoberts 2016). Finally, a hypermedia scheme should be developed to allow clients to navigate the dataset more autonomously (cf. for example Verborgh 2014, Alarcon et al. 2011). 
  17. While in general, so-called reification is frowned upon in LOD circles, this applies mostly to “statement reification”, i.e. not stating assertions directly but creating “statement” resources to which subject, predicate, object and other information are then assigned. This leads to added volume and complexity of data, and it diminishes the “assertiveness” of the knowledge represented. (Statements are not “made”, but rather as it were “talked about”.) Creating resources for events, on the other hand, is a case of “relation reification”, which is common and often accepted as a good compromise. It more directly asserts the propositions and does not lead to as big an increase in the number of statements as statement reification does. On a comparison and further alternatives, see again Mynarz 2013

Leave a Reply

Your email address will not be published. Required fields are marked *