(English) TEI XML to Zenodo service published: Automatic depositing the project’s TEI files at a long-term archive

Por el bien del que está mirando, el contenido se muestra abajo en el idioma alternativo. You may click the link to switch the active language.

The idea: automatic depositing the project’s TEI files at Zenodo

We have been using the github-zenodo integration for a while already with our source code releases. This allows us to deposit our code, update the deposit with new releases and get a persistent identifier for each of the versions. Since we are facing similar requirements for our TEI XML files, I have investigated how we could take profit of this or a similar mechanism. The crucial difference is this: The integration as it is makes deposits from releases/snapshots of the whole github repository, i.e. of all the files that are in the version control system. This is good for software, where all the files depend on each other and make sense only in the context of an encompassing application. But for our TEI sources, it would be better to have deposits for individual files (and persistent identifiers for them) rather than for the collection as a whole.

So I have developed a «TEI2Zenodo» service (in the following just «t2z») that can take care of uploading our files to zenodo. The idea is that a project or an institution that regularly wants to commit TEI XML files to long-term archival can host an instance of it and do its uploads via this instance. I have used it to upload 16 of our source TEI files automatically from our github repository.

Continuar leyendo «(English) TEI XML to Zenodo service published: Automatic depositing the project’s TEI files at a long-term archive»

(English) How to Get to «Seuilla»: Dealing with Early Modern Orthography in a Search Engine

Por el bien del que está mirando, el contenido se muestra abajo en el idioma alternativo. You may click the link to switch the active language.

One of the central tools for working with the texts in the School of Salamanca‘s digital collection of sources is our digital edition’s search functionality, which is supposed to provide an easy-to-use means for querying single texts or the whole corpus for search terms. Speaking from a technical perspective, the quality of this search functionality – or, more precisely, the search engine that it is powered by – depends on several resources that it needs to be provided with, such as high-quality dictionaries for the languages in the corpus as well as a suitable setup for the search engine’s ‘indexing’ of the texts. As we recently encountered a demand to adjust the functioning of our search engine in a rather fundamental way due to some specific orthographic properties of the texts we are dealing with, I would like to use this opportunity to give a rough (and, hopefully, not too technical) overview of the functioning of our search before going into detail about the problem at hand.

In the case of our digital edition, the search functionality is currently based on a Sphinxsearch engine that provides our web application with search results in the following way: whenever a user triggers a search (say, for the Latin term «lex»), the web application sends a request to the Sphinxsearch engine (that is, a data server independent from the web application) asking for all text snippets that include the search term, or a term derived from its lemma (such as «leges» or «legum»), or even – if they are available in the engine’s dictionaries – translated forms of the term (hence, in this case, we would also find the Spanish term «ley»). These snippets, which are returned by the search server after it has gone through its indexes, are then displayed in the search results view in our webpage. Of course, this works only if the search server has already indexed these snippets beforehand (where ‘indexing’ generally means the process of storing and making the text snippets searchable), so that it can quickly fetch them on demand. Therefore, whenever a new work is being published in our digital collection, text snippets for the respective work are created through the web application, and a nightly indexing routine of the search server regularly requests all available snippets from the web app, thus making sure that the snippets it holds are updated on a day-to-day basis. As one might already guess with regards to the prevalence of the term ‘index(ing)’ in this paragraph, a suitable configuration of the indexing routine is of central importance for a well-functioning search engine, and it is precisely the configuration of the indexing which we recently had to revise fundamentally due to a problem that we were hinted at by users of the search function.

In our Latin and Spanish texts, a frequent orthographic pattern is the interchangeability of certain characters like «u» or «i» such that a word may occur in different forms throughout texts from the same period, or even within the same text, without varying in its meaning. To take a concrete example, it is not unusual in our texts to find a word like «Sevilla» (the city in Spain) also as «Seuilla» or, potentially, even as «Sebilla». In this example, the interchangeability of characters applies to the characters «v», «u», and «b», but it may also involve other frequently-used character pairings such as «i» and «j» (e.g., «ius» vs. «jus»), as well as some more rarely and, for the modern reader, more unexpectedly occurring pairings such as «j» and «x» (e.g., «dejar» vs. «dexar»).

The problem with such interchangeable characters, from a search engine’s point of view, is that the engine does not know about their equivalence and thus cannot yield the whole set of search results for certain queries. For example, if a user searches for the above-mentioned «Sevilla» in its modern spelling, the engine will only return results in which the exact character string «Sevilla» occurs, but not strings like «Seuilla» or «Sebilla».

There are, however, certain ways to cope with this problem. One quick (and, from a system administrator’s perspective, quite lazy) solution would be to advise the user to be aware of this problem and to anticipate it in her search queries. For instance, our search engine allows for the use of ‘regular expressions’ in queries, which makes it possible to replace certain characters by so called ‘wildcards’: in this way, all possible forms of «Sevilla» could be found by using a query string such as «Se?illa», where the «?» question mark stands for any character. As this «solution» would demand quite a lot of anticipation from the user, who may not be aware of all the pitfalls of early modern orthography, it is not a very good option. In general, we want to facilitate the usage of our digital tools for the user as much as possible, and this certainly includes the search functionality as a central component of the toolset.

A second possible solution, then, would be to extend the dictionaries of word forms that are used to index the text snippets in such a way that for any word form containing one or more interchangeable characters we also add all its different forms to the respective dictionary. Unfortunately, this is practically unfeasable in a manual and «scholarly controlled» way, since our dictionaries of word forms currently hold between 600,000 (Spanish) and over 2,000,000 (Latin) word forms (and counting). Any endeavour to manually add word forms – even if one would only stick to important ones, such as names – would always represent an open-ended task. The only comprehensive and (ideally) complete solution thus seems to be an automatic replacing of word forms, and fortunately, this is where the Sphinxsearch engine offers quite handy configuration options.

In the configuration for the indexing routine, Sphinxsearch allows for the definition of certain directives that can be used to enhance, speed up, or tweak the indexing (and, thus, the searchability of words). In particular, there is a charset_table directive that can be used for stating which symbols should be relevant for the index, and this directive also makes it possible to map characters to other characters. Until now, for example, the character mapping configuration read as follows:

charset_type = utf-8
charset_table = 0..9, A..Z->a..z, a..z, U+C0..U+FF, U+100..U+17F, U+1C4..U+233, U+370..U+3E1, U+590..U+5FF, U+400..U+7FF, U+500..U+8FF

Here, we first declare the «utf-8» encoding (which is one of the prevalent encodings for the nowadays ubiquitous Unicode character set) as the encoding to be applied for indexing text snippets. The charset_table field then determines which characters are relevant for the indexing routine, and, if required, which of those characters shall be regarded as equivalent to other characters. In this setting, for example, we have stated that we consider digits between 0 and 9 as relevant (0..9), and that all capitalized characters are to be treated as lowercased characters (A..Z->a..z) such that a query for the term «Sevilla» will also find the lowercased form «sevilla» (of course, we also need to declare these lowercased characters as relevant for indexing through the , a..z, segment). Furthermore, we have extended the set of relevant symbols by including the Latin Extended A character block (that is, Unicode symbols in the range between codepoints «U+0100» and «U+017F»: U+100..U+17F) and certain parts of other Unicode blocks such as Latin Extended B, Greek and Coptic, etc.

Now, the solution to our problem lies in the possibility of mapping characters to other characters. Similar to the above-mentioned way of mapping capitalized characters to lowercased characters, we can map a character such as «u» to a completely different character such as «v», thereby evoking the desired effects of character normalization:

charset_table = 0..9, A..I->a..i,J->i,j->i,K..T->k..t,U->v,u->v,V..Z->v..z,a..i,k..t,v..z,U+C0..U+FF, U+100..U+17F, U+1C4..U+233, U+370..U+3E1,U+590..U+5FF, U+400..U+7FF, U+500..U+8FF

What simply was A..Z->a..z (the mapping of all capitalized characters «A-Z» to their lowercased equivalents) in the previous configuration has now become slightly more complicated:


Here, we first map all capitalized characters «A..I» to their lowercased equivalents, but then we map the capitalized and lowercased «J»/»j» to the lowercased «i» (not «j»). Equally, after mapping all characters between «K..T» to their equivalents «k..t», we apply a mapping for «U»/»u», which are both to be treated as «v» by the indexing routine. Like this, we effectively declare «J»/»j» and «i» as well as «U»/»u» and «v» as equivalent characters, thus normalizing j/i and u/v for any search query. Note that the mapping/replacement of characters stated here has effect only for the «internal» indexing of the snippets in the search engine, but does not actually transform the text snippets displayed to the user in the end by any means.

For the time being, we have refrained from defining further normalizations such as with «v» and «b», or «j» and «x», since it is not yet fully clear to us what (unexpected) effects such all-encompassing character normalizations might have, especially in «exotic» use cases that we cannot even imagine. In particular, ‘transitive’ normalization effects are something that we still need to experiment with: If «j» and «i» are equivalent and «j» and «x» are equivalent, then «i» and «x» are also equivalent, although «i» and «x» might not have been used in such an interchangeable manner in early modern texts. Therefore, we very much welcome any further suggestions – also beyond the character pairings mentioned here – that we can use for enhancing our search engine further.

(Deutsch) Snapshot-Praktikum bei der Schule von Salamanca

Por el bien del que está mirando, el contenido se muestra abajo en el idioma alternativo. You may click the link to switch the active language.


Die Akademie der Wissenschaften und der Literatur, Mainz, bietet allen StudentInnen, AbsolventInnen und Promovierenden der Uni Mainz die Möglichkeit zu einem Snapshot-Praktikum: Eine Woche lang kann man den Forschungsalltag im Projekt «Die Schule von Salamanca: Eine digitale Quellensammlung» live miterleben.


Im August haben wir Laura Kopp von der Historischen Fakultät bei uns begrüßen dürfen. Editionsarbeit, Digital Humanities, Aufbau und Pflege einer Webanwendung für Forscher, die Arbeit in einem internationalen und interdisziplinären Forscherkreis, fachliche Gesprächsrunden über historische Fragen, viel Spaß und Spundekäs – wir haben nichts ausgelassen!

(Alemán) iiif in der Schule von Salamanca

Por el bien del que está mirando, el contenido se muestra abajo en el idioma alternativo. You may click the link to switch the active language.

Das iiif Consortium und die weitere iiif Community haben Standard-Protokolle definiert, wie man sie in der Darstellung von Bildressourcen benötigt. Die Protokolle sind als Beschreibungen von «Schnittstellen» formuliert, d.h. es wird beschrieben, unter welcher Adresse, mit Hilfe welcher Parameter der Dienst eine bestimmte Funktion anbieten soll. In dieser Weise gibt es Beschreibungen von Zoom-/Rotations-/Ausschnitts-/Format-Konversions- und ähnlichen Diensten in der iiif image API, die aktuell in Version 2.1.1 vorliegt. Ferner Beschreibungen von Zugangsmanagement- und Authentifikationsservices sowie von Such-Funktionen in der Authentication API bzw. der Search API, beide jüngeren Datums und erst in der Version 1.0 vorliegend. Beschreibungen von Video- und Audio-Daten (z.B. für die in diesem Fall hinzukommenden Zeit-Indices der Ressourcen) sind in Vorbereitung.

Continuar leyendo «(Alemán) iiif in der Schule von Salamanca»

(English) Code Release and Open-Source Development of the ‘School of Salamanca’ Web Application

Por el bien del que está mirando, el contenido se muestra abajo en el idioma alternativo. You may click the link to switch the active language.

On 1 March 2018, we have released the code of the web application of the ‘School of Salamanca’ project’s digital edition (https://salamanca.school) as free and open source software (under the MIT license) on GitHub: https://github.com/digicademy/svsal, where the development process and the versioning of our web application takes place exclusively from this date onward. The publication of our web application’s code represents the first major code release of the project; other parts of our digital infrastructure and the research data will be published separately.

The web application, now having reached version 1.0, has been developed since 2014 and, more precisely, consists of an eXist-db application package. While this package can be downloaded and deployed in any eXist-db (version 3.6+) instance, it must be mentioned that, in order to function correctly, the web application draws upon the integration with further, external services: for example, an iiif-conformant image server (image and presentation APIs) allowing for the incorporation of facsimile images in the reading views of our works, or a SphinxSearch server providing lemmatized and cross-language search results for the texts. Notwithstanding these current caveats in portability of the application package, and although there still remains much to be achieved with regards to the functionality of our web application, the code underlying central features of the application (such as the endless-scrolling segmentation of texts in the reading view, the content negotiation-based URI-linking of texts and text segments, and others) is fully available now and can be utilized or serve as an example for similar projects, for instance. For a more extensive and detailed description of current features, caveats, and provisos of the application please refer to: https://github.com/digicademy/svsal.

The software is tagged with a DOI so that it can be cited: https://doi.org/10.5281/zenodo.1186521

(Inglés) What’s in a URI? Part I: The School of Salamanca, the Semantic Web and Scholarly Referencing

Starting from experiences of the the philosophical and legal-historical project «The School of Salamanca. A digital collection of sources and a dictionary of its juridical-political language», this article discusses an experimental approach to the Semantic Web.1 It lists both affirmative reasons and skeptical doubts related to this field in general and to its relevance for the project in particular. While for us the general question has not been settled yet, we have decided early on to discuss it in terms of a concrete implementation, and hence the article will also describe preliminary goals and their implementation along with practical and technical issues that we have had to deal with.

In the process, we have encountered a few difficult questions that — as far as we could determine — involve (arguably) systematic tensions between key technologies and traditional scholarly customs. The most important one concerns referencing and citation. In the following, I will describe a referencing scheme that we have implemented. It attempts to combine a canonical citation scheme, some technologies known primarily from semantic web contexts and a permalink system. Besides the details of our particular technical approach and the very abstract considerations about risks and benefits of the semantic web, I will point out some considerable advantages of our approach that are worthwhile pursuing independently of a full-blown semantic web offering.

Continuar leyendo «(Inglés) What’s in a URI? Part I: The School of Salamanca, the Semantic Web and Scholarly Referencing»


Por el bien del que está mirando, el contenido se muestra abajo en el idioma alternativo. You may click the link to switch the active language.

Wir laden sehr herzlich ein zur Einreichung von Beiträgen zu einem Diskussionsforum der Zeitschrift RG RECHTSGESCHICHTE – LEGAL HISTORY mit dem Titel «Die geisteswissenschaftliche Perspektive: Welche Forschungsergebnisse lassen Digital Humanities erwarten?/With the Eyes of a Humanities Scholar: What Results Can We Expect from Digital Humanities?» Dieses Diskussionsforum wird durch das Projekt für die Zeitschrift RG RECHTSGESCHICHTE – LEGAL HISTORY organisiert.

Die Beiträge sollen dezidiert die geisteswissenschaftliche Perspektive einnehmen und die bereits gemachten Erfahrungen in der forschungspraktischen Arbeit mit dem Instrumentarium der DH thematisieren. Im Mittelpunkt steht die Interaktion zwischen geisteswissenschaftlichen Erkenntnisinteressen und digitalen Tools: Wie verändern digitale Möglichkeiten Forschungsinteresse und Methode? Continuar leyendo «(Deutsch) CfP: Forum RG RECHTSGESCHICHTE – LEGAL HISTORY 24 (2016)»

(Deutsch) Technischer Workshop des Projekts – Das SvSal-Team veranstaltet am 29./30. Oktober 2015 einen Workshop zur Technik des Projekts.

Por el bien del que está mirando, el contenido se muestra abajo en el idioma alternativo. You may click the link to switch the active language.

Wir profitieren enorm davon, dass sich durch das eXist HTML-Templating schnell und elegant ein wirklich ansehnlicher und funktionaler Web-Auftritt realisieren lässt. Zugleich sind wir auf einige nicht-triviale Probleme gestoßen (wie z.B. die Notwendigkeit, umfangreiche Dokumente im Voraus als viele HTML-Fragmente zu rendern und erst nach und nach in den Browser zu laden), für die wir inzwischen Lösungen parat haben. Sowohl das eine als auch das andere Motiv dürfte für andere Projekte interessant sein, die eine Realisierung ihrer Edition mit eXist erwägen.
Daher haben wir beschlossen, bei einem informellen Workshop die «Haube aufzumachen» und Interessenten unsere Architektur, unsere Lösungen und unseren Entwicklungsprozess zu erläutern – nicht mittels großartig vorbereiteter Präsentationen, sondern in den Programmen und im Code selbst. Continuar leyendo «(Deutsch) Technischer Workshop des Projekts – Das SvSal-Team veranstaltet am 29./30. Oktober 2015 einen Workshop zur Technik des Projekts.»

(Deutsch) Historische Semantik und Semantic Web – AG «Elektronisches Publizieren», Heidelberger Akademie der Wissenschaften, 14. bis 16. September 2015

Por el bien del que está mirando, el contenido se muestra abajo en el idioma alternativo. You may click the link to switch the active language.

Workshop der Heidelberger Akademie der Wissenschaften und der Union der deutschen Akademien der Wissenschaften

Die AG «Elektronisches Publizieren» richtet vom 14. bis 16. September 2015 einen Workshop zum Thema «Historische Semantik und Semantic Web» aus. Akademie-Vorhaben, die im Bereich der Digital Humanities angesiedelt sind, werden im Rahmen verschiedener Sektionen vorgestellt und diskutiert: