Polina Solonets and Maxim Kupreyev, members of the project team, participated in the poster presentation as a part of the DARIAH Annual Event 2023 taking place on June 6th to June 9th in Budapest. This year the conference topic was ‘Cultural Heritage Data as Humanities Research Data?’. Polina and Maxim introduced their approach to sustainable workflow organisation when working on a large scale edition and presented a poster entitled: ‘Sustainable Practices for the Large-Scale TEI Editions at the School of Salamanca Text Collection’.
The project “The School of Salamanca” is creating a freely-accessible online collection of texts produced in the intellectual centre of the Spanish monarchy during the 16th and 17th centuries. Currently 33 works have completed the production cycle (out of total 116) which includes TEI XML encoding, HTML export for the online access and full-text search, IIIF presentation APIs, PDF, and RDF export. The development of a sustainable workflow for the project has been influenced by the massive size of our textual collection and its unique features. In preparing editions of Early Modern Latin and Spanish texts it is crucial to take into account their inherent instability, i.e. heterogeneous structures, orthographic, typographical, and punctuation variations etc. Our editorial principles were therefore shaped by the necessity to trace and reproduce the development steps at any given moment; to reuse tools independent of the context and individual texts; to scale the complex processing tasks; to perform constant data quality checks, and to document the requirements and the results.
Salamanca’s workflow shares common ground with both Waterfall and Agile development techniques. The concept of pipeline, inherent to Waterfall, is in the centre of Salamanca’s editorial technique: the production on the edition consists of a number of steps executed sequentially, where the output of each stage serves as the input for the next one.
Digitization → Transcription in TEI TITE → Structural annotation → TEI transformation → Manual and automatic corrections → HTML, PDF, IIIF, and RDF generation.
The advantage of such a predefined sequence is its reproducibility, where each part of an editorial process can be restored at any time. It makes individual work steps traceable and enables comprehensive documentation in form of the program code and editorial guidelines.
In Agile, the software product is built in small chunks, and each of the development cycles includes feature clarification, design, coding, testing and deployment. For this purpose Agile integrates the software development, testing and operations teams in a single collaborative iterative process. In Salamanca’s adoption of Agile practises each of the above-mentioned development stages contains the definition of the requirements, development and quality assurance. This also means that each stage of the production of the digital edition delivers a part of the overall product features, which can be accessed and disseminated.
For example, the QA routines after step 1 – digitization of the print originals – allow us to publish the IIIF presentation manifests even before the TEI transcription starts. IIIF manifests are later enriched with additional data, pertaining to chapters and pagination. The same applies to PDF generation – it was initially intended to be one of the export methods, located at the end of the workflow. Yet, when implemented, it exposed a number of semantic and structural inconsistencies of the source XML. We therefore decided to use PDF earlier in the pipeline as a diagnostic tool and data quality service. In our talk we will show how the sustainable editorial workflow, adapted for processing of large-scale textual sources, translates into the delivery of high-quality data.