Logo der Bayerischen Akademie der Wissenschaften

Cuneiform Artefacts of Iraq in Context (CAIC) - Keilschriftartefakte Mesopotamiens

Menu

From the Clay Tablet to the Data Set

Enrique Jiménez and Fabian Simonjetz

CAIC seamlessly connects antiquity with cyber space. Indeed, the project uses cutting-edge technologies to process some of humanity’s oldest textual evidence, collaborating with leading digital cuneiform initiatives around the world. Together we have an ambitious goal: to establish a universal, modular platform for the cataloguing, documentation and edition of cuneiform texts, bringing together existing tools and corpora and adding significant new functionalities.

Connecting and improving existing platforms

The world of digital cuneiform studies has so far been an archipelago of many islands and islets, which the CAIC platform aims to bring together into one landmass. Our goal is, on the one hand, to collaborate with all existing major international projects in digital ancient Near Eastern studies, such as the French Archibab, the Spanish BDTNS and the international initiatives CDLI and Oracc, and, on the other hand, to contribute to the interoperability of all these insular initiatives. The Leibniz Supercomputing Centre (LRZ) of the Bavarian Academy of Sciences and Humanities, which hosts our platform and all research data on its enormously powerful server infrastructure, is essential for this. Researchers worldwide will be able to create sub-projects in CAIC and benefit from all the system’s functions. In this way, we aim to advance not only networking within Ancient Near Eastern Studies, but also interdisciplinary collaboration.

Complex data, structured format

The CAIC team uses transliteration rules to produce transcriptions from the cuneiform texts and expand them with additional information. For the documentation of the tablets, it is not enough to save only the “text itself”. Because the three-dimensional tablets are inscribed on several sides, information on layout and distribution of signs is important. We also encode data such as place of discovery, epoch, language, as well as linguistic annotations and references to associated fragments. Of course, the CAIC team also translates the painstakingly transliterated texts in full and annotates them philologically and historically.

All this is stored in a complex data format. After all, data can only be meaningfully used for digital research if it is stored in a form that allows it to be searched, sorted and filtered. When the data is transferred to the database, each piece is given a unique ID and an entry containing the processing, linguistic information (e.g. morphology), meta-information (e.g. find location, dating, archival context), photos and statistical data. For this, Java Script Object Notation (JSON) is used, a structured data format that works with key-value pairs. The keys are fixed labels such as “period” or “transliteration”, and the values are numbers for dates, temporal categories such as “Old Babylonian”, or more complex entries such as bibliographic references. The challenge is to define keys and values in such a way that as many properties as possible can be represented, and the abstract data model must be constantly adapted and extended. One advantage of JSON is that it can be easily converted into other formats, such as TEI P5 XML. This means that every user can use the material for their own purposes. Backups of the database are stored several times a day on the LRZ server, so that data loss is prevented.

Solutions for a constantly growing text corpus

The corpus of cuneiform languages (mainly Sumerian and Akkadian) grows by many thousands of words every year. We need to create tools that accommodate this growth. Already, the CAIC platform for Akkadian texts allows lemmatisation, linking each word of a text to its dictionary form. The creation of a dynamic concordance thus possible is best suited for a still rapidly growing language corpus. A central goal of the first project phase is the development of lemmatisation for Sumerian texts, which is currently occupying Walther Sallaberger in particular; for this he can build on his many years of experience with a Sumerian glossary.

Lemmatisation allows the automatic generation of glossaries and these can be linked to material on etymology or even thematic collections. It will be particularly important for us to link to the online entries of the direct predecessor project at the BAdW, the Reallexikon der Assyriologie und Vorderasiatischen Archäologie.

From maps to clay tablets to editions

Our platform will not only offer the photographs and editions generated by CAIC for each clay tablet, but will also visualise them in their find context. This takes into account the diversity of well-documented provenances that characterises the Iraq Museum’s collections. To do this, we are building on the Ancient Records of Middle Eastern Polities (ARMEP) visualisation tool developed by Karen Radner and Jamie Novotny with the Humanities IT Group at LMU, which allows cuneiform texts to be queried from a map interface. Users will be able to travel virtually to any Iraqi site and consult the documents found there in their chronology and archival context. In the city of Uruk, for example, one will be able to see in which libraries manuscripts of the adventures of its legendary king Gilgamesh were kept. A mouse click will then lead to the translation, and anyone who wants to can immediately access the entire edition.

Tim Berners-Lee, the inventor of the World Wide Web, has said: “Data is a precious thing and will last longer than the systems themselves.” This is also true of our platform, which will inevitably one day become obsolete. But thanks to its open data structure, the editions and other information we generate with the tools described will permanently shape scholarship and general knowledge of the rich textual treasures of ancient Iraq. For CAIC, with its platform and its data, deliberately addresses not only experts but also the general public.

A palaeographic revolution

A groundbreaking tool has been developed within the framework of the Alexander von Humboldt Foundation-funded project electronic Babylonian Literature (eBL), led by Enrique Jiménez, which enables the direct linking of the photographs with the transcriptions of the clay tablets depicted, character by character.

This tool is central to the CAIC platform. What is new now is that all the data generated flows directly into the project’s dynamically growing list of signs, which thus documents the forms of each character across the millennia of cuneiform script use. That clay tablets can be dated by character shape (“palaeography”) has long been known, but until now a comprehensive repertoire of these shapes has been lacking, severely hampering research. CAIC will change that. Whenever the sign forms are queried, direct access to all the contextual information of the tablets in the Iraq Museum is also provided. It is therefore no exaggeration to say that this will revolutionise cuneiform research.

The data will also be used to train artificial intelligence to recognise characters in photographs. One day, AI will also be able to read entire clay tablets. Since there are far too few people in the world who can read cuneiform, this will not take work away from anyone, but rather solve a serious structural problem that has resulted in far too much material remaining unprocessed for decades.