Download the latest version here.
The Open Corpus of Asimjeeg Datooga is an annotated audio-visual corpus of recordings of the Asimjeeg variety of the Datooga language, spoken in northern Tanzania. This corpus consists of audio and video recordings with time-aligned text data for use with the ELAN software developed at the Max Planck Institute for Psycholinguistics. For more information about ELAN, please visit the following website: https://tla.mpi.nl/tools/tla-tools/elan/elan-description/
It may also be useful to install the Praat acoustic analysis software: http://www.fon.hum.uva.nl/praat/
Asimjeeg Datooga is a language variety of the Southern Nilotic family. All of the varieties of Datooga are spoken exclusively in Tanzania, whereas most of the other Southern Nilotic languages are spoken in Kenya (the one exception being Akie, which is spoken by hunter gatherers in Tanzania). Many of the Datooga varieties are closely related, and are mutually intelligible. Asimjeeg Datooga, also referred to as Isimjeega, is often not easily understood by speakers of other varieties of Datooga, although the speakers of Asimjeeg claim they can understand the other varieties.
Organization of the Corpus
The corpus is divided into three primary types of audio-visual data: (1) wordlists, (2) elicitation, and (3) texts. Wordlist data consists of recordings of minimally inflected citation forms (similar to the words you might find in a dictionary). Elicitation consists of recordings of nominal and verbal paradigms and other patterned grammatical constructions or linguistic features that were specifically sought out. Texts consist of recordings of speakers talking freely, either on their own or with other speakers, and can include narratives, discussions, explanations, etc.
Development of the Corpus
The Corpus of Asmjeeg Datooga began during a documentation project sponsored by the Endangered Languages Documentation Programme (ELDP) that started in 2015. The corpus is still very much in progress and many errors and inconsistencies have yet to be corrected. The goal is to eventually have a fully annotated and tagged corpus that is accessible to researchers of many different backgrounds and interests.
How to Download the Corpus
You can download individual files or entire folders. If you download an entire folder it will be download as a .ZIP file, which will save you bandwidth and time. Using the shared Dropbox folder link, you are brought to the main OCAD folder. To download a folder, click on that folder to enter it and then click “Download” in the top right hand corner. To download individual files, click on the files. You should make sure to download the .WAV, .EAF, and .PFSX files for each recording that you would like to use. You can download all of OCAD simply by clicking “Download” when at the main folder (where you start).
How to Use the Corpus
The corpus is designed to be useful for those who want to use regular expressions to search for certain types of linguistic patterns (in glosses or content language data or both) across all of the files of the corpus or a subset of them. This is enabled by the search functions of ELAN, and is further embellished by the use of Praat. After searching for a given pattern, you can open an example of that pattern in ELAN and then open that same audio selection in Praat, giving you instant access to detailed phonetic analysis tools.
If you would like to look at an individual recording, simply open the .EAF file using ELAN. If the audio is not properly linked to the .EAF file, click on Edit > Linked Files… and locate the .WAV file manually to link it.
To search across multiple files, open ELAN and click Search > Structured Search Multiple EAF… > Define Domain > New Domain. Select all of the files you would like to search across (you can also define subsets of the corpus, e.g. wordlists, or all files of a certain speaker, etc.). Click OK, and give a name to the domain you have defined. For a basic search, click the tab Single Layer Search and choose a tier type or name in the drop-down menu on the right. Click on the drop-down menu that says “substring match” and choose “regular expression.” Type the data you would like to search for in the white text box and then click Find.
Restrictions on use of the corpus / Creative Commons License
The OCAD corpus, including all audio files, annotations, and metadata contained within, is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License. This license allows you to share the corpus in any medium or format, and to remix, transform, and build upon it. You must give appropriate credit when sharing the material, including a link to the CC license, and you must indicate if any changes have been made. You may not use the material for commercial purposes, and if you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
For further information about this license please read the following information: