1.   The C-ORAL-ROM corpus

C-ORAL-ROM is a multilingual spontaneous speech corpus that comprises four romance languages: Italian, French, Portuguese and Spanish (Cresti & Moneglia 2005). In our work we have used the Spanish sub-corpus, which contains around 300.000 spoken words. From a sociolinguistic point of view, speakers are characterized by their age, gender, place of birth, educational level and profession. From a textual point of view the corpus is divided into the parts shown on Table 1.

Informal 150.000 wordsFormal 150.000 words
Familiar 113.000Public 37.000Formal in natural context 65.000
    Monologs 33.000  Dialogs/ Convers. 80.000    Monologs 6.000  Dialogs/ Convers. 31.000    Formal on the media 60.000
 Telephone conversations 25.000
Table 1: Distribution of words in C-ORAL-ROM.

Table 1 shows that the main division is balanced between formal speech and informal speech. For informal speech a division is considered between speech in a familiar/private context and speech in a public context. The first group is further classified into monologues, dialogs and conversations with three or more speakers. The second group is similarly classified into monologues, dialogs and conversations. Regarding formal speech, a division has been made between speech in natural context and speech on the media. The former includes political speeches, political debates, preaching, teaching, professional expositions, conferences, speech in business contexts and speech in legal contexts. Speech on the media includes news, sports, interviews, meteorology, science, reports and talk shows. Telephone conversations, although initially considered under the formal speech category in C-ORAL-ROM, have very particular features and are more similar to informal speech than to formal speech.

Each corpus is built to the same design using identical sampling techniques and transcription format. In addition, each corpus is presented in a multimedia format to allow simultaneous access to aligned acoustic signal and text transcription. Texts consist of a header, with a complete socio-contextual information of the recording, and a orthographic transcription. Speech acts and morphosyntactic information are tagged.

