Pragmatic analysis of man-machine interactions in a spontaneous speech corpus (2)

1. The C-ORAL-ROM corpus

C-ORAL-ROM is a multilingual spontaneous speech corpus that comprises four romance languages: Italian, French, Portuguese and Spanish (Cresti & Moneglia 2005). In our work we have used the Spanish sub-corpus, which contains around 300.000 spoken words. From a sociolinguistic point of view, speakers are characterized by their age, gender, place of birth, educational level and profession. From a textual point of view the corpus is divided into the parts shown on Table 1.

Informal 150.000 words				Formal 150.000 words
Familiar 113.000		Public 37.000		Formal in natural context 65.000
Monologs 33.000	Dialogs/ Convers. 80.000	Monologs 6.000	Dialogs/ Convers. 31.000	Formal on the media 60.000
				Telephone conversations 25.000

Table 1: Distribution of words in C-ORAL-ROM.

Table 1 shows that the main division is balanced between formal speech and informal speech. For informal speech a division is considered between speech in a familiar/private context and speech in a public context. The first group is further classified into monologues, dialogs and conversations with three or more speakers. The second group is similarly classified into monologues, dialogs and conversations. Regarding formal speech, a division has been made between speech in natural context and speech on the media. The former includes political speeches, political debates, preaching, teaching, professional expositions, conferences, speech in business contexts and speech in legal contexts. Speech on the media includes news, sports, interviews, meteorology, science, reports and talk shows. Telephone conversations, although initially considered under the formal speech category in C-ORAL-ROM, have very particular features and are more similar to informal speech than to formal speech.

Each corpus is built to the same design using identical sampling techniques and transcription format. In addition, each corpus is presented in a multimedia format to allow simultaneous access to aligned acoustic signal and text transcription. Texts consist of a header, with a complete socio-contextual information of the recording, and a orthographic transcription. Speech acts and morphosyntactic information are tagged.

Pragmatic analysis of man-machine interactions in a spontaneous speech corpus (2)

1. The C-ORAL-ROM corpus

Related Posts

Pragmatic analysis of man-machine interactions in a spontaneous speech corpus (4)

Pragmatic analysis of man-machine interactions in a spontaneous speech corpus (1)

Pragmatic analysis of man-machine interactions in a spontaneous speech corpus (5)

Pragmatic analysis of man-machine interactions in a spontaneous speech corpus (3)

Add Comment

Cancelar la respuesta