Theoretical Framework
Studies in Pragmatics revealed that the communication process is carried out at the inferential level, i.e., the communication process is no longer regarded as a process of encoding and decoding the information, but as a process where the interpretation of the world of the speaker is encoded, transmitted, decoded and finally interpreted again by the interlocutor. According to this model, each linguistic utterance has an argumentative charge responsible of raising certain inferences in the interlocutor. Discourse markers play an important role in guiding these inferences and, therefore, tasks concerning their identification and annotation become indispensable for language processing in order to reach to a complete understanding of the underlying meaning of each utterance.
The theoretical framework adopted proposes a correspondence between the cognitive, socio-cultural level on one had and the linguistic levels, on the other hand. Following this premise, a type of reasoning is reflected through a discursive operation. This operation is linguistically encoded through discourse markers. However, the linguistic encoding of discourse markers varies from one language to another, since each language adopts different morphological, syntactical or lexical strategies to express the discursive operation.
In other words and according to the theoretical framework implemented in the PRAGMATEXT annotation model, a discourse marker implies an inference, considered as a cognitive universal phenomenon. However, such phenomenon is modelled in a different way by each group of users; expressed by different grammatical strategies in each language and, thus, materially realized taking different linguistic forms.
Given these cognitive and linguistic facts, a computational approach dealing with discourse markers should take into consideration the following theoretical challenges:
1) The lack of consensus regarding the classification of what is considered as a discourse marker and what is not and, consequently, what is the definite set of discourse markers in a given language.
2) The categorical (grammatical) ambiguity; a discourse marker could be at the same time an adjective, adverb or a whole phrase. For example, in Spanish, “bueno” (well) could be a discourse marker in some cases and an adjective in other cases. In the same way, the Arabic “حسناً” (well) could also be a discourse marker or an adjective. Similar is the case of the English “well” which can be a discourse marker or an adverb..
3) The syntactic ambiguity, i.e., some markers operate at the sentence level and at the phrase level. For example, the conjunction “and” in English and its equivalents in Spanish “y” and “و” in Arabic.
4) The discursive ambiguity; one discourse marker can have different discursive functions. For example, the Arabic “كما” and its English equivalent “as”, both can play the role of a marker of concretion. However in Arabic, it could also be a co-argumentation marker and in English, it could be a marker of cause as well.
5) Idiomatic expressions, i.e., a decision has to be taken in case two discourse markers appear simultaneously; either to be considered as two different discourse markers or as a multi-word discursive unit. For example, the English “as” when it appears in “as well as” or “as to”. In Spanish, como el de pero si en pero si yo no he sido[D1]
For the first challenge, more time and more work is need to be directed to the study of discourse markers in different types of corpora and in the different language, both in the spoken and the written register. Accordingly, the present work is an initiative and a step forward to reach this goal.
Resolving the grammatical and syntactic ambiguity, representing the second and the third challenges, requires disambiguation methods, either by means of contextual rules or statistical methods. However, for the latter approach, the availability of resources with both POS and tagged discourse markers is the main obstacle. For the present study, contextual disambiguation rules, using information from the POS tagging, were applied only to the Spanish and English corpora. Once the Spanish discourse marker is identified, we search the Arabic corpora for an equivalent marker.
Regarding the discursive ambiguity and the discourse markers formed up from more than one marker, the attitude of theoretical linguists varies from that of the computational linguists. Theoretical linguists opt for increasing the discursive values for each marker. Computational linguists and computer scientists, on the other hand, prefer a well-defined and stable set of discourse markers applicable to wider range of domains and registers.
Considering the above mentioned challenges, the proposed annotation model pretends to reach a compromise between the theory and the computational practice by decreasing the ambiguity, but at the same without losing the details provided by the Pragmatic theories. The following section describes the annotation model.
[D1] Hace falta un ejemplo de lenguaje escrito
Add Comment