Contenu du sommaire : Construction des faits en linguistique : la place des corpus

Revue Langages Mir@bel
Numéro no 171, septembre 2008
Titre du numéro Construction des faits en linguistique : la place des corpus
Texte intégral en ligne Accessible sur l'internet
  • Présentation : éléments de réflexion sur la place des corpus en linguistique - Marcel Cori, Sophie David, Jacqueline Léon p. 5-11 accès libre
  • Aux sources de la « Corpus Linguistics » : Firth et la London School - Jacqueline Léon p. 12-33 accès libre avec résumé en anglais
    The aim of this paper is to investigate the sources of Corpus Linguistics in order to enlighten the current distinction between the “corpus-based” and “corpus-driven” trends, led respectively by Geoffrey Leech and John Sinclair. Although Corpus Linguistics originates from the London School founded by Daniel Jones and John Rupert Firth in the 1950s, and from the British empirical tradition, both trends have diversely interpreted Firth's pioneering notions : context of situation, restricted languages, meaning by collocation, lexicogrammar, use, corpus and texts. In this paper, we will examine how these notions have been involved in the making of early pre-computerized and computerized corpora of the English language in the 1960s.
  • Remarques sur l'usage des corpus en morphologie - Bernard Fradin, Georgette Dal, Natalia Grabar, Stéphanie Lignon, Fiammetta Namer, Delphine Tribout, Pierre Zweigenbaum p. 34-59 accès libre avec résumé en anglais
    The aim of this article is threefold. Firstly, it aims to recall how accessing large scale digitized data resources has qualitatively changed the way of doing morphology. Secondly, it shows that raw data extracted from such resources cannot be used as such and requires in-depth preparation in order to be properly exploited ; it also shows that the procedures involved in such preparation have to be carefully made explicit since they have a strong impact on the results of queries submitted to sorted data. Finally, it will be argued that using large scale digitized data does not allow us to disregard speakers'judgements concerning grammaticality, which remain indispensable whenever one needs to determine whether a linguistic form is acceptable or not, but that it puts such judgements in perspective insofar as they have to be compared with what is attested in the data.
  • Corpus et données en phonologie post-lexicale : forme et statut - Elisabeth Delais-Roussarie p. 60-76 accès libre avec résumé en anglais
    This paper aims at : 1. Explaining why data are essential in order to study phonological phenomena at the post-lexical level, be they segmental such as liaison and “enchaînement”, or suprasegmental such as accentuation or intonation. This results mainly from the fact that the analysis of these phenomena must take into account various types of information that may enter in conflict ; 2. Showing how data are designed in this field of research. In doing so, we will insist on the form and the content of the data used ; 3. Arguing that data constructed by the linguist and recorded in specific contexts as well as oral corpora are complementary in developing a linguistic analysis, both having their own advantages and limitations.
  • Quelques problèmes observés dans l'élaboration de dictionnaires à partir de corpus - Alexander Geyken p. 77-94 accès libre avec résumé en anglais
    This work investigates the quantitative and qualitative criteria that preside over the construction of electronic corpora in the context of the elaboration or the update of dictionaries. In particular the concepts of balanced and opportunistic corpora are addressed. It is shown that there are interesting linguistic phenomena that are not present in the largest balanced corpora currently available. Opportunistic corpora are many times bigger due to the availability of large quantities of electronic newspaper text. However, different studies conducted e.g. on the gender distribution or on archaisms show that the results vary considerably depending on the size and the sampling of the corpora. Hence, frequency is no longer a reliable criterion which poses a problem for opportunistic corpora with regards to their objectivity.
  • Des méthodes de traitement automatique aux linguistiques fondées sur les corpus - Marcel Cori p. 95-110 accès libre avec résumé en anglais
    This paper deals with the links between the development of stochastic methods in Natural Language Processing (NLP) and the attempts of refounding a linguistics based on corpora. First, we define robust NLP, in opposition to “theoretical” Computational Linguistics, by specifying its criteria and methods. Stochastic methods can be defined as probabilistic methods using statistics on corpora. Second, we note that the users of these methods tend to confuse NLP models and models of language representation, in other words they tend to confuse NLP and research in linguistics. Finally, we show that this stand is close to “instrumentalism”, to which Popper was opposed.
  • Les corpus fondent-ils une nouvelle linguistique ? - Marcel Cori, Sophie David p. 111-129 accès libre avec résumé en anglais
    This paper attempts to reconstruct the positions of a trend that recommends approaches based on corpora as founding a new linguistics. Two sets of arguments have been put forward : critical arguments, which target Chomskyan linguistics and the way data are tackled, and positive arguments, which deal with the possibility of discovering new facts, improving analyses, taking into account the fields and domains that were previously considered as ignored. Although some of the criticisms actually point out some difficulties in linguistics research, the use of corpora does not fundamentally change the researcher's position.