Corpus-based research on young learner writing

Project owner

Western Norway University of Applied Sciences

Project categories

Applied Research

Project period

January 2015 - June 2018

Project summary

Learner corpus research is a young, but rapidly growing field in the intersection between corpus linguistics and research on second language acquisition. As Granger (2014: xxi) writes, “computer learner corpus (CLC) research is still in its infancy”.  As such, new avenues of research are constantly being discovered and explored. The majority of papers so far have concentrated on advanced or upper-intermediate learner language, such as academic writing. The present study, by contrast, examines the written production of young Norwegian learners of English. We also have access to a reference corpus with texts written by British children of the same age.

So far, the research group has focused on cohesive ties. The term cohesion refers to ‘’the property of connectedness that characterizes a text in contrast to a mere sequence of words.’’ (Mahlberg 2009) Cohesive ties in writing serve the purpose of maintaining unity within the sequence of sentences and ease the interpretation for the reader. In the research literature, cohesive ties have long been recognized as important linguistic devices that contribute to ‘’good’’ writing (Halliday & Hasan 1976). In Halliday and Hassan’s (1976) seminal work Cohesion in English, text connectedness is described in terms of reference, substitution, ellipsis, conjunction and lexical cohesion. Looking at these aspects of cohesion, research literature has identified a number of features in the writing of EFL/ESL learners that diverge from those found in native speaker writing (e.g. Liu and Braine 2005, Palmer 1999, Zhang 2000). The current research group has so far focused on the category of reference. We have looked in particular at the expression of personal and demonstrative reference, as well as the way Norwegian learners of English exploit reference for achieving various narrative effects.

Given the important impact that they have on the clarity and comprehensibility of writing, cohesive devices should play an important role in English language teaching at all levels. Nevertheless, cohesion does not receive enough attention in traditional language teaching. Mahlberg 2009, citing Cook (1989) observes that ‘’cohesion between sentences is too easily seen as an aspect of language use to be developed after the ability to handle grammar and words within sentences.’’ In our experience, this is true of the Norwegian classroom as well, where cohesive devices are treated only sporadically, if at all.  We hope that by charting the tendencies and challenges in the use of cohesive ties that this group of learners face, we can draw attention to these linguistic devices and the importance of their more systematic treatment in the EFL/ESL classroom.

While members of the research group are still involved in carrying out work on cohesive ties, we also aim to go on to investigate other linguistic and narrative aspects of young learner writing, including fluency, accuracy, syntactic complexity and narrative skills.


The data for this project is drawn from the CORYL corpus, which is an electronic corpus of young learner language developed at the University of Bergen and only recently made available for research. The corpus is based on writing scripts provided by Norwegian pupils in the 7th, 10th and 11th grade for the National tests of English writing, 2004-5. The corpus is still under development, and 10th grade texts will be added at some stage. At present, the corpus comprises 42,749 words  (69,322 tokens). The test scripts have been evaluated on a scale based on the Common European Framework (CEFR).

In their handbook for language teachers Hasselgreen et al (2011) explain how the CEFR scale (originally designed for adults) was adapted for young learners in the AYLITT (Assessment of young learner literacy) project.  The test scripts have been tagged for errors using sic tags. Corpus searches can be restricted according to error type. In addition, searchable sub-corpora include “age”, “gender”, “CEFR-grade”, and “year”. It is also possible to search for concordances and collocations, and to generate word lists.

The reference corpus used in the study is the Oxford Children's Writing Corpus (Oxford University Press, 2014).This is a corpus consisting of 118,00 short stories (about 500 words each) written by UK children aged 4 to 14 as part of a short story competition organized by the Oxford University Press and BBC in 2014.  The corpus comprises 58,961,819 tokens and 50,883,923 words. It can be divided into searchable sub-corpora according to the writer’s age, gender, and region.

The broad aim of the research project is to investigate the writing of young Norwegian learners of English by applying corpus linguistic tools.

So far the research group has focused on the following issues:

  • How is reference, as a cohesive tie, used in the writing of young Norwegian learners?
    • can patterns of over-, under-, misuse be identified in the learner writing?
  • Are referential ties used to achieve narrative effects and how?

Further research questions will be formulated as the project enters a new stage and with a new member on-board. The fact that the texts in the corpus have been evaluated on the CEFR-scale opens up the possibility of relating the use of cohesive devices to the pupils’ proficiency level as well.