Symposium on Language Assessment - How to Design
Adaptive diagnostic language
testing through the internet: the DIALANG
Peppi Taalas is a researcher at the Centre for Applied Language Studies, University of Jyvaskyla, Finland.
|Tel. +358-(0)14-603 526|
|Fax. +358-(0)14-603 521|
|Email : firstname.lastname@example.org|
|Personal web site : https://www.jyu.fi/~peta|
|Research interests: Educational change and educational technology in the sense of improving and enhancing the quality of learning and teaching. The "virtual classroom" : what is different, if anything, what is interaction in the new learning environments? (Definitely something that is NOT brought about by technology.)|
José J.E.A. Noijons is a staff member in the vocational department of CITO, Dutch National Institute for Educational Measurement, The Netherlands
|Tel. +31 26 3521 447|
|Fax. +31 26 3521 576|
|Research interests : Research into and development of all types of language tests. Currently development of computer adaptative tests of listening and reading, electronic item banking and testing on the Internet. Engaged in a number of international projects funded by the EU (Socrates and Leonardo da Vinci): DIALANG, the development of diagnostic language tests on the internet ; MULTINED, the development of a multimedia development tool for language courses ; V-LASSO, the development of internet-based adaptive language tests for intake purposes.|
Desværre var, på nær Laurence Oliviers engelsk / franske indlæg, følgende presentationer på fransk jeg vil derfor ikke prøve at gengive deres indhold, men blot henvise til efterfølgende engelske beskrivelse samt ikke mindst muligheden for selv at høre foredragene på mine optagelser!
|Listen to Part 1
|Listen to Part 2 |
|Listen to Part 3 |
|Listen to Part 4|
As the eighties had just ended, Alderson (1991) had a prospective look on the coming decade and identified priorities which had to be addressed in second and foreign language assessment. Among these priorities was the development of new approaches and new techniques to assess learners' proficiency by means of educational technology.
Ten years later, where
are we? To which extent use of computers has improved during the nineties
et what are the challenges do we have to take up as we enter a new
millennium? More recently, Brown (1997) reminded us the benefits of computer use as regards administration (individual, no time limit...),
psychometrics (computer accuracy, creation of an infinite number of
different tests...) and human considerations (less frustrating,
individualized feedback....). One may be surprised at Brown's research
agenda for the upcoming years which focuses only on issues in connection
with the development of adaptive tests in spite of problems related to the
use of these devices to assess a competence as complex as language
competence (Canale, 1986; Meunier, 1994). Design issues, rating
procedures, practical problems that he raised are part of the development
of instruments that originate from the former applications of item response
theory with the concept of "flexilevel" test as proposed by Lord (1970).
Since adaptive tests seem to dominate the present debates in regard to the use of computers for foreign language assessment, it is important to remind
briefly the theoretical bases of adaptive testing. Limitations will then
appear so that we shall suggest new uses of computers to assess language
Within the item response theory framework, once it has been shown that a
set of tasks can measure at different degrees the development of
competence, it is possible to sort, on a common scale, tasks (in relation
with their difficulty, i.e., the level of competence that is required to
succeed) and learners (in relation with the level of development of
competence). Tasks may then be stored in a bank and learners level can be
determined out of responses on tasks whose number may vary. Adapting the
test to the learner is then possible recalculating the competence level
after each new response in order to select the next most appropriate task.
By "appropriate task", we mean a task that is neither too easy nor too
difficult. As a result, the sequence of selected tasks is not necessarily
the same from one learner to the other.
From the foundations of the latent-trait theory at the origin of item
response theory (Birnbaum, 1968), literature in psychometrics field has
grown with the development of increasingly sophisticated and complex
estimation algorithms for task characteristics or learners' ability, with
the development of measurement models that take into account the various types of answers and the nature of the competence to be assessed and with the trialing of selection modes that take into account the content areas and the function of a test. However these developments are done at the expenses of simplicity and suit better the needs of large scale testing (sometimes with gigantic samples). On the other hand, computer aids are more and more accessible and efficient for psychometric task analysis, bank construction and administration (on site or distance). It is important to underline that adaptiveness is yet restricted because learners characteristics such as learning style, interests, age... are not given much attention.
At the same time, linear computerized assessment tools are found. Although they are not only a transposition of paper and pencil conventional tests, they are based on a classical measurement approach. Innovation lies more in the use of multimedia capabilities. This seems a sound approach as adaptive testing is not always appropriate. Laurier (1998) distinguishes five major functions that must be taken into account in the use of computer capabilities.
1) Selection assessment - This may not be the most virtuous function but it is still fulfilled, mainly in educational systems which include programs that are accessed through a competition. Computers may be used, mostly for practical reasons, as this type of assessment often involves a norm-referenced measure. One could hardly justify the use of adaptive testing techniques only for selection purposes.
2) Placement assessment - Faced with the heterogeneous levels, many
institutions use a placement tool to determine the general proficiency
level. Predictive validity is sought and indirect measures that maximize the variance may be used. Since large differences between learners' levels are observed, adaptive testing techniques are particularly appropriate for this type of test.
3) Certification testing - This evaluation is often a high-stake one for
the individuals. The tasks that are presented must be modelled on
real-like communication situations the learners may encounter. Since the cutoff score that is established usually correspond to the expected level of proficiency in relation with the decision to be made, the development of an adaptive test is not always the most judicious choice (Wainer, 1993).
4) Formative assessment - Formative assessment is a continuous process in close relation with teaching. Diagnostic information is needed to regulate the learning process, i.e., modulate teaching and learning strategies according to the observed strengths and weaknesses. Rather than replicating real-life situations, it is important to form a judgment on the elements that compose the interactional abilities as described by Bachman (1991).
5) Achievement testing - This form of assessment is used at predetermined moments. It is important to sample the content in such a way that competence elements that have been presented in the class are adequately represented. Therefore the efficiency of adaptive testing procedures that select tasks according to their difficulty rather than their position in the syllabus is questionable.
It appears that in computerized testing and assessment, the development of an adaptive test is not always necessary and is not even desirable. In fact, adaptive testing seems to be particularly worthwhile for the construction of placement tools. Even for placement purposes, solutions that reduce the inherent limitations of current adaptive tests must be considered. For example, in the construction of the French CAPT, a computer adaptive placement test, that we developed for English-speaking students learning French as a second language, we took advantage of recent advances in item response theory (Laurier, in press)
1) The unidimensionality assumption - Although the complexity of language competence does not necessarily involve the emergence on many traits that represent dimensions to be measured (McNamarra, 1991, Dunkel 1999), meeting the unidimensionality assumption cannot be taken for granted. In this perspective, we chose to create five item banks rather than one. In addition, in order to take into account the high correlations between the comprehension questions that refer to a given passage, we applied the "testlet" technique (Wainer & Lewis, 1990) and we analysed each passage as a single item (with k + 1 answer possibilities, where k stands for the
number of questions about a passage).
2) Limitation of dichotomous scoring - "Testlets" are analysed with a
graded response model. This model allows the design of tasks that are not necessarily considered as either correct or incorrect. Rating scales may be used so that self-assessment sections can be integrated in an adaptive test. As the French CAPT is concerned, this potential was particularly useful to circumvent the twofold problem of oral expression - voice recognition and constructed response rating.
3) Number of experimental examinees - The number of experimental
administrations that are required is still a major problem in contexts that are not large scale testing. To avoid the use of a psychometric model that does not fit the nature of the tasks, efficient algorithms that produce accurate estimations must be available. Some estimation approaches that are implemented in recent software may help to reduce the number of examinees (Mislevy, 1986). Furthermore, multidimensional models, that probably fit best the nature of language competence, require a formidable number of examinees and are therefore generally unaccessible.
Apart from placement decisions, other ways to take advantage of computer capabilities must be considered (Burstein et al., 1996). Environments where multimedia improves authenticity can be imagined for certification testing. Admittedly, many authentic tasks which require a high literacy level, are performed with computers. Let think about the writer for whom a word processor is an indispensable writing tool or the Internet surfer for whom reading tasks are performed on a computer screen. For these communication needs, it is obvious that the traditional paper and pencil environment put the candidate in a situation that is very different from a real-life situation. In addition, the authenticity requirement may be met with assessment situations that simulate real-life situations. One does not have to use heavy and imperfect technology involved in virtual reality
systems to create with the computer environments that incorporates the elements of a real-life situation to facilitate the role-playing.
Multimedia leads the student to interact and provides a context for the execution of a task. It may then be easier to meet the requirements of interactivity and authenticity, which Bachman and Palmer (1996) add to the traditional requirements of validity, reliability and practicality.
The most interesting applications will probably arise from formative
assessment needs but it will be necessary to free developers from adaptive testing principles that originate from item response theory. Indeed, this theory does not allow flexibility to produce assessment tools based on a variety of tasks that are integrated in the pedagogical sequence and can regulate learning in an efficient way. We must count on computers as a guiding device for pedagogical intervention by combining information from various sources: performance on an exercise (communicative exercise or metalinguistic activity), self-assessment or diagnostic tests. Diagnostic
tests can be developed with the contribution of different approaches found in learners' modelling to individualize learning (Giardina & Laurier, 1999): semantic networks, expert representations, mental model development, diagnostic approaches, measurement-oriented approaches. Within the last-mentioned approaches, it is necessary to explore some techniques based on probability of combined occurrences of events, that aim at inferring the state of knowledge or, more specifically areas where the student should receive some kind of assistance. For example, in French, if the system observes recurrent use of intonation patterns for the interrogative form and syntactic problems with other interrogative structures, it may conclude
that use of "est-ce-que" for questions is not acquired. A set of inference rules for diagnostic has to be devised. However it is important to realize that systems that have been designed so far in this perspective focus on acquisitions that are much more simple than acquisitions underlying the development of language competence and are still out of teachers' reach.
Therefore, the challenge is twofold: develop diagnostic tolls that could be easily modified by teachers (who must be in charge of formative assessment) for a competence that is recognized as complex. However, this challenge can be taken up because, as we know, diagnostic test construction within formative assessment often involves the verification of isolated elements for a specific intervention. Therefore, replicating the conditions of an authentic use of the language is no longer a priority.
As far as achievement test, and to a lesser extent selection test, are
concerned, the application of adaptive testing procedures may jeopardize the content validity of a test. As a matter of fact, these tests must reflect the elements of a syllabus so that particular procedures have to be used to ensure a proper representation of important elements of a program (Stocking & Swanson, 1993). Generally, the use of technology for this type of test is more appropriate during the test construction et for the transmission of the results rather than the administration. Item analysis computer programs have been available for many years and we now find
programs that can create item banks in order to produce conventional or computerized tests that are generally delivered in a linear way.
the computer allows the creation of data bases which incorporate learners'
results within an educational programme and help to find the most
appropriate remedial measures.
By and large, we believe that the use of educational technology for foreign language testing and assessment is worthwhile from three different points of view et we wish that research and development in the coming years be oriented toward new means to take advantage of this technology.
1) A computerized test is often more efficient than a conventional test - This advantage can be observed in different ways: immediate and economical transmission of the results, distance administration (via Internet), shorter and still reliable tests, test security, etc.
2) Computerization allows the creation of tasks that cannot be created in another way - This means new type of answers, a higher degree of
interactivity that may even allow changes in the test situation during the administration, integration of sound and images... This wealth should eventually improve the validity but we must be careful in this regard. It is particularly important to make sure there is no bias that favours those who are more comfortable with technology (Taylor et al., 1999, Coniam, 1999).
3) With a computerized test it is possible to individualize the pedagogical intervention - On one hand, we must broaden the concept of adaptiveness in order to take into account not only the task difficulty bu also the interests, the needs and the style of each learner. On the other hand, we have to know how to use assessment results to propose an individualized program.
Alderson, J. C. (1991). Language Testing in the 1990s: How far have we
come? How much further have we to go? In S. Anivan (Ed.) Current
Developments in Language Testing (pp 1-27). Singapour: RELC.
Bachman, L. F. (1990). Fundamental Considerations in Language Testing.
Oxford: Oxford University Press.
Bachman, L. F. & Palmer A. (1996). Language Testing in Practice. Oxford:
Oxford University Press.
Birnbaum, A. (1968). Some latent-trait models and their use in inferring
an examinee's ability. In F. M. Lord & M.R. Novick, Statistical Theories
of Mental Test Scores (pp 395-549). Reading, MA: Addison-Wesley.
Brown, J. D. (1997) Computers in language testing: Present research and some future directions. Language Technology 1 (1), 44-59
Burstein, J., Frase, L.T., Ginther, A. & Grant, L. (1996). Technologies
for language assessment. Annual Review of Applied Linguistics 16, 240-260.
Canale, M. (1986). The promise and threat of computerized adaptive
assessment. In C. W. Stansfield (Ed.), Technology and Language Testing (pp
29-46), Washington, DC: TESOL.
Coniam, D. (1999). Subject's reaction to computer-based tests. Journal
of Educational Technology Systems 27 (3), 195-225.
Dunkel, P. A. (1999). Considerations in developing or using second/foreign
language proficiency computer adaptive tests. Language Technology 2 (2),
Giardina, M. & Laurier, M. (1999). Modélisation de l'apprenant et
interactivité. Revue des sciences de l'éducation XXV (1), 35-59.
Laurier, M. (1998). Méthodologie d'évaluation dans des contextes
d'apprentissage des langues assistés par des environnements informatiques
multimédias. Études de linguistique appliquée 100, 247-255.
Laurier, M.. (sous presse). The Development of an Adaptive Test for
Placement in French. In M. Chalhoub-Deville (Ed.), Trends in Computer
Adaptive Testing. New York: Cambridge University Press.
Meunier, L. (1994). Computer adaptive language tests (CALT) offer a great
potential for functional testing. Yet why don't they? CALICO Journal 11
Mislevy, R.J. (1986). Bayes modal estimation in item responses models.
Psychometrika 51 (2) 177-195.
NcNamarra, T. (1991). Test dimensionality: IRT analysis of an ESP
listening test. Language Testing 8 (2), 45-65.
Taylor, C., Kirsch, I., Jamieson, J. & Eignor, D. (1999). Examining the
relationship between computer familiarity and performance on computer-based
language tasks. Language Learning 49 (2), 219-273.
Wainer, H. (1993). Some practical considerations when converting a
linearly administerd test to an adaptive format. Educational Measurement:
Issues and Practices 12 (1), 15-20.
Wainer, H. & Lewis, C. (1990). Toward a psychometrics for testlets.
Journal of Educational Measurement 27, (1) 1-14.
SLA current research made it possible to show the importance, on the one
hand, of the pragmatic aspect of communicative competence and, on the
other hand, of the major role of interlanguage and individual factors. The
testing tradition in foreign languages is not troubled by this type of
considerations. Computer-assisted evaluation offers interesting
possibilities today. However, we have to admit that this new technological
approach does not guarantee, for the moment, a new approach in language
assessment. One can observe on the contrary the same drifts as for Cdrom
language programmes, i.e. a real didactic regression hidden by
technological attraction and marketing.
My report, supported by my experience in the making up of the French
"Diplôme de competence en langues", will present three essential aspects we
should keep in mind. The 2 first aspects are taken from the theoretical
framework designed by L Bachman and A. Palmer (1996) : the authenticity of
the task and the degree of interactivity of the task. The third aspect,
which is essential to me if one wants to really make a diagnosis of
language competence which makes sense to the learner, is interlanguage
dynamics. This point develops the impact criterion suggested by Bachman and
My objective will therefore be resolutely descriptive.
Université Marc Bloch Strasbourg
Département de linguistique appliquée et de didactique des langues
22 rue Descartes
F - 67084 STRASBOURG
The French "Evaluation diagnostique à l'entrée en seconde en anglais"
concerns three of the four language skills : oral comprehension, written
comprehension and written expression. What we are looking for is to know
how pupils use the knowledge and know-how they have acquired to succeed at
the lycée level.
Each skill has been analysed to make an inventory of related competencies.
Each competence itself was then translated into evaluation items.
Competencies represent, in a nonexhaustive way, the principal mental
operations a successful listener, reader or writer uses, often in an
This type of evaluation follow two general objectives :
1. In itself, diagnostic evaluation must help the teacher to determine with
precision the nature of learning difficulties in order to consolidate the
learning process. It also enables the teacher to identify the points of
support he will use to design his lessons.
2. As far as follow-up and analysis of results are concerned, it allows
· the teacher to plan different learning paths and to imagine
differenciated tasks which respect the significant stages in the approach
of a document and in the training of production ; unable to get over
difficulties, pupils cannot continue the study in progress, they do not
perceive any coherence and remain unable to transfer the strategies suggested.
· the pupil to become aware of his own mental process, to know what he is
supposed to do, to feel secure while becoming more autonomous, in short, to
see that he can improve.
Linguistic competence underlies each of the four basic skills. For example,
not being able to recognize a past tense can hinder comprehension ; on the
other hand, being able to recognize the past tense does not automatically
imply understanding the meaning of a message. It is therefore judicious to
integrate linguistic competence into written comprehension, written
expression, and to a lesser degree, oral comprehension, as it is usually
the case in the language classroom. Only, indeed, training carried out
within communicative activities can guarantee both perception of values and
appropriation of the forms which express them.
Pupils entering the lycée will have to make the most of the knowledge and
competencies acquired at the college level. The French National Language
Curriculum points out to the following expectations :
* knowing how to work, knowing how to learn ;
* being curious ;
* becoming autonomous.
Reports reveal that few pupils, entering the lycée, show the right attitude
ensuring success. The raison d'être of modular lessons, we have
implemented, is to help them develop learning and methodological
competencies and risk taking.
The aim of diagnostic evaluation is therefore to make it possible, for the
pupil, to know himself better and, for the teacher, to obtain a fast and
reliable picture of his class in order to set up the best adapted
strategies. It is of little use if it is restricted to a simple language
check-up. It becomes efficient when it helps pupils to go beyond learning
Any diagnostic evaluation has two goals:
1) Make an inventory of the competencies - and incidentally knowledge -
mastered by pupils coming from different horizons.
Knowledge and competencies are not exclusive but are complementarity. This
database can only concern language competencies, which it is essential to
master if one wants to make some to progress, i.e. to develop and
strengthen transferable knowledge. Transfer of competencies is one of the
major objectives of the lycée.
Evaluating the specific competencies necessary to good comprehension or
production of a written or oral message in English represents the essential
diagnosis needed to build up a lesson.
2) Use this database for remediation. Remediation can be based only on a
common interpretation, by the teacher and the pupil, of language errors.
The more an exercise integrates different competencies to be tested, the
more it is easy to do. Unluckily it also becomes more difficult to
interpret errors. Remediation becomes uncertain.
The choice we made, in order to facilitate remediation, was to split up
skills and competencies into identifiable items and to test pupils on the
most fundamental items.
This choice leads to two major consequences
1) Using a different code for each type of error is useless. Coding is thus
Coding indicates if the objective is attained or not. Achieved objective
does not always mean 100 % success ; it is necessary to take into account
the right to make a mistake. A binary coding gives a caricatural picture of
This is why we introduced code 2, which indicates, although the fixed
threshold has not been reached, that the exercise contains a certain number
of exact answers. This code should avoid discouraging pupils, who partially
succeeded. Binary coding would have given a not achieved mark.
In addition, a specific error code (code 6), available in a certain number
of exercises, makes it possible to refine wrong answers. This code 6
informs the teacher about the nature of a specific error and can also
indicate that it was probably due to misunderstanding of the instruction.
This should provide a closer to reality picture of the class.
2) Classifying exercises according to the four basic skills is not relevant.
The same competencies, the same mental operations are concerned in written
or oral skills. It seems wiser to organize evaluation around competencies
(written or oral indifferently) rather than around formal linguistic
Our choice to evaluate pupils basic mental operations explains why our
evaluation can move from written comprehension to oral comprehension, for
A diagnostic and formative evaluation
Diagnostic evaluation is also formative: it should help pupils become aware
of their own learning strategies, build and increase their knowledge and
Formative evaluation must help the teachers to have a better idea of the
various mental operations necessary for each skill and make it possible for
pupils to take into account their own learning strategies.
Such an approach emphasizes mental operations and allows pupils
· to become aware of the need for good learning strategies ;
· to adopt a personal approach in developing their own learning process,
which cannot be the reproduction of a single model.
Importance of individualization
The " classe de seconde " is the right moment for pupils to get a better
idea of the strategies they use when resolving a complex comprehension or
expression task. Evaluation represents the first step of this process.
Distinction between evaluation and training activities
For a better picture of each pupil, evaluation activities split up complex
tasks into simple ones, whereas teaching, although it can also rely on
simple tasks, is turned towards the achievement of complex tasks on a short
term basis. Differentiation of learning objectives, of tasks, of mediation
techniques is thus inevitable at any moment.
The exercises presented in the evaluation booklet should not be done in a
systematic way, in the hope of a better control of each skill. They are a
starting point for a new typology of learning exercises. Involving the
pupils into complex communicative activities remains of vital necessity.
Oral production skill does not appear in the diagnostic booklet, but it can
be evaluated separately over all the school year. Volume 3 of the booklet
gives some general indications for the evaluation of oral production
Second language acquisition follows universal stages of acquisition, and
interlanguage variation is constrained by specific aspects of language
These regularities form the basis for an observation-based linguistic
screening procedure which is administered on the computer. This procedure
makes it possible to assess a learner's level of language development and
his or her variational type on the basis of a very brief speech sample that
is elicited using specific communicative tasks.
Apart from demonstrating the procedure I will show the accelerating effect
and the naturalness of communicative tasks in data elicitation. I will also
demonstrate the reliability of the on-line assessments.
A further issue is the comparison of results obtained from linguistic
profiles with results obtained through proficiency rating. In several
studies we found a significant correlation between the two measures when
taken in the same sample.
Finally the diagnostic value of the procedure and its implication for
teaching practice will be outlined.
Kontaktmuligheder for at få flere oplysninger:
Claude Springer (som organiserede
konferencen) is a researcher at the Département de Linguistique
Appliquée et de didactique des langues de l'Université Marc Bloch Strasbourg. He is Chairman of the French Confederation for the Development of Applied Linguistics.
|Tel. 03 88 41 74 22|
|Fax. 03 88 41 59 06|
|Email : email@example.com|
|Web site : https://u2.u-strasbg.fr/dilanet/|
|Research interests : Assessment of language performance. Language acquisition : definition of interlanguage profiles. Plurilingualism and language learning : didactic problems related to content and language integrated classroom. Self-directed learning and ICT.|
|Book published : La didactique des langues face aux défis de la formation des adultes, Ophrys, 1996.|
Michel D. Laurier is Professor at the Faculté des sciences de l'éducation, University of Montréal, Canada. He is a researcher in second language assessment. He has been involved in the development of different types of language tests. He is also director of the "Centre de langues patrimoniales".
|Tel: (514) 343-7034|
|Fax: (514) 343-2497|
|Research interests : Research in curriculum evaluation ; CALL ; computer assisted language testing and problems related to authentic testing.|
Lyle Bachman Lyle Bachman is professor
at the Department of TESL/Applied Linguistics and
Director, English as a Second Language Placement Examination, University of
California, Los Angeles, USA. Editor of Language Testing.
|Tel: 310 267-2085|
|Fax: 310 206-4118; 818 706-9816|
|Web site: https://www.humnet.ucla.edu/humnet/al/bachman/|
|Research interests : Empirical (quantitative and qualitative) studies into : the nature of language ability, factors that affect performance on language tests, relationships between individual characteristics and performance on language tests, relationships between various types of strategies (e.g. learning, cognitive, metacognitive) and performance on language tests.|
|Books published : Language Testing in Practice, Oxford University Press, 1996. |
Fundamental Considerations in Language Testing, Oxford University Press, 1990.
Manfred Pienemann is Professor of Applied
Linguistics at the University of
Paderborn, Germany. He recently returned to Germany after 10 years at the University of Sydney and 5 years at the Australian National University. In 1983, he founded the Language Acquisition Research Centre at the University of Sydney . During this time he developed a psycholinguistic theory of language acquisition.
|Tel: +49 5251 - 60 28 66|
|Research interests : Psycholinguistics, Language Acquisition Theory, Language Teaching Research and IT in Applied Linguistics. Research on psycholinguistic constraints on language teaching. M. Pienemann also developed a number of practical applications which follow from his theororetical and empirical work, including COALA, a software tool for computational linguistic analysis, and computational tools for linguistic profiling.|
|Books published: |
M. Meisel, H. Clashen, M. Pienemann, "Deutsch als Zweitsprache. Der Sprachwerb ausländischer Arbeiter." Language Processing and Second Language Acquisition, J. Benjamin, 1998.