ABSTRACT
Our research team of middle
school teachers at a suburban middle school in Korea’s third largest city
was asked to participate in a “level based teaching” experiment to which
we had grave ethical and methodological objections. We decided instead
to leave the classes undivided, and develop materials which would allow
learners working in pairs a choice: one set of “low output” exercises that
had explicit models for imitation or very narrow variation, and another
“high output” set which opened up into more extensive controlled practice.
We then asked if we could account for the choices that learners made in
terms of their “ability,” measured by tests or wordcounts and c-unit counts
in a pairwork task. The answer was, mostly, no. For the purposes
of placement and materials design, high and low output levels are probably
better conceived as levels of activity than of “ability” -- properties
of situations and materials rather than learners.
PRESENTER
BIOGRAPHY
David Kellogg taught at
university level for a decade in China. He then worked in the EPIK
program for nearly 2 years at Kwancheon Middle School in Taegu. He
now teaches at Pusan National University of Education.
MATERIALS
1. INTRODUCTION: MIXED
ABILITY OR LEVEL BASED CLASSES?
Two poems, to begin with.
There is a classroom in Pusan
Where some kids can't talk
and some can
If you can't, you're a LOG
If you can, you're a HOG
And the High Output Generators
always try to generate the highest level of output with as many words and
as complex syntax and as rich a vocabulary as they possibly can.
There is a classroom in Taegu
Where some kids don't talk
and some do
If you do, you're a ….
If you don't, you're a …
HOGs say "How are you?",
LOGS say.…
Part of the fun of these
poems is the sound, and part is the story, but the main fun is an attempt
to REPRODUCE the HOGs and LOGs and not simply describe them.
Accordingly, part
of this research is a story. I’m going to tell the story about how we resisted
an attempt to divide classes in half and label the bottom half under-achievers.
How they produced a book of ripping good yarns to supplement a very boring
middle school textbook.
Part of this research
is data: actual word output, test scores, and the relationship between
them. I’m also going to present data on whether or not high output, and
high test scores predict the tendency of kids to tackle hard exercises,
and conversely.
But the main thing
I really want to do is to try to reproduce the research—so I’m going to
have you look at the hypotheses—consider whether they are plausible or
not. Then look at the data—match the data to the hypotheses. And finally
we’ll draw some conclusions. In this way, I hope to not simply describe
our research but recreate some of the excitement.
Every class, Prodromou remarks,
is a mixed ability class (Prodromou 1992: 7). But of course some classes
are more equal than others. For that matter, some abilites are more equal
than others.
What is to be done? Well,
according to the idea of “level based teaching”, we unmix the abilities,
and prepare separate materials for each group. My argument is that this
cannot be done and should not be attempted at least as far as output is
concerned. My argument is that output is better thought of as a level of
activity, rather than a level of ability.
But let me tell the
story from the beginning. Like so many things in Korea, I think it really
goes back to the Cold War. In 1960, partially in response to the Soviet
Sputnik/Vostok missions in Space, the US was involved in a special science
education project. This meant setting up what Buell (1960) called “homogenous
groupings” in secondary schools, with “fast tracks” for good students.
The idea spread to
Korea. Ever since, budgets are made available to pursue experimental research
in “level-based teaching”. Typically, classes are split into same sized
groupings on the basis of test scores, and the “lower ability” class is
given “easy” material, while the “upper ability” class receives a richer
diet. The test results of the two classes are then compared with undivided
"mixed ability" control groups.
It is easy to see
why this kind of research (which our team dubbed “classroom horse-racing”)
yielded results we might call unequivocally ambiguous. No attempt was made
to control for variables such as teacher experience, teacher enthusiasm,
learner attendance, attendance of extra-curricular hakwon cram schools,
or previous English study.
Beretta and other writers
have pointed out the procedural and underlying methodological fallacies
of this kind of quasi-experimental comparative methodology (Alderson &
Beretta, 1992).
For our study, which
was to focus on output, there were problems with the way this sort of classroom
sweepstakes operationalized ability as well. The key dependent variable
was viewed as nothing more than mid-term and final examination scores,
from in-house tests of no particular external validity or consistency (they
are written by school staff on a rota basis). We strongly suspected that
this construct of ability--backwashed from discrete point multiple choice
examinations--would not accurately predict actual oral output. Even if
it did, there was no reason to expect the kind of growth that was expected
of us by the ministry. As Nunan has pointed out (Nunan 1998), real proficiency
is not so susceptible to linear growth.
We went into the project
fired with a reluctance that was not only procedural and conceptual, but
also ethical. We did not believe that ability was destiny; we did not even
believe it was real ability. We did believe, however, that “low ability”
could be a self-fulfilling prophecy.
PRELIMINARY STUDY: MAKING
A CASE AGAINST CLASS DIVISION
To make our case, we
collected test scores from first and second semesters of two English classes
of 48 kids each, and also set them a simple, chatty pairwork task, to be
done in pairs, which was videotaped and transcribed to measure their output
in words and correct c-units. The kids were requested to ask and answer
three questions which they were often exposed to in pre-class or the beginning
chat-up of a class, to wit:
How are you?
Tell me about your family.
What did you do yesterday?
Questions were not
included in the word or c-unit count because they were written out for
the students. We disregarded repetitions in the word count, but we included
any word token, appropriate or not. The definition of c-unit was any utterance
which could, using normal rules of elision and reference, be reconstructed
into a grammatical clause in context. C-units had to be appropriate and
grammatically well formed in order to be counted.
Note that at no time
did we attempt to manipulate the pairings in any way--we simply used the
pairs which naturally occurred in the seating arrangement in class (generally
based on the height of learners, though of course this variable does not
remain stable for long).
We had two reasons for our
“hands-off policy towards the pairs. First of all, we sought to make our
research as non-invasive as possible, so as to interfere as little as possible
in the normal life of the class; in particular, we determined to do nothing
that would give the kids grounds for labelling certain learners under-achievers.
Secondly, we wished our results and the resulting materials to have external
validity, and we recognised that few Korean teachers would have time or
inclination to manipulate pairs in their classroom even if they had the
means to reliably diagnose individual learners.
So that gave us this data:
DATA: 1) End term and mid
term test scores
2) Direct output measurements
from the three-question pairwork task
WORD COUNT: number of words
produced (right or wrong, not counting repetitions)
CLAUSES OR C-UNITS:(expressions
that could be construed as clauses, e.g. “(I’m) fine. (I) thank you.
And (how are) you?” = three c-units. These must be correct and appropriate
to be counted.
We examined the data in the
light of the following hypotheses.
H1: Test scores are not stable
in our learners--they vary with test difficulty (because the tests are
written by different teachers with widely varying standards), and with
the development (not always linear) of the learners themselves. Therefore
the final test in the first semester would not accurately predict mid-term
test scores in the second semester, or test scores on a nation-wide standardised
test.
Do you think this hypothesis
will be proven true?
YES ………… NO ………..
H2: Direct output measurements
(that is, word counts and correct c-unit counts) are not predictable from
test scores--learners who are good at dealing with input in listening and
reading formats may still find output very difficult.
Do you think this hypothesis
will be proven true?
YES ………… NO ………..
H3: Output will increase
in quality and quantity when it is “pushed (e.g. by the presence of a teacher
in the interview), and this increase will happen whether or not the learner
is a “high” or “low” scorer on tests.
Do you think this hypothesis
will be proven true?
YES ………… NO ………..
H1 DISPROVEN: TEST SCORES
ARE RELATIVELY STABLE
H1 was roundly disproved--we
found very high correlations between First and Second Semester test results
(.761** p < .01), and between both of these and the national test administered
to students on September 9th 1998 (.820** and .837**).
The second semester
mid-term, being mostly listening based, shows a weaker relationship with
the first semester mid-term, but there is remarkable consistency, demonstrating
how narrowly syllabus-based our apparently independently written exams
are. In fact, all of these exams were based, as were our subsequent materials,
on eight similar state-approved middle school text books sharing a common
grammatical syllabus and word list.
H2 PROVEN: TEST SCORES
DO NOT RELIABLE PREDICT OUTPUT LEVEL
H2 fared better. We
found a significant but really quite weak relationship between our direct
output measures and the various test scores. This allowed to us to
argue that classes divided on the basis of test scores would not reflect
differences in output ability.
To see the sort of
practical problem we were up against, try dividing the scores on the following
scatterplot into two coherent classes, one “upper” and one “lower” in “ability”.
There does not appear to
be any natural seam along which the class will split. If we divide, as
our predecessors did, at the median test score of 61, we will find both
HOGS and LOGS in the resulting “low class. But if we divide at the median
output score, around 10, we find both “high and low test scorers in the
resulting class.
This is reflected
by the Pearson correlation coefficients, which, although highly significant
thanks to our large sample, indicate a very weak relationship even using
a crude word count (.506** p < .01). When we add the dimension of quality,
and demand appropriacy and well-formedness in the c-unit count, the relationship
becomes weaker, not stronger (.405** p < .01).
H3 DISPROVEN: TEACHER
PRESENCE PUSHES OUTPUT DOWN, NOT UP
H3, that teacher presence
would push output, yielded the most interesting, but also the most ambiguous,
results; interesting because counter-intuitive, but ambiguous because our
research had been designed to meet a probable administrative demand for
class division rather than rigorously control variables. To our surprise,
the pairwork interviews yielded significantly more words and also significantly
more correct c-units than the teacher-led, individual interviews.
We were a little suspicious
of this result and initially attributed it to the fact that the questions
in the teacher led condition were sometimes slightly different, with the
teacher sometimes asking questions which, although easier, gave shorter
replies, in order to ensure that the LOGS were able to make some kind of
response. But we reconsidered this attribution in the light of the results
for one class. We were forced to administer the pairwork twice for one
class, because during administration of the first pairwork task, there
had been a (rather interventionist) teacher present in the room running
the video camera (whereas the other class had simply had the camera running,
fly-on-the-wall fashion). The questions and indeed the pairs were precisely
the same, but the second time around, with no teacher in the room, the
kids did substantially better.
Was this the practice
effect? Perhaps, but probably not entirely. First of all, the questions
were all extremely familiar ones from classwork anyway, and therefore the
first administration of the questions should have also had a practice effect.
Secondly, there is a notable difference in the way the kids appear on the
two videos. With the teacher present, they stare at the table and go through
the questions learner by learner, with each learner taking turns to ask
three questions in a row, as if the learners were taking turns being a
teacher.
Without the teacher
present, the learners tend to proceed question by question, with the first
learner asking the first question, the second responding and then repeating
the question or asking “You? Or “And you” or even “How about you?” This
allows learners to crib from each other’s answers in an obvious way, and
produces something much more like real conversation, with the topics rather
than the roles of questioner and answerer being shared and alternating
Compare (without the teacher present):
DO LOGS GAIN MORE FROM
PAIRWORK THAN HOGS?
So we speculated that
learners at the low end of the output scale would especially benefit from
the absence of the teacher, because the presence of the teacher inhibits
the stronger learner from helping the weaker. Perhaps the HOG considers
that it is really the teacher’s job to intervene, or, more likely, the
HOG considers this kind of help to his LOG partner a form of cheating on
exams and so refrains--so long as the teacher is present!
As you can see, the
data is a little hard on our speculation that LOGS would do exceptionally
well in closed pairs. Both HOGS and LOGS benefit from the teacher’s absence
by producing more words. But when we ask for accurate, appropriate c-units,
only the HOG can really do better without the teacher (perhaps because
he is taking over the teacher’s role?) At least when one looks at output
in its own right, independent of any posited benefit mediated by interaction
or input, it is the HOG who truly “pushes” his output in the direction
of accuracy as well as fluency.
SUMMING UP OUR CASE
We then summed up our
case. We argued that this kind of HOG-LOG interaction should be encouraged
and not prevented by erecting a classroom wall between HOGS and LOGS and
selecting their material for them. We proposed to see different levels
of output generation as different levels of ACTIVITY, not different levels
of ability. So we would design materials that separated the levels of ACTIVITY,
not the students. Learners, working in pairs, could choose between a) exercises,
which gave people fairly rigid models to imitate, and b) exercises, which
provided an extended, rather loose model, on which to improvise. We expected
that the a) exercises would be too easy for some, and b) exercises too
difficult for others, and the pushing and pulling in pairs would be good
for both HOGS and LOGS.
This argument was never
explicitly approved by the Municipal Ministry of Education or by the principal
of our school, but it won strong support (52 out of 56) from a poll of
teachers at our seminar, and it was therefore given tacit permission to
go ahead.
POST STUDY: DESIGNING
AND EVALUATING OUR MATERIALS
The result was a book,
and an accompanying CD ROM, entitled “A Cow’s Head and Other Tales” (the
title is a reference to a Korean folktale about a lazy LOG who is turned
into a cow and learns to work hard). Lesson 10, one of two lessons examined
in our post-study, can be found as an appendix to this article (without
the graphics, unfortunately).
We evaluated the material
using a portfolio of methods, both objective and subjective, both qualitative
and quantitative. Once a fortnight, classes were “co-taught, and the co-teachers
evaluated the classes impressionistically. This often led to revision and
rewriting of the materials. Classes were occasionally videotaped. This
demonstrated to us that, despite our design, relatively little class time
was spent in closed pairs, as learners were very unused to this format
and breakdowns of discipline were widespread. Learners were polled as to
the materials “effectiveness” and interest. In general, learners felt that
our material was “more interesting” than the usual textbook fare, but not
as effective. This is not surprising, since learners tend to see effectiveness
through their parents eyes, in terms of the examinations, which, as we
have seen, are narrowly based on a state-approved textbook, and learners
often did not recognise the sentence patterns from the state-approved textbook
when we embedded them in “A Cow’s Head and Other Tales.”
POST-STUDY: MORE
HYPOTHESES
Since the material
was designed to produce ink on paper as well as words in the air, the teacher
using the material could see at a glance which exercises were being done,
or (as we did on two occasions) collect the material and make precise measurements.
We awarded one point for each b) exercise which was at least half complete,
regardless of accuracy--what we wanted to look at was the extent to which
the pairs were game for the challenge.
One thing we wanted
to do was to empirically test an old bit of teaching wisdom--that unequal
pairs are better for introducing new material but equal pairs are better
for practising it. Since the “Let’s Talk” portion of the lesson we collected
data for was entirely for “Let’s Practice (the language points having been
presented using the usual text book for more than four days previously),
we might expect more b) exercises would be done by homogenous HOG pairings,
and fewer by pairs of mixed ability or homogenous LOG pairs. So we approached
the data with the following hypotheses.
DATA: 1) End term and mid
term test scores
2) Direct output measurements
from the three-question pairwork task
WORD COUNT: number
of words produced (right or wrong, not counting repetitions)
CLAUSES OR C-UNITS:(expressions
that could be construed as clauses, e.g. “(I’m) fine. (I) thank you.
And (how are) you?” = three c-units. These must be correct and appropriate
to be counted.
3) Worksheets from the exercises
indicating how many a) exercises and how many b) exercises were done.
We examined the data in the
light of the following hypotheses.
H4: There will be no significant
difference between the two classes in the tendency to tackle the b) exercises,
as there was no significant difference in “level” of output (or, for that
matter, test score).
YES …………… NO ………..
H5: Pairs which consist of
HOGS (that is, learners who performed above the median in pairwork output
assessment tasks both before and after the project) will show a strong
tendency to tackle the b) exercises. LOG pairs will not.
YES …………… NO ………..
H6: HOG-LOG pairings will
tackle the b) exercises, but will complete rather fewer of them than unmixed
HOG pairs and rather more than unmixed LOG pairs. In other words, the mixed
pairings will behave like the sum of their parts.
YES …………… NO ………..
H4 DISPROVEN: CLASS MAKES
A DIFFERENCE IN EXERCISE CHOICE
Again, the data was
rather cruel to our first hypothesis. The following bar chart shows
at a glance that, although Class 2-10 was slightly better at producing
correct clauses in the initial pairwork task and there were no significant
differences in word counts or test scores, they are absolutely unwilling
as a class to tackle the b) exercises. Class 2-11, on the other hand, is
far more game for the difficult exercises.
How can this spectacular
difference be explained, since the two classes are so very similar in other
respects (we include the c-unit count to demonstrate this, but the other
measures show similar comparability)? Classes 2-10 and 2-11 were both co-taught
by a Korean teacher and a Native Speaker for Lessons 9 and 10. The Native
Speaker was the same for both classes. The Korean teachers were different,
although both were young women with roughly the same amount of teaching
experience and the same training. The Korean teacher for 2-10, however,
was noticeably passive in class and spoke very little, limiting herself
to supporting the Native Speaker with translations; the teacher for 2-11
was aggressively interventionist, often ignoring the Native Speaker entirely
and driving the material forward in Korean only, reserving English for
the pairwork itself. We drew two conclusions for our materials. Firstly,
the NON-native speaker teacher has the critical role, although it is an
indirect one, in managing closed pairwork. Secondly, output is never stable;
it is constantly influenced by other variables. HOGS and LOGS were clearly
not separate species which could be safely removed to separate habitats
without altering their behaviour. In fact, they were often not even separate
individuals.
H5 AND H6: A PROBLEM OF
LEAKY CATEGORIES
The fact that the groups
of indubitable HOGS (who scored above the median on both output measures)
and LOGS were small meant that the number of pairs we could examine
were even smaller: five double-HOG pairs, six double-LOG pairings, and
only three of the crucial, most interesting, heterogeneous HOG-LOG teams.
Another reason for the small number of heterogeneous groupings may have
been the very effect we were counting on--that high output attracts high
output from partners regardless of their level, and low output creates
low output in partners, regardless of their “level--thus creating more
HH and LL than HL. Of course, had we known during the data gathering who
the HOGS and LOGs were, we could have deliberately manipulated the pairings
to increase the number of heterogeneous groupings, but that too might have
altered output by making people work together who normally would’t have
done so. In retrospect, the non-invasive “double blind approach--with the
identity of the HOGS and LOGS unknown to all parties--was probably best.
The bar graph showing
the average number of b) exercises attempted per pair each lesson--but
remember that the imposing middle bars represent a mean of only three heterogeneous
pairs.
So the double HOG pairs
tackle, on the average, more of the difficult exercises than the double
LOGS, but this result can hardly surprise. Rather more odd is the fact
that the HOG-LOG groupings behave, on the whole, rather better than the
sum of their parts, or at any rate much more like double HOGs than like
double LOGS. But perhaps this is a freakish result caused by the small
size of the HOG-LOG sample, which comprises, remember, only six learners
in three pairings.
Using all the data
lumped together, irrespective of pairings, we find that the tendency to
do the b) exercises is not closely predicted by any of our measures of
“ability”: test scores or output measurements (see correlations in the
appendix). In the same chart, you will note that the first output measurement
correlates rather weakly with the second measurement, at only about .584**
(significant at p < .01) for the word counts and only .236 * (significant
at p < .05) for the correct c-unit count. Further evidence that the
HOG is a frail creature, indeed, a creature of circumstance, and a product
of an environment which is easily manipulated, for worse and for better,
by the materials designer and the teacher.
EPILOGUE: A SIGNIFICANT
BUT NOT MEANINGFUL DIFFERENCE
This was not the sort
of conclusion expected by our funders or even by ourselves--they wanted
to see, after the six months of our project had elapsed, a significant
increase in test scores, or failing that, in our output measures. In the
event, no increase in test scores took place, though there was a significant
decrease in one class (2-10). In output, there was a small (but greater
than the Standard Error of the Mean) increase in word output in one class
only (2-11) in the six months from July 1998 to November 1998.
This lacklustre result
has many mitigating circumstances: the long summer holiday, the crowded
(nearly fifty student) classes, of which only one forty-five minute session
every fortnight could be devoted to our material, and the difficulty of
securing the cooperation of the teachers of these two classes, who felt
that the whole project simply added an extra burden to their already over-loaded
schedule.
But the main reason
we do not take it to heart is simply that we understand that there is a
tremendous difference between traditional measures of “ability, based on
tests, and our output measurements, subject to all kinds of performance
variations. Even if we had accepted that our “output measurements” reflected
in a direct way language proficiency, with Nunan, we do not expect this
to increase in a linear way.
CONCLUSION: HOGDOM/LOGDOM
IS A STATE OF MIND
At first glance, our
results appear to be even more unequivocally ambiguous than those of the
“classroom horse-racing studies which preceded us. We found that not only
do our output measures correlate rather poorly with test scores, they are
not very stable in themselves; a third of the children were not consistently
HOG or LOG over six months, and our c-unit count had a reliability coefficient
of only .236 (p < .05: see appendix two for other counts). When we look
at the choices that pairs of learners made, the only firm conclusion which
we can draw is the rather banal one that pairs of two “High Output Generators
did more of our b) exercises than pairs made of two “Low Output Generators.
We are not able to make any accurate predictions about what Mixed HOG-LOG
groups do.
Uncontrolled
factors are clearly rampant. Among these factors are probably the crucial
decisions which teachers and materials designers make, often in blithe
disregard of Second Language Acquisition theory: teacher-centredness, use
of mother tongue, choral work, and the motivational power of a ripping
good yarn. Thus the huge gap--not a contradiction, but nevertheless a gap--between
what SLA theory says may be good for us and what teachers and materials
designers know will work remains. Necessarily--our research and even the
book we developed is really a perverse attempt to build a bridge along
that gap (on the what-will-work bank of the abyss) rather than across it.
Research has a tendency
to shake the constructs that it is based on. This can be a very useful
surveying or even ground-clearing exercise, but it is rather troublesome
to those who have already erected substantial assets on those constructs.
The data was not kind to the hypotheses we formulated, and still less to
the categories of HOG and LOG we posited (fortunately, we posited these
post-facto and did not use them to form classes). But we emerge from it
with our faith in closed pairwork, even in crowded Korean classrooms, intact
and to find the materials we rather prematurely erected upon it still standing.
Our classrooms do not in fact chiefly consist of pure-bred HOGS or unmitigated
LOGS; instead, they are a Gordian knot of learners with many abilities
and many levels of ability not easily susceptible to disentanglement, whether
by researchers or by sword-wielding administrators. Then let Prodromou’s
remark that every classroom is a mixed ability classroom stand, not simply
as a statement of what is, but as a statement of what should be….
To conclude:
In classrooms from Seoul
to Cheju
You find learners who don’t
talk and do
Some are seen, others heard
In c-unit and word
But the difference is sometimes
just YOU!
|