Pusan Web Coverage of Pusan Kotesol Conference 1999

Pusan
Kotesol

Chat Room
Classifieds
Exchange Rates
The Exit
Features
Food
Graffiti Wall
Guestbook
Info Center
Language Guide
Links
Movie Listings
News
Photo Gallery
Places
Profiles
Pusan Time
Pusan Weather
Q & A
What's Going On
Writings

HOME

PRESENTATIONS

PHOTOS

HOGS and LOGS:
Dealing with Different 'Levels' of Output

David Kellogg
Pusan National University

Abstract * Bio* Materials

VIDEO

Click on the photo to see a full size image. To watch the video clip, click on the 'watch' icon. To download the video file to your hard drive, RIGHT CLICK on the 'download' icon and 'save link as'. For more information on how to watch and download video click here. You will need to have the Free Real Player installed in order to open the video files. You can download it for free here.

ABSTRACT
Our research team of middle school teachers at a suburban middle school in Korea’s third largest city was asked to participate in a “level based teaching” experiment to which we had grave ethical and methodological objections. We decided instead to leave the classes undivided, and develop materials which would allow learners working in pairs a choice: one set of “low output” exercises that had explicit models for imitation or very narrow variation, and another “high output” set which opened up into more extensive controlled practice. We then asked if we could account for the choices that learners made in terms of their “ability,” measured by tests or wordcounts and c-unit counts in a pairwork task. The answer was, mostly, no. For the purposes of placement and materials design, high and low output levels are probably better conceived as levels of activity than of “ability” -- properties of situations and materials rather than learners.

PRESENTER BIOGRAPHY
David Kellogg taught at university level for a decade in China. He then worked in the EPIK program for nearly 2 years at Kwancheon Middle School in Taegu. He now teaches at Pusan National University of Education.

MATERIALS

1. INTRODUCTION: MIXED ABILITY OR LEVEL BASED CLASSES?
Two poems, to begin with.

There is a classroom in Pusan
Where some kids can't talk and some can
If you can't, you're a LOG
If you can, you're a HOG
And the High Output Generators always try to generate the highest level of output with as many words and as complex syntax and as rich a vocabulary as they possibly can.

There is a classroom in Taegu
Where some kids don't talk and some do
If you do, you're a ….
If you don't, you're a …
HOGs say "How are you?", LOGS say.…

Part of the fun of these poems is the sound, and part is the story, but the main fun is an attempt to REPRODUCE the HOGs and LOGs and not simply describe them.
Accordingly, part of this research is a story. I’m going to tell the story about how we resisted an attempt to divide classes in half and label the bottom half under-achievers. How they produced a book of ripping good yarns to supplement a very boring middle school textbook.
Part of this research is data: actual word output, test scores, and the relationship between them. I’m also going to present data on whether or not high output, and high test scores predict the tendency of kids to tackle hard exercises, and conversely.
But the main thing I really want to do is to try to reproduce the research—so I’m going to have you look at the hypotheses—consider whether they are plausible or not. Then look at the data—match the data to the hypotheses. And finally we’ll draw some conclusions. In this way, I hope to not simply describe our research but recreate some of the excitement.

Every class, Prodromou remarks, is a mixed ability class (Prodromou 1992: 7). But of course some classes are more equal than others. For that matter, some abilites are more equal than others.

What is to be done? Well, according to the idea of “level based teaching”, we unmix the abilities, and prepare separate materials for each group. My argument is that this cannot be done and should not be attempted at least as far as output is concerned. My argument is that output is better thought of as a level of activity, rather than a level of ability.

But let me tell the story from the beginning. Like so many things in Korea, I think it really goes back to the Cold War. In 1960, partially in response to the Soviet Sputnik/Vostok missions in Space, the US was involved in a special science education project. This meant setting up what Buell (1960) called “homogenous groupings” in secondary schools, with “fast tracks” for good students.
The idea spread to Korea. Ever since, budgets are made available to pursue experimental research in “level-based teaching”. Typically, classes are split into same sized groupings on the basis of test scores, and the “lower ability” class is given “easy” material, while the “upper ability” class receives a richer diet. The test results of the two classes are then compared with undivided "mixed ability" control groups.
It is easy to see why this kind of research (which our team dubbed “classroom horse-racing”) yielded results we might call unequivocally ambiguous. No attempt was made to control for variables such as teacher experience, teacher enthusiasm, learner attendance, attendance of extra-curricular hakwon cram schools, or previous English study.
Beretta and other writers have pointed out the procedural and underlying methodological fallacies of this kind of quasi-experimental comparative methodology (Alderson & Beretta, 1992).
For our study, which was to focus on output, there were problems with the way this sort of classroom sweepstakes operationalized ability as well. The key dependent variable was viewed as nothing more than mid-term and final examination scores, from in-house tests of no particular external validity or consistency (they are written by school staff on a rota basis). We strongly suspected that this construct of ability--backwashed from discrete point multiple choice examinations--would not accurately predict actual oral output. Even if it did, there was no reason to expect the kind of growth that was expected of us by the ministry. As Nunan has pointed out (Nunan 1998), real proficiency is not so susceptible to linear growth.
We went into the project fired with a reluctance that was not only procedural and conceptual, but also ethical. We did not believe that ability was destiny; we did not even believe it was real ability. We did believe, however, that “low ability” could be a self-fulfilling prophecy.

PRELIMINARY STUDY: MAKING A CASE AGAINST CLASS DIVISION

To make our case, we collected test scores from first and second semesters of two English classes of 48 kids each, and also set them a simple, chatty pairwork task, to be done in pairs, which was videotaped and transcribed to measure their output in words and correct c-units. The kids were requested to ask and answer three questions which they were often exposed to in pre-class or the beginning chat-up of a class, to wit:

How are you?
Tell me about your family.
What did you do yesterday?

Questions were not included in the word or c-unit count because they were written out for the students. We disregarded repetitions in the word count, but we included any word token, appropriate or not. The definition of c-unit was any utterance which could, using normal rules of elision and reference, be reconstructed into a grammatical clause in context. C-units had to be appropriate and grammatically well formed in order to be counted.
Note that at no time did we attempt to manipulate the pairings in any way--we simply used the pairs which naturally occurred in the seating arrangement in class (generally based on the height of learners, though of course this variable does not remain stable for long).
We had two reasons for our “hands-off policy towards the pairs. First of all, we sought to make our research as non-invasive as possible, so as to interfere as little as possible in the normal life of the class; in particular, we determined to do nothing that would give the kids grounds for labelling certain learners under-achievers. Secondly, we wished our results and the resulting materials to have external validity, and we recognised that few Korean teachers would have time or inclination to manipulate pairs in their classroom even if they had the means to reliably diagnose individual learners.

So that gave us this data:

DATA: 1) End term and mid term test scores
2) Direct output measurements from the three-question pairwork task
WORD COUNT: number of words produced (right or wrong, not counting repetitions)
CLAUSES OR C-UNITS:(expressions that could be construed as clauses, e.g. “(I’m) fine. (I) thank you. And (how are) you?” = three c-units. These must be correct and appropriate to be counted.

We examined the data in the light of the following hypotheses.

H1: Test scores are not stable in our learners--they vary with test difficulty (because the tests are written by different teachers with widely varying standards), and with the development (not always linear) of the learners themselves. Therefore the final test in the first semester would not accurately predict mid-term test scores in the second semester, or test scores on a nation-wide standardised test.

Do you think this hypothesis will be proven true?
YES ………… NO ………..

H2: Direct output measurements (that is, word counts and correct c-unit counts) are not predictable from test scores--learners who are good at dealing with input in listening and reading formats may still find output very difficult.

Do you think this hypothesis will be proven true?
YES ………… NO ………..

H3: Output will increase in quality and quantity when it is “pushed (e.g. by the presence of a teacher in the interview), and this increase will happen whether or not the learner is a “high” or “low” scorer on tests.

Do you think this hypothesis will be proven true?
YES ………… NO ………..

H1 DISPROVEN: TEST SCORES ARE RELATIVELY STABLE

H1 was roundly disproved--we found very high correlations between First and Second Semester test results (.761** p < .01), and between both of these and the national test administered to students on September 9th 1998 (.820** and .837**).
The second semester mid-term, being mostly listening based, shows a weaker relationship with the first semester mid-term, but there is remarkable consistency, demonstrating how narrowly syllabus-based our apparently independently written exams are. In fact, all of these exams were based, as were our subsequent materials, on eight similar state-approved middle school text books sharing a common grammatical syllabus and word list.

H2 PROVEN: TEST SCORES DO NOT RELIABLE PREDICT OUTPUT LEVEL

H2 fared better. We found a significant but really quite weak relationship between our direct output measures and the various test scores. This allowed to us to argue that classes divided on the basis of test scores would not reflect differences in output ability.
To see the sort of practical problem we were up against, try dividing the scores on the following scatterplot into two coherent classes, one “upper” and one “lower” in “ability”.
There does not appear to be any natural seam along which the class will split. If we divide, as our predecessors did, at the median test score of 61, we will find both HOGS and LOGS in the resulting “low class. But if we divide at the median output score, around 10, we find both “high and low test scorers in the resulting class.
This is reflected by the Pearson correlation coefficients, which, although highly significant thanks to our large sample, indicate a very weak relationship even using a crude word count (.506** p < .01). When we add the dimension of quality, and demand appropriacy and well-formedness in the c-unit count, the relationship becomes weaker, not stronger (.405** p < .01).

H3 DISPROVEN: TEACHER PRESENCE PUSHES OUTPUT DOWN, NOT UP

H3, that teacher presence would push output, yielded the most interesting, but also the most ambiguous, results; interesting because counter-intuitive, but ambiguous because our research had been designed to meet a probable administrative demand for class division rather than rigorously control variables. To our surprise, the pairwork interviews yielded significantly more words and also significantly more correct c-units than the teacher-led, individual interviews.
We were a little suspicious of this result and initially attributed it to the fact that the questions in the teacher led condition were sometimes slightly different, with the teacher sometimes asking questions which, although easier, gave shorter replies, in order to ensure that the LOGS were able to make some kind of response. But we reconsidered this attribution in the light of the results for one class. We were forced to administer the pairwork twice for one class, because during administration of the first pairwork task, there had been a (rather interventionist) teacher present in the room running the video camera (whereas the other class had simply had the camera running, fly-on-the-wall fashion). The questions and indeed the pairs were precisely the same, but the second time around, with no teacher in the room, the kids did substantially better.
Was this the practice effect? Perhaps, but probably not entirely. First of all, the questions were all extremely familiar ones from classwork anyway, and therefore the first administration of the questions should have also had a practice effect. Secondly, there is a notable difference in the way the kids appear on the two videos. With the teacher present, they stare at the table and go through the questions learner by learner, with each learner taking turns to ask three questions in a row, as if the learners were taking turns being a teacher.
Without the teacher present, the learners tend to proceed question by question, with the first learner asking the first question, the second responding and then repeating the question or asking “You? Or “And you” or even “How about you?” This allows learners to crib from each other’s answers in an obvious way, and produces something much more like real conversation, with the topics rather than the roles of questioner and answerer being shared and alternating Compare (without the teacher present):

DO LOGS GAIN MORE FROM PAIRWORK THAN HOGS?

So we speculated that learners at the low end of the output scale would especially benefit from the absence of the teacher, because the presence of the teacher inhibits the stronger learner from helping the weaker. Perhaps the HOG considers that it is really the teacher’s job to intervene, or, more likely, the HOG considers this kind of help to his LOG partner a form of cheating on exams and so refrains--so long as the teacher is present!
As you can see, the data is a little hard on our speculation that LOGS would do exceptionally well in closed pairs. Both HOGS and LOGS benefit from the teacher’s absence by producing more words. But when we ask for accurate, appropriate c-units, only the HOG can really do better without the teacher (perhaps because he is taking over the teacher’s role?) At least when one looks at output in its own right, independent of any posited benefit mediated by interaction or input, it is the HOG who truly “pushes” his output in the direction of accuracy as well as fluency.

SUMMING UP OUR CASE

We then summed up our case. We argued that this kind of HOG-LOG interaction should be encouraged and not prevented by erecting a classroom wall between HOGS and LOGS and selecting their material for them. We proposed to see different levels of output generation as different levels of ACTIVITY, not different levels of ability. So we would design materials that separated the levels of ACTIVITY, not the students. Learners, working in pairs, could choose between a) exercises, which gave people fairly rigid models to imitate, and b) exercises, which provided an extended, rather loose model, on which to improvise. We expected that the a) exercises would be too easy for some, and b) exercises too difficult for others, and the pushing and pulling in pairs would be good for both HOGS and LOGS.
This argument was never explicitly approved by the Municipal Ministry of Education or by the principal of our school, but it won strong support (52 out of 56) from a poll of teachers at our seminar, and it was therefore given tacit permission to go ahead.

POST STUDY: DESIGNING AND EVALUATING OUR MATERIALS
The result was a book, and an accompanying CD ROM, entitled “A Cow’s Head and Other Tales” (the title is a reference to a Korean folktale about a lazy LOG who is turned into a cow and learns to work hard). Lesson 10, one of two lessons examined in our post-study, can be found as an appendix to this article (without the graphics, unfortunately).
We evaluated the material using a portfolio of methods, both objective and subjective, both qualitative and quantitative. Once a fortnight, classes were “co-taught, and the co-teachers evaluated the classes impressionistically. This often led to revision and rewriting of the materials. Classes were occasionally videotaped. This demonstrated to us that, despite our design, relatively little class time was spent in closed pairs, as learners were very unused to this format and breakdowns of discipline were widespread. Learners were polled as to the materials “effectiveness” and interest. In general, learners felt that our material was “more interesting” than the usual textbook fare, but not as effective. This is not surprising, since learners tend to see effectiveness through their parents eyes, in terms of the examinations, which, as we have seen, are narrowly based on a state-approved textbook, and learners often did not recognise the sentence patterns from the state-approved textbook when we embedded them in “A Cow’s Head and Other Tales.”

POST-STUDY: MORE HYPOTHESES

Since the material was designed to produce ink on paper as well as words in the air, the teacher using the material could see at a glance which exercises were being done, or (as we did on two occasions) collect the material and make precise measurements. We awarded one point for each b) exercise which was at least half complete, regardless of accuracy--what we wanted to look at was the extent to which the pairs were game for the challenge.
One thing we wanted to do was to empirically test an old bit of teaching wisdom--that unequal pairs are better for introducing new material but equal pairs are better for practising it. Since the “Let’s Talk” portion of the lesson we collected data for was entirely for “Let’s Practice (the language points having been presented using the usual text book for more than four days previously), we might expect more b) exercises would be done by homogenous HOG pairings, and fewer by pairs of mixed ability or homogenous LOG pairs. So we approached the data with the following hypotheses.

We examined the data in the light of the following hypotheses.

H4: There will be no significant difference between the two classes in the tendency to tackle the b) exercises, as there was no significant difference in “level” of output (or, for that matter, test score).

YES …………… NO ………..

H5: Pairs which consist of HOGS (that is, learners who performed above the median in pairwork output assessment tasks both before and after the project) will show a strong tendency to tackle the b) exercises. LOG pairs will not.

YES …………… NO ………..

H6: HOG-LOG pairings will tackle the b) exercises, but will complete rather fewer of them than unmixed HOG pairs and rather more than unmixed LOG pairs. In other words, the mixed pairings will behave like the sum of their parts.

YES …………… NO ………..

H4 DISPROVEN: CLASS MAKES A DIFFERENCE IN EXERCISE CHOICE

Again, the data was rather cruel to our first hypothesis. The following bar chart shows at a glance that, although Class 2-10 was slightly better at producing correct clauses in the initial pairwork task and there were no significant differences in word counts or test scores, they are absolutely unwilling as a class to tackle the b) exercises. Class 2-11, on the other hand, is far more game for the difficult exercises.
How can this spectacular difference be explained, since the two classes are so very similar in other respects (we include the c-unit count to demonstrate this, but the other measures show similar comparability)? Classes 2-10 and 2-11 were both co-taught by a Korean teacher and a Native Speaker for Lessons 9 and 10. The Native Speaker was the same for both classes. The Korean teachers were different, although both were young women with roughly the same amount of teaching experience and the same training. The Korean teacher for 2-10, however, was noticeably passive in class and spoke very little, limiting herself to supporting the Native Speaker with translations; the teacher for 2-11 was aggressively interventionist, often ignoring the Native Speaker entirely and driving the material forward in Korean only, reserving English for the pairwork itself. We drew two conclusions for our materials. Firstly, the NON-native speaker teacher has the critical role, although it is an indirect one, in managing closed pairwork. Secondly, output is never stable; it is constantly influenced by other variables. HOGS and LOGS were clearly not separate species which could be safely removed to separate habitats without altering their behaviour. In fact, they were often not even separate individuals.

H5 AND H6: A PROBLEM OF LEAKY CATEGORIES

The fact that the groups of indubitable HOGS (who scored above the median on both output measures) and LOGS were small meant that the number of pairs we could examine were even smaller: five double-HOG pairs, six double-LOG pairings, and only three of the crucial, most interesting, heterogeneous HOG-LOG teams. Another reason for the small number of heterogeneous groupings may have been the very effect we were counting on--that high output attracts high output from partners regardless of their level, and low output creates low output in partners, regardless of their “level--thus creating more HH and LL than HL. Of course, had we known during the data gathering who the HOGS and LOGs were, we could have deliberately manipulated the pairings to increase the number of heterogeneous groupings, but that too might have altered output by making people work together who normally would’t have done so. In retrospect, the non-invasive “double blind approach--with the identity of the HOGS and LOGS unknown to all parties--was probably best.
The bar graph showing the average number of b) exercises attempted per pair each lesson--but remember that the imposing middle bars represent a mean of only three heterogeneous pairs.
So the double HOG pairs tackle, on the average, more of the difficult exercises than the double LOGS, but this result can hardly surprise. Rather more odd is the fact that the HOG-LOG groupings behave, on the whole, rather better than the sum of their parts, or at any rate much more like double HOGs than like double LOGS. But perhaps this is a freakish result caused by the small size of the HOG-LOG sample, which comprises, remember, only six learners in three pairings.

Using all the data lumped together, irrespective of pairings, we find that the tendency to do the b) exercises is not closely predicted by any of our measures of “ability”: test scores or output measurements (see correlations in the appendix). In the same chart, you will note that the first output measurement correlates rather weakly with the second measurement, at only about .584** (significant at p < .01) for the word counts and only .236 * (significant at p < .05) for the correct c-unit count. Further evidence that the HOG is a frail creature, indeed, a creature of circumstance, and a product of an environment which is easily manipulated, for worse and for better, by the materials designer and the teacher.

EPILOGUE: A SIGNIFICANT BUT NOT MEANINGFUL DIFFERENCE

This was not the sort of conclusion expected by our funders or even by ourselves--they wanted to see, after the six months of our project had elapsed, a significant increase in test scores, or failing that, in our output measures. In the event, no increase in test scores took place, though there was a significant decrease in one class (2-10). In output, there was a small (but greater than the Standard Error of the Mean) increase in word output in one class only (2-11) in the six months from July 1998 to November 1998.

This lacklustre result has many mitigating circumstances: the long summer holiday, the crowded (nearly fifty student) classes, of which only one forty-five minute session every fortnight could be devoted to our material, and the difficulty of securing the cooperation of the teachers of these two classes, who felt that the whole project simply added an extra burden to their already over-loaded schedule.
But the main reason we do not take it to heart is simply that we understand that there is a tremendous difference between traditional measures of “ability, based on tests, and our output measurements, subject to all kinds of performance variations. Even if we had accepted that our “output measurements” reflected in a direct way language proficiency, with Nunan, we do not expect this to increase in a linear way.

CONCLUSION: HOGDOM/LOGDOM IS A STATE OF MIND

At first glance, our results appear to be even more unequivocally ambiguous than those of the “classroom horse-racing studies which preceded us. We found that not only do our output measures correlate rather poorly with test scores, they are not very stable in themselves; a third of the children were not consistently HOG or LOG over six months, and our c-unit count had a reliability coefficient of only .236 (p < .05: see appendix two for other counts). When we look at the choices that pairs of learners made, the only firm conclusion which we can draw is the rather banal one that pairs of two “High Output Generators did more of our b) exercises than pairs made of two “Low Output Generators. We are not able to make any accurate predictions about what Mixed HOG-LOG groups do.
Uncontrolled factors are clearly rampant. Among these factors are probably the crucial decisions which teachers and materials designers make, often in blithe disregard of Second Language Acquisition theory: teacher-centredness, use of mother tongue, choral work, and the motivational power of a ripping good yarn. Thus the huge gap--not a contradiction, but nevertheless a gap--between what SLA theory says may be good for us and what teachers and materials designers know will work remains. Necessarily--our research and even the book we developed is really a perverse attempt to build a bridge along that gap (on the what-will-work bank of the abyss) rather than across it.
Research has a tendency to shake the constructs that it is based on. This can be a very useful surveying or even ground-clearing exercise, but it is rather troublesome to those who have already erected substantial assets on those constructs. The data was not kind to the hypotheses we formulated, and still less to the categories of HOG and LOG we posited (fortunately, we posited these post-facto and did not use them to form classes). But we emerge from it with our faith in closed pairwork, even in crowded Korean classrooms, intact and to find the materials we rather prematurely erected upon it still standing. Our classrooms do not in fact chiefly consist of pure-bred HOGS or unmitigated LOGS; instead, they are a Gordian knot of learners with many abilities and many levels of ability not easily susceptible to disentanglement, whether by researchers or by sword-wielding administrators. Then let Prodromou’s remark that every classroom is a mixed ability classroom stand, not simply as a statement of what is, but as a statement of what should be….

To conclude:

In classrooms from Seoul to Cheju
You find learners who don’t talk and do
Some are seen, others heard
In c-unit and word
But the difference is sometimes just YOU!