Normalizing Grades Across TA Sections

by Julianne

‘Tis the season to file your grades, and in that holiday spirit, I present to you my tried-n-true method for normalizing scores for different TA’s in a large class.

The problem is this: When you teach a 300 person class, you typically run it with a single lecturer and multiple TA’s handling sections. The students do labs and problem sets, which are graded by the TA’s. However, not all TA’s are equally benevolent when it comes to grading, which can lead some sections to have lower scores than they should. On the other hand, not all sections are equally on the ball, so maybe their low scores are exactly what they deserve. So, how do you tell the difference between a TA who graded more harshly than average, and a TA that was stuck with somewhat dimwitted students?

The key is to use the exams, which are taken by all the students. Presumably, a student that does well on the exam is probably sharp enough that they did well in their problem sets and labs. Thus, if that student has a lower than expected section score, then there is a chance that too many points were taken away by the TA, compared to the average TA. So, the trick is to make a plot of the student’s section grade divided by their exam grade, ranked by exam grade. When you do so, you’ll find a well-defined sequence which goes towards 1 at the high end (i.e. those top-notch students who ace everything). I tend to make the section work easier than the exams, so for me this ratio goes to larger values for lower exam scores, but another instructor who gives tough assignments but puffball exams might find the opposite. There is a scattering below the main sequence, due primarily to students who did not turn in all their assignments. TA’s who readily accept late assignments tend not to have this tail.

So, if you have a TA who takes off more points then average, you’ll find that all their section points lie below the main trend, particularly at the low-scoring end, where lots of points were taken off. You can see the effect whether or not the TA’s sections were dimmer than average or brighter than average. The plot on the left shows the section-to-exam ratio for the class as a whole (open circles) and one particular TA (solid circles), whose points clearly fall below the mean trend (click on the image for a larger, more readable version):

sec_exam_no_tweak.jpg sec_exam_tweak.jpg

You can then easily scale the points taken off by some factor to correct them back to the typical TA. Note that it’s critical to do this scaling on the points taken off, not the points earned, because you don’t want to penalize the students who got everything correct; presumably they’d get everything right no matter how tough the TA was. The plot on the right shows what happens to that same TA’s scores after rescaling the points taken off to reflect the kinder-and-gentler average TA. As you can see, it jumps right back onto the mean line.

Pretty cool, huh? You may now go back to your regularly scheduled grading.

submit to reddit

December 14th, 2007 4:44 PM
in Academia | 33 comments | RSS feed | Trackback >

33 Responses to “Normalizing Grades Across TA Sections”

  1. 1.   eric gisse Says:

    Sucks to be the guy parked at (0,0).

  2. 2.   mollishka Says:

    Some classes prevent this problem by having TAs rotate which section they grade the problem sets for each week.

  3. 3.   Julianne Says:

    Usually, the kid at 0,0 doesn’t give a crap. They were typically so out of it that they couldn’t be bothered to drop the class.

    This quarter I had a kid show up to take both exams. However, he’d never been to section. Or turned anything in. In a class where the exams were only 50% of the grade. What could he have been thinking?!?! (Answer: not much, but I suppose hope springs eternal)

  4. 4.   improbable Says:

    I’d be much more likely to normalise down to the harshest TA’s grades than up to the softest. This would help dull the urge to go and strangle half the students for making me read such nonsense… have you not had much grading to do? Or do you have some secret weapon for not loosing your temper while doing it?

  5. 5.   Julianne Says:

    I just normalize to the average TA. My benevolence or malevolence comes out when translating scores to grades.

    I’m prone less to throttling than to sad head shaking. But, I don’t do much grading at the undergrad level, since I tend to run larger lecture classes with 1-6 TA’s.

  6. 6.   David Nataf Says:

    Does anyone else find the post kind of elitist?
    Kids who do less badly might not necessarily be less “sharp” or “dimwitted”. There are all sorts of problems that could afflict an undergrad such as personal problems. Some people also don’t care for classwork and prefer exams. Exams and classwork test different skills, and it seems the statistical scheme listed here could be skewed by a large group of friends of similar mentality taking a course together. Another case would be if one section developed a more genuine culture of copying on assignments (I don’t know what kind of assignments were used here). Group sociology goes a long way. But aside from that, I’m surprised Professor Delcanton wrote what she did, and I know a few people that if they had seen something similar happen in undergrad they would have filed a complaint.

  7. 7.   Brad Holden Says:

    I am struck by the spread. Back in the last century, when I last TAed, I never remember seeing a big dispersion on a TA by TA basis.

    That said, grading was usually done in such broad bins that small offsets would get lost in the final assigned grades.

    The really striking case was a class where the final grade distribution was naturally multi-peaked. The professor just drew lines for A, B, C, etc.

  8. 8.   Zeno Says:

    David Nataf: Does anyone else find the post kind of elitist?

    Well, I don’t. It’s simply an account of an elegant solution to the problem of ensuring that all students are treated uniformly despite potential differences in the way they are evaluated by different TAs. Very nice. Undergraduate afflictions such as personal problems are really beside the point: Those are best dealt with individually and Julianne was not discussing how she might handle them.

    As for students who “don’t care” for classwork, perhaps they’re in the wrong class.

    I’m also not too worried about groups of friends who work together. That’s usually a good thing, but if they develop a “genuine culture of copying,” I think anyone who tried to coast on the efforts of others would find that scheme unraveling at exam time — a self-correcting situation.

  9. 9.   Jason Dick Says:

    Here in the UC Davis physics department, it seems we always set it up so that one TA gets one problem or one quiz, in order that any normalizing problems are just swept under the rug. At least, that’s been the case for the undergrad sections I’ve graded.

  10. 10.   Julianne Says:

    Well, in retrospect “dimwitted” is definitely more for humorous snarkiness than a correct appraisal of how I see some of these students. So, apologies if it comes off as offensive. On the other hand, there is nothing elitist in the statement of fact that some subsets of 30 kids from a 300 person class can perform statistically worse than the class as a whole, while others do better. I have individual TA’s who run two back to back sections where one group is engaged with the material, interactive, and diligent, while the other can’t be bothered to turn work in, or even seem remotely awake when they do bother to show up. It’s a range. The students who perform poorly have a variety of reasons why, and at a large state school, those reasons range from complete immaturity to shouldering tremendous financial and personal burdens. So, how I view the students who perform poorly depends a lot on other factors that I know about them, but that I wouldn’t get into on a quick post about a useful tidbit instructional methodology.

  11. 11.   Julianne Says:

    With regard to the rotation of grading as an alternate solution, what you lose is a TA’s ability to get into the minds of the particular students they’re working with in section. When you always grade the same group of kids’ work, you come up with a better understanding of what they’re getting and not getting. This varies from section to section, as different TA’s tend to shore up different student misapprehensions, so while one section might be weak on the HR diagram, another might have issues with the Doppler shift.

  12. 12.   stand Says:

    My first thought when I glanced at the graphs was, “what an unusual set of H-R diagrams…where are the Red Giants?” Then I read the title.

  13. 13.   Brad Holden Says:

    Re 12.

    The students are all young, no post main-sequence evolution yet. And, in astronomically correct terms, the brighter ones are in the upper left….

  14. 14.   Harold Says:

    Hi Julianne,

    I think this is one of the best ideas I’ve seen for normalizing across TA sections. But what if the TA is teaching a lab, and some students find labs more engaging and do better there than on their exams?
    There seems to me something fishy about having one their exams influence their section grades. They should be separate – yet I can’t find a cure for normalizing across TA sections.
    The way they do it at my school is that 35% get A’s, 35% get B’s and the rest get the rest of the grades. I don’t like this at all, since I know more students deserve A’s.

  15. 15.   Clark Says:

    I propose that we use the terms “K- and M-type” to refer to those students who do not shine as brightly as the others.

  16. 16.   thomas Says:

    personally, i don’t think there is a way to fairly grade ppls, because different teachers have different ideas about what they want to reward. obviously if you just wanted to reward knowledge then simple tests would probably suffice. if you wanted to reward creativity, you would give easy tests and hard hw’s, and maybe projects. if you wanted to reward personal growth, you could compare score on a pre-test to score on a final, but then people who just want an A without caring about your class will do the best ’cause they’ll intentionally fail the pretest. if you wanted to reward going to class, you would grade people on whether or not they came to class…

    most teachers want to reward both knowledge and personal growth, so their already conflicted. the goal of an introductory science class is both to teach some results (the earth is round, it goes around the sun, a plot of Doppler shift against distance shows that things that are far away are moving away from us…) and to teach some philosophy of science (we build theories to explain facts and make predictions regarding future observations). some students will come to a class already knowing most of the material; perhaps they should have simply taken a test and been given credit for their knowledge, and then have taken a more interesting class? others may have come in not knowing anything but after studying carefully they could get the material, perhaps they deserve better grades than the kids who kinda know stuff from their engineering major roommate talking and are pretty sharp sometimes, but otherwise play halo on their xboxes all day? but talking about student archetypes isn’t really a good idea because the boundary conditions are so complicated that it’s hard to come up with all possible solutions that your arbitrary student will be a linear combination of.

    ultimately, grading is a hard question, and while grading to optimize the number of students who end up joining your field may be a scientific question, most universities just want you to teach a completely random collection of students some tidbits from your field, which may conflict with your desire to both challenge and give an A to students who seem like they want to join your field (whence you might get profs like the dude who came up to me and shook my hand when i was hanging out with some physics majors discussing physics and physics homework in the physics lounge at my school… but when i identified as a math major told me ‘i dont tlk 2 math majors’).

    i guess whats most interesting about the grading question is that almost every science teacher has a strong opinion on how&why. i wonder if they evaluate the grades that past teachers have given their new students based on their own criteria on how&why. when all you see on someone’s gradesheet is: B B+ A C, and it’s not annotated by ‘i think students should be graded on %criterion%’, how useful is that? what happens when you get a student who got a D because his professor refused to grade his homework when he answered the questions correctly, but found the answers using a technique that wasn’t gone over in class or in the book? also, higher level classes are often graded much easier than lower level classes, but not always, so you can’t count on that factor.

    so. grading is hard, and the grades are then fairly meaningless. As a student, i just go to school just to try to learn something, because it’s easier to find things out when you have a teacha their to tell you stuff or ask questions of than just by reading books + wikipedia, and i don’t even care about my grades. another extreme position is that of some of my friends who are there to qualify for some other school and need A’s but don’t give a damn about what they’re studying.

    and i’m going to be pretty depressed when some bureaucrat looks at someone else’s non-meaningful qualifications and selects them over me. because of the well-known cognitive bias favoring meaningless, easy to access information, whence we get processors designed for marketing like the pentium 4 and seagate falling in line with the rest of the industry in falsifying SMART checks to make their drives look good. or people buy sprite instead of their grocery store’s lemon-lime soda even because sprite is twice as expensive, and more expensive obviously means better; besides coke is a trusted name in soft drinks (if carl went to harvard, and carlos with to umass boston, obviously carl is the better student). and because of the need for bureaucrats to cover their asses and avoid doing things they may need to justify in the future (nobody ever got fired for buying ibm…).

    so: grading is hard. grades are pretty meaningless and may even be harmful, unless you were wanting to reward grade-grubbers at the expense of serious students. maybe all classes should just be pass/fail on the grounds that either you learned enough to pass or you didn’t?

    ‘i am spartacus’ capitalization to let everybody know im an undergrad with a learning disability. in my case it’s asperger’s syndrome, and describing what that means is pretty tangential to this discussion of grades. i just want people, when they say that it isn’t and can’t be their goal to grade everyone perfectly, to know who they may be allowing to slip through the cracks. i understand its hard to give people the grades they deserve, but it’s also necessary in a way that i think that many professors aren’t really comfortable with assuming they think even think about it.

  17. 17.   Julianne Says:

    Harold — I think the engagement in labs vs test scores is what produces the scatter in the relationship, so as long as that’s not systematic between sections, it should be fine.

    And I too hate the fixed-percentage curve. I work it by giving them a “minimum grade contract”. If they reach some percentage threshold, I guarantee them a given grade or better.

  18. 18.   Ian Paul Freeley Says:

    David,
    Only the most dimwitted students would complain about normalizing section grades.

  19. 19.   thomas Says:

    >>18
    “Only the most dimwitted students would complain about normalizing section grades.”

    Actually, the students who end up complaining about grading styles aren’t the most “dimwitted”, but the most opportunistic.

    The ones you thought were “dimwitted” might be attempted opportunists who just can’t hack it. Or maybe they really wanted to learn but don’t have any talent. Maybe they’re in the wrong place for the wrong reason. Ultimately, you can’t save them, all you can do is try to teach them something and grade them based on whether or not they learned something or know the material… or whatever other criterion you’ve chosen. At any rate, they’re probably accepting whatever grade you give them, and going to their quiet place to sulk, and/or find something else to be interested in than your class.

    It has been my experience that the students who talk about grading the most are the ones who get good grades for no reason. Getting a better than deserved grade is almost as bad for a person as getting a worse than deserved grade.

  20. 20.   Julianne Says:

    Thomas — I definitely sympathize with your concerns. We all have to accept that there will never be a consistent uniform standard for grading. So, you have to use grades as only a part of evaluating the potential of a student. Those of us who serve on admission committees and fellowship committees do this all the time, judging patterns of grades (start weak, finish strong?), letters of recommendations, actual accomplishments, interviews, and personal statements. The grades fit into an overall picture of a student, but are not the only means of judging.

  21. 21.   david nataf Says:

    Ian Paul Freeley,
    I never complained much about my grades in undergrad, once it came in, it came in, and then that story was over.
    I was just objecting to what I perceived as an assumption of a strong correlation between GPA and intelligence. There’s definitely a correlation… I was just there… but there’s a lot of scatter about the mean. Among the things that mattered a lot, for example, was a person’s exam schedule. Sometimes exams were well-placed, sometimes not. I remember when I had challenging, 3 hour exams, I’d be too tired for anything else that day. Fortunately, I never had 2 exams in one day. And I hope if that had happened, the professor would not have thought less of me.
    The point I raised was addressed properly by the original poster, though some of the responses disturb me.

  22. 22.   Orin Says:

    Is there any observed correlation between student performance and the day/time of the TA section? Perhaps better students choose early classes while the worse sleep in, or it may be the other way around — students who choose the early classes got last choice because they waited too long to pick classes. Either way, there’s got to be some sort of correlation there…

  23. 23.   Robert Says:

    Renormalising scores is definitely a must but I think any fixed algorithm can at best be a decent hack. I think the reason all these clever schemes eventually fail are due to the fact that scores and grades even more only have the structure of an ordered set (let’s hope at least that), any further structure (affine, linear or full blown field) is typically artificial and introduced as grades are often written in terms of numbers, as these tend to be peoples favourite examples of ordered sets.

    I strongly believe in the distinction along the lines of “excellent”, “very good”, “fair”, “just so” and “fail”. And everybody can tell an excellent student when one sees one. A “good” student is obviously better than a “just so” student. But that’s about what you can say about grades. Thus I think ideally the situation is as described in comment #7, each section has multi-peaked distributions and that tells you which grades the students get.

    Already arithmetic averages of grades/scores are dodgy, it’s not clear that “very good” is really the middle of “excellent” and “fair”. I have heard people argue for example that one should compute geometric or harmonic averages but not very convincingly. Arithmetic averages requires an affine structure the others something else and usually this is not given. Julianne above argues that one should renormalise the points deducted rather than the ones given. This is already an appreation of this fact. But this is only good for good students. If somebody writes only very little on a problem you are typically giving points for what you find and do not deduct point for what’s missing.

    Last week on the news I heard a report that some grades (was some business news, some costumer satisfaction measure) had improved since last year: In Germany grades are usually 1 to 6 with 1 being very good, 2 good, 3 satisfactory, 4 sufficient, 5 not sufficient and 6 complete fail. The report said that the average grade had improved from 2.5 to 2.0 . OK, they computed averages what ever those mean. But then some spokes person said that the improvement 2.5 to 2.0 was a “20% improvement” . I couldn’t believe my ears. How math illiterate can you be? What is 20% of a grade supposed to mean???

    I am fearing, multiplying points (deducted or given) by a renormlisation constant is not much better. It only works in a small window of grades/points.

    Ideally, your sections are large enough that you have sufficient statistics and you do a histogram in each section, identify the peaks and give grades accordingly.

  24. 24.   Ian Paul Freeley Says:

    Orin,
    I was always surprised there wasn’t much correlation between time of day and grade. I suspect that most actual learning happens outside the classroom, so when they go to class has a minimal impact on how well they do in the class. Where it does show up is in the student evaluations, they really hate anyone who tries to talk to them in the morning.

  25. 25.   thomas Says:

    >23

    Robert, do you believe in those five categories because we recognize five letter grades? Maybe there can be seven categories, like in Newton’s partition of the rainbow. Or perhaps only three, like in the Norse partition. That’s how Professor Jackson grades his advanced math classes: you were active in the class and knew your stuff, and get an A, or you weren’t all that active, were out of it for long periods, and didn’t seem like you knew too much, so you get a B, or you gave up halfway through, don’t know the material and get an F.

    You are of course completely right about arithmetic means assuming affine structure. Naturally, assuming each of the assignments you wrote is of equal difficulty to a student who was first exposed to the material in your class, using the arithmetic mean to determine their score is the logical way to reward effort that leads to understanding. On the assumption that you get about the same kinds of students in each of your sections, normalizing sections is probably a good idea. Throwing away the outliers by using the median score to normalize sections is probably even better.

    One thing I’m always interested in whenever normalizing grades comes up is, how do you avoid punishing the kids who take time out of their schedules to help other students? A simple normalization method (fitting scores to some kind of curve) would lower their scores by as much as it raises the scores of the people they help- making schooling pointlessly competitive. Students should be encouraged to present material to each other, because teaching is just about the best way to learn.

  26. 26.   ts Says:

    A very experienced instructor that I worked for a couple times as a TA kept telling me that he has never seen much variation in grading across different TAs. From what I see, I know the attitudes toward grading among TAs do vary significantly (some try to reflect giving constructive feedback in the way they “penalize,” while others just make up a scheme that allows them to grade at a maximal efficiency, etc.), so I cannot see how the variation is not significant. Yet I mostly had to agree when the instructor told me such a variation would be so small that students can just overcome by studying only a bit harder than the typical undergrads of nowadays.

  27. 27.   Julianne Says:

    ts — In a class with 5-6 TA’s, I typically only have to adjust 1, or 0. So operationally, I don’t see a lot of variation. There’s probably more variation when the labs or problem sets are more open ended and free form.

  28. 28.   TomR Says:

    A fun solution to an old problem. And, as I remember from my TA days, playing with grade statistics is a lot more fun than the actual grading!

    Hmmmm…so, you want to measure the lumonosity of 300 objects, and you have uncalbrated visible light measurements made by a bunch of independent amateur (but skilled) astronmers. You then measure UV brightness for all the objects, and use that to callabrate the visible measurements. Legit? Seems so, unless there’s some systematic variance in the visible/UV ratio.

    By analogy, this system rewards students with low homework/test ratios if there’s any correlation between that and section assignment. Would you think of that as a second order effect? One could make the argument the homework/test ratio measures ‘responsibility,’ which is as much a parameter of individual students as ‘intelligence.’

    Interestingly, this should let you test hypothesises about one section being brighter than others…finaly answer that long-standing question if the brighter (or more responsible) students take the morning sections.

    Robert–have you ever seen those multiple peaks in practice? When I TA’ed classes (in economics), I’d always predict that grades would have bimodal distributions, and I was always wrong!

  29. 29.   Belizean Says:

    Juliane,

    When I taught my 300-student course last spring, we solved the TA problem by having a single TA teach all 6 sections (all in one day). We had graders to help him. No partial credit, so no ambiguity in grading.

    Where we had a problem, that I never solved, is with our quizzes. The only way not to give the students in later sections an advantage was to make 6 different versions of each quiz, one for each section. This was a huge pain and in hindsight a dumb idea.

    After wasting so much energy on making 6 equivalent yet different version of the same quiz each week, I was totally disinclined to run stats on the class to check for inequities due, for example, to the 1:00 pm section having 5 more hours to study than the 8:00 am section. Or for any cheating resulting from info about the quiz topics leaking from the early sections to the later ones.

    You’re a saint in my view for taking the trouble to check for inequities and to adjust grades accordingly. Me? I’m way too harried by other responsibilities and just too lazy to give a damn.

  30. 30.   Robert Says:

    No, I do not expect five different categories (here in Germany we use a grade system with six grades with two fail grades), I just meant you can usually tell if one student is much better than the other or if they are of roughly equal ability. Then I would make cuts such that there are as few borderline cases as possible.

    All the courses I have taught or graded so far had at most 30 students, not enough statistics to see a clear bi- or even multi-modal distribution. But I do have seen excellent students that still stick out clearly from the good ones. And I have seen students who should have definitely revised their choice of subject…

  31. 31.   Michael Says:

    Julianne, you say that for a 5-6 TA course, 1 or 0 sections are normalized. I’m curious how many students have grades changed due to normalization? Even if it’s a small (

  32. 32.   Michael Says:

    oops:


    Even if it’s a small (less than 0.5), it seems to me that it may be worth it anyway. This is because there is little to lose by taking some attempt to normalize. On the other hand, if you did not bother, a student might complain and have a point.

  33. 33.   thomas Says:

    > This is because there is little to lose by taking some attempt to normalize.
    unless the students in one section were actually better students

    > On the other hand, if you did not bother, a student might complain and have
    > a point.
    cover your ass, then. Why should you let random students decide how you grade things? If anyone, teachers need to be independent thinkers.