Evaluating students: the halo effect

by Chris Bertram on March 28, 2012

In the thread on community colleges (which morphed into a discussion of more general education and management issues), someone mentioned Kahneman on the “halo effect” in grading (or marking) student work. _Thinking Fast and Slow_ has been on my to-read pile since Christmas, but I got it down from the shelf to read the relevant pages. Kahneman:

bq. Early in my career as a professor, I graded students’ essay exams in the conventional way. I would pick up one test booklet at it time and read all the students’ essays in immediate succession, grading them as I went. I would then compute the total and go on to the next student. I eventually noticed that my evaluations of the essays in each booklet were strikingly homogeneous. I began to suspect that my grading exhibited a halo effect, and that the first question I scored had a disproportionate effect on the overall grade. The mechanism was simple: if I had given a high score to the first essay, I gave the student the benefit of the doubt whenever I encountered a vague or ambiguous statement later on. This seemed reasonable … I had told the students that the two essays had equal weight, but that was not true: the first one had a much greater impact on the final grade than the second. This was unacceptable. (p. 83)

Kahneman then switched to reading all the different students’ answers to each question. This often left him feeling uncomfortable, because he would discover that his confidence in his judgement became undermined when he later discovered that his responses to the same student’s work were all over the place. Neverthless, he is convinced that his new procedure, which, as he puts it “decorrelates error” is superior.

I’m sure he’s right about that and that his revised procedure is better: I intend to adopt it. Some off-the-cuff thoughts though: (1) I imagine some halo effect persists and that one’s judgement of an immediately subsequent answer to the same question in consecutive booklets or script is influenced by the preceding one; (2) reading answers to the same question over and over again can be even more tedious than marking usually is. I thing it would be even better to switch at random through the piles; (3) (and this may get covered in the book) the fact that sequence matters because of halo effects strikes me as a big problem for Bayesians. What your beliefs about something end up being can just be the result of the sequence in which you encounter the evidence. If right (and it’s not my department) then that ought to be a major strike against Bayesianism.

{ 83 comments }

1 Jamie 03.28.12 at 4:32 pm: Chris, Bayesians do not (typically) believe that everybody always *does* update by conditionalizing. They only believe that it’s *rational* to update by conditionalizing.

A huge part of Kahneman’s book (and a huge part of his life) is devoted to showing how far ordinary people depart from rational apportionment of credence, so if you are going to count each of these as a major strike against Bayesianism, then Bayesianism is going to feel like me standing up to bat against Pedro Martinez in 2000.
2 dsquared 03.28.12 at 4:34 pm: What your beliefs about something end up being can just be the result of the sequence in which you encounter the evidence. If right (and itâ€™s not my department) then that ought to be a major strike against Bayesianism.

I’d guess that a true believer Bayesian would say that it’s a major strike against the ability of human beings to reason, and that everyone should be more Bayesian?
3 ajay 03.28.12 at 4:43 pm: I imagine some halo effect persists and that oneâ€™s judgement of an immediately subsequent answer to the same question in consecutive booklets or script is influenced by the preceding one

Ah, but which way? If I read Niamh’s answer to question 1, and it’s really good, am I going to overrate or underrate dsquared’s answer to the same question when I read it immediately afterwards? Because both sound plausible…

I thing it would be even better to switch at random through the piles

Yes indeed.
4 Chris Bertram 03.28.12 at 4:48 pm: Thanks Jamie and dsquared. As I say, not my department, but you are quite right to point to the distinction between the description of (defective) cognitive processes and the normative claims made by Bayesians. So probably not helpful of me to have put that third point in.

Still IF it is the case that the sequence in which an ideal Bayesian reasoner encountered the evidence would have an effect on that reasoner’s final beliefs (having gone through the evidence) then that would seem (to me) to be a problem (an ideally Bayesian Sherlock Holmes and his doppelganger encountering clues in a different order and coming to different conclusions from one another). But maybe that isn’t a possibility. I’d be interested to know.
5 Cosma Shalizi 03.28.12 at 4:55 pm: What Jamie and dsquared said.

For that matter, a certain amount of a halo effect is in fact available to the Bayesian. Suppose that the quality of the essays should be independent, given the knowledge and skill of the student. Each essay read then (on average) makes the Bayesian’s distribution for that knowledge and skill more concentrated. Pretty generally, the width of the credible interval shrinks like 1/sqrt(n), so the first essay does a lot more to narrow it down than the second, the second than the third, etc. (Said differently, the information gained about the student is on the order of (log n)/2 bits.)

If the essays are not supposed to be conditionally independent, then of course the order could matter. (E.g., the first essay is written while the student is fresh and energetic, so it gives a more precise measurement of their knowledge, but the harder they worked on that one, the more tired they are for the others, leading to noisier measurements.)

(This should not be taken as an endorsement of Bayesianism.)
6 Billikin 03.28.12 at 4:56 pm: “(3) (and this may get covered in the book) the fact that sequence matters because of halo effects strikes me as a big problem for Bayesians. What your beliefs about something end up being can just be the result of the sequence in which you encounter the evidence.”

Then you are not a Bayesian. :)

However, you are human. Beliefs tend to persist for too long in the face of new evidence, and old ways of thinking can hinder us from even forming better beliefs.
7 otto 03.28.12 at 5:02 pm: I suppose that many students are aware that their essays are likely read and graded in order. In that case, they often painstakingly explain something in essay number 1, and then, if the point is also relevant in essay number 2, do so more cursorily the second time, so that they can get on with new material, because they expect the prof to have just read the previous one.
8 Chris Bertram 03.28.12 at 5:05 pm: otto: Yes that’s true. And is, potentially, a problem for my plan to implement the Kahneman method.
9 Western Dave 03.28.12 at 5:07 pm: Or you could just make a rubric. My approach to blue book essays when I taught college was three reads. An initial sort into top middle bottom. A second read where I put the comments on the essays and settled them into actual grades. A third review from worst to first to see if there were any outliers, grades shifted from pencil mark to pen mark and put in gradebook. This is way too slow for hs teaching and I now use rubrics. These aren’t always x= so many points but could be an “A essay contains: a clear, argumentative thesis that answers the question and provides an overall framework for organizing the answer without listing categories of analysis. Ample evidence drawn from at least three different sources used in the class and from the beginning, middle and end of the course. A shred of original thinking beyond that provided by the professor in the course. Reasonable clarity in sentence structure and style for a blue book exam (while spelling doesn’t count, and I am tolerant of run-ons in a timed writing situation, I am not a mind reader and can only grade what is on the page – if it doesn’t make sense, your grade will suffer).” And so on. While it’s a pain to construct the rubric and distribute it before hand, it helps you write a better exam and demystifies what it is you expect of students. Rubrics aren’t perfect, and you have to give yourself permission to reward that student who goes off-rubric in surprising (good) ways while punishing students who to off-rubric in surprising (bad) ways.
10 Tom Hurka 03.28.12 at 5:10 pm: I was also struck by that part of the Kahneman, and it certainly fits my experience. I think I’m strongly affected by a student’s first answer on an exam, and will likewise switch to a different order of marking next time. The same problem arises in essay marking if you’ve already formed an impression of a student’s ability, e.g. from seminar contributions, and then grade her essay. I guess the ideal would be to have essays submitted or at least read anonymously. Hard to do, though, when you’ve talked to some students about their essays in advance.

Another part of Kahneman I loved: He describes an experiment where they make up fake proverbs with internal rhymes, e.g. “Woes unite foes” and “A fault confessed is half redressed,” and then make up versions of them without the rhymes, e.g. “Woes unite enemies” and “A fault admitted to is half redressed.” When they ask subjects how insightful the statements are, the rhyming ones are rated as much more insightful. Try that on your English Lit colleagues, or at least those who think literature conveys profound knowledge.
11 Billikin 03.28.12 at 5:13 pm: I think that I can illustrate this pretty simply. Suppose that the probability of a hypothesis, H, given evidence, E, is P(H|E). Then further evidence, F, is learned. That makes a new probability, P(H|E & F). Now let’s reverse the order, starting with evidence, F. Then our original probability is P(H|F). After we learn E we have a new probability, P(H|F & E). But F & E is the same as E & F, so P(H|E & F) = P(H|F & E). The order in which we learn evidence does not matter (in theory).
12 Trey 03.28.12 at 5:20 pm: I use Kahneman’s grading method not to avoid halo effects but to make grading more efficient; grading the same question repeatedly increases the speed at which I can evaluate the response. However, as you note in the original post, it does tend to become tiresome and students whose responses are graded at the tail end of the process tend to get more cursory comments. To counteract this, I randomize the order of students on a per-question basis.
13 bemused 03.28.12 at 5:25 pm: I’d think most exam takers would assume that points written about in question x would not have to be addressed again in x+1. Does random question reading imply that the exam taker will have to adopt the lawyers’ technique of saying something like “the answer provided in question x is included herein as though written out in full”?

Regarding rubrics, my kids’ k-8 school believed in using rubrics as part of the learning experience. I wrote a web application that their teacher in 7-8 grade used to involve a class in a discussion to create the rubric, then make it available on the school website for their class to document the assignment. Obviously it wasn’t a democracy and the teacher controlled what was admitted into the rubric, but the kids did come up with facets for the rubrics that reflected a surprisingly good understanding of what was expected of them, and how they felt the “point spread” should reflect steps of success on meeting each facet. (The format of the rubric allowed for a statement about 1-3-5 point achievement on each facet. The teacher could adjust the weighting.) Kids seldom complained about their grades arrived at through these rubrics.
14 Marc 03.28.12 at 5:51 pm: I have a trial grading scheme. Before I assign grades I read through a sample to see whether I’ve captured the sense of the class for any given question and adjust accordingly. I then grade through each question; I usually divide them into manageable stacks and start the process with different piles for each question. This tends to produce pretty consistent marks.
15 dsquared 03.28.12 at 6:06 pm: I think that order could matter to a Bayesian, or at least that the version of Bayesianism under which order doesn’t matter has the (to me, at least as unattractive) characteristic that a Bayesian can never believe anything to be impossible.[1] Bayes’ Rule is basically multiplicative, and this ensures that order doesn’t matter, but it also ensures that if you start off with non-zero subjective probability about something, you are never going to get zero subjective probability about it.

Conversely, if you have zero subjective probability on something, Bayes’ Rule will mean that you never have non-zero. So if you make some adjustment to Bayesianism which allows there to be an information set that drives your prior to zero, then order will matter (fairly easy proof – if the minimum information set that will drive your prior to zero is M, and there is a subject on which the available information is M and -M, then you will have different beliefs if you get the information in the order M then -M rather than -M then M).

[1] More strictly, can never believe anything to be impossible because of evidence to that effect; he can believe all sorts of things are impossible if he has a dogmatic prior.
16 psycholinguist 03.28.12 at 6:10 pm: I think I take issue with Chris Bertram’s statement here, if I’m reading it correctly as describing the halo effect as a defect in cognition:

” but you are quite right to point to the distinction between the description of (defective) cognitive processes and the normative claims made by Bayesians.”

Why is the halo effect a defective process? I would argue that it isn’t – Assuming an unbiased grading process, I would bet that there is more within-question grade correlation of a students graded answers than between-student correlation of grades. As you note, this can be problematic when each point counts, but as a general heuristic for the real world, it has a lot of utility – I can use this to make predictions about future performance, etc. If I was going to be completely honest, I have had an overworked, over-deadline semester or two where I’ve used some subset of essay answers as proxy for the test grade – on a random blind scoring of a sample of the full exam, I found the percentage I assigned based on my subset of answers, and the percentage assigned for the whole test, to be really, really close.
17 psycholinguist 03.28.12 at 6:17 pm: After reading my own comment, here’s a simple solution, and really cuts down on grading – Assign say a 5 question essay exam, and only grade one question per student. Students are informed ahead of time that it is a double-blind procedure, so neither of you know which question will be selected for grading. Generate a random number for each student and select that essay – that eliminates both within-student halo effects, and “same-question contamination” from one student to another. And I get to watch the ballgame that night.
18 Bloix 03.28.12 at 6:18 pm: I have written here and elsewhere before on the scandal that is college grading. This post is yet another example of the lack of professional standards and quality control in the process. It’s especially horrible in light of the fact that in many courses, there are only two exams and one paper, so arbitrary grading will not average out for any given student.

I was a TA for a brief period before I dropped out of grad school, and I found myself grading papers without any instruction in how to do so. I don’t doubt I fell victim to this halo effect. Another effect I noticed was that after I’d graded 20% or so of the exams, I realized that my average was either too high or too low, and I had an inclination either to downgrade or upgrade the next few to bring the average back into line. I tried to fix that by scanning all the exams quickly and sorting them into tentative grade piles. Even so, it was very hard to avoid grading later exams without paying some attention to the distribution of grades I’d already assigned.

Because no one teaches teachers how to grade, it’s without a doubt many or most of them are making errors that result in the assignment of arbitrary grades, having to do with where an exam falls in a pile, or which question was assigned first, or any number of other arbitrary considerations.

These could all be fixed with the expenditure of a fair amount of money and effort, but the result would be grading procedures that take more time.

And it really is a scandal. People are paying hundreds of thousands of dollars in order to obtain a scrap of paper showing the grades that professors give them, and what they receive is the result of a process that wouldn’t be acceptable in quality-control terms for a CPA audit of a sandwich shop.
19 Chris Bertram 03.28.12 at 6:33 pm: _This post is yet another example of the lack of professional standards and quality control in the process. _

Whoah there Bloix!

The post doesn’t even address questions of quality control. You’ve barged in here with a set of US-derived assumptions about how things are done.

In the UK (where I teach), exam scripts will often be marked by two people who have to agree. If they can’t a third marker may arbitrate. (Sometimes this is replaced by a system of moderation where the second-marker looks at a sample of scripts). There’s an external examiner who checks a sample of the department’s marking. In some places (such as the Open University) there are elaborate statistical methods to bring different markers into line. Criteria for marking are also discussed and published (thought there’s some scepticism about how good these are). I’m very far from taking the view that the UK system (or systems) are perfect. They’re not. But you’re not entitled to write

“a process that wouldnâ€™t be acceptable in quality-control terms for a CPA audit of a sandwich shop.”

In relation to a post that I, a UK-based academic, write about marking.
20 Nate Roberts 03.28.12 at 6:53 pm: I have always followed Kahneman’s revised method. But as a get near the end of the pile I also cycle back to the first few exams make sure my grading standards haven’t drifted. Here’s how I make it easy. I don’t use blue books, but have students write their answers on ordinary letter-sized paper. I then scan all their exams into a single pdf using a document feed scanner. As I go through I bookmark each answer to each question, and I arrange my bookmarks so that all the top scoring answers are together in one cluster, the next ranked answers are in another cluster, and so on. Using bookmarks I can instantly toggle back and forth between different students’ answers to the same questions. When I discover discrepancies arising in my grading standard between beginning and end (which I nearly ALWAYS do) I can very easily re-calibrate by moving the bookmarks around. By the end I have a very high degree of confidence in the internal consistency of the grades I award, and I really don’t think my method takes more time. In fact, I suspect it saves a lot of time (though I suppose if I were willing to just go blithely through without ever worrying about inconsistencies, that would save even more time).

The other thing I do, which my students find very useful, is that I select the one or two of best answers to each question, and with the student’s permission I extract their answers (with names trimmed off) and post a composite pdf of all the best answers on the course web site. With real examples of what answers received full credit to refer to, I have never had a grade complaint on a written exam. Again, this may take a little extra time, but I think the time is more than made up for by the time saved in entertaining students questions and complaints about their grades. (And I do not indulge in grade inflation, at least not in comparison to prevailing norms at my university—though these are, admittedly, considerably inflated from the days when C+ was the middle of the curve.)
21 Kenny Easwaran 03.28.12 at 6:56 pm: I actually find that it’s more pleasant to do one question at a time – I feel accomplished when I finish one question for all the students, and the individual reads go a little faster as I go on. Somehow the focused repetition feels less repetitive than the longer-period repetition involved in grading each student cyclically.
22 BelgianObserver 03.28.12 at 6:57 pm: I agree with psycholinguist in #16. Performance by a student on 2 essay questions shouldn’t be independent. It’s not a coin-toss. The test is (or should be) measuring ability. So, if someone does better on question 1, why is it a “bias” to expect them to do better on question 2? Seems more like a rational prediction. If I see a sports star sink a basket, I don’t predict he and I are equally likely to make the next shot.
23 otto 03.28.12 at 7:55 pm: “In the UK (where I teach), exam scripts will often be marked by two people who have to agree. If they canâ€™t a third marker may arbitrate. (Sometimes this is replaced by a system of moderation where the second-marker looks at a sample of scripts). Thereâ€™s an external examiner who checks a sample of the departmentâ€™s marking. In some places (such as the Open University) there are elaborate statistical methods to bring different markers into line. Criteria for marking are also discussed and published (thought thereâ€™s some scepticism about how good these are). Iâ€™m very far from taking the view that the UK system (or systems) are perfect.”

Slightly off-topic, but having taught in a few different places, I dont think these systems of double-grading etc have much to be said for them. Usually an enormous timesuck for almost no added value. And, because the work of grading is so much more onerous in double-grading systems, it creates an incentive to ask students for less written work. YMMV, of course.
24 Neville Morley 03.28.12 at 8:01 pm: I’ve followed the Revised Kahneman method for exam scripts quite happily for years without having thought about it – it started as a reaction against certain colleagues who not only marked each student’s entire script one after another but then insisted on arguing about whether it was e.g. ‘a decent 2.1 performance overall’ apparently independently of the marks they’d allocated to individual questions. Which is very silly indeed, and a complete waste of time.

As regards the residual halo effect, I’m not sure this is such a problem; isn’t it reasonable enough to judge one answer to a question in relation to the previous answer to the same question? The problem comes in if I’ve given (say) 60 to the first essay, so that the range of marks for the next, slightly inferior essay is then pretty limited, but that is a problem only if I refuse to countenance revising the first mark if that seems reasonable in the light of the second essay.
25 g 03.28.12 at 8:11 pm: _The test is (or should be) measuring ability. So, if someone does better on question 1, why is it a â€œbiasâ€ to expect them to do better on question 2? Seems more like a rational prediction. If I see a sports star sink a basket, I donâ€™t predict he and I are equally likely to make the next shot._

A wagerer could bias a bet on the next shot based on the outcome of the previous shot. However, the referee must not judge the shots differently. “He made the previous shot, so I’ll give him two points for this shot that missed but almost went in.”

An instructor grading essays should be an impartial referee.
26 otto 03.28.12 at 8:31 pm: “insisted on arguing about whether it was e.g. â€˜a decent 2.1 performance overallâ€™ apparently independently of the marks theyâ€™d allocated to individual questions.”

These sort of discussions are just one of the many ways in which joint grading of scripts and discussing them at meetings introduce, rather than reduce, error in students grades.
27 Barry 03.28.12 at 8:42 pm: “But F & E is the same as E & F, so P(H|E & F) = P(H|F & E)”

The statement before the clause is only correct if order does indeed not matter.
28 Barry 03.28.12 at 8:47 pm: dsquared @15: go read Andrew Gelmann’s blog; he covers things like this.

In short, the model matters, and the prior matters (which in Bayesian statistics can be subjected to sensitivity analyses).

As for not necessarily being able to determine a zero posterior probability, see the joke about the engineer, the mathematician, the beautiful woman, and a rule about only being able to cross half of the remaining distance at a time :)
29 Marc 03.28.12 at 8:49 pm: @22: the odd thing is that grades really are strongly correlated. Multiple choice grades track well with essay and short answer formats. This is even more obvious if there is a curve; no matter what Fred does he is a B- and no matter what Sally does she is a C+. And this holds true even for machine-graded multiple choice quizzes.

This does suggest that you probably don’t need the more labor-intensive exercises if the goal is assessment. Learning is a different matter, of course.
30 Khan 03.28.12 at 9:04 pm: So far, this discussion has focused on grading in the liberal arts. FWIW, on the other side of the fence in math/science/engineering, Kahneman’s revised method is used almost universally in cases where partial credit is ambiguous â€“ e.g., word problems, complex diagrams, multi-step problems, etc. This surprised me at first; you’d think the revised method would be most favored in liberal arts, not vice-versa. I think it’s because STEM problems lend themselves to grading rubrics better than essay questions, yet have so many possible semi-correct answers that actually creating a rubric ahead of time is extremely tedious. So, we follow the revised method, and create a rubric on the fly. In retrospect, it’s a naturally occurring phenomenon â€“ even most undergrad homework graders, without any instructions, quickly land on this method.
31 Chris Bertram 03.28.12 at 9:04 pm: Otto, I’d be more sympathetic to that view if I didn’t have long experience of wildly divergent marks, 40/70 disagreements and the like. Essay judgements in the humanities can be highly fallible and idiosyncratic . You need a check to protect students from that.
32 Tim Wilkinson 03.28.12 at 9:09 pm: dsquared – well, ‘impossible’ doesn’t really come into it, does it? ‘Certain (ly false)’ would be the operative concept I think.

And I have an inkling that Bayesianism-with-certainty is ‘impossible’ in the Arrovian sense. Since Bayes’s theorem involves division, Bayesianism with certainty (from P(p)=1 you immediately get P(-p)=0, so either 1 and 0 are admissible values or neither is) would throw up div/0! errors. But the Bayesian, being a thoroughgoing empiricist, shouldn’t actually do utter certainty anyway. (But humans functionally speaking do, therefore – or, to equivocate on “can’t”, because – they can’t be comprehensive Bayesians. Fixing Bayesian credences to 1 or 0 in empirical contexts does cause dogmatic errors in reasoning, so sucks to be us.)
33 Tim Wilkinson 03.28.12 at 9:17 pm: psycholinguist/Belgian Observer: but this is to confuse observation (or ‘information’) with inferred, probabilistic, credence. At some stage there does need to be some kind of distinction (or heirarchy) like that, however messy it may be to make it (Quinean holism presents a problem, not a solution). At least I remember being convinced of this in the past. For some reason like: otherwise we end up, very roughly speaking, building infinite nests of P() functions – matrices of P() functions indeed – around every proposition. Or some kind of analogue of Russell’s paradox. (The Naked Gun paradox – an 80% chance of rain but only a 20% chance of that). Perhaps I should have some coffee and try to get that clearer.

On this view, while the observation of the quality of the first answer may be relevant to what we should expect ex ante to observe in the second answer, that expectation is irrelevant to our observation of the second question. I’m a bit rusty on this stuff but Susan Haack’s foundherentism seemed to me to be on the right lines, fwiw.

(Just looked back at that and it’s probably all wrong and certainly not very clear, but maybe the CT-thread dialectic will help there. If it takes off, I may revisit some of this stuff.)

One other thing – not sure that viewing the matter as (to address only the discrete case) ‘what is the set of probability/quantity pairs, where the quantity to be predicted is the student’s knowledge’ is the best way of looking at it. If it were, markers should – I tentatively suppose – take into account all they know about the student, possibly more or less entirely disregarding the actual answers provided in the exam. Marking papers has a procedural aspect, the marker being required to mark the paper based only on what is in the paper, and perhaps plausibly by some kind of extension to mark each answer based only on what is in that answer.

Revising previous marks as suggested by Neville Morley might help to avoid a ratcheting up (or down) or marks by this kind of double-counting, but might also require quite a lot of back-and-forth to get everything mutually adjusted to a stable state.
34 P.D. 03.28.12 at 9:25 pm: I grade all of one question before moving onto the second, and I’m surprised that this seems like a revelation!
As Trey (@12) recommends, I randomize the order of exams in the stack for each question. Nothing utterly thorough, but some serious shuffling. This mitigates any effect from being at the top or bottom of the stack, and also may mitigate the problem of halo effects for being just after an especially good or bad exam.
35 dsquared 03.28.12 at 9:28 pm: dsquared @15: go read Andrew Gelmannâ€™s blog; he covers things like this.

Barry, please don’t be patronising. And not just to me, you’ve been doing it quite a lot recently.
36 js. 03.28.12 at 9:32 pm: I use Kahnemanâ€™s grading method not to avoid halo effects but to make grading more efficient; grading the same question repeatedly increases the speed at which I can evaluate the response.

Seconded. Though I would say, …not primarily to avoid the halo effect….
37 Cosma Shalizi 03.28.12 at 9:55 pm: A Bayesian agent can certainly give 0 probability (or probability density) to a hypothesis with positive prior probability, if that hypothesis strictly rules out certain events. Remember p(H=h|X=x) is proportional to p(X=x|H=h)p(H=h), so if a particular h says what I actually observed, x, was impossible, out it goes.

It is however true that a Bayesian agent can only ever learn something it has always already believed.
38 Soru 03.28.12 at 10:02 pm: If something starts off as not impossible, I don’t see how a finite amount of statistical evidence can make it infinitely improbable. And if there is one or more pieces of evidence that does make it impossible, then surely it will remain so whatever the order the other evidence is looked at. If not, then things are inconsistent.
39 dsquared 03.28.12 at 10:04 pm: Ahh good point; the paradigm case being that when something happens, it falsifies the hypothesis that it didn’t. But how do you know that p(x|h)=0? Nicholas Taleb would surely say you can’t be in that position for non-trivial h.
40 Cosma Shalizi 03.28.12 at 10:28 pm: 38: Consider the hypothesis that X is uniformly distributed on [0,1], confronted with the observation that x=2.

39: well, this is why we have a more complicated and interesting theory of hypothesis testing than “two words: modus tollens“, no? You can even use it to check Bayesian models, though the tests do not make sense to strict Bayesians.
41 Henry 03.28.12 at 11:02 pm: bq. Barry, please donâ€™t be patronising. And not just to me, youâ€™ve been doing it quite a lot recently.

Seconded. You comment quite a lot, but your actual contribution to argument is at best moderate. I would strongly recommend that you think about only commenting where you have something substantive (i.e. not just snark, or policing what you consider to be the acceptable boundaries of discourse) to say. As it is, I suspect that you drive out more good conversation than you encourage.
42 psycholinguist 03.28.12 at 11:58 pm: So, as I’m reading through the responses, I’m struck by what seems to me the promotion of a larger scale halo effect. If, as many of you have explicitly stated, “adjust” a previous score after gathering experience from the larger group of essays, then scores that go from the C+ to the B- have had the halo benefit, haven’t they? After all, your original judgement probably hasn’t changed about the content of the work, just your judgement about the work as compared to other examples, and so some responses benefit from the surrounding context, and some are likely penalized.
43 Daniel 03.29.12 at 12:25 am: #40: but generally, any theory that is capable of assigning p=0 to a hypothesis on the basis of evidence is going to have the characteristic that the order in which evidence arrives matters? Unless it is a theory in which p=0 isn’t a special value, in which case I don’t think it can be described as Bayesian.
44 Cosma Shalizi 03.29.12 at 12:44 am: 43: I don’t see why. As a merely statistical example, consider the problem where X1, X2, … are independently and identically distributed, uniformly on some interval, and the problem is to determine what the interval is. For a Bayesian agent, the posterior probabilities after seeing X1=a, X2=b, …, X_{n-1}=y, Xn=z will be the same as after seeing X1=z, X2=y, … X_{n-1}=b, Xn=a, or any permutation thereof; nonetheless, some intervals which had positive prior belief will now have been definitely ruled out.

I am willing to accept that this is a trivial problem in Taleb’s sense.
45 QB 03.29.12 at 12:45 am: It is however true that a Bayesian agent can only ever learn something it has always already believed.

You mean “always already believed to be possible“, that is, assigned nonzero prior probability to… no?
46 Watson Ladd 03.29.12 at 1:02 am: There seems to be a lot of confusion about what Bayesians actually believe. I’ll answer about my beliefs, which might not apply to all Bayesians.

First, Bayesians do not believe that zero and one are legitimate priors for things that are conceivable. This is a feature not a bug: on what basis would you believe that something that is logically possible is impossible? Might it not be possible that your evidence is wrong? Cosima’s example works: the evidence conditioned on the prior has zero probability, so the posterior that the sample is uniformly distributed is zero.

Secondly the Bayesian updating formula is the only correct updating formula. Why? First, because we believe in the axioms of probability as given by Kolmogorov. From this it follows that the Bayesian updating formula is the only correct one through a fairly standard argument. Lastly, there are certain strong epistemological statements about what distributions represent, and with them a theory that states that Bayesians cannot argue for long.

There are deep problems however. The first is that distributions do not admit a probability measure. This is technical, but essentially there are in some sense “too many” distributions to think of probability. One way of emerging from this moras is to note that we usually have lots of information that leads us to a prior on distributions that is reasonable. (Don’t ask where priors come from)

The second deep problem applies to the issue of insufficient evidence and priors designed to give largely the same results, except for anomalies. But this affects all statistical inference: it’s the grue problem, and using a prior weighted to simplicity only pushes it back.
47 Barry 03.29.12 at 1:08 am: Me: “dsquared @15: go read Andrew Gelmannâ€™s blog; he covers things like this.”

dsquared: “Barry, please donâ€™t be patronising. And not just to me, youâ€™ve been doing it quite a lot recently.”

Henry: “Seconded. You comment quite a lot, but your actual contribution to argument is at best moderate. I would strongly recommend that you think about only commenting where you have something substantive (i.e. not just snark, or policing what you consider to be the acceptable boundaries of discourse) to say. As it is, I suspect that you drive out more good conversation than you encourage.”

Actually, I wasn’t trying to be. Perhaps I should rephrase it as that a naive view of Bayesian statistics will lead to naive conclusions about Bayesian statistics. [please note that ‘a naive view of X leads to naive conclusions’ is something which factors into my life more than it should]

As Cosma has thankfully pointed out, it’s quite possible to come up with things which can be discarded 100%. However, even that’s not necessary; a posterior can assign infinitesimally small weight to a given value for an item (which, of course, can still be a problem when looking at functions which blow up at certain values).

BTW, the joke, which I had assumed that anybody dealing with math or engineering had heard, is that an engineer and a mathematician (both male, this joke is probably from the 1950’s) were suddenly placed in room with a, ah, ‘friendly’ woman and a wooden chair. A voice says that they can approach the woman, but only if they cover half of the remaining distance each time. If they violate that rule [lightening bolt from nowhere turns the chair to ashes].

The mathematician just sits down on the floor, while the engineer starts towards the woman. The mathematician says, ‘you know you’ll never get there’. The engineer smiles and says ‘I’ll get close enough for practical purposes’. The application to Bayesian statistics and assigning [close to] zero weight in a posterior to certain values is obvious.
48 bxg 03.29.12 at 2:03 am: @dsquared “…has the (to me, at least as unattractive) characteristic that a Bayesian can never believe anything to be impossible.”

Since you want to take off the table evidence that is just logically incompatible with the hypothesis (in which case Bayes does what you want anyway, as subsequent posts including yours do concede) I just very curious about what is making you unhappy here. You think some “X” is initially possible, and you see a an mass of evidence that is probabilistically implausible and yet still technically consistent with X – your complaint seems to be that Bayes still gives >zero posterior to X? The posterior may become trivially small but if I read you correctly that’s still a problem for you – ? I can see if you would like to reach a point where inference says X is “practically” impossible (but: (a) this would involve some sort of knob where you dial in what a practicality means to you, and (b) given that knob Bayes has no problem giving you want you want). But am I right that you want an inference method that says: “no, X is now definitively false”. Can you tell us about some other reasoning method that you are happier with in this respect?

N.b. if pressed, I’m pretty sure there is nothing, nothing at all, I believe to be absolutely utterly – with no equivocation of any degree however small – impossible. Not even tautologies. Maybe that’s why I don’t see the downside of an inference method that can’t get to this absolute certainty either.
49 Matt 03.29.12 at 2:12 am: N.b. if pressed, Iâ€™m pretty sure there is nothing, nothing at all, I believe to be absolutely utterly â€“ with no equivocation of any degree however small â€“ impossible.

You’re not sure if someone could produce a complete list of prime numbers? Or exactly represent the square root of 3 as a ratio of two whole numbers?
50 J. Goard 03.29.12 at 2:19 am: Still IF it is the case that the sequence in which an ideal Bayesian reasoner encountered the evidence would have an effect on that reasonerâ€™s final beliefs […]

Is one assumption here that grades are meant to be “final beliefs” about students’ relative abilities in the subject (which itself assumes that ability can be reasonably collapsed into a single dimension, perhaps more reasonable for some course topics than others), rather than, say, the outcome of a sufficiently fair “tournament” involving an interaction of ability and chance?
51 bxg 03.29.12 at 3:22 am: > Youâ€™re not sure if someone could produce a complete list of prime numbers? Or exactly represent the square root of 3 as a ratio of two whole numbers?

No, not completely sure. My (miniscule) doubt here centers on my own cognitive abilities. Yes I’ve seem proofs that these are impossible, and they are simple proofs, using mathematical/logical steps that seem beyond reproach. And I think I’m mathematically sophisticated so my broad agreement goes way beyond “I’ve seen proofs” – I could argue any of these points twelve convincing different ways.

And I cannot conceive of a situation where I would act as if either of your claims were wrong, or even in doubt. I don’t see the utility function that could lead to this.

But no, I am not _sure_. I know enough about cognitive traps that we as humans fall into (and that’s even excluding a god-like being that can manipulate human beliefs for obscure ends … which I also cannot say is absolutely impossible) to be confident with, total, not nearly 1, not 1.0 – 1^-googoplex, but absolute certainty.
52 bxg 03.29.12 at 3:40 am: > Youâ€™re not sure if someone could produce a complete list of prime numbers? Or exactly represent the square root of 3 as a ratio of two whole numbers?

I should add: of course, I’m – for all conceivable _purposes_ – sure of these claims (no finite list of primes is possible, sqrt(3) is irrational). My – really sincere – question is why it isn’t good enough that Bayes will get you to position #1 “the posterior on that is just so ridiculously tiny, I basically don’t even need to know your utility function since it’s just not worth talking about” in the face of adverse evidence. dsquared seems to be unhappy that we never get position #2 that says: “No. Simply no. False. Not false for all practical purposes, but false. Utterly disproven.” That’s a subtle difference. But I reiterate that for me, trying to introspect appropriately, I’ve don’t believe I’ve ever been at point #2 vs #1. Perhaps because I view the difference as amazingly uninteresting. But if pressed I think #1 is always more accurate than #2 up to and include mathematical “facts” and, yes including tautologies. You may know there are people who argue against the valid truth of “A or -A”? Yes, they have got to be wrong, or they are interpreting meaning differently than intended, or…? They can’t be understanding the question and its premises, right? Right with no possible doubt however minscule?????
53 ChrisTS 03.29.12 at 5:04 am: Trey 03.28.12 at 5:20 pm

“I use Kahnemanâ€™s grading method not to avoid halo effects but to make grading more efficient; grading the same question repeatedly increases the speed at which I can evaluate the response. However, as you note in the original post, it does tend to become tiresome and students whose responses are graded at the tail end of the process tend to get more cursory comments. To counteract this, I randomize the order of students on a per-question basis.”

I do something like this, and I do think it helps. In fact, I find reading all the number P essays together far less tiresome than reading one complete exam after another.
54 ChrisTS 03.29.12 at 5:12 am: Marc 03.28.12 at 8:49 pm

” the odd thing is that grades really are strongly correlated. Multiple choice grades track well with essay and short answer formats. This is even more obvious if there is a curve; no matter what Fred does he is a B- and no matter what Sally does she is a C+. And this holds true even for machine-graded multiple choice quizzes.”

I’m struck by this claim. I have found, at least in introductory courses, that students’ grades vary widely depending on the format. I don’t know if this is in some way peculiar to philosophy (I would not think so), but I wonder if there are disciplinary asymmetries.
55 dsquared 03.29.12 at 5:54 am: Barry, I have no idea what kind of reasoning led you to the conclusion that the way to improve your contributions to the debate was to start making sex jokes.

Cosma #43: I think the Talebian response to your version of the problem would be to ask whether one was prepared to admit the existence of some piece of evidence B that made you believe that there was a chance that X1 = a … were not valid observations of the underlying X1 … If such a B could exist, then obviously you can’t ever rule anything out; if you are capable of getting into a cognitive state where you’ve actually ruled something out, then it matters very much whether you find out about B before or after you see the Xns.

[ie, Nick sees a white swan and therefore asserts “some swans are white” with certainty. The next day, he sees something that looks like a white swan, but is actually a black swan that someone has dipped in flour. But he still believes “some swans are white” with certainty, because he’s the kind of Bayesian that can believe p=0 on the basis of evidence. Naz, on the other hand, saw the fake white swan yesterday, and so when he sees the white swan today, he still believes that “no swans are white” has p>0]

BXG thinks this isn’t a problem; that like the halo effect in grading it’s evidence that we are all wrong and should be more Bayesian, but I think it’s unattractive to have to admit when pushed that there is a nonzero probability, however tiny, that the world is full of invisible pink unicorns, or that we are all brains in vats or whatever.
56 Chris Bertram 03.29.12 at 7:20 am: #42 No, not all ex post adjustment in the light of marks represents a “halo effect”. A “halo effect” is a psychological mechanism that introduces bias, adjustment in the light of the statistical properties of a large collection of scripts is usually bias-reducing. The Open University, which has a very large number of independent markers for their units, examining an enormous number of scripts, routinely engages in this kind of statistical adjustment to ensure that students don’t get advantaged or penalised because of which marker they happened to get. And quite right to.

More generally, you and B.O. have argued that there’s nothing wrong with the halo effect. But, assuming that the purpose of grading the script is to produce an accurate assessment of the student’s performance, there’s everything wrong with a procedure that produces a different answer depending on something as arbitrary as the order you mark a student’s answers in. That’s the point here.
57 magistra 03.29.12 at 7:48 am: Troll mode on: But surely the halo effect has one vital function: it’s more efficient? You don’t need to consider in detail what the student has actually written for a particular question, you just go for the general impression of whether they’re a good student or not. And that saves a lot of the marker’s time.

What I find odd is that the thread about community colleges was largely about how it takes people an excessive length of time to do tasks such as marking. And then it’s followed by a discussion of how you can mark most effectively that introduces potentially time-consuming additions like randomisation and going back and rechecking your marking for consistency. That’s undermining all the lecturers who grade by throwing the papers up in the air on a staircase and seeing what step they land on. And they’re probably single parents as well!
58 dsquared 03.29.12 at 8:18 am: Troll mode on: But surely the halo effect has one vital function: itâ€™s more efficient? You donâ€™t need to consider in detail what the student has actually written for a particular question, you just go for the general impression of whether theyâ€™re a good student or not.

I think that what it does suggest is that you shouldn’t have too many questions on an exam, because the halo effect and similar issues mean that the benefit of allowing the student extra chances to shine is not as great as one might think. And that it might make sense to have a greater number of shorter exams rather than one big one at the end. Better living through science!
59 magistra 03.29.12 at 9:13 am: And that it might make sense to have a greater number of shorter exams rather than one big one at the end.

But there are adminstrative costs in time to doing that – every time you set an exam, you have to liaise with surfer marking dude. It’s far more efficient to send him an exam to mark once or twice a year than every week. (It may be a less effective form of education, but hey, this isn’t Oxbridge).
60 Daniel 03.29.12 at 9:17 am: Yes there are trade-offs. I’m not sure what point you are making though – it seems to be something along the lines that people who are concerned about efficient allocation of limited resources can’t possibly be concerned about quality. But that would be such an obviously invalid point that I’m presumably misunderstanding you.
61 Katherine 03.29.12 at 9:41 am: Would it matter if there was a halo effect as long as all the students get the benefit of it? Doesn’t everyone write their best essay first?
62 ajay 03.29.12 at 9:44 am: It may be a less effective form of education, but hey, this isnâ€™t Oxbridge

…where, in fact, they are AFAIK generally still running the “one big exam at the end of the year” model.
63 ajay 03.29.12 at 9:45 am: 61: very good point.
64 Chris Bertram 03.29.12 at 10:01 am: @Katherine … Looking at an old spreadsheet on which I’d recorded some marks, I found that the first question answered was the best or best= in only 50% of the cases. So that’s some empirical evidence to the contrary.
65 Barry 03.29.12 at 11:53 am: dsquared 03.29.12 at 8:18 am

” I think that what it does suggest is that you shouldnâ€™t have too many questions on an exam, because the halo effect and similar issues mean that the benefit of allowing the student extra chances to shine is not as great as one might think. And that it might make sense to have a greater number of shorter exams rather than one big one at the end. Better living through science!”

I think that what it suggests is that a larger number of questions doesn’t give increasing information at the rate that one would think, if one didn’t know about the halo effect. That’s different.
66 J. Otto Pohl 03.29.12 at 12:36 pm: The final exam here is worth 70% of the grade and consists of three in essay questions for each class. It used to be 100% and since I am grading mid terms now I wish it was still 100% of the grade. I grade the tests blind so a good midterm result does not necessarily mean a better final grade. But, I am not sure why more and smaller assignments are considered better than fewer larger ones. If anything they add unnecessary grading work that detracts from more worthwhile things.
67 psycholinguist 03.29.12 at 1:24 pm: #56 – I would disagree – to a cognitive psychologist, there is nothing special about the halo effect other than it has a cool name – it is simply referring to one possible outcome of the much more general process of top-down or theory driven perception (Kahneman & Tversky called it framing).
Perception is never a one-way street, where the ultimate percept is simply the “pure” resulting product of all the perceptual processes that happened along the way (that would be the Bayesian view I think). Perception is shaped by prior experience and context, and it should be. You note that:
“that the purpose of grading the script is to produce an accurate assessment of the studentâ€™s performance, thereâ€™s everything wrong with a procedure that produces a different answer depending on something as arbitrary as the order you mark a studentâ€™s answers in. Thatâ€™s the point here.”

So, I have a class of one single student, Johnny, and I give his essay a C. Now suppose there are 20 students in the class, and Johnny’s same essay answer, when weighted against the others, now results in a B. Now suppose that my best student (Jill) was out that day, and I don’t have the benefit of her answer as part of the context to evaluate Johnny’s essay, and his essay, when evaluated in light of the others, gets a B+. That seems to be exactly what is happening when grades are adjusted as the prof gains more experience with the class answers as a whole – the perception of an answer has changed from time one to time two because of that experience. How is that any less arbitrary?

Suppose you receive an essay answer from a class consist
68 Chris Bertram 03.29.12 at 1:36 pm: Sorry psycholinguist, but your toy example is utterly beside the point. The kind of statistical adjustments done by the OU (for example) are on samples of hundreds (or even thousands) or scripts, so “Jill being absent that day” doesn’t cut it as a counterexample.
69 psycholinguist 03.29.12 at 1:51 pm: #68 I would agree, but the OU has such a large database of examples that it is approaching an accurate representation of the population as a whole – that’s just really good sampling. Is that really the case for most our classes? I’m not trying to troll here – this is a real issue and one that any of us who want to be as fair and accurate as possible with our students must come to terms with. But what I resist is the idea that there is a “context free” assessment that exists for a particular paper to be discovered if we could just eliminate the bias. Anyway, I appreciate the topic, and It is a good reminder to all of us to be systematic in grading.
70 bianca steele 03.29.12 at 1:58 pm: @Matt, bjk:
There’s someone out there arguing we have to give up the truth of an infinity of primes because it can’t be proved statistically? Really?
71 Barry 03.29.12 at 2:03 pm: “#68 I would agree, but the OU has such a large database of examples that it is approaching an accurate representation of the population as a whole â€“ thatâ€™s just really good sampling. Is that really the case for most our classes? ”

For classes with only a few graders, this would be a problem, and the model would be based on something simpler, like a mean adjustment for graders, based on them each grading a random selection of papers. Alternatively, one could have them grade the same (small) random sample to start, and then run a model to estimate adjustment factors.

The first is what OU is probably doing (at a guess), unless they deliberately send each grader a baseline set to grade, to determine an individual rating factor for each grader. I’d recommend the second approach, but it would be more of a cost, since OU would be paying for some amount of work for which they couldn’t directly bill.
72 Barry 03.29.12 at 2:07 pm: “But what I resist is the idea that there is a â€œcontext freeâ€ assessment that exists for a particular paper to be discovered if we could just eliminate the bias.”

If you think of actually eliminating the bias, that is a really hard goal. However, if you think of getting the bias down to a small level, it’s doable.

” Anyway, I appreciate the topic, and It is a good reminder to all of us to be systematic in grading.”

I think that the takeaway is that if you’re not working at eliminating biases in grading, then you’re biased, and that customary methods should be evaluated to see if they really work.
73 primedprimate 03.29.12 at 4:17 pm: If we view tests as not just a way to evaluate students, but also a way through which we teach them and as way though which we evaluate ourselves, then a longer test may make more sense regardless of how much (or how little) grade-relevant information we glean from subsequent answers.

By teach, I mean two things: force the students to think through concepts in a timed setting and provide feedback that helps them understand how they erred in their answers. For the former, a long test with a randomly graded answer is efficient.

For the latter, a long test works better and grading one student at a time works better because then it is easier to understand exactly how the student is erring and there is less redundancy in the feedback provided. The halo effect here could work to the detriment of students who have done a splendid job on their first answer because I will be less likely to make extensive comments on ambiguously written subsequent answers if I keep giving them the benefit doubt.

I also use tests to evaluate myself – grading one question at a time is very helpful because that makes it easier to spot the same mistakes being repeated by different students and that tells me where I could do better. This is one reason why short answer questions are so much more useful than multiple choice questions (which I think are great if efficient evaluation were our only goal).
74 bxg 03.29.12 at 7:20 pm: > Thereâ€™s someone out there arguing we have to give up the truth of an infinity of primes because it canâ€™t be proved statistically? Really?

Has this “someone”‘s comment been deleted? This seems a crazy, almost nonsensical, claim – so silly, indeed, that it’s probably better if we can also see whoever-it-is you-are
-criticizing’s original words before thinking too hard about whether something so out there could possibly be defensible.
75 bianca steele 03.29.12 at 7:49 pm: bxg:
Did I offend you by misspelling your initials? Sorry about that.

I fully agree with you that someone who claimed that even by the rules of mathematics, it could be an open question whether the number of primes is infinite, would have misunderstood something very severely (if that’s what you were arguing). Yet, if you happened to find yourself in a room filled with people who believed maybe such a claim was possible, this “You may know there are people who argue against the valid truth of â€œA or -Aâ€? Yes, they have got to be wrong, or they are interpreting meaning differently than intended, orâ€¦? They canâ€™t be understanding the question and its premises, right?” might not quite do the trick.
76 leederick 03.29.12 at 9:31 pm: I think people underestimate CPAs. Before signing a set of accounts they’ll routinely do things like re-perform the calculations which transform invoices into expenses – and insist on a wholesale reworking of the accounts if sampled items show a material difference to posted results. Exernal examiners in my experience wouldn’t re-mark papers blind and insist on wholesale remarking if they find error. They’ll look at papers that are already marked and see if the grades seem okay, and maybe adjust some results where they doesn’t. Financial audit metholology is really quite formidable, I’m not sure you can asume UK academics will have invented a better system without examining what CPAs actually do.
77 bxg 03.29.12 at 11:29 pm: > bxg: Did I offend you by misspelling your initials? Sorry about that.

No offense. But how can you interpret my earlier comments as questioning the infinitude of primes _because there is no statistical proof_?. That’s not fair; I didn’t say that and I don’t see how you can twist it this way. First, I didn’t say that you need to give up anything (you believe what you want to believe, I won’t object), and second I certainly didn’t mean to imply that statistical “proof” is relevant one way or another (it’s not, IMO)

I, myself, and I am not trying to say others are obliged to follow, have some doubt as to everything including tautologies and established mathematical facts. Y0u can of course even laugh at me but I’d rather you not misrepresent me. To be clear, I have no _practical_ doubt at all – buy my comments are in response to d^2 who sees (what I do not see) a qualitative distinction between (a) absolute certainty vs (b) belief that something is false with only miniscule probability. I offer myself as an example of someone for whom – even if the distinction is even meaningful – never (*) believes (a) other than a shorthand for (b). Not even for quasi-elementary mathematical facts.

(*) I’m not entirely sure of this.
78 bianca steele 03.30.12 at 12:57 am: Iâ€™m not going to speak for Matt, and I donâ€™t think there are necessarily only two possibilities (proof by logic from first principles (roughly), and statistical proof by induction), but it did seem to me he probably understood you as setting up those two alternatives as the only ones and choosing the latter. And your response seemed to me to agree with (what seemed to be) his guess.

Youâ€™ve said that you (knowing how mathematics works and having examined the work yourself in detailâ€”and agreeing that the work in question is â€œsimpleâ€) are unsure whether the proofs we currently have, of the infinity of primes, were carried out correctly. Not because in a thousand years something like non-Euclidean geometry might have to be accommodated. But because human beings make mistakes. And apparently this is relevant somehow to Bayesian reasoning from empirical data. Though you donâ€™t go into much detail about the connection between Bayesian reasoning and the rest of your post. I donâ€™t think it is unreasonable to assign a relatively high probability to the possibility that your doubt about these proofs has something to do with discomfortâ€”based in what, I donâ€™t knowâ€”with arguments from premises that arenâ€™t derived empirically.

My mind is still working on how we might decide the set of primes isnâ€™t infinite. Will we decide that each prime number corresponds to a thing in the world, and that there are really only 1024 of those? I mean, I am willing to suppose that Euclid himself thought prime numbers were like gnomes and that he was proving there were an infinite number of Hellenistic gnomes in the world. But I canâ€™t begin to imagine a turn of events, beginning from that discovery, that would result in everybody deciding to abandon all the parts of number theory that failed to correspond to the objectivity of gnomes, and also deciding that the gnome theorists were the ones who truly understood Euclid all along.

But I still donâ€™t see how this makes Bayesian inference applicable to the infinity of primes. Or even to the proofs of the infinity of primes. So your posts simply puzzle me.
79 bianca steele 03.30.12 at 1:11 pm: Did I do that? Sorry. It’s just that I think “all the mathematicians may simply have gone wrong on infinite primes” is on a par with “the Prince of Wales may not in fact speak English” and “today might not really be Friday.” I’m much more willing to consider the existence of gnomes than that this particular proof is in error.
80 Western Dave 03.30.12 at 7:58 pm: Psycholinguist: If you did that ask 5 questions grade 1 bullshit to me as an undergraduate I would have punched your lights out and our disciplinary process would have found me not guilty (and I went to a Quaker school). Unless you expect your students to do only 1/5th of the assigned reading and lectures (somehow I don’t think you bother to run discussion sections), all you are doing is saying “look how much work I can make you do and how little I care about it! Now excuse me, I have more important things to do!”
81 Frances Woolley 04.01.12 at 12:23 pm: The reaction to this post shows how valuable it is for academics to talk and think about how we teach – which, whatever one thinks about the earthshattering import of one’s own research, is what most academics are paid to do.

Having marked thousands of exams on a question-by-quesiton basis, a couple of responses (I haven’t read all of the discussion, so others many have raised these points already).

“oneâ€™s judgement of an immediately subsequent answer to the same question in consecutive booklets or script is influenced by the preceding one;” – yes, but this effect can be dispersed across papers by shuffling the exams between markings, an exercise that occurs quite naturally for the less tidy among us.

“reading answers to the same question over and over again can be even more tedious than marking usually is.” I don’t find it so. By reading responses to the same question over and over again you find yourself thinking really deeply about it, and seeing patterns in the types of mistakes students make, all of which helps your own understanding. And any greater tedium is more than compensated for by the greater speed – it’s much faster, because you can hold a single simple rubric in your mind.

It’s also easier to spot people who submit identical answers to questions.
82 Bloix 04.01.12 at 9:25 pm: #76 – I’m the one who brought up CPA’s. CPA’s are careful about their results because if they are incorrect they may lose their jobs or be sued. No professor, so far as I know, has ever lost a job because he or she gave a C to a paper that deserved a B.

#19 – Chris Bertram – you’re right, I have no familiarity with grading in the UK. Perhaps things are better where you are.

It is my belief, based on the personal experience of myself, my friends, and my children, that in the US grades are left entirely to the discretion of the individual professor or instructor, with perhaps the limiting factor that the final results should fit roughly within a pre-established distribution. Professors, instructors, and TA’s are given no training or guidance in evaluation of student work and other than the distribution, no effort is made to ensure consistency as between different teachers. In some schools (e.g., law school) exams are often identified by number to prevent gender or race discrimination. Other than these very loose constraints, there is no effort to prevent arbitrary or unfair results that may result from bias, laziness, or systematic error.

The result is the creation of a university-wide database of information that would be rejected by any statistician (or economist, sociologist, etc) in the university as being utterly unreliable. And yet it seemingly doesn’t occur to anyone to devote any time or money to the application of the knowledge created in the university to the most important data set created by the university – the numerical evaluation of the students, who are the ostensible reason the institution exists, and who pay its bills.

Why is that?

Perhaps I’m wrong about this. If so, someone on this blog surely can correctly me.
83 Katherine 04.02.12 at 5:10 pm: In some schools (e.g., law school) exams are often identified by number to prevent gender or race discrimination.

Okay, I’m actually quite shocked. Do you mean to say that there are universities that don’t do this as a general rule? When marking things that actually count for real life marks (rather than just essays that are marked and reviewed to let people know how they are doing)?

Comments on this entry are closed.

Evaluating students: the halo effect

Recent Comments

Search

Archives

Pages

Book Events

Contributors

Fine Print

Lumber Room

Old Wood

Meta

Recent Posts

Tags