The NRC Releases Its “Methodology” for Ranking Graduate Programs (Ranking to Follow…)

7/10/2009

Brian Leiter

IHE has a useful story and summary of the 200-page (!) document. What made the earlier NRC reports (1982, 1995) useful was they included systematic surveys of experts in different disciplines evaluating program faculty and training of students. That is no more. According to the IHE article, the worry about expert evaluation was that, "Many people assume departments at outstanding universities must be outstanding as a result, even if that's not the case, or people who associate certain stellar researchers with a department may not know that they have retired." Dare I observe that there is a pretty simple solution to these problems: ask experts to evaluate faculty lists, not university names, and make sure the faculty lists are current and exclude those who are retired, dead, not really teaching etc.

Instead of the peer evaluations that made the prior NRC reports so important, programs will now be evaluated using 21 different variables–many different in kind from each other (see below)–and all weighted differently. Here are the variables being utilized (I wish I were making this up, but, really, I'm not!):

The 21 Program Characteristics Listed in the Faculty Questionnaire.

Faculty characteristics

i. Number of publications per faculty member

ii. Number of citations per publication (for non-humanities fields)

iii. Percent of faculty holding grants

iv. Involvement in interdisciplinary work

v. Racial/ethnic diversity of program faculty (only non-Asian minorities count)

vi. Gender diversity of program faculty

vii. Reception by peers of a faculty member’s work as measured by honors and awards

Student characteristics

i. Median GRE scores of entering students

ii. Percentage of students receiving full financial support

iii. Percentage of students with portable fellowships

iv. Number of student publications and presentations (not used)

v. Racial/ethnic diversity of the student population (only non-Asian minorities count)

vi. Gender diversity of the student population

vii. A high percentage of international students

Program characteristics

i. Average number of Ph.D.’s granted in last five years

ii. Percentage of entering students who complete a doctoral degree in a given time (6

years for non-humanities, 8 years for humanities).

iii. Time to degree

iv. Placement of students after graduation (percent in either positions or postdoctoral

fellowships in academia)

v. Percentage of students with individual work space

vi. Percentage of health insurance premiums covered by institution or program

vii. Number of student support activities provided by the institution or program

The weightings to be used in the case of philosophy programs are not yet public–the weightings were determined in each case by a survey of people in the field. No doubt many of these individual measures will be illuminating, but the idea of aggregating them in order to say that "Ivy University is in the 5-15 cluster" will produce a meaningless, 'nonsense' number: what does it mean to say Ivy University is somewhere between 5th and 15th based on some aggregation of the number of publications per faculty member, the number of international students, the number of non-Asian minority faculty, and the number of student support activities? Who would care about such an aggregation? What is most distressing is that the NRC has eliminated any meaningful measure of faculty quality, relying on factors that have no qualitative dimension (e.g., publications per faculty member) and proxies for quality like grants and honors, some of which are certainly probative (e.g., Guggenheim or NEH Fellowships), others of which will just reinforce traditional hierarchies because of their insular and self-reinforcing nature (e.g., American Academy of Arts & Sciences membership)).

And then, of course, there is the delay issue. Most of the data collection on faculty took place over three years ago. Among those who would have been included for philosophy at UT Austin, for example, are Robert Kane [now retired], me, and Robert C. Solomon [now deceased]. Chicago's evaluation will presumably include William Wimsatt (now retired), John Haugeland (retiring next year), and Charles Larmore (left for Brown). One Ohio State department reports that more than 20% of the faculty is new since the time they submitted the faculty questionnaires to the NRC, while nearly 20% of the faculty at OSU then have either left or retired. There will obviously be substantial variation in how much these changes in faculty rosters over the last 3-4 years matter, but in some cases, they will be very significant.

In any case, I would be most interested to hear what philosophers think of the variables the NRC is using and also what they think of the idea of an aggregation of such variables. Non-anonymous comments preferred, though you must at least submit a valid e-mail address; submit comments only once, they may take awhile to appear.

Uncategorized

26 responses to “The NRC Releases Its “Methodology” for Ranking Graduate Programs (Ranking to Follow…)”

jon kvanvig

July 10, 2009 at 5:26 pm

Brian, exactly right–the results of NRC are no longer interesting in themselves, but only for political purposes. In philosophy, this direction for the NRC should make the PGR (and citation-based measures) more important to administrators who actually pay attention to the details of a ranking system. We wait for a ridiculous amount of time to hear what the NRC has to say, and now we know: they are just blowing smoke…

Loading…

Reply
Mark van Roojen

July 10, 2009 at 10:34 pm

None of these measures seem very useful. Number of publications matters, if you control for quality, but it just can't matter if you don't, and if the summary is correct they don't. Awards are just too rare to be used as a measure of faculty quality, at least if you use awards that matter. (Similarly with portable grad fellowships.) OTOH, if you use awards that don't matter, you'll get results that don't reflect anything that matters.

Maybe worse, I really don't see how GRE scores of grad students are a measure of program quality. First off there is real controversy about their value, with some prominent people thinking them virtually worthless. That is going to lead to different admissions policies generating different scores on this measure without necessarily reflecting anything about the quality of the program. Second, Average GRE scores might be an indirect reputational measure of applicant opinion, since places considered better will probably get a somewhat higher mean or median applicant GRE. But I'm not sure that the reputation of a department with applicants (as opposed to other people with more background knowledge) is a very good measure of program quality.

There's more to say about the rest, but I'm sure it will be said.

Loading…

Reply
Remy Debes

July 11, 2009 at 11:45 am

I don't think there is much more to say. Your comments are spot on, Brian. I only add that the near elimination of qualitative factors is especially punishing on junior faculty, whose work can often be quite good (having been brewing for six years or so), but just beginning to appear in print. So these measures seem blind to tracking the promise of new faculty.

Of course, when choosing a graduate program students will not want to gamble too much on "promise." Still, I think the point holds.

Loading…

Reply
Grad student

July 11, 2009 at 1:15 pm

It's worth remembering that lots of individually weak pieces of evidence, combined appropriately, often give very good evidence. The main challenge in this case is that it's unclear how to weight different features, and it's hard to test how well a weighting tracks some coherent scale of "quality". There are some useful things to do, though: for instance, you can measure how well different subsets of features predict one another. It looks like the NRC did at least one test along these lines. But of course I don't know how good the resulting model is.

One really good idea the NRC has is generating a range of rankings. The PGR could easily incorporate this idea, by using randomly sampled subsets of the survey data to generate many rankings. The resulting "fuzzy rankings" vividly show some of the uncertainties involved.

Loading…

Reply
Matt Lister

July 11, 2009 at 2:19 pm

What's perhaps unfortunate is that these factors are all, or at least mostly, things that grad students would like to know about departments, and that could, on the margins, play an important roll in making choices between departments. But, each factor would have different weights for different students, so if they are somehow lumped into a single score (as seems to be indicated will happen) any potential value will be lost. If the raw data were made available then I can see how this might have some value to prospective grad students. (This value is lessened, of course, by the fact that there might well be high levels of fluctuation in many of the factors measured from year to year.) But put together into on score I don't see what use it is, even abstracting from the (important) fact that it's a dubious way to rank academic quality.

Loading…

Reply
Martin Lin

July 11, 2009 at 5:38 pm

Mark, there was a big study done at the University of Minnesota that found that GRE's are moderately correlated with graduate student success. The study was a meta-analysis conducted on the results obtained by nearly every other study of the topic. I haven't read the study and I don't know if philosophy is different from other disciplines in this regard, but I'd say it looks like they are valuable.

But I agree with Brian and others that the many of the things measured are irrelevant.

Loading…

Reply
Mark van Roojen

July 12, 2009 at 12:47 am

Martin,

I know there is some correlation with success, though I'm pretty sure that it is less than the correlation for previous grades, and also that there are studies to indicate that women with lower GRE scores(or perhaps SAT scores – I'm working from memory) have as good academic success as men with somewhat higher scores, a fact which I believe some testing companies acknowledge. I also know that a test-wise person can do much better than chance looking at just the answers without the questions. (This was the lore when I worked writing such questions as a summer job during grad school, and I note I can do this with many multiple guess tests myself.) So I'm on the "not very probative when it comes to doing philosophy" end of the spectrum when it comes to such scores.

But that wasn't so much my point. My point was that average scores of admitted students were going to determined by two factors primarily — (1) the attitude of department members to those scores when admitting students, and (2) the reputation of departments among applicants (on the theory that having more applicants with higher scores correlates with reputation among applicants). The first of these has no discernible correlation with quality. The second might. But if you are going to use a factor that indirectly employs reputational evidence why not at least use reputation among people in a better position to know than undergraduates or recent undergraduates applying to grad school?

Going back to the first factor, if a bunch of very good faculty think scores don't matter and disregard them, while another bunch doesn't and so uses them in admissions, the second group will do better than the first on this measure, other things equal. But by hypothesis they are not better faculty, and I suspect it won't be a better department.

So whatever you think of scores as a measure of chance for academic success among applicants, you shouldn't use it to measure program quality. Or so I would argue.

Loading…

Reply
Albert Atkin

July 12, 2009 at 1:37 am

Sorry if the answer to this is obvious or well known, but why don't Asian minorities count for the racial/ethnic diversity variables?

Loading…

Reply
Christopher Gauker

July 12, 2009 at 12:04 pm

In the Inside Higher Ed article, the NRC committee chair, Ostriker, is cited as having said that reputational surveys may be biased toward departments that are part of otherwise strong universities. Brian begins his post by claiming that the PGR is not subject to that criticism. So it seems to me very legitimate and not at all off-topic to challenge Brian's defense of PGR on that score.

It is very legitimate to doubt the PGR results on the grounds that they may be biased toward departments that are part of top universities or toward departments that have traditionally been strong. That one can find departments highly ranked in the PGR that belong to average universities is no defense; they might have done even better without the bias. I find Brian's "simple solution", namely, giving evaluators lists of faculties without university names attached, to be no remedy at all. Anyone looking at one of the lists will immediately think, "Oh, that's Nebraska", or whatever. Whatever biases might be induced by seeing the printed text will surely be induced just as well by the conscious thought.

As for the utility of the various factors: The report explains that the factors for each field were assigned on the basis of research. Faculty in that field were polled and the results based on that poll were checked against the ratings that those same faculty provided for a sample of programs (p. 17). Publication record always turned out to be the most important factor. The account of the methodology gives it at least the appearance of sophistication. Whether it is really any good I cannot say.

The delay between data-collection and publication of results is of course a serious concern. Another concern I have had is about the data-collection. In the humanities, the data were culled from CVs. Those of us in the biz have no problem parsing a normal CV. But I have found that people outside the biz sometimes have no idea what they're looking at. ("Monograph? What's a monograph?")

Loading…

Reply
Matt Brown

July 12, 2009 at 12:55 pm

I agree with Matt Lister that many of these factors are important to potential grads making these decisions. In particular, I think both qualitative and quantitative information on placement is as or more important than information about faculty reputations (as in the PGR and the old NRC reports). But I also agree that attempting to put all this information together into a single score is a bit silly, as are some of the inclusions and omissions in the list of factors.

Loading…

Reply
Jeff H

July 12, 2009 at 10:31 pm

My guess is that people don't consider Asians a disadvantaged minority in North America, at least in the relevant respects, on the grounds that they already do very well in our universities.

Or maybe it's just that their legitimate grievances (consider the history of Chinese railroad workers in Canada, more recent violence – much of it perpetrated by other "disadvantaged minorities" and apparently therefore okay – against Korean-owned businesses in the US, still more recent violence against Sikhs in the wake of 9/11) aren't as well-publicized.

Loading…

Reply
David Wallace

July 13, 2009 at 3:37 am

Every problem is also an opportunity. If someone statistically-minded (Nate Silver?)feels so inclined, it would be interesting to run a regression analysis on the NSC questionnaire data and see which fields, if any, are good predictors of the PGR score. (And yes, obviously there's not much you can do about faculty turnover.)

Loading…

Reply
Kenny Easwaran

July 13, 2009 at 7:45 am

I suspect that Asian minorities aren't counted because in many fields, Asians are well-enough represented that students in any department will have potential role models that are Asian, and will see clearly that Asians are not at all unusual as members of the field. On the other hand, black and Hispanic students will, in many departments and almost all disciplines, often find themselves implicitly confronted with the thought that people like them can't be accepted by the field. Of course, in philosophy it seems that Asians may well be just as affected by the lack of representation and role models. Perhaps a more useful measure would have looked at which minorities do as a matter of fact tend to be underrepresented in the particular discipline at hand.

Of course, the fact that Asians are well-represented in math (for instance) doesn't mean that their representation in a given department is irrelevant as far as students are concerned. A large math department with very few Asian faculty members may be a sign of a problematic social environment that students could have very good reason to avoid. From my potentially naive perspective, it's only the factors of the availability of role models and seeing people that look visibly like oneself as potential members of the field that seem to be assuaged by the fact that in many disciplines Asians tend not to be underrepresented.

But of course, it's extremely hard to quantify just how important this measure is – it seems very easy to underestimate or overestimate. And as Brian Leiter says, this may be incommensurate with many of the other measures involved as well, to the extent that various other factors are in fact relevant.

Loading…

Reply
Brian Leiter

July 13, 2009 at 9:05 am

I of course agree with Matt Brown about the value of job placement information, but, as they say in the investment business, past returns are no guarantee of future performance, which is especially true when there are significnat changes in faculty rosters. That problem is compounded here by the delay factor. In esence, the NRC is going to be utilizing placement results based on students who were picking programs in the mid-to-late-1990s.

I find Christopher Gauker's initial remarks puzzling in the extreme. First, I was not giving a defense of the PGR, which no longer needs one, but remarking on the flimsiness of Professor Ostriker's reasons for why the NRC was no longer undertaking peer surveys, as it did in 1995 and 1982. Second, it is hard to see how someone can "assume" anything about the quality of a faculty based on a university name when they are confronted with the actual list of faculty. I've heard repeatedly from evaluators over the years how helpful it is to be forced to consider a faculty list, and not just a university name. Chris's hypothesis that evaluators simply stop considering the faculty list once they figure out the university, and just make a judgment based on their impression of the university, seems to me a bit extravagant, and not borne out by the actual results.

I do wish the NRC were undertaking a survey of expert opinion, as they have in the past. This would make for a useful point of comparison with the PGR surveys. As it is, and as David Wallace notes, it may be possible to look for correlations between the new NRC proxies for faculty excellence and the results of the 2006 PGR surveys, which would be closest in time to the data on which the NRC will rely.

In any case, I'd like to keep the focus of this discussion on the variables the NRC is using this time around. Thanks.

Loading…

Reply
Ned Block

July 13, 2009 at 7:48 pm

Christopher Gauker says “The report explains that the factors for each field were assigned on the basis of research. Faculty in that field were polled and the results based on that poll were checked against the ratings that those same faculty provided for a sample of programs (p. 17). Publication record always turned out to be the most important factor.” This is highly misleading since publication record turned out to be the most important factor ONLY COMPARED to the other factors they used, like number of student activities (e.g. orientation, student association) and the percentage of international students, to pick just two of the many factors they measured that seem very very weakly correlated with program quality. They also considered whether “a faculty member’s work is CURRENTLY supported by an extramural grant or contract,” where ‘currently’ refers to the time of filling in the form. This would be significant in the sciences but misleading and far inferior in the case of philosophy to the question they did not ask, whether the faculty member had ever been supported by a grant from a major foundation. The one factor which is relevant to “Program Quality” and where their measurement actually used judgments made by a human being is in their own ranking of the value of awards and honors. As far as publications, they made no attempt to distinguish a discussion note in a fourth rate journal from a substantive article in a first rate journal. And they made no distinction (as far as their documents indicate) among: an internal university grant, a grant from a local private foundation and a grant from NEH. In the sciences they considered citations, a useful measure, but they did not use citations in philosophy. The upshot will be a skewed ranking. I wonder whether the judgment-free criteria they used would predict genuine quality better than say the number of philosophy books in the library or the departmental coffee budget. If a national body released a ranking of, say, the medical quality of hospitals on the basis of such weak criteria there would be an outcry.

Loading…

Reply
Christopher Gauker

July 14, 2009 at 7:59 am

As I said, Ned, for each field the weights were assigned to the various factors on the basis of research on that field. We do not yet know the results of that research. In some fields some factors received a weight of 0 (p. 49). I do not know how you can be so sure that the results of this research will be wrong. Certainly if the results turn out to be very far from your priors you will be entitled to think that something must have gone wrong. Maybe if we looked closely at the method of calculating the weights we could see that it is unreliable.

Ned also disparages the NRC method for measuring publication productivity. Brian does not want this discussion to devolve into another debate about the PGR, and I would like to respect his wishes. However, the issue here is comparative. Which of several imperfect methods is best? And both Brian and Ned are making such comparisons. So I hope that I will be allowed to reply to Ned as follows: One kind of reputational survey has evaluators come up with a number based on a list of names of people about most of whom they know very little. Counting up lines on CV is pretty crude, but I don't know why it would be worse than that. (I say "one kind" of reputational survey, because a reputational survey could be designed to ensure that evaluators were evaluating only people about whom they had a reasonable level of knowledge.)

Loading…

Reply
Brian Leiter

July 14, 2009 at 8:10 am

There is no factual basis for the statement that PGR evaluators are merely assessing "a list of names of people about most of whom they know very little." PGR evaluators have the option of not evaluating a faculty where this is genuinely the case, an option often exercised. What appears to be most common is evaluators know a lot about some faculty, and less about some others, and form an opinion based on an inference from those whose work they know. The survey then aggregates many different philosophers' partial extensive knowledge, which is precisely the point.

Obviously anyone who thinks raw productivity is a superior measure to expert evaluation will want to pay attention to that aspect of the NRC. Universities themselves always express a preference for expert evaluation, of course, when they appoint external review committees of experts, rather than doing page counts, to assess their units.

Loading…

Reply
Ned Block

July 14, 2009 at 4:11 pm

Christopher, what you call "research" consists of two things: (1)Simply asking people which of the 21 characteristics Brian listed above are "the most important to program quality", (2) Investigating which of the 21 characteristics best predict faculty members' RANKING OF PROGRAMS. That is one of the two (and I would argue the more significant) "validation" of the weightings they used is a PGR-like ranking. So if you don't like the PGR, you should not expect much from a survey whose validation is PGR-like. Except, they don't report having used a list of the faculty and certainly not an updated list. Since they are at pains to list anything they did that adds objectivity, I would guess that they presented the names of the departments, the very methodology that famously got us high rankings of the Princeton Law School. I should add that in asking people which of the 21 characteristics are "the most important to program quality", they did not distinguish between indicators of program quality and determinants of program quality. For example, percentage of students with portable fellowships might be an indicator of quality, but not a determinant of it. And health insurance might be a determinant or at least something good, but not much of an indicator, especially since health insurance is usually or always a university rather than a departmental matter.

Loading…

Reply
Christopher Gauker

July 14, 2009 at 7:53 pm

Ned: The "simply" in your statement of (1) seems to imply some criticism, but I don't think you mean to criticize the practice of deriving rankings on the basis of polling. They had a complicated method of deriving "direct weights" from the faculty answers on questionnaires. Then they compared actual ratings to ratings predicted on the basis of the direct weights. That does not reduce the method to one of simply asking people to rate departments, which seems to be your main point. For one thing, the actual ratings were merely used to modify the direct weights. But I have to admit that I am not in a position to understand exactly what they did. (For those of you who have not dipped into this yet, the detailed account of the calculation of weights starts on p. 33. To see what you're up against, just check out the monstrous flow chart on p. 34.) The questionnaire in the appendix (pp. 161-162) shows that in collecting ratings of departments they provided, for each department, a link to a list of the graduate faculty in that department, which respondents had to go to in order to do the rating. If health insurance turns out to get better than a negligible weight, that will be a significant deviation from my priors too.

Brian: In the report for 04-06 you included for each rated department the number of evaluators who evaluated it (http://www.philosophicalgourmet.com/2006/overall.asp ). There were 266 evaluators, and every department in the top 50 received at least 181 evaluations, but though most received more than 200. That means that the average number of evaluations per evaluator is at least 34 ((50X181)/266). I don't think many of your evaluators can claim to be well acquainted with the accomplishments of the majority of the members of 34 departments. The idea, as you here affirm, was that the aggregation of many partially informed evaluations might provide a good measure of objective quality. And that might be right, but — and this was my point from the start — it will not eliminate the influence of widely shared biases, such as a bias toward departments that have traditionally been strong or toward departments in top universities.

Loading…

Reply
Jeff Glick

July 15, 2009 at 8:33 am

Professor Gauker writes, regarding the methods used in the PGR:

"it will not eliminate the influence of widely shared biases, such as a bias toward departments that have traditionally been strong or toward departments in top universities."

I am unmoved by this concern for a few reasons. First, I'm not sure what evidence there is for the existence of the bias you specify. It's just as reasonable to think that some evaluators, if they are biased at all, are biased AGAINST traditionally strong departments and top universities. Second, I'm not sure what evidence there is for thinking that such a bias, if it exists, is widely shared.

Even if we grant that there is such a bias, and that it is widely spread among philosophers, that still isn't necessarily a problem for the PGR. The philosophers who are actually called on to fill out survey forms (see http://www.philosophicalgourmet.com/reportdesc.asp) constitute a special group. This is an elite team of philosophical talent and experience. (This suggests to me that they certainly would be able to meaningfully evaluate an average of 34 faculty lists.)

I suppose it would be rash to say outright that we should just trust this group because they are less likely to be biased. But consider a sports analogy. If we surveyed all the living members of the Basketball Hall of Fame and asked them to rank the 50 greatest players of all time, whatever they came up with would be a pretty good indicator. There might be biases toward historically strong teams or big media markets, but there would be good reason to take the results very seriously.

Loading…

Reply
Brian Leiter

July 15, 2009 at 8:52 am

I fear the Gauker-Leiter exchange has gotten a bit repetitive at this point. Ned Block (or others) are welcome to reply to the other issues raised in the Block-Gauker exchange, and further comments on the NRC variables are welcome.

Loading…

Reply
Ned Block

July 16, 2009 at 9:56 am

I am not criticizing using polling to derive rankings. That is what the PGR does and I (like a lot of other people) think the PGR ratings, despite flaws, are good enough to be of considerable utility. However, it is already clear that the NRC procedures are so poor as to make their results specious and misleading. It is important to be clear on what the NRC did. The 21 variables that they used (e.g. number of faculty publications, whether students have health insurance, percentage of international students) were not arrived at by any kind of research. They were a result of deliberation within the NRC committee. There were two methods of “weighting” the 21 variables (20 in the case of the humanities since citation data in the humanities was inadequate) and the results of the two methods were averaged. The first method was just asking people about their views. These were the “direct weights”, and the method used was NOT complicated or hard to understand. They asked people which characteristics were "the most important to program quality". (The one minor complication was that they asked this question within each of 3 categories and then asked people a similar question about the categories themselves.) My point against this procedure is that in asking people which characteristics were “the most important to program quality” they did not distinguish between indicators of (i.e. factors that give information about) program quality and features that themselves constituted a kind of program quality even if not very relevant to the all important intellectual quality of the program. I mentioned that the percentage of students with portable fellowships (Mellons, NSFs) is an indicator or assay but not itself a kind of program quality. I mentioned health insurance as something that it is itself a kind of quality but not much of an indicator. This is an absolutely crucial confusion since (at least I would argue) the use of many of the variables including the 5 measures of diversity (of 20 variables) could not be justified on the basis that they are significant indicators of the intellectual quality of the program. The whole rationale of assembling a lot of weak indicators into a stronger measure is undermined if the question asked is straightforwardly ambiguous in this obvious way. In addition, it is a repeated result in experimental psychology that people who can make an expert evaluation (e.g. a doctor evaluating symptoms or a faculty member evaluating a graduate application) are not usually very good at saying what the factors are that justify the evaluation or which is more important. Thus academics may know high quality work when they see it but not what justifies the evaluation.

The second method of weighting the variables is the one that is hard to understand. But in overview it is simple. They ascertained which weights made the variables correlate best with a PGR-like ranking. Here my point is that those who object to the PGR ranking (e.g. Christopher Gauker) can hardly have high expectations of a ranking to the extent that it is derived from a PGR-like ranking. A further point is that the NRC has not told us HOW WELL their weighted variables predict the PGR-like ranking. And they are not revealing the PGR-like ranking itself. I think this is an astonishing piece of intellectual acrobatics. They got PGR-like rankings from their respondents but won’t tell us about what they were because they involve mere opinions about the quality of philosophy programs, but they think well enough of those opinions to use them to choose the weights of the variables they used. They actually describe themselves as using the PGR-like rating to weight “objective variables” so as to “imitate, to the extent achievable, the judgment criteria of the initially surveyed faculty” [i.e. the PGR-like rating]. The chair of the NRC committee (Ostriker) and one of the members (Kuh) cite a joint 2003 book as having shown how to do this. Christopher Gauker is right that for each evaluated program, they did provide a link to a page that provided information about the program. One thing I like about their PGR-like ranking (the one they aren’t going to tell us the results of because it isn’t “objective” enough) is that they asked for both a rating and how familiar the rater was with each program. This is something the PGR could learn from (though at the cost of complicating the process).

The NRC methodology is so technical (see Ostriker & Kuh, 2003) that most readers will lack the expertise to understand it. The key fact to focus on though is that eyeballing the list of 20 variables they used, only 4 have much prima facie plausibility as indicators of intellectual quality (as opposed to other kinds of goodness) in a philosophy program. Those are the number of publications of faculty, percent of faculty holding grants, faculty awards and GRE of students. The first of these is rendered uninteresting because no attempt is made to evaluate the substantiveness of the publication or the quality of the venue. One point for articles, five for books; that’s it. The percent of faculty holding grants is a matter of faculty holding grants at the very moment they filled in the form, rendering it uninteresting. GREs…well we all have our own opinion of the value of GREs. The award variable seems to me the most significant, although as Brian pointed out, some awards just perpetuate traditional hierarchies. Putting it all together, no matter how wonderfully advanced their method of weighting these variable is, it is unlikely that the information is there in those variables to make a highly significant predictor of actual intellectual quality of philosophy programs. And when you add the fact that it is already 3 years out of date, the word ‘pathetic’ comes to mind.

Loading…

Reply
Christopher Gauker

July 16, 2009 at 2:23 pm

As Ned says, some of the factors they asked about should not get much or any weight. And we can expect that they will not. The report provides a worked-out example using the field of Economics. In Economics, eight of the 20 factors received weights of 0 (columns 5 and 6, p. 20). These include all of the diversity measures that Ned objects to (except for Percent Female Students which, disturbingly, got a negative weight).

Ned objects to mixing determinant and indicators. He thinks that confuses the question for respondents. Maybe, but do you also mean, Ned, that there is some other problem? In any case, the economists filtered out most of the determinants (and some indicators), with a couple of possible exceptions. Percent first-year students with full support is probably a determinant rather than an indicator, but it surely indicates something about program quality. Number of student activities offered is probably more a determinant but also indicates.

There are other indicators that get a nonnegligible weight on the Economics list but which do not make it onto Ned's list. For example, number of degrees awarded. If Philosophy gets that result too, then I won't know why Ned will be so sure that doesn't matter when the philosophy respondents said that it did. Ned characterizes the things that matter as those that indicate intellectual quality as opposed to other kinds of goodness, but I don't think the NRC survey was intended to leave out of account those other kinds of goodness.

Ned accuses, thus: "They got PGR-like rankings from their respondents but won’t tell us about what they were because they involve mere opinions." Why do you think they're hiding anything, Ned? This is a methodology document. One should not expect to have that data in this document. It's too early for such accusations.

Incidentally, I did not say that the poll questions about the importance of factors were complicated. I said that the method of turning answers to those questions into the "direct weights" was complicated. The procedure is described on pp. 35-36, and 38. It's complicated. Regarding the combining of direct and regression-based weights, they do say at one point that they averaged them, but that's not actually what they did (p. 17, note 26).

On the other side, one might make a better case against the NRC's use of publication data than Ned does. Number of books and articles could be (as far as I know) a good indicator of relative quality, quite apart from the quality of those books and articles, but only if there are significant differences between the faculties on this score. But I don't know that this variable really varies enough. (Can any of you see any correction for this in the methodology?)

I have set myself up here as a critic of the PGR, and that's how Ned now labels me. I do think that the PGR is liable to be affected by biases. Introspection will not detect biases, but I think anyone who tries to do these evaluations should detect a great deal of uncertainty that biases might exploit. And when I am faced with a hostile reaction to trying to get any other perspective, I am tempted to dig in my heels. But the fact is that whenever I have had a chance to participate as an evaluator (in the early years, and, since then, whenever Cincinnati has been in the running), I have happily done so and have put a lot of effort into my ratings. I do that because I do think PGR provides a valuable service, and the results are informative. Even if the rankings themselves are tainted, a department's movement in rank from one report to the next is informative. And the specialty rankings are far less subject to my doubts.

Loading…

Reply
Tim O'Keefe

July 17, 2009 at 7:47 am

Christopher Gauker writes: "I do think PGR provides a valuable service, and the results are informative. Even if the rankings themselves are tainted, a department's movement in rank from one report to the next is informative. And the specialty rankings are far less subject to my doubts."

It would be possible (if you think the specialty ranking are more reliable than the overall rankings) to construct an overall ranking from the specialty rankings, plugging in the numerical rankings of the various specialty rankings. Of course, you'd have to weight the various specialities (a 3.5 in Metaphysics ought to carry more weight than a 3.5 in Philosophy of Mathematics), and any such weighting of areas would be **extremely** contentious. Still, the results would be interesting–my guess is that, on any reasonable set of weightings, the results would come out about the same as the current overall rankings, but I don't really know.

Loading…

Reply
Fritz Warfield

July 17, 2009 at 8:31 pm

"Still, the results would be interesting–my guess is that, on any reasonable set of weightings, the results would come out about the same as the current overall rankings, but I don't really know."

Everyone I know who has tried to do anything like what Tim O'Keefe suggests has found significant variation between the overall rankings and the various attempted constructions of new overall rankings from weighted specialty rankings.

I favor a proposal that I believe Keith DeRose was first to make: do the overall rankings only after providing evaluators with the collected results of specialty rankings. Almost everyone agrees the specialty rankings are on firmer ground than the overall rankings (experts evaluating strength solely in areas of special expertise helps).

But this obviously isn't the place for that discussion.

One point on the topic of this discussion. As far as I can tell, the NRC methodology and input suffer from all (or almost all) of the criticisms above and many more.

Loading…

Reply
Ned Block

July 18, 2009 at 4:28 pm

The rationale for the NRC procedure is this: a ranking based on informed expert judgments about the quality of programs (like the PGR), will reflect both genuine merit and biases of the respondents. Ostriker & Kuh (2003) propose to eliminate or reduce the biases as follows: you take a set of objective factors that that are relevant to program quality and weight the various factors so as to maximize prediction of the PGR-like judgments. You don’t expect or even want to predict the PGR-like judgments perfectly because you want to eliminate the biases. That is, what you want is to find weights that that have the effect of “predicting” the good component of the informed expert judgments (PGR-like rankings), leaving out information that reflects the biases. The success of this rather typical example of quantitative social science depends of course on the the data on which it is based. If you include the wrong elements in the data, you will end up modeling the very biases you are trying to avoid and if you fail to include enough merit-reflecting data, you will fail to model what is good in the informed expert judgments. The NRC procedure looks to be subject to the first problem and is certainly subject to the second in spades. To take an example of the first problem, suppose you include as part of your data whether a department is in a university that is Ivy vs non-Ivy, rich vs poor, private vs public, coastal vs not, heavily male vs mixed gender. If one of the biases in the PGR-like ranking is one of favoring the Ivy, the rich, the private, the coastal and the male, the Ostriker & Kuh procedure would yield results that are biased—perhaps more biased than the informed expert opinions are supposed to be. Why more biased? Just as Ostriker & Kuh hope to get measures that are more merit-based than the PGR-like rankings by leaving out sources of bias, they run the risk of producing more biased rankings if they cannot find “objective” measures of the relevant kind of merit. (This is not a mere theoretical problem. See below.) Many of the 20 factors included in the NRC model obviously reflect wealth. Here are some examples.

Percentage of students with individual work space
Percentage of health insurance premiums covered by institution or program
Number of student support activities provided by the institution or program
Percentage of students receiving full financial support
However, a trickier category is the measures that reflect both wealth and merit, like average GRE scores of entering students. Of course wealth is genuinely correlated with merit, but if you are trying to factor out biases—which is the whole rationale for the Ostriker & Kuh procedure– it is problematic to include measures to the extent that they vary with wealth. In some cases—Christopher Gauker mentions the negative weighting of percentage of female students by the economists—such bias will be obvious in the data, but the more insidious problem is measures that correlate with both wealth and merit.

I have been talking about the inclusion of variables that correlate with biases. A worse problem in my view is failure to include the kind of substantive information that is actually relevant to intellectual quality. As I mentioned earlier, the publication variable is too crude to be of much use, the grant variable is geared towards the sciences not towards the humanities. (They should have asked faculty whether they have ever had a grant, and they should have categorized the granting agencies so as to give less credit for less meritocratic grants.) The variable that has the most prima facie force (behind citations, which they couldn’t use) is the variable measuring awards and honors, and that is because actual human beings who know the score ranked the various kinds of awards and honors. Overall, the information needed for good rankings of intellectual quality just isn’t there. And without information that reflects the good aspect of PGR-like rankings, there is a real worry that the rankings will be more biased than PGR-like rankings.

The main disagreement between Christopher Gauker and me is that he favors a wait and see attitude whereas I think we can already see that the NRC survey is hopeless. His view makes sense with respect to the first of the two problems I mentioned above. If the philosophers, like the economists, give zero weight to health insurance and workspace, they will have avoided the possibility of those factors biasing the results toward wealth. However, Christopher’s case fails on the second problem. No result of the NRC survey would show that they have included factors that highly reflect real intellectual quality—because they haven’t. What if the NRC result correlates highly with the PGR? I doubt very much if that will happen, but if it did, that would show something wrong the with PGR, not something right with the NRC. As many have noted, there may be a tendency in the PGR for raters to rate more departments than they really know about. If the NRC were to correlate very well with the PGR, that would suggest that this problem is more serious than we think it is. But remedies would be obvious. Tim O’Keefe suggests that we could average the specialty rankings, perhaps weighted by importance of field. (If you think the PGR overall rankings are biased in this way, do the calculation!!) Another option would be the one the NRC used for the PGR-like rankings that they made but show no sign of releasing, asking for judgments of familiarity with each department as well as a rating.

Christopher mentions that I have focused my criticisms of the NRC on its utility as a ranking of intellectual quality of departments but the NRC intends a ranking of quality generally not just intellectual quality. I am sure the NRC committee knows full well that the results will be interpreted mainly as a ranking of intellectual quality of programs. This fact is apparent in the economists responses which seem to involve a reaction against factors that are not pretty directly measures of intellectual quality, narrowly construed. The weights of the variables that best predicted the economists’ PGR-like rankings were not significantly different from zero for workspace and health insurance and some measures of diversity (minority and female faculty, minority students), and disturbingly they reflected negative weight to percent female students. (What this means is that other factors being equal, the economists ranked departments higher if they had fewer female students.) In a way even more surprisingly for a field that involves so much interdisciplinary work, they gave zero weight to interdisciplinary faculty.

A second point about how the rankings will be interpreted: in philosophy at least, students heading for graduate school tend to visit the departments they are considering, and departments in philosophy make a lot of information public (in part thanks to the PGR). They can pretty easily get an idea of the following variables used by the NRC without the NRC report:

Racial/ethnic diversity of program faculty (only non-Asian minorities count)
Gender diversity of program faculty
Percentage of students receiving full financial support
Racial/ethnic diversity of the student population (only non-Asian minorities count)
Gender diversity of the student population
A high percentage of international students
Average number of Ph.D.’s granted in last five years
Placement of students after graduation (percent in either positions or postdoctoral
fellowships in academia)
Percentage of students with individual work space
Percentage of health insurance premiums covered by institution or program
Number of student support activities provided by the institution or program

What students need help with is in evaluating the intellectual quality of the program and the effectiveness of its graduate teaching and advising. The NRC report will be of even less use for the latter than the former. Their main relevant variable in my view is the percentage of PhDs in the period 2001-2005 (!) who got some sort of academic position. The NRC data does not distinguish as far as I could tell between a 1-year job in a community college, a postdoc and a tenure track job in a research university. This is typical of why their data is of little use. For most philosophy departments, applicants can see data that is not only much more recent but also includes what jobs the PhDs actually got, and those jobs can be evaluated using the PGR! What everyone really expects from a ranking in philosophy is first and foremost a ranking of intellectual quality, but the NRC report will provide us only with a misleading low grade proxy for that.

Loading…

Reply

To be worth using, a detector needs not only (A) not get very many false positives, but also (B) get…

Everything you say is true, but what is the alternative? I don’t think people are advocating a return to in-class…

The discussion here assumes an institutional context where returning to supervised in-person assessment is at least theoretically feasible, a reasonable…

Cyber security professional here -reliably determining when a computational artifact (file, etc.) was created is *hard*. This is sorta why…

Agreed with the other commentator. It is extremely unlikely that Pangram’s success is due to its cheating by reading metadata.

I see this question as a bit naïve. There is metadata on every document created by a modern word processor…

There’s a simple way to test. Open a pre-2022 essay and copy-and-paste it into a new file.

Leiter Reports: A Philosophy Blog