In Defense of Intelligence Tests
Critics of intelligence testing have recently called for nothing less than its abolition in schools and elsewhere. More and more often, their calls are being answered in kind by courts, legislatures, and spokesmen in the media. The outcry against intelligence testing arises because many well-intentioned people believe that the tests victimize the culturally disadvantaged portion of the population, the portion least able to tolerate further disadvantage. However, as good as the intentions are, it is likely that intelligence tests are a net benefit to the disadvantaged and that their abolition would be precisely the wrong change for them. To see why, we must look closely at the facts about testing.
If we are to examine intelligence testing here, we should first consider the idea of testing more broadly, to see how ordinary an idea it is. To test is to sample and to predict. We test a frail-looking chair or a newly frozen pond before committing ourselves fully. With an exploratory poke, we hope to measure the strength of the chair or the ice well enough to predict later performance. Doctors sample heart beats, body temperature, blood composition, and so on. From these small soundings, they are supposed to predict what would be found more deeply inside us and our health until the next check-up. We allow a doctor to take an ounce of blood rather than insisting on a gallon, gladly trusting in the predictive power of a small sample.
All it takes to create a test is a measurement known to correlate with whatever it is we would like to predict. An intelligence test is, to begin with, merely a measurable sampling of a person’s behavior lasting anywhere from about 20 to 120 minutes. From this sampling, we should be able to predict other behavior of the sort that is thought of as intellectual. As tests in general go, intelligence tests are unremarkable in principle.
Unremarkable as they are, they have been engulfed for a decade in an unequaled controversy within the academic world and beyond it. A number of state and municipal authorities have now banned routine intelligence tests in public schools, and it is not unlikely that others will do the same. Hundreds of court decisions and executive orders at every level have prohibited or restricted the use of intelligence tests for screening job candidates or for selecting and tracking students. The public manifestations of controversy mean far more than the personal ones, but the personal ones show its intensity. A few academics who are publicly identified as advocates of intelligence testing have been physically assaulted on lecture platforms, have had their property damaged, their and their family’s lives threatened, and have been accused of the grossest defects of character.
As a result, many experts on intelligence testing resist becoming publicly identified, especially if they agree with the unpopular views. Since those who hold the popular views are not similarly deterred, the public debate is predictably biased. It takes special pains to get a balanced and reasonably complete account of intelligence testing into the public forum where it might inform public policy.
Intense as it has been, the controversy may now intensify again because of the publication of Arthur R. Jensen’s new book, Bias in Mental Testing.1 Jensen, a professor of educational psychology at the University of California at Berkeley, has been the major figure in the intelligence controversy of the 1970’s. He was thrust into the role after his article, “How Much Can We Boost IQ and Scholastic Achievement?,” was published in the Harvard Educational Review in 1969. Jensen’s answer to the question in the title was that since intelligence contributed to the variation in the performance of schoolchildren and since intelligence has genetic determiners, individual differences in scholastic achievement were bound to be relatively hard, though not necessarily impossible, to improve by simple remedies. The article bristled with over 100 pages of analysis of data and so must be oversimplified in a one-sentence summary, but its impact was as startling as a clap of thunder, for a variety of reasons.
The prime reason was that the article raised the possibility of racial differences in mental capacity. More than 90 per cent of it, however, dealt not with race but with individual differences in school performance and test scores. Jensen’s review was nearly exhaustive, more nearly so than other summaries then available on the subject. Findings on the heritability of intelligence, for example, that had little impact individually were impressive when they began piling up. The strength of the evidence as a whole had not often been noted, particularly by the audience of educators to whom the article was addressed. Several streams of research converged to support the possibility that compensatory-education programs were disappointing because they were bucking the heritability of intelligence. In practical terms, significant heritability translates into significant individual differences in learning from a given educational curriculum.
By suggesting that genetic factors were affecting educational outcomes, Jensen’s paper brought two usually separate bodies of evidence together, which was a source both of strength and of vulnerability. Its strength was its novelty. Its vulnerability was that few experts had broad enough training to testify to Jensen’s overall accuracy. Unfriendly critics could point to occasional minor lapses and get away with insinuating much worse. Not many readers of the Harvard Educational Review or the wider audience of the spreading controversy knew the methods and findings of quantitative genetics, a new and still developing branch of biology that relies heavily on complex statistics to determine how much of the variation in a trait is hereditary and how much is not. Many of Jensen’s critics did not realize that it was possible, using the new mathematical procedures, to estimate within limits the relative contributions of environment and genes to psychological measurements like IQ or scholastic achievement scores. Quantitative genetics applied to socially significant human characteristics still has many technical hurdles to surmount, which Jensen and others may have been underestimating, but the very enterprise shocked many people who did not easily think of human society in biological terms.
Not only was it a shock, it was unwelcome. Biology seemed more often than not to undermine political and moral ideals. Biology creates inequalities, not just the physical inequalities of size, strength, speed, and durability but also psychological variations in cognitive capacities, emotional tendencies, and motivational appetites. Biology sets limits, though perhaps broad ones, on how much or how little altruism, fidelity, aggression, fairness, compassion, piety, and lawfulness a society composed of the creatures we call human can reasonably count on. Political or moral theories that promise to erase inequalities of intellect or character or to enhance compassion and lawfulness while eliminating avarice and aggression are challenged by the hypothesis of biological determination. Not that those issues are settled, but their appearance on the intellectual horizon signaled the inevitable blending of the social sciences with biology. As threatening as Darwinian theory seemed when it challenged Scripture in the 19th century, the new challenge cut closer to the quick—to personal qualities, to the structure of family life, and to the institutions of society. Social Darwinism had, of course, been known before, but now both the social sciences and biology have matured to a point when their union bears fruit: actual data on practical social concerns. Jensen’s analysis of compensatory education was the first significant example to be picked up by the public press, hence the most shocking.
The intelligence controversy is, then, part of a larger upheaval in social science, one that could be called the naturalization of social science. The upheaval will no doubt rumble on at least until there are enough social scientists—economists, sociologists, and political scientists no less than psychologists—with training in biology, and vice versa for biologists, so that it will no longer seem indecent to bring the two disciplines together. But the intelligence controversy is not just an argument about the relative contribution of the genes. Intelligence testing is embroiled in four separate quarrels, only one of which involves biology directly. Jensen’s new book, in fact, deals with the other three, which are: first, do test scores really measure intelligence, and how can such a question be answered? Second, do test scores really predict anything important or are they just an arbitrary credential used by the haves to lock out the have nots? Third, what kinds of biases are programmed into the tests? In this article, I will survey all four questions, starting with the heritability of IQ scores.
Inheritance of Intelligence
The human traits that everyone presumes to be genetic have, by and large, not been proved to be so. Not much formal evidence has been gathered about, say, how tall people grow, whether their noses curve up or down, whether they have long legs and short waists or vice versa. For such visible, intuitively measurable features, family correlations are self-evident. We can see traits running in families. When the Kennedys recently came to Boston for the dedication of the Presidential Library, we saw on the evening news informal evidence for the heritability of a broad jaw and wavy brown hair.
If people wore their IQ’s like the numbers on a football jersey, we would see this trait too running in families. Parents correlate with their children, and also with their grandchildren but not as much. First cousins correlate more than second cousins, who correlate more than third cousins, and so on to the limit of what anyone has bothered to measure so far. Half-siblings, with just one parent in common, correlate less than full siblings; identical twins, who share all their genes, correlate more than fraternal twins, who share about 50 per cent, no more than ordinary brothers and sisters.
It could be, and indeed has been, argued that these correlations really show shared environments, not shared genes. On the average, parents share more of the environment with their children than grandparents share with their grandchildren. First cousins have more of a shared environment than second cousins, full siblings more than half-siblings, and so on. Possibly, it is suggested, identical twins share more of the environment than fraternal twins. As far as it goes, the argument is fine, but the available data go beyond it. We know, for example, that adopted children correlate in IQ more highly with natural parents than with adopting parents, unless placement in the adoptive home involves a direct attempt to match the child to the adopting parents in probable IQ. Identical twins whose families mistakenly think they are fraternal are more correlated than fraternal twins whose families mistakenly think they are identical. Identical twins growing up separated in foster homes are more correlated than fraternal twins growing up together with their own parents.
Not only do the correlations follow the path of blood relationship in general, they usually follow it numerically. The size of the correlations among parents and children, siblings, cousins from the first to the nth degree, twins, and so on is specified by the level of heritability, which can range from 0 to 100 per cent for any measurable trait. A given level of heritability goes with a certain pattern of correlations among relatives, although various complexities need to be taken into account in the calculations. For IQ, judging from the evidence as a whole, the level of heritability is unlikely to be less than 50 per cent and the most comprehensive modern estimates suggest a range between 60 per cent and 75 per cent in the United States. This is somewhat lower than the 80 per cent that Jensen cited in 1969, for reasons that I will comment on below, but it is a firmer estimate based on more and better data than his earlier one.
Although people do not wear IQ’s on their shirts, they nevertheless get a message across to their friends. In America now, husbands and wives correlate in IQ about as much, if not slightly more, than brothers and sisters—about 50 per cent or just above it. Married couples correlate only about 30 per cent in height (this figure discounts the average sex difference in height). The tendency to match IQ’s itself causes increased heritability, for it makes the scores within a family more similar, while increasing the differences between families. Assortative mating, which is the technical term, produces more extreme IQ’s, both high and low, than if mating were random for intelligence. Willy-nilly, modern society produces selective mating for intelligence.
The evidence for heritability has been accumulating since the Frenchman Alfred Binet started testing intelligence at the turn of the century. In fact, the evidence started accumulating before quantitative genetics emerged from evolutionary biology. Testing and genetics developed along parallel lines, although an occasional worker tried to bring them together. For example, Cyril Burt in England worked on the English standardization of the Binet test and also on the statistical procedures for estimating heritability, publishing in every decade of the century from the first to the eighth. Burt, it now turns out, was probably more voluminous than we knew in his lifetime; he apparently published under various pseudonyms when it suited him to have another party enter a scholarly exchange.
This discovery about one of testing’s major early figures, which came to light in the London Times in 1976, has been followed by other revelations and pseudo-revelations of irregularities in Burt’s work. I stopped counting when the New York Times published its fifth article about Burt in a few months. The newspapers did have a story, though they did not always put it in perspective and it probably did not deserve the repetition. Burt in his later years (he lived to eighty-eight and was publishing to the very end) may have fabricated data. His fabrications (if that is what they were) usually confirmed what he had, in fact, actually observed and published earlier, when he had the resources and vigor to test substantial numbers of people. His data, with this uncertainty now established, should no longer be used in serious estimates of heritability, as they were until quite recently. But leaving them out changes little, for Burt’s data are, first, a small fraction of the total evidence and, second, they generally agree remarkably well with other people’s data. Even when Burt’s answer to a specific question was the earliest published, so that he had no source to copy, it has still been approximately confirmed by other workers. His work can no longer be relied on without independent verification, but it is false and unjust to characterize it as a hoax, as the press often does, echoing an altogether too ungenerous judgment by a few psychologists.
Burt and other researchers typically found slightly higher heritabilities for IQ scores in England than in the United States. Heritability is inherently statistical; it expresses, for a given population, the percentage of a trait’s variation caused by genetic differences. Since it is a percentage, a ratio of one number to another, it expresses both the genetic and non-genetic sources of variation in a trait. Suppose my neighbor and I each grow squash in our gardens, using seeds from the same packet. The plants in each garden may resemble each other more than they resemble the plants in the other because of differences in soil, drainage, and the styles of the gardeners. Within each garden, relatively more of the variation is likely to be genetic than in the two gardens considered as a whole, and this would show up as a higher heritability in each garden taken individually. Moreover, if I am a less consistent gardener or if my garden setting is more varied, heritability may be lower in my garden than in my neighbor’s, for the squash’s size, abundance, or anything else I might measure. Similarly, if England is environmentally more homogeneous for intelligence than the United States, then, other things being equal, the heritability would be higher there than here. Other data besides Burt’s suggest that this may be the case by perhaps 10 per cent or so, or it may have been when Burt and others collected their data, since heritabilities can change as environmental circumstances change. In 1969, using large doses of English data and the limited methods of analysis then available, Jensen estimated the heritability of IQ about 5 per cent to 15 per cent higher than seems correct for the United States now.
Significant heritability in a trait means that it is relatively resistant to changes in environment. Thus, insensitivity to environmental change is itself evidence for genetic control, albeit indirect. The upheavals of war sometimes provide such evidence. The test scores of those who grew up during the catastrophic disruptions of World War II show remarkable stability. For example, intelligence tests of thousands of nineteen-year-old native-born men in Holland in 1963 produced virtually the same average and distribution of scores as those of other samples of Dutch men. The significance of the finding is that the selected group had been conceived and born in famine-stricken districts of Holland during the winter of 1944-45, when daily caloric intakes had dropped to about one-third the normal ration. This indicates that test scores are unaffected by severe malnutrition of bearing mothers, although it does not rule out other sorts of dietary effects on mental development.
Another example of stability in test scores is a study of intellectual development in Warsaw a generation after World War II. The war destroyed about 70 per cent of the city, and afterward the Polish government razed the remaining slums. Strict egalitarianism guided the rebuilding of residential housing, intermixing families of manual workers, non-manual workers, and professionals. The study examined over 200 schools in Warsaw and found considerable variation in quality as measured by teacher training and to a lesser extent by other indices. There was a correlation between location in the city’s center and schools with the most highly trained teachers, in spite of official policy. Virtually all the children born in 1963 and still living in Warsaw in 1974 were tested. Their scores on intelligence tests varied about as widely as they do, for example, in London, where the egalitarian impulse has not yet randomized social classes across neighborhoods. The average IQ in Warsaw was about 109, substantially higher than the norm of 100, which the authors of the study attribute to the flow of highly trained people, with their IQ’s, toward the capital. (In the United Kingdom, too, it has been shown that London is the region with the highest IQ, but it is only about 102, according to one estimate.) In Warsaw, neither the quality of school nor the location of the residential area correlated with children’s scores, but their parents’ education and occupational status did, and did so about as much as they do in other regions of the world, such as the United States and Western Europe, where comparable data are available. Thus, a generation of homogenized residential patterns leaves unchanged the transmission of intellectual capacity from one generation to another and also the gap in IQ scores between the offspring of different occupational classes. This resistance to social control would be expected of a heritable trait.
Jensen’s new book omits any question of heritability, for in the short run it is immaterial. In the short run, the questions about intelligence testing concern its meaning, its biases, and its predictive value, and these arise within a single generation, without considering transmission of test scores to the next generation or worrying about attempts to change them. In the long run, however, heritability is crucial, not just for intelligence but for any mental or physical trait that matters socially. Substantial heritabilities are like large unseen rocks in the stream of social life, shaping and at times distorting the effects of institutions and laws and defining private relationships. Heritability causes individuals, and the groups they are part of, to respond differently to common environments. This is obvious if we are thinking about people growing to different heights though sharing a diet and other physical conditions. It is less obvious but more worth understanding if it is performance in school we are concerned with. If we want education to fit everyone, it will have to be tailored to individual differences. And so it may be for all the other points of contact between us and our environment. In a society of legal and moral obligations, economic opportunities, hedonistic enticements, and emotional risks, biology may be setting limits and creating individual differences. No significant aspect of human conduct is yet known to be free of substantial heritability; the best case for substantial heritability is the case for intelligence-test scores.
What Do Intelligence Tests Measure?
But why should we care about those test scores at all? Critics of testing sometimes deny that the tests measure intelligence. This point of view has been expressed in a documentary film shown on network television (CBS’s The IQ Myth) and has been repeated in any number of books, articles, and commentaries by generally responsible people. This outpouring has emboldened government agencies around the country to ban the test in public schools. The denial that tests actually measure intelligence is, in short, not an uncommon view in one form or another. It is, therefore, worth examining more closely, and Jensen’s new book does so.
Two points need to be noted from the start. First, “intelligence” is not yet a concept that has been pinned down by a scientific discipline, the way “force” has been by physics. When we consider whether tests measure intelligence, we can only mean whether the scores correlate with what people generally understand by the word. Since people are unlikely to mean precisely the same thing, the correlation can only be approximate. But if there is any sort of correlation, then as a practical matter the test measures intelligence to some extent, however obliquely. Second, as a one-dimensional scale, the IQ cannot fully represent a multidimensional conception of intelligence. There are, consequently, many measurement procedures that yield not a single number like an IQ, but multiple scores, such as the verbal and quantitative tests used for college entrance. The objective measurement of intelligence can be made as multidimensional as the data and the occasion warrant.
Scores on well-standardized tests correlate highly with each other, which is as it should be. They do so even when the items are superficially utterly different. Let’s look at a sampling of items, for their diversity is often forgotten by the critics. Most of these examples are described at greater length by Jensen. A block-design test requires us to copy a red-and-white design using small cubical blocks whose faces are painted red, white, or half-and-half. The scores this test yields correlate with scores on vocabulary tests, sentence-completion tests, or the number of digits we can recite back backward, having heard them once. Scores on any of these tests correlate with our speed and accuracy in being able to see a figure or a word in a picture that has been blotted out by random patches. Tests involving nothing but visual inferences from geometrical designs correlate with tests involving nothing but word usage. All the foregoing tests correlate with our ability to extract syntax from sentences using nonsense words, such as the following item (one of Jensen’s examples):
Pick the correct choice for the blank: A gelish lob relled perfully. I grolled the — meglessly. (a) gelish (b) lob (c) relied (d) perfully.
Many people quickly sense the noun-ness of “lob,” but some people do not.
The scores also correlate with tests that require reasoning with numbers or that ask us to figure out the family relationships among relatives in a family described in simple terms, or to find the odd or missing features in a series of distorted or mutilated pictures of familiar objects, or to count the cubes in a perspective view of a stack of cubes, or to say whether or not two drawings of a stack of blocks are two views of the same stack.
The foregoing further correlate with tests of common and uncommon knowledge, including such questions as: how many days are there in a week? or, who wrote Antigone? or, what is the boiling point of water? The people who answer the hard questions tend to answer all the easier ones too, and conversely for people who cannot answer the easy questions. On the whole, the same people can or cannot distinguish the meaning of similar but common words, such as “produce” and “yield” or “triumph” and “victory.” The scores also correlate with scores on a series of finger mazes graded in difficulty.
The discovery that items as different as these result in correlated scores was no small finding about the nature of the human intellect, and this brief survey barely suggests the full range. Only those who do not know the variety of tests can believe that they are limited to specialized vocabulary or knowledge, to particular cognitive skills, or that they are explicitly geared to any subculture in our society. A deeper and more abstract solution is needed for the puzzle of what links together so varied an assortment. Some of the people who have thought about the puzzle say that the common trait involves extracting relationships from minimal data and applying the results to new data. The many different tests are different ways to tap the process. Vocabulary or word-usage tests sample the accumulated harvest of the underlying ability, for we learn most of language from everyday contexts, not from the dictionary or by rote memorization. How well we extract subtle relationships from scant data has much to do with our command of language. Tests of inference or perceptual analysis confront directly the ability to extract and apply relationships. The correlated scores show the underlying common capacity or capacities, which is what makes the test an intelligence test rather than merely a vocabulary or a spatial-analysis test.
The scores correlate, but not perfectly. The more similar two tests are, the more their scores correlate, but sometimes tests that seem quite different produce closely corresponding scores. The pattern of correlations permits an estimate of how much the scores depend on mental factors common to all the tests, which is often called “g” (for general) in the jargon of testing. A person with a high g will profit from it whenever it is called upon by a given test. Some tests that depend heavily on g are verbal analogies (e.g., fish is to scales as dog is to: lap; skin; fur; bite), spatial reasoning, number- or letter-series completion, and verbal distinctions. In addition, analysis uncovers other factors besides g—for example, verbal factors that draw verbal scores closer to each other than they would be on the basis of g alone, or spatial-orientation factors, perceptual speed factors, and so on. It is thus possible to design test batteries that measure g or other factors in some proportion for a particular practical or scientific purpose.
Do the scores correspond to people’s impressions of intelligence? Various kinds of evidence say yes, more or less. Teachers’ ratings of pupils’ intelligence correlate with test scores, as do pupils’ ratings of each other. Though substantial, the correlations are not perfect, so each of us is free to decide whether we trust the tests or the subjective ratings more. Test scores correlate with measures of social acceptance; also with being picked earlier when athletic or other sorts of teams are chosen and with being selected as a leader by classmates. Even without knowing the scores, parents tend to favor their higher-scoring children. Test scores correlate with adults’ ratings of each other’s potential for occupational success. African bushmen rating each other’s intelligence correlate with scores on a test that measures g especially well. Even if you believe that approval and high regard by parents, teachers, and peers foster high scores, rather than vice versa, data show that scores measure something like intelligence or a related quality of leaders, co-workers, friends, playmates, and offspring.
What Do Test Scores Predict?
Intelligence tests endure in the face of unpopularity because they correlate with so many different kinds of things. We should start with school because success in school is what the tests were originally designed to predict. No fact about a six-or seven-year-old child predicts later success in school as well as an intelligence-test score. Not the parents’ education, or their socioeconomic standing, their attitude toward school, or even their IQ’s, correlate as well as the child’s own score. The predictive power of the IQ is not overwhelming, leaving room for the other factors to have some effect, but, on the average, the other factors are small in comparison. Test scores predict school grades about as well if the test is given at the beginning or the end of the grading period. This finding argues against the popular claim that the correlation between IQ and grades results from teachers’ grading so as to match the children’s IQ’s. Other findings also argue against it. IQ’s predict objective measures of scholastic achievement better than grades. This is not because objective achievement tests are the same as IQ tests, for they are not. Grades, more than achievement scores, reflect the differences in teachers’ grading standards, prejudices, and competences, which weaken their correlation with IQ. The difference between grades and achievement scores is the opposite of what it would be if the correlation with IQ resulted from teachers’ expectations or self-fulfilling prophecies for their students. If teachers’ expectations make a contribution, as some people believe they do, it must be relatively small on the average.
Not all subjects correlate with IQ equally. Even within a subject, topics do not correlate equally. Mathematics, science, and English generally correlate more than non-academic subjects like typing, woodworking, or learning to play a musical instrument. Within a subject, topics that require mostly time and memory correlate less than topics that require abstraction or mental manipulation, the thought problems rather than the rote problems. Repeating a list of just-heard numbers is less correlated with IQ than repeating them in reverse order. The former requires memory alone; the latter, memory plus mental work. If teachers equalize the time spent on a subject by their pupils, achievement scores may vary less among the students, but they correlate even more with IQ than when the students are left to their own devices. Reading scores correlate with IQ, but reading can be subdivided into two components, translation from text to speech and comprehension: IQ correlates more with the latter than the former. Most children who read poorly also have problems with tests of oral comprehension, involving no reading. But dys lexics, who may actually be having trouble with the translation from text to speech, often have high verbal IQ scores, if the test does not require extensive reading. The speed of learning something new may correlate with IQ, but once it is learned and routinized, the correlations may decrease or vanish. Typing, driving a car, doing the multiplication table, reciting the alphabet may be examples. That is why some items turn up on IQ tests at one age level, while they are being learned, but disappear later, after they have been mastered. IQ may then predict the speed of learning but not the final level of performance for relatively simple intellectual activities. If an activity continues to demand abstraction or cannot be routinized, the IQ probably correlates with both the speed of learning and the final level.
Intelligence-test scores correlate with scholastic work from the first grade on. In college, and even more so in graduate schools, the correlation begins to shrink, although it does not vanish entirely. When a law school admits only people with IQ’s in the top 2 or 3 per cent of the population, it has filtered out most of the correlation between IQ and performance. This does not mean that intelligence is irrelevant to the study of law, as the law school would quickly discover if it started admitting candidates from the full range of scores. With only the top 3 per cent in attendance, performance correlates with other attributes, mainly interest and sitzfleisch. Nevertheless, even in selective colleges and professional schools, intelligence-test scores still retain a trace of predictiveness, though not as powerful as in the almost unselected population of large elementary or high schools.
College admission often depends on a mixture of grades, scores on achievement and aptitude tests, subjective assessments, even geography and demography. Colleges go through this complex procedure partly because their goals are more subtle and diverse than just to produce the largest number of A grades in next year’s freshman class. They are, indeed, not merely picking individuals but trying to create a student body whose characteristics may concern many people—the students themselves, alumni, faculty, administration, and, lately, the federal government and the general public. But if colleges were trying to maximize academic performance alone, their best single predictor would probably be scholastic-achievement scores. Intelligence-test scores would predict also, but not as well because they are less sensitive to the other personal qualities that contribute to success in school. But achievement scores and intelligence scores are themselves so highly correlated that the differences in prediction are negligible. It may nevertheless seem fairer to reward the student who has worked hard for high achievement scores in spite of a not so high IQ rather than the converse, the high-IQ underachiever. Not only fairer, it is good practical policy, for overachievers are generally better risks than underachievers. Without intelligence-test scores (or a close correlate like scholastic-aptitude tests), distinctions like this would be hard to draw.
The correlates of intelligence reach beyond school, as sociologists and economists are showing. Consider socioeconomic status (SES), the modern counterpart of class. People can rate occupations subjectively for status or prestige and the result closely resembles any one of a variety of objective classifications of occupational status. Any of these scales correlates with IQ. Jobs rated high are typically filled by people with high IQ’s, and so on through the scale. The stratification of society into a gradient of prestige approximately parallels the gradient of measured intelligence. A full account of intellectual meritocracy is too technical and at present too much in contention to attempt here, but several qualifications can be noted. The correlation between IQ and SES is only moderate, lower than the correlation between IQ and success in school. But we should distinguish two correlations involving IQ and status. The one just mentioned is between individual scores and individual SES. The other correlation is between the average IQ in an occupation and the average status of the occupation. Taking several hundred occupational categories and the average IQ of the people in those categories gives quite high correlations, near the limits imposed by the measurements themselves.
The correlation between individual IQ and social status is diluted by arbitrary prejudices like those against women or blacks and also by individual circumstances. A young person might be a stock clerk, on the way to promotion to shop foreman or plant manager. Someone else might have just opened a motel with inherited capital, and might be two years away from bankruptcy and a job as a salesman in a clothing store. At any moment, streams of people are moving through jobs and levels of status, and the correlation with individual test scores cannot register where they are headed. On the average, however, they tend to be headed toward a level highly correlated with intelligence scores. The conceptual demands of many occupations are not always recognized as such because they may not seem intellectual. In a serious poker game, the player who thinks it is all a matter of luck will in time lose to the one who knows how subtle the game is. As a gambling man, I would wager that the difference in players correlates with IQ, barring differences in experience.
Economists prefer income to SES as a measure, but they see the same picture, broadly speaking. Income, too, correlates with intelligence-test scores, though slightly less than SES.
Sometimes it is said that the fundamental relationship is between income or SES and not IQ but educational level. Society does, in fact, often require people to find a niche via education. Even legislation gets involved in some of the higher occupations, requiring graduation from professional schools or certification by examination for doctors, lawyers, accountants, and others. At the same time, the most consequential jobs, which usually pay well and carry high status, are often hard to routinize and demand abstraction and inference. These are the very traits predicted by test scores, whether for school work or real work. What people do in school shows their intelligence, but also other traits that count at work, such as dependability or adaptability or willingness to work for socially approved rewards. School funnels into its records a blend of personal traits that affect our dealings with people afterward, including a large component measured by test scores.
Apparently the point is easy to miss. For example, a new book by Christopher Jencks and his associates (Who Gets Ahead?) presents convincing evidence for the network of correlations connecting test scores, school success, occupational standing, and earnings. But then, the conclusion says:
We have shown . . . that a nontrivial fraction of background’s effect on success derives from the fact that background affects cognitive skills. But it is not clear that cognitive skills are, or should be, synonymous with “merit.” A large vocabulary seems to help a man get through school, and getting through school clearly helps him enter a high-status occupation and earn more money than most men do. But this does not prove that a man “needs” a large vocabulary in order to perform competently in most highly paid jobs. Furthermore, even if such jobs do demand a large vocabulary, it does not follow that this is either technically inevitable or morally desirable. The same logic applies to educational attainment. Educational credentials are essential for obtaining some lucrative jobs. But it does not follow that educational credentials ought to be essential for these jobs. If what we want is competence, for example, we might be better off dispensing with academic credentials and setting up on-the-job selection procedures for identifying incompetents.
What this argument neglects is that if “on-the-job selection procedures” can predict performance, especially at the higher occupational levels, they will correlate with both intelligence-test scores and school records. No one has yet invented, or even proposed, a test that could predict competence in demanding occupations yet not correlate with both school record and IQ.
The predictive power of test scores is sometimes credited to the family’s social background. At the extremes this is certainly correct. A Rothschild or a Rockefeller does not compete on an even footing with the children of migrant laborers, either in school or at work. The different home backgrounds no doubt affect IQ and eventual social status. But the question is not whether such things happen, it is how much of a contribution they make overall to the correlation between SES and IQ. Numerous kinds of evidence show that social background cannot be the whole story and may be relatively minor in the United States at present, when averaged over the entire population.
For example, occupational differences between brothers cannot be explained by differences in SES background, which they share, especially if they are close or equal in age. Nevertheless, brothers sometimes land at different occupational levels. (Sisters have not figured in most sociological studies like this.) Thus something more than social background must be involved. The wheel of fortune may spin mysteriously, but the evidence says that the intelligence-test-score difference between brothers predicts (though not entirely) the difference in their progress up or down the socioeconomic ladder. Other studies confirm the point by showing that brothers rise or fall from the family’s socioeconomic level in some proportion to their test scores. That we all know exceptions to this conclusion does not alter its validity in general.
Though data on adopted children are scarce, it seems that their IQ’s correlate more highly with the SES or school records of their natural parents than of the adopting parents. The children may not know or know about, let alone live with, their natural parents in order to correlate with them. This shows that a person’s socioeconomic and educational fate depends in part on traits like those measured by an IQ test and that the traits pass to their offspring more via the genes than by the home environment. Here, I hasten to add that the correlations are low, though statistically significant, leaving considerable room in principle for adopting parents to contribute to their foster children’s scores and other traits.
Occasionally accounts in the popular press report a failure to find much correlation between IQ and performance on the job, which then serves as an argument against using the scores or any correlate of them to select employees. Judges often rely on this argument in job-discrimination cases. But the lack of correlation means different things for different occupations. Performance may not correlate much with IQ in occupations for which most of the population has already been screened out. We know, in general, that the higher the occupational index of an occupational category, the smaller the variation of IQ’s in it. Physicians and lawyers and large-corporation executives are probably more homogeneous in intelligence than house painters and short-order cooks. Purely on statistical grounds, correlations between IQ and performance within an occupation should diminish toward the top of the occupational ladder. For occupations at the other end of the ladder, low correlations between performance and IQ may reflect the routinized character of the job. Between the extremes must be the selective movement of IQ up and down the scale of performance that yields the known correlation with status. We may or may not admire the people who rise toward the higher rungs of the ladder, but intelligence is often a factor. On the whole, it can be shown that random placement of IQ’s in occupations, if it were conceivable, would exact a great loss in gross national product, let alone in the quality and safety of life. The differences among people in mental capacity are so large that no society can long afford the indulgence of ignoring them, as the Chinese have recently decided after disastrously experimenting with egalitarianism in schools and occupations. In the United States at present, it has been estimated that more stringent objective selection than we use now would yield significant economic gains, but this is not a matter to be decided on purely economic grounds. (This point comes up again in the next section, on test bias.)
Test predictions reach even beyond conventional occupations. For example, the criminal population varies around an IQ about 10 points below the general average of 100. We are fascinated by stories about wily criminals, who may not often get caught and have their IQ’s measured, but by far most crimes are impulsive acts, the outcome of general attitudes rather than long-range planning or cautious execution. The intellectual component of most crimes is fairly small, compared to the demands of legitimate work. As far as we can tell, the population of criminals in general would not test much higher in IQ than the population of apprehended criminals. Test scores are a better predictor of criminal behavior than socioeconomic status, and if measures of personality are combined with those of intelligence, prediction is even better. The higher crime rate among men than women arises from other attributes than intelligence, such as personality and social role. The widely touted XYY chromosomal abnormality in a few men does contribute to the risk of criminal behavior, but only slightly and mainly because it lowers the IQ a few points.
How, if the foregoing is a fair summary of the data, can the predictiveness of IQ for school and beyond remain in doubt? That good question deserves an answer. Here are six reasons for the continuing dispute about test scores as predictors:
- Tradition. Standardized mental testing is recent, but explanations of human behavior and society are ancient. Cultural inertia resists a new account of significant events in our lives, such as success or failure or the prospects for our children.
- Professionalism. A narrower version of 1. IQ is psychology’s creation, and, as the new kid on the block, is not always kindly received by the older disciplines, by economists, sociologists, political scientists, and philosophers, who each have their own, non-psychological, way of dealing with human variation.
- Alternatives. IQ is not the best predictor of many sociological or economic variables at some levels of analysis. Schooling, for example, predicts social status better, but we miss something if we do not also note that IQ is the best predictor of school performance. The best predictors of criminal behavior are sex and age, not IQ, but within an age and sex grouping, IQ is probably the best predictor. IQ is often the silent partner in standard sociological and economic explanations of many aspects of society.
- Unpopularity. Since IQ is relatively resistant to change, it is a bit frightening to believe that it matters. The criterion for belief in the importance of IQ is therefore set higher than the criterion for belief in its unimportance. This is a reasonable defense against unwelcome news if it is not an absolute unwillingness to accept evidence.
- Insulation. Because the IQ correlates with schooling, status, income, and many other less global aspects of our lives, we move in a social world already stratified for test scores. Our friends, spouses, children, colleagues, competitors, and neighbors are not randomly spread across the range of scores, More often than not, their scores approximate ours. We are relatively insulated from scores in other parts of the range. Consequently, we underestimate intellectual differences among people and their social importance.
- Politics. Marxists and egalitarians dislike evidence that individual differences can be socially significant. If the evidence looks strong, they may seek other ways to counter it, such as by trying to discredit those who publicize the evidence.
Are Intelligence Tests Biased?
The received opinion is that intelligence tests are biased against people from lower socioeconomic levels and minority groups. To evaluate the opinion, we should start by focusing “bias” more clearly than it often is. Tests are not biased simply because some people get higher scores than others, any more than yardsticks are biased because they show some people to be taller than others. Bias means that the measurement represents different things for different people, which is not the same as simply showing a difference between people. For example, bias in intelligence testing suggests that a test score for a member of group A would be associated with better performance in school than is the same score for a member of group B. People in group A could reasonably protest that the test is biased against them if it underpredicts their performance.
Or bias may be shown if certain items in a test specifically favor members of one group rather than another. If people in group A are privy to information not available to those in group B, then items drawing on the information are biased against group B. The differences in their scores on these items would not correspond to a difference in intelligence as usually understood. The extreme form of this kind of bias is when group B gets an advance copy of the test plus its answers but group A does not. More commonly this kind of bias is thought to apply to items on intelligence tests that call on specific knowledge, such as knowing the boiling point of water.
Jensen’s new book explores both of these kinds of bias, plus several other lesser varieties, in great detail. No comparably exhaustive, meticulous treatment is available and so once again his work will be the point of departure for other scholars. His conclusions, which I will summarize below, have nevertheless been foreshadowed by his own and other earlier work. It is unlikely that significantly different conclusions can be supported by the available data.
There is, however, one kind of bias that, by its very nature, is not readily evaluated empirically. Perhaps bias in our society suppresses not only test scores but also the measures of performance in school and on the job. This is, in effect, to suppose that bias in testing parallels prejudices in school and elsewhere, The matching suppressions would simply push a group’s average below the population averages for both test scores and the correlates of test scores, without leaving a clear trace in the data. This is a hard argument to refute or to defend, for it is by definition invisible in the data. Nevertheless, it apparently is the right story to tell about some immigrant populations of the past, struggling with the vagaries of English on tests, in school, and at work, as well as with less obvious aspects of intellectual and cultural assimilation. They may have seemed in the past to be stuck at the lower rungs of the socioeconomic ladder because of low intelligence or low test scores. Then, they gradually merged into the mainstream population, no longer a depressed or oppressed minority. But, as Jensen notes, tests have generally been good to immigrant populations or their children. The ascent up the ladder has typically been anticipated by rising test scores placing many of the children of immigrants, if not the immigrants themselves, within reach of higher education and all that goes with it. At this point we simply cannot say if something similar is to be the story of today’s depressed groups, for it may or may not be. But banning intelligence tests is hardly the right remedy for this kind of bias, wherever it is happening, for the tests can be a relatively quick escape from suppression, as they have been in the past.
Test bias is no longer just an academic question. For example, Judge Robert F. Peckham recently decided, in a federal district court in California, that intelligence-test scores violate constitutional guarantees because they are biased against black schoolchildren. The prime evidence for bias was the disproportionate frequency of black children selected for special classes on the basis of test scores. The New York Times cited Judge Peck-ham’s figures: “Although black students made up only 27.5 per cent of the student population, blacks accounted for 62 per cent of the students in classes for the mentally retarded.” This case exemplifies court cases by the hundreds in the last decade because it deals with bias in testing in relation to ethnic-group membership. Judge Peckham might have had other grounds besides bias in testing for banning the special classes. It seems to me, at any rate, that children are stigmatized needlessly by being designated as retarded when they require special education. Even so, Judge Peckham chose test bias as the basis of his decision and he cited disproportionate numbers of children as evidence.
Discussing group differences in test scores is intensely uncomfortable, like a breach of etiquette or worse. Well-meaning people usually avoid the issue, which may have been the right attitude when a person’s ethnic background had no legal standing and the goal was to make it have no social standing either. But the situation has changed, even though social equity remains the goal, for ethnic politics and jurisprudence are upon us. The government and the courts have become entangled in the delicate web of individual and group differences. The time has come to look closely at the facts and to see what they suggest about the possibilities for ethnic fairness in our society.
The first fact to know is that individuals often vary more than the groups they belong to. Within each race, for example, is found the full range of IQ scores. Millions of American blacks exceed the average white score; millions of whites score below the average black. All other known psychological traits, intellectual and otherwise, that vary at all vary within the races and overlap between the races. To treat members of a group uniformly or to expect them to behave uniformly violates the most salient psychological fact we know, which is the uniqueness of individuals. Individuality is far too high a price to pay for the benefits of ethnic politics. Indeed, racism itself starts with the denial of individuality. Happily, the evidence strongly affirms the fact that individual differences are vastly more important than group differences. Given the overlapping distributions of scores, and assuming for the sake of argument a color-blind society, blacks and whites should be found on every rung of the social ladder, although not necessarily in the same proportions.
Disproportionate frequencies cannot by themselves show bias, as other court decisions besides Judge Peckham’s have noted. It is not bias that increases the risk of Tay-Sachs disease for Jews or that suppresses the number of whites in the National Basketball Association. Disproportion does not even establish a presumption of bias in any empirical sense, though legally it may. Whenever a trait varies widely within groups, there is a statistical likelihood that it will vary on the average between groups too. Since most group differences in traits are small, socially insignificant, or both, we may not notice or care about them. When the differences are socially important, they strain our social institutions and clash with the underlying egalitarianism of our social tradition. Judge Peckham tried to relieve the strain by inferring bias in the tests that recorded the difference. But to demonstrate bias requires more analysis than seems to have figured in his decision. From what I have seen of the many recent cases involving claims of bias in objective tests, it would be an understatement to say that courts and legislatures do not always welcome or accept competent expert help, perhaps because it is not obvious to laymen that it is needed. But it is, if the decisions are not to do more harm than good.
Bias may be approached from various directions, as noted above. A practical approach involves the predictive function of testing. Does a score predict the same for a member of group A as it does for a member of group B? If not, the test is biased against the group whose performance is underpredicted. It is generally supposed that intelligence tests are biased in this sense against people from lower socioeconomic backgrounds, particularly those from the minority groups favored by affirmative-action programs. The evidence, which is now considerable in some areas, refutes the claim of predictive bias in almost all cases. In studies of primary, secondary, and college students, in the military, and in a large number of occupations, there is virtually no evidence of significant underprediction of performance by IQ scores, either for comparisons across SES within a race or across races. With only some minor qualifications that Jensen elaborates, a given test score has essentially the same predictive value for people throughout the socioeconomic scale, of whatever race.
It is also generally believed that test items drawing on specific cultural knowledge would distinguish between classes or races better than items that do not, such as the geometrical-inference tests described earlier. But this belief too is refuted by the data. Culturally specific items typically show smaller differences between blacks and whites than items that seem to be relatively free of cultural information. Blacks and white differ less on verbal tests than on non-verbal tests, but the distinction is of minor quantitative significance. This can be contrasted with the evidence that the verbal portions of tests are, in fact, especially difficult for Mexican Americans and other people who speak English as a second language, such as some American Indians.
Jensen also summarizes the evidence bearing on other familiar claims of bias. He finds no convincing data that the race of the examiner makes any practical difference and a fair amount of data that it does not. Nor does he find evidence in different groups of significant differential effects of anxiety, test sophistication, and a variety of other factors that might have been suspected.
The hypothesis that tests are biased in prediction or assessment is therefore not supported by mounting evidence. But bias is only one question to consider in the use of tests as predictors. Another concerns predictiveness itself, namely, how good is it? The simpler or more mechanical a job is, the less we would expect intelligence-test scores to predict about performance. We noted earlier that test scores sometimes predict initial learning of an activity better than final performance. Each shift in an employee’s duties or change in conditions of work may temporarily boost the predictions of performance by test scores, but then prediction would subside as a new routine is established. Different jobs require more or less of this sort of adaptiveness, hence more or less correlation between performance and IQ. For relatively routine, unchanging jobs, it may or may not be unfair to use an IQ test (or its strong correlates, such as school record) to screen candidates, but it may be unwise as social policy. Any loss in productivity in these jobs by not using test scores or their correlates for selection may be only slight, hence more than offset by political gains. Thus, even if tests are not biased, we may decide against using them for selection when their predictive power is marginal.
In schools and for more complex occupations, the predictive power of tests is substantial and the consequences of interfering with the natural sorting of ability may be profound. If intelligence testing is to be banned, as many judges and legislators would have it, the result is not likely to be what these very people may expect or hope. The scores are firmly correlated with much of the rest of our behavior, and the correlates would persist after the scores vanished. There would still be differences in scholastic performance and in status and income. There would still be achievement scores, competency tests, occupational inventories, etc., which correlate with intelligence tests and perhaps predict later performance even better. There would still be correlations between parents and children and, for the time being at least, disproportions among people from different races and SES backgrounds in movement up and down the socioeconomic ladder.
Without intelligence tests, the main practical difference would be in schools, where there would be less chance of spotting the gifted child whose achievement is lagging or the average child who is failing. Perhaps they are children from homes that fail to inspire caring about schoolwork. Perhaps they have emotional problems or pedagogical problems. More often than not, the children will be from deprived or disrupted homes. Without intelligence-test scores, these children are more likely to slip through school with their capacities unrecognized and unappreciated. History suggests that, imperfect as intelligence testing is, it can see through the veneer of cultural advantage or disadvantage better than anything else yet devised while still predicting later competence in school and beyond. When a district in England, for example, stopped using intelligence tests to select children for college-preparatory secondary schools, the proportion of children from disadvantaged backgrounds fell. For the sake of the culturally disadvantaged, we should not be banning the tests but improving their diagnostic power and using them more intelligently.
Because intelligence-test scores correlate with so much that people care about, the controversy leaves virtually no one untouched or indifferent. The complexities of the subject are compounded by the emotions it awakens. People who usually keep good track of things seem to lose their way remarkably quickly when intelligence testing comes up. It may, therefore, not be out of place to conclude with a simple guide through the thicket—first the don’ts, then the dos:
Do not blame the tests for the differences t
1 Macmillan, 800 pp., $29.95.