Authors: Ben Goldacre
Tags: #General, #Life Sciences, #Health & Fitness, #Errors, #Health Care Issues, #Essays, #Scientific, #Science
Cocaine Floods the Playground
We are now ready to move on to some more interesting statistical issues, with another story from an emotive area, an article in
The Times
(London) in March 2006 headed: cocaine floods the playground. “Use of the addictive drug by children doubles in a year,” said the subheading. Was this true?
If you read the press release for the government survey on which the story is based, it reports “almost no change in patterns of drug use, drinking or smoking since 2000.” But this was a government press release, and journalists are paid to investigate: perhaps the press release was hiding something, to cover up for government failures. The
Telegraph
also ran the “cocaine use doubles” story, and so did the
Mirror
. Did the journalists find the news themselves, buried in the report?
You can download the full document online. It’s a survey of nine thousand children, aged eleven to fifteen, in 305 schools. The three-page summary said, again, that there was no change in prevalence of drug use. If you look at the full report, you will find the raw data tables: when asked whether they had used cocaine in the past year, 1 percent said yes in 2004, and 2 percent said yes in 2005.
So the newspapers were right: it doubled? No. Almost all the figures given were 1 percent or 2 percent. They’d all been rounded off. Civil servants are very helpful when you ring them up. The actual figures were 1.4 percent for 2004 and 1.9 percent for 2005, not 1 percent and 2 percent. So cocaine use hadn’t doubled at all. But people were still eager to defend this story; cocaine use, after all, had increased, yes?
No. What we now have is a relative risk increase of 35.7 percent, or an absolute risk increase of 0.5 percent. If we use the real numbers, out of nine thousand kids we have about forty-five more saying yes to the question “Did you take cocaine in the past year?”
Presented with a small increase like this, you have to think: Is it statistically significant? I did the math, and the answer is yes, it is, in that you get a p-value of less than 0.05. What does “statistically significant” mean? It’s just a way of expressing the likelihood that the result you got was attributable merely to chance. Sometimes you might throw heads five times in a row, with a completely normal coin, especially if you kept tossing it for long enough. Imagine a jar of 980 blue marbles, and 20 red ones, all mixed up; every now and then—albeit rarely—picking blindfolded, you might pull out 3 red ones in a row, just by chance. The standard cutoff point for statistical significance is a p-value of 0.05, which is just another way of saying, “If I did this experiment a hundred times, I’d expect a spurious positive result on five occasions, just by chance.”
To go back to our concrete example of the kids in the playground, let’s imagine that there was definitely no difference in cocaine use, but you conducted the same survey a hundred times. You might get a difference like the one we have seen here, just by chance, just because you randomly happened to pick up more of the kids who had taken cocaine this time around. But you would expect this to happen less than five times out of your hundred surveys.
So we have a risk increase of 35.7 percent, which seems at face value to be statistically significant; but it is an isolated figure. To “data mine,” taking it out of its real-world context and saying it is significant, is misleading. The statistical test for significance assumes that every data point is independent, but here the data is “clustered,” as statisticians say. They are not data points; they are real children, in 305 schools. They hang out together; they copy one another; they buy drugs from one another; there are crazes, epidemics, group interactions.
The increase of forty-five kids taking cocaine could have been a massive epidemic of cocaine use in one school, or a few groups of a dozen kids in a few different schools, or miniepidemics in a handful of schools. Or forty-five kids independently sourcing and consuming cocaine alone without their friends, which seems pretty unlikely to me.
This immediately makes our increase less statistically significant. The small increase of 0.5 percent was significant only because it came from a large sample of nine thousand data points—like nine thousand tosses of a coin—and the one thing almost everyone knows about studies like this is that a bigger sample size means the results are probably more significant. But if they’re not independent data points, then you have to treat it, in some respects, like a smaller sample, so the results become less significant. As statisticians would say, you must “correct for clustering.” This is done with clever math that makes everyone’s head hurt. All you need to know is that the reasons why you must correct for clustering are transparent, obvious, and easy, as we have just seen (in fact, as with many implements, knowing when to use a statistical tool is a different and equally important skill from understanding how it is built). When you correct for clustering, you greatly reduce the significance of the results. Will our increase in cocaine use, already down from “doubled” to “35.7 percent,” even survive?
No. Because there is a final problem with this data: there is so much of it to choose from. There are dozens of data points in the report: on solvents, cigarettes, ketamine, cannabis, and so on. It is standard practice in research that we only accept a finding as significant if it has a p-value of 0.05 or less. But as we said, a p-value of 0.05 means that for every hundred comparisons you do, five will be positive by chance alone. From this report you could have done dozens of comparisons, and some of them would indeed have shown increases in usage—but by chance alone, and the cocaine figure could be one of those. If you roll a pair of dice often enough, you will get a double six three times in a row on many occasions. This is why statisticians do a “correction for multiple comparisons,” a correction for “rolling the dice” lots of times. This, like correcting for clustering, is particularly brutal on the data and often reduces the significance of findings dramatically.
Data dredging is a dangerous profession. You could—at face value, knowing nothing about how stats works—have said that this government report showed a significant increase of 35.7 percent in cocaine use. But the stats nerds who compiled it knew about clustering and Bonferroni’s correction for multiple comparisons. They are not stupid; they do stats for a living.
That, presumably, is why they said quite clearly in their summary, in their press release, and in the full report that there was no change from 2004 to 2005. But the journalists did not want to believe this. They tried to reinterpret the data for themselves; they looked under the hood, and they thought they’d found the news. The increase went from 0.5 percent—a figure that might be a gradual trend, but could equally well be an entirely chance finding—to a front-page story in
The Times
(London) about cocaine use’s doubling. You might not trust the press release, but if you don’t know about numbers, then you take a big chance when you delve under the hood of a study to find a story.
Ok, Back to an Easy One
There are also some perfectly simple ways to generate ridiculous statistics, and two common favorites are to select an unusual sample group of people and to ask them a stupid question. Let’s say 70 percent of all women want Prince Charles to be told to stop interfering in public life. Oh, hang on—70 percent of all women
who visit my website
want Prince Charles to be told to stop interfering in public life. You can see where we’re going. And of course, in surveys, if they are voluntary, there is something called selection bias; only the people who can be bothered to fill out the survey form will actually have a vote registered.
There was an excellent example of this in
The Daily Telegraph
in the last days of 2007. doctors say no to abortions in their surgeries was the headline. “Family doctors are threatening a revolt against government plans to allow them to perform abortions in their surgeries, the
Daily Telegraph
can disclose.” A revolt? “Four out of five doctors do not want to carry out terminations even though the idea is being tested in NHS [National Health Service] pilot schemes, a survey has revealed.”
Where did these figures come from? A systematic survey of all doctors, with lots of chasing to catch the nonresponders? Telephoning them at work? A postal survey, at least? No. It was an online vote on a doctors’ chat site that produced this major news story. Here is the question and the options given:
“[Doctors] should carry out abortions in their surgeries”
Strongly agree, agree, don’t know, disagree, strongly disagree.
We should be clear: I myself do not fully understand this question. Is that “should” as in “should”? As in “ought to”? And in what circumstances? With extra training, time, and money? With extra systems in place for adverse outcomes? And remember, this is a website where doctors—bless them—go to moan. Are they just saying no because they’re grumbling about more work and low morale?
More than that, what exactly does “abortion” mean here? Looking at the comments in the chat forum, I can tell you that plenty of the doctors seemed to think it was about surgical abortions, not the relatively safe oral pill for termination of pregnancy. Doctors aren’t that bright, you see. Here are some quotes:
This is a preposterous idea. How can [doctors] ever carry out abortions in their own surgeries. What if there was a major complication like uterine and bowel perforation?
[Doctor’s] surgeries are the places par excellence where infective disorders present. The idea of undertaking there any sort of sterile procedure involving an abdominal organ is anathema.
The only way it would or rather should happen is if [doctor] practices have a surgical day care facility as part of their premises which is staffed by appropriately trained staff, i.e. theater staff, anesthetist and gynecologist…any surgical operation is not without its risks, and presumably [we] will undergo gynecological surgical training in order to perform.
What are we all going on about? Let’s all carry out abortions in our surgeries, living rooms, kitchens, garages, corner shops, you know, just like in the old days.
And here’s my favorite:
I think that the question is poorly worded and I hope that [the doctors’ website] do[es] not release the results of this poll to
The Daily Telegraph
.
Beating You Up
It would be wrong to assume that the kinds of oversights we’ve covered so far are limited to the lower echelons of society, like doctors and journalists. Some of the most sobering examples come from the very top.
In 2006, after a major British government report, the media reported that one murder a week is committed by someone with psychiatric problems. Psychiatrists should do better, the newspapers told us, and prevent more of these murders. All of us would agree, I’m sure, with any sensible measure to improve risk management and violence, and it’s always timely to have a public debate about the ethics of detaining psychiatric patients (although in the name of fairness I’d like to see preventive detention discussed for all other potentially risky groups too—like alcoholics, the repeatedly violent, people who have abused staff in the job center, and so on).
But to engage in this discussion, you need to understand the math of predicting very rare events. Let’s take a very concrete example and look at the HIV test. What features of any diagnostic procedure do we measure in order to judge how useful it might be? Statisticians would say the blood test for HIV has a very high “sensitivity,” at 0.999. That means that if you do have the virus, there is a 99.9 percent chance that the blood test will be positive. They would also say the test has a high “specificity” of 0.9999, so if you are not infected, there is a 99.99 percent chance that the test will be negative. What a smashing blood test.
15
But if you look at it from the perspective of the person being tested, the math gets slightly counterintuitive. Because weirdly, the meaning, the predictive value, of an individual’s positive or negative test is changed in different situations, depending on the background rarity of the event that the test is trying to detect. The rarer the event in your population, the worse your test becomes, even though it is the same test.
This is easier to understand with concrete figures. Let’s say the HIV infection rate among high-risk men in a particular area is 1.5 percent. We use our excellent blood test on 10,000 of these men, and we can expect 151 positive blood results overall: 150 will be our truly HIV positive men, who will get true positive blood tests, and 1 will be the one false positive we could expect from having 10,000 HIV negative men being given a test that is wrong one time in 10,000. So, if you get a positive HIV blood test result, in these circumstances your chances of being truly HIV positive are 150 out of 151. It’s a highly predictive test.
Let’s now use the same test where the background HIV infection rate in the population is about one in ten thousand. If we test ten thousand people, we can expect two positive blood results overall: one from the person who really is HIV positive; and the one false positive that we could expect, again, from having ten thousand HIV negative men being tested with a test that is wrong one time in ten thousand.
Suddenly, when the background rate of an event is rare, even our previously brilliant blood test becomes a bit rubbish. For the two men with a positive HIV blood test result, in this population where only one in ten thousand has HIV, it’s only fifty-fifty odds on whether they really are HIV positive.