What does it mean to say that something causes 16% of cancers?

By Ed Yong | May 10, 2012 9:00 am

A few days ago, news reports claimed that 16 per cent of cancers around the world were caused by infections. This isn’t an especially new or controversial statement, as there’s clear evidence that some viruses, bacteria and parasites can cause cancer (think HPV, which we now have a vaccine against). It’s not inaccurate either. The paper that triggered the reports did indeed conclude that “of the 12.7 million new cancer cases that occurred in 2008, the population attributable fraction (PAF) for infectious agents was 16·1%”.

But for me, the reports aggravated an old itch. I used to work at a cancer charity. We’d get frequent requests for such numbers (e.g. how many cancers are caused by tobacco?). However, whenever such reports actually came out, we got a lot confused questions and comments. The problem is that many (most?) people have no idea what it actually means to say that X% of cancers are caused by something, where those numbers come from, or how they should be used.

Formally, these numbers – the population attributable fractions (PAFs) – represent the proportion of cases of a disease that could be avoided if something linked to the disease (a risk factor) was avoided. So, in this case, we’re saying that if no one caught HPV or any other cancer-causing infection, then 16.1% of cancers would never happen. That’s around 2 million cases attributable to these causes.

From answering enquiries and talking to people, I reckon that your average reader believes that we get these numbers because keen scientists examined lots of medical records, and did actual tallies. We used to get questions like “How do you know they didn’t get cancer because of something else?” and “What, did they actually count the people who got cancer because of [insert risk factor here]?”

No, they didn’t. Those numbers are not counts.

Those 2 million cases don’t correspond to actual specific people. I can’t tell you their names.

Instead, PAFs are the results of statistical models that mash together a lot of data from previous studies, along with many assumptions.

At a basic level, the models need a handful of ingredients. You need to know how common the risk factor is – so, for example, what proportion of cancer patients carry the relevant infections? You need to know how big the effect is – if someone is infected, their risk of cancer goes up by how many times? If you have these two figures, you can calculate a PAF as a percentage. If you also know the incidence of a cancer in a certain population during a certain year, you can convert that percentage into a number of cases.

There’s always a certain degree of subjectivity. Consider the size of the effect – different studies will produce different estimates, and the value you choose to put into the model has a big influence on the numbers that come out. And people who do these analyses will typically draw their data from dozens if not hundreds of sources.

In the infection example, some sources are studies that compare cancer rates among people with or without the infections. Others measure proteins or antibodies in blood samples to see who is infected. Some are international registries of varying quality. The new infection paper alone combines data from over 50 papers and sources, and some of these are themselves analyses of many earlier papers. Bung these all into one statistical pot, simmer gently with assumptions and educated guesses, and voila – you have your numbers.

This is not to say that these methods aren’t sound  (they are) or that these analyses aren’t valuable (they can tell public health workers about the scale of different challenges). But it’s important to understand what’s actually been done, because it shows us why PAFs can be so easily misconstrued.

The numbers aren’t about assigning blame.

For a start, PAFs don’t necessarily add up. Many causes of cancer interact with one another. For example, being very fat and being very inactive can both increase the risk of cancer, but they are obviously linked. You can’t calculate the PAFs for different causes of cancer, and bung them all into a nice pie chart, because the slices of the pie will overlap.

Cancers are also complex diseases. Individual tumours arise because of a number of different genetic mutations that build up over the years, potentially due to different causes. You can’t take a single patient and assign them to a “radiation” or “infection” or “smoking” bucket. Those 16.1% of cancers that are linked to infections may also have other “causes”. Cancer is more like poverty (caused by a number of events throughout one’s life, some inherited and some not) rather than malaria (caused by a very specific infection delivered via mosquito).

You can’t find trends by comparing PAFs across different studies. 

The latest paper tells us that 16.1% of cancers are attributable to infections. In 2006, a similar analysis concluded that 17.8% of cancers are attributable to infections. And in 1997, yet another study put the figure at 15.6%. If you didn’t know how the numbers were derived, you might think: Aha! A trend! The number of infection-related cancers was on the rise but then it went down again.

That’s wrong. All these studies relied on slightly different methods and different sets of data. The fact that the numbers vary tells us nothing about whether the problem of infection-related cancers has got ‘better’ or ‘worse’. (In this case, the estimates are actually pretty close, which is reassuring. I have seen ones that vary more wildly. Try looking for the number of cancers caused by alcohol or poor diets, if you want some examples).

Unfortunately, we have this tricky habit of seeing narratives even when there aren’t any. Journalists do this all the time. A typical interview would go like this: “So, you’re saying infections cause 16.1% of cancers, but a few years ago, you said they cause 17.8% of cancers.” And then, the best-case scenario would be: “So, why did it go down?” And the worst-case one: “Scientists are always changing their minds. How can we trust you if you can’t get a simple thing like this right?”

The numbers are hard to compare, and obscure crucial information.

Executives and policy-makers love PAFs, and they especially love comparing them across different risk factors. They are nice, solid numbers that make for strong bullet points and eye-grabbing Powerpoint slides. They have a nasty habit of becoming influential well beyond their actual scientific value. I have seen them used as the arbitrators of decisions, lined up on a single graphic that supposedly illustrates the magnitude of different problems. But of course, they do no such thing.

For a start, the PAF model relies on a strong assumption of causality. You’re implying that the risk factor you’re studying clearly causes the disease in question. That’s warranted in some cases, including many of the infections discussed in the new paper. In others… well, not so much.

Here’s an example. I could do two sets of calculations using exactly the same methods and tell you how many cases of cancer were attributable to radon gas, or not eating enough fruit and vegetables. A casual passer-by might compare the two, look at which number was bigger, and draw conclusions about which risk factor was more important. But this would completely obscure the fact that there is very strong evidence that radon gas causes cancer, but only tenuous evidence that a lack of fruit and vegetables does. Comparing the two numbers makes absolutely no sense.

There are other subtle questions you might also need to ask if you were going to commit money to a campaign, or call for policy changes, or define your strategy. How easily could you actually alter exposure to a risk factor? Does the risk factor cause cancers that have no screening programmes, or that are particularly hard to treat? Is it becoming more of a problem? PAFs obscure all of these issues. That would be fine if they were used appropriately, with due caution and caveats. But from experience, they’re not.

What PAFs are good for

They’re basically a way of saying that a problem is this big (I hold my hands bout an inch apart), or that it’s this big (they’re a foot apart now) or THIS big (stretched out to the sides). They’re our best guess based on the best available data. In the case of infections, the message is that they cause more cancers than people might expect.

Used carefully, I have no real problem with PAFs, but I think that they’re blunt instruments, often wielded clumsily. We could do a much better job at communicating what they actually mean, and how they are derived. I’d be happier if we quoted ranges based on confidence intervals. I’d be even happier if we stopped presenting them to one decimal place – that imbues them with a rigour that I honestly don’t think they deserve. And if, whenever we talked about PAFs, we liberally used the suffix “-ish”? Well, I’d be this happy.

CATEGORIZED UNDER: Cancer, Medicine & health, Select

Comments (18)

  1. Thanks for this excellent explanation. It would make the basis of an excellent case study for all journalism and policy students–and even mature practitioners.

  2. Siri Carpenter

    This is a valuable explanation of a statistical reference that is too often thrown around without real understanding — sometimes with harmless results, sometimes not.

    Also excellent to read the word “bung” twice in one post.

  3. “Bung” is fun to say and write. I’m just sitting here saying it to myself now. This is why I don’t work in an office.

  4. Tony Mach

    Thanks for the article, but isn’t a risk factor something different than a cause?

    You write:
    Many causes of cancer interact with one another. For example, being very fat and being very inactive can both increase the risk of cancer, but they are obviously linked. You can’t calculate the PAFs for different causes of cancer, and bung them all into a nice pie chart, because the slices of the pie will overlap.
    How easily could you actually alter exposure to a risk factor? Does the risk factor cause cancers that have no screening programmes, or that are particularly hard to treat?

    While being very fat is a clear risk factor for cancer, it is not a established cause, ASFAIK. It could very well be that being very fat and getting cancer is caused by some (unknown or poorly understood) third factor – say, nutrition?

  5. Given that with many of these things you can’t do RCTs, you have to infer causality from lots of converging lines of evidence. If I say “Bradford-Hill criteria”, does that ring a bell? If not, here: http://en.wikipedia.org/wiki/Bradford-Hill_criteria

    Obviously, the line between “risk factor” and “cause” is thus a bit subjective, but I think obesity is as established a “cause” as we have. There’s plenty of consistent evidence from lots of large cohort studies with millions of people (many of which control for the effects of nutrition), large effects on the risk of several cancers, solid biological mechanisms, evidence for reversibility (falling risk in people who undergo radical weight loss surgery) and more. You’d have to bend over backwards to argue that the relationship isn’t causal, or invoke some sort of mysterious unknown third factor. Nutrition ain’t it – conversely, the evidence linking specific nutrients (or diet more broadly) to cancer risk is much weaker, despite *hundreds* if not thousands of studies.

  6. I think part of your problem is a general issue with the reporting of statistics. Charles Seife called it “disestimation” in Proofiness – implying a far more exact number than your measure suggests.
    As you say, reporting a range would more accurately convey the inherent uncertainty in these sorts of models, and perhaps also soften the interpretations of causality.

  7. Emma Croager

    Thank you Ed for your excellent explanation. Essential reading for anyone having to explain what statistics like this really mean.

  8. Steve Pratt

    Ed, thanks for this, I will be citing this widely.

    One small point of disagreement. You say “Comparing the two numbers makes absolutely no sense”, then go on in the following paragraph to qualify this statement with a non-exhaustive list of caveats. And conclude the post with a nice real-world application. I baulk at your use of the absolute: “…absolutely no sense…”, but then everyone’s a critic.

    All this said, I was hoping for a list of the 2 million names of people who have succumbed to virus-caused cancer. TBH, I’m a bit disappointed in you. 😉

  9. Kate McKiernan

    Another huge problem in estimating cancer risks and rates is the huge signal to noise problem. The background rate of cancer in the population is around 20%-ish*. Making measurements of the influence of any particular risk factor is difficult, and making precise ones is intellectually bankrupt.

    *20% is the figure we used in my grad-level Radiation Biology class; I don’t have anything citable handy at the moment

  10. I worked for a while on the British Household Panel Survey (BHPS) run by the Institute for Social and Economic Research (ISER) at the University of Essex since 1991. One of their big issues — something that they award PhDs in, actually — was survey methodology. How do you ensure that long timescale longitudinal studies are actually comparable? Can you be sure that the pattern of old households falling out and new households coming in to the study over time isn’t indicative of some socioeconomic trend which you’re failing to take account of? I’m now working for a thyroid cancer diagnostic company, and again, designing the follow-up clinical study to our original validation study is quite tricky, for different reasons. It’s not an area I’m actively involved in any more, but it certainly brings back memories of long discussions at ISER!

  11. @Steve – I’m just waiting so I can list them all alphabetically. I have standards, you know.

  12. mfumbesi

    Great read as always (lawd I sound like a stuck LP record).

  13. Ed,

    The definition you give of PAF, “the proportion of cases of a disease that could be avoided if something linked to the disease (a risk factor) was avoided,” could also be dissected. The PAF is often calculated from observational studies, and it reflects observations of things that happened to be observed. It doesn’t measure what “could be avoided,” though it might suggest a possibility of avoiding disease exists.

    For example, the stomach cancer rate is higher for individuals with H. pylori infection than for those without it, hence there is a positive PAF for the association. That doesn’t mean, however, that avoiding H. Pylori infection will reduce stomach cancer. Maybe it will, maybe it won’t. The PAF only measures the association. To what extent these infections are causes of cancer, as opposed to markers for, say, immunological defects that allow cancers to progress, requires additional research. (I don’t know the research on H. Pylori and stomach cancer, and I can’t access the Lancet Oncology article to see how the authors evaluate the association in their calculation. It might be that we know a lot in this particular case, and there might be good reason to believe that the PAF represents what you say it does about “avoidance,” but I don’t know.)

    Not directly related to PAF, but also worth noting in any discussion of public health statistics, is the fact that everyone dies eventually of something, and that biology and public health are complex matters that are hard to reduce to headline-size quips. If we eliminate one or two particular infection-related cancers in developing countries (at some cost that wasn’t spent on other measures, and with the resulting change in terms of who dies of what when), have we improved the world? Maybe, maybe not. Learning as much as we can about the science of health is a great thing – understanding what we’ve learned and deciding what to do with it is a lot harder.

  14. Great comment, Steve. Thanks. And I very much agree.

  15. anon

    Just want to point out the HPV vaccine does not protect against all oncogenic (cancer causing) hpv types, only the most dangerous. About 30% of cervical cancers are caused by types which are not protected against by the current generation of the vaccine. It’s still a major breakthrough, but we don’t have a vaccine that will prevent all HPV caused cervical cancer.

  16. dmitryb

    so if PAF is mostly useful for measuring a magnitude of a problem (“They’re basically a way of saying that a problem is this big”), why is it being reported with such a precision (to the first decimal point as in “16·1%”)?

  17. John Barrett

    I have always wondered how these medical statistics are derived. They seem to result in the sort of syllogisms, where eventually one starts to conlcude :

    “Peas, I bet 95% of all people who have cancer eat peas. Therefore peas must give you cancer.”

    So many and varied are health scares now, that they lose their potency, especially for sceptics or cynics like me who are willing to engage in such reductio ad absurdum. I now believe almost nothing that I read or hear about medical staistics and so-called “studies”.


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Not Exactly Rocket Science

Dive into the awe-inspiring, beautiful and quirky world of science news with award-winning writer Ed Yong. No previous experience required.

See More

Collapse bottom bar