Can A Computer Measure Your Mood? (CAT Part 3)

By Neuroskeptic | March 12, 2014 5:41 pm

In part 1 and part 2 of this series, I examined the story of the Computerized Adaptive Test – Depression Inventory (CAT-DI).comp_psych

Touted as a revolutionary new way of measuring depression, the CAT-DI is a kind of computerized questionnaire, that assesses depressive symptoms by asking a series of questions about how the user is feeling. Unlike a standard questionnaire, however, the CAT-DI is adaptive because it picks which question to ask next based on previous responses.

The CAT-DI’s creators have said that the commercial release of the product (and related CATs) is under consideration.  They’ve formed a company, Adaptive Testing Technologies (ATT). This commercial aspect has led to fierce controversy over the past few weeks, with accusations of conflicts of interest against some very senior figures in American psychiatry. It was this aspect of the story that I focused on previously.

Now, I’m finally going to delve into the statistics to find out: does it really work?

The CAT-DI was revealed in a 2012 paper by Robert Gibbons and colleagues in the prestigious Archives of General Psychiatry. In this article (which has been previously criticized), the authors, after introducing the theoretical background of the method, and describing its development, compared the CAT-DI against three other depression questionnaires, the HAMD, the PHQ9, and CES-D. These are all widely used, old-fashioned pen-and-paper scales.

Gibbons et al examined the ability of each of these four measures to distinguish between three groups of people: those diagnosed with no depression, with minor depression, or with major depression. An ideal depression scale ought to give, respectively, low, medium and high scores for these three different groups.

The importance of this comparison can hardly be overstated. It asks the question: is the CAT-DI any better than what we already have? What, if anything, does the new kid bring to the party? And this is the only head-to-head comparison of the CAT-DI’s performance in the paper.

However, remarkably, Gibbons et al give almost no details about these crucial results. This is all they say about it in the Results section:

In general, the distribution of scores [on the traditional questionnaires] among the diagnostic categories [no depression, minor, major] showed greater overlap (ie, less diagnostic specificity particularly for no depression vs minor depression), greater variability, and greater skewness, for these other scales relative to the CAT-D

I did a double-take when I realized that this was all we’re given. ‘In general’? No p-values? No confidence intervals? No numbers of any kind (except for some descriptive stats for the CAT-DI group only)? ‘In general’, one would expect those things in a scientific paper.

The data from the four measures are presented purely in the form of some graphs (their Figure 2, reproduced below). An ideal depression test would have a tight spread within each category (small blue bars) and clearly higher scores with higher severity (right bar higher than middle bar higher than lower bar.)

My ‘general’ impression from eyeballing the graphs is that the CAT-DI is only slightly better than the other questionnaires, if at all. In particular the humble CES-D (bottom right), which dates to 1977, seems to me to have performed just as well as the fancy new contender – ‘in general’.

But I don’t like generalities. So (for want of any better way!) I measured the height of the blue bars (in pixels) on Figure 2, and thus estimated the degree of overlap between the 10th-90th percentiles of the distributions for the CAT_DI vs the CES-D (the central 80 percentiles being what the bars indicate).

overlapsFor the CAT-DI, the overlap between the ‘none’ and ‘minor’ bars was 47.2% of the ‘none’ spread and 62.5% of the ‘minor'; for the ‘minor’-‘major’ overlap, it was 80.0% of the minor and 64% of the major. For the CES-D, the corresponding overlaps were 48.5%, 63.4%, 76.9% and 62.5% – almost identical.

Overall proportional overlap – which I defined as the total of the two overlaps between the  adjacent bars, divided by the total length of the three bars – was identical to within the margin of error (i.e. 1 pixel) but for what it’s worth, the CES-D was marginally better (with 0.397 ratio vs 0.399).

This is an… unorthodox approach to psychometrics I’ll be the first to admit, but it’s the best that I could do given the (lack of) information provided in the paper, and I feel that it’s more rigorous than just saying ‘in general’.

But there’s a deeper issue. Even assuming that the CAT-DI were better than those three others, would that mean everyone would need to start using it? Or might there be an easier way to get the same level of performance?

Quite possibly there might. Back in 2000, Reise and Henson were developing a CAT for personality testing. They found that the CAT performed very well, but, they also found that a minimalistic non-computerized questionnaire (made up of the test items that were most highly correlated with the total score in the calibration dataset) did equally well. The fancy adaptive algorithm was actually unnecessary!

In 2010, Reise went on to study a CAT for measuring depression. This time around, the CAT did do slightly better than a best-item questionnaire, but the authors concluded that the CAT provided only “marginally” superior efficiency. Furthermore, these researchers also found that a simple ‘branching’ procedure – essentially, one of those if-you-answer-yes-please-go-to-question-8 rules – was even better, and basically just as good as the CAT.

Have Gibbons et al read this cautionary tale of the limits of CATs? They should have, given that one of them, Paul Pilkonis, helped to write it. Yet they omitted to examine a best-item comparison scale in their paper.

In summary, I’m not convinced that the CAT-DI is a more effective and useful way of measuring depression than the available alternatives. That’s not to say it isn’t better, and I do find the idea of computer adaptive testing a fascinating one. But in my view there’s just not enough data in the Gibbons et al 2012 paper to tell us whether their new product offers any added value.

Gibbons et al seemed to acknowledge this lacuna, promising that:

We will explore the extent to which [the bifactor model] translates to gains in measurement precision, reliability, and validity in a future statistical article.

That was over two years ago. The ‘statistical article’ has yet to appear, to my knowledge, and I’m not quite sure why the Archives peer reviewers didn’t just require the authors to just include those statistics in the original. Promissory notes are not – ‘in general’ – worth much in science.

ResearchBlogging.orgGibbons RD, Weiss DJ, Pilkonis PA, Frank E, Moore T, Kim JB, & Kupfer DJ (2012). Development of a computerized adaptive test for depression. Archives of General Psychiatry, 69 (11), 1104-12 PMID: 23117634

  • Bernard Carroll

    Yes, these authors do go in for a lot of hand waving, while
    withholding proper analyses. They don’t hold back, though, on making irrationally exuberant forward looking statements designed to create a positive buzz for their new business.

    Their anxiety scale, recently published in American Journal of Psychiatry, is the weakest of all. Basically, it is just like a crude
    thermometer for unspecified anxiety. It doesn’t align well with clinical anxiety diagnoses for positive case identification in primary care (too many false positives and it only considered generalized anxiety disorder); they did not bother to do field testing; they didn’t bother to check test-retest reliability; and they way overstated the potential to correctly identify key anxiety phenotypes important for epidemiologic or molecular genetic studies. And we should not overlook that NIMH has poured millions of dollars into these pitiful projects.

    As a matter of fact, their scale for anxiety did a lousy job
    distinguishing depression from anxiety. Their suggested solution? Pay us twice to run a second CAT scale for depression as well as the anxiety CAT scale! Now that’s what I call chutzpah!

  • I.G.

    I am a Masters-level student in Counselling psychology. From the training I am currently receiving, I confirm that lack of available psychometric analysis makes this test look suspicious, and that clinicians should avoid using a test without being sure of its good psychometric properties.

    However, I also wanted to highlight something I was not fully aware of until I started graduate training to become a clinician:
    1. There is a wide number of tests out there that should help in diagnosing mental disorders, many of them with questionable or controversial results.
    2. Psychological assessment should never ever ever be made on the basis of a test. To do so for a clinician is a sin arguably worse than significance fishing for a researcher, it is complete disregard for validity of assessment process. Psychological assessment should include a thoughtful integration over time of as many sources of information as possible and practical, including personal, social, and environmental context of an individual’s life; history, symptoms; and an individual’s own interpretation of all these factors. Information from any test is merely a single piece of this larger puzzle that requires clinical judgment to put together – and it should never be a deciding piece at that. Basically, tests are generally used as screeners to identify areas to explore and focus the assessment process. Actually relying on them for diagnosis is completely invalid.

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      Thanks for the comment. This is a very good point – if you want to understand someone’s mental state then questionnaires and other ‘tests’ should be just one part of your approach.

  • RogerSweeny

    “That was over two years ago. The ‘statistical article’ has yet to appear, to my knowledge, and I’m not quite sure why the Archives peer reviewers didn’t just require the authors to just include those statistics in the original.”

    Since jobs, money, and prestige go to people with more publications, requiring someone to combine two papers into one probably seemed very, very mean.

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      Maybe, but all of the authors on Gibbons et al are veteran players who hardly need one more paper to pad out their CVs. The senior author Kupfer was chair of the DSM-5 committee, after all!

      I suspect that this was the reason the reviewers let the lack of details slide. Far from taking pity on the authors, they were overawed by them.

NEW ON DISCOVER
OPEN
CITIZEN SCIENCE
ADVERTISEMENT

Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Neuroskeptic

No brain. No gain.

About Neuroskeptic

Neuroskeptic is a British neuroscientist who takes a skeptical look at his own field, and beyond. His blog offers a look at the latest developments in neuroscience, psychiatry and psychology through a critical lens.

ADVERTISEMENT

See More

@Neuro_Skeptic on Twitter

ADVERTISEMENT
Collapse bottom bar
+

Login to your Account

X
E-mail address:
Password:
Remember me
Forgot your password?
No problem. Click here to have it e-mailed to you.

Not Registered Yet?

Register now for FREE. Registration only takes a few minutes to complete. Register now »