Putting tests to the test. How to evaluate a diagnostic test.
Updated: Apr 3
You may have noticed lately that we are inundated with new diagnostic tests: COVID tests, cancer screening tests, genetic tests, genealogy tests, food sensitivity tests, wearable devices (Fitbit, Apple Watch), microbiome tests, continuous glucose monitors. There are several reasons for this:
New discoveries in medicine; for example, the human genome, and microbiome
Reduction in costs; for example, genome sequencing, imaging, wearable devices
Deregulation of diagnostic tests: tests can be sold without being validated, no longer require a doctor's prescription, and can be sold direct-to-consumer
Diagnostic testing is profitable for both industry and doctors; $60 billion yearly revenue in the US
Unsurprisingly, one of the most common patient inquiries I get in my practice is a request for a new diagnostic test. The patient argument is, "I know the test is new, but what's the harm? ‘the more information the better’. However; there is a problem with this line of reasoning. A bad diagnostic test is worse than no test at all. This is because the diagnostic test is the cornerstone of medicine. From the diagnosis everything else follows. A bad diagnostic test can lead to false positives, further testing, more invasive testing, and a cascade of bad decisions, bad treatments, and bad outcomes. Not to mention the anxiety and psychological impact of being told something is seriously wrong with you. In deed, we need to think and be more careful before we order a test.
The diagnostic test is the cornerstone of medicine. From the diagnosis everything else follows.
Lessons from the past
Seventeenth century Dutch artists, like Jan Steen, painted insightful scenes of everyday life. A common theme was ‘The Doctors Visit’. A critique of seventeenth century diagnostic medicine. In Steen’s painting of the Doctor’s visit below, he depicts a swooning lady. The doctor is trying to diagnose her condition using the technology of the day. He is taking her pulse, and next to him, an attendant is carrying a flask of urine. He will use the urine to perform uroscopy. The color, smell, and yes, even the taste of the urine, will help him diagnose her medical problem. A “clear pale lemon color leaning toward off-white, having a cloud on its surface”, for instance, indicated pregnancy.
A closer look at the foot of the chair reveals a bowl with a burned ribbon. This was another pregnancy test of the day. A ribbon was soaked in the patient's urine and burned. If the smell nauseated her, she was pregnant. However, the boy playing with the bow-and-arrow in the foreground (resembling cupid) hints at the real diagnosis. The doctor has it all wrong. The lady is suffering from love sickness.
Have we made any progress in medicine? Are our contemporary diagnostic and screening tests any better? The answer is yes. The reason: the scientific method. The scientific method has demanded that we put tests to the test. The process can be summarized with three core questions:
1. Is the test accurate?
The first question to answer about a test is whether it is accurate or not. Simply put - the less accurate a test the more false positives and false negatives it produces. Evaluating accuracy is straightforward. We use something called a diagnostic accuracy study, where a the test in question is compared with a ‘gold standard’.
Let me provide an example. Dr. Ning Fanggang, a prominent surgeon in Beijing, is skeptical of the ‘pulse diagnosis' used by practitioners of Traditional Chinese Medicine (TCM). He believes this test is not accurate. He is offering a cash reward to anyone that can accurately diagnose pregnancy in a women by feeling pulse alone. Intuitively, pregnancy should be easy for a TCM practitioner to determine; for all the subtle and miraculous things TCM claims to accomplish, diagnosing a major physiologic change like pregnancy should be a piece of cake.
His diagnostic accuracy test is rather simple. Forty women are chosen; some are pregnant and some are not. Pregnancy is first determined by the ’gold standard’ - a blood test, ultrasound, and the belly starting to show. Then, a blindfolded TCM practitioner assesses the pregnancy status of the 40 women by only feeling their pulse. The more women they guess correctly the more accurate the test. Random guessing would lead to about 50% accuracy (20 out of 40 correct). To qualify for the reward, Dr. Fanggang is asking for 80% (32 out of 40 correct) or greater. Thus far, one TCM practitioner has failed the challenge and there have been no other takers.
It is important to recognize that without a diagnostic accuracy study, a test could be as useless as tasting urine. Because of this, the FDA began regulating tests in order to protect you. If a new test passes a diagnostic accuracy study it can be FDA approved. However, there are some loopholes. First, there is nothing stopping a doctor from ordering a test that attained approval for a specific for a disease or condition, and using it to diagnose a disease or condition it was not studied for. Second, tests without a diagnostic accuracy study can be sold by laboratories as long as it is disclosed the test is an LDTs (Laboratory Developed Tests). In a paper from 2015, the FDA documented 20 case studies of LDTs proven to be inaccurate but still sold by laboratories. These include tests for: Lyme disease, mercury, autism, fibromyalgia, cancer screening, and genetic screening. The clinical decisions made as a consequence of these inaccurate tests lead to direct patient harm. For example a false positive Lyme disease test leads to 6 months of unnecessary intravenous antibiotics. The FDA wants to clamp down on LDT’s but is coming up against resistance from industry. An informative article on this topic in Scientific American can be found here.
A test without a diagnostic accuracy study could be as good as tasting urine.
How can an LDT be recognized? First, LDTs are usually not covered by insurance and the patient will be asked to pay out of pocket. Second, it is required in the report to include a disclaimer that the test is not FDA approved. Assuming the laboratory is compliant with this rule (which often they are not) this disclaimer is usually found in very small print at the bottom of the report.
2. Is the test ordered in the appropriate patient?
In addition to determining the accuracy of a test, a test must be ordered in the appropriate patient. In fact, when testing for a rare disease or testing a patient unlikely to have the disease, even a highly accurate test is going to produce a lot of false positives.
This is a critical concept in medical testing, so let me demonstrate this point with an example. Suppose you have a friend that was diagnosed with Cushing’s Disease. A rare condition that leads to too much cortisol production and causes obesity, diabetes, and hypertension. You have none of these symptoms but you wish to "be proactive", and rule it out, thinking, "what's the harm?"
You are informed there is a test for Cushing’s that is 99% accurate. You take the test and it comes back positive. What is the likelihood it is a true positive? (In statistics we call this the PPV or positive predictive value.)
The answer is: 0.1%. Yes, despite the test reading "POSITIVE", there is only a one in a thousand chance you actually have the disease. Surprised? This is because your baseline risk for Cushing’s Disease is one in a hundred thousand (the prevalence of the disease in the US population). A test for Cushing's that is 99% accurate only brings you two decimal points closer to having the disease; from one in a hundred thousand to one in a thousand. Yet, an official-looking document says in bold print ‘POSITIVE FOR CUSHING’S DISEASE’. Hopefully you can understand how misleading this can be for both doctors and patients alike. Ultimately, these false positives lead to a cascade of unnecessary medical testing, biopsies, and potentially dangerous interventions. Not to mention the psychological trauma.
On the other hand, if the test turned out to be negative, there is a 99.99999% chance it is a true negative; which means you really do not have Cushing’s Disease. But we didn’t need a test to tell us that a patient without symptoms doesn't have a rare disease, did we? The take home message here is that testing for rare conditions in healthy patients produces an unacceptable number of false positive results even in the light of an accurate test.
Testing for rare conditions in healthy patients produce an unacceptable number of false positive results even in the light of an accurate test.
3. Is the test useful?
A diagnostic test can be both accurate and ordered in the apporpriate patient, yet, still not be useful. For a test to be useful, the test must lead to an intervention, and the intervention must lead to a better health outcome. For example, a disease like Alzheimer's disease has very few effective treatments at present. This begs the question, "do you want to know if you you have Alzheimer's disease, if you can't do much about it?". A question that is not easily answered by most of us.
When it comes to cancer, we use screening tests to find cancer earlier so we can intervene earlier. We have speculated that earlier is always better when it comes to cancer. However, preventive oncology is rife with accurate screening tests for cancer that turn out to be useless.
How do we know this? The usefulness of a test can be determined by a Clinical utility study. The clinical utility study is simple in theory. It closely resembles a randomized controlled trial to test a drug. A large group of subjects is randomly split in two. One group, "the diagnostic test group", is tested with the diagnostic test, and the other group, "the control group", is not. Positive test results in the diagnostic test group are acted upon and treated. The groups are followed for several years, and if the patient outcomes are superior in the diagnostic test group, the test is deemed to have clinical utility.
The PSA blood test for the diagnosis of prostate cancer is a prime example of a test that looked great until we did the clinical utility studies. The test was developed in the 1980's and at first glance seemed to be a great test. It is relatively accurate and finds prostate cancer well before it is detected on a rectal exam. Furthermore, prostate cancer is a common disease and a deadly disease, and there are effective interventions (surgery and radiation). This is why there were widespread recommendation to screen all men over 50.
However, it all changed in 2009 after the two large clinical utility trials were published. The PLCO trial in the US with 75,000 men, and the ERSPC trial in Europe with 162,000 men. In these trials many more prostate cancers were found in the PSA screened group - as expected - and these cancers were treated and treated early. However, there was only a very small improvement in the primary outcome - death from prostate cancer. In other words, despite finding more cancers and treating them earlier, the screened group died from prostate cancer at about the same rate as the control group. To put this in numbers; for every 50 treated cases of prostate cancer only one patient’s life was saved. The biopsies, surgeries, radiation treatments, proctitis, impotence, incontinence, and anxiety generated in the other 49 patients was all for naught. This leads us to the conclusion that a PSA test is far more likely to ruin your life than save your life.
A PSA test is far more likely to ruin your life than save your life.
There are many reasons for the apparent failure of PSA screening, the most important of which is non-progressive disease. From autopsy studies it appears that up to 50% of men in their seventies and eighties harbor prostate cancer but do not die from it. Why most prostate cancers behave this way and others kill, is still unknown. Another potential explanation for the lack of improved outcomes is that early detection may not be early enough. The aggressive prostate cancers, the ones we really care about, may be already metastasized (spread) by the time they are detected by PSA. Although debate and disagreement regarding PSA screening continues, the American Cancer Society, The USPSTF, and even the American Urologic Association rescinded their recommendations for widespread screening. Currently, most of them say “discuss with your doctor if screening is right for you”.
It turns out that the few tests that have been put through clinical utility studies, such as PSA for prostate cancer, ultrasound and CA-125 for ovarian cancer, and chest x-ray for lung cancer have been a disappointment. Some, like mammography, colonoscopy, and CT of the lungs in smokers can boast a significant reduction in cancer-related mortality. However, NO CANCER SCREENING TEST HAS EVER DEMONSTRATED A DECLINE IN OVERALL MORTALITY. Meaning that the effect of the screening test is so small, and there are so many other things that will kill you, it makes no difference in the grand scheme of your life.
No cancer screening test has ever demonstrated a decline in overall mortality.
So is earlier always better? Not necessarily. Sometimes earlier finds harmless forms of disease. Sometimes earlier is not early enough. And most of the time something else is going to kill you first.
The Popularity Paradox of screening
The paradox of screening tests is that the more false positives they generate, and the more benign the form of the disease they pick up, the more popular they become. This is because they generate, "survivor stories”. For instance, a friend with a non-progressive form of prostate cancer tells another friend, "My doctor ordered this test, caught my disease early, and saved my life". What the popular paradox means for you is that you must demand evidence that a test is useful. Don’t use popularity, anecdotes, or FOMO (Fear of missing out) as the driving force to get a new test.
We spent forty years and billions of dollars to study the accuracy and usefulness of one test, the PSA blood test. We are still confused. Yet, thousands of new tests enter the market every year, most of which have not been put through diagnostic accuracy studies, and none of which have been subjected to clinical utility studies. Yet, these are promoted the most and all claim to be useful. What is the likelihood they will prove to be useful after a gauntlet of accuracy and utility studies? Very low. We know this because the few tests that have been proven to be useful, lie on a mountain of useless tests.
This doesn't leave us far off from our dutch relatives from 400 years ago. We may find their diagnostic tests strange and laughable, yet, many of our own modern diagnostic tests have not been objectively evaluated for accuracy or clinical utility. We've replaced the five senses with fancier machines, but 400 years from now someone will be laughing at us.
The take home message is that the number of bad tests far outnumber the good tests, and bad tests can lead to physical or psychological harm. The only thing that stands between you and ordering bad tests are: reason, your pocket book, and your insurance company (in this case the avarice of your insurance company aligns with the need for good evidence). Unfortunately, Concierge Medicine patients are not impeded by the financial obstacles. Therefore, if you or your doctor orders a test, especially a test not covered by insurance, you must bring in reason to help with the decision. Ask the following questions:
Has the accuracy of the test been formally determined?
Are you testing for a rare disease or a disease you are at low risk?
Have clinical utility studies proven the test leads to better health outcomes?