Why not order lots of medical tests?
Some patients ask me what is the harm in ordering tests. Cost aside, they feel that ‘more information is better’. The counterpoint is that ‘bad information is worse than no information at all’ - let’s try shed some light on this argument.
(If you want the quick version, skip to the summary at the bottom)
First, some diagnostic tests from the past...
Instead of painting religious scenes like their Catholic counterparts, seventeenth century Protestant artists painted insightful scenes of everyday life. ‘The Doctor’s visit’ was a popular scene to depict. Above is a painting by Dutch artist Jan Steen. The swooning lady in the chair is a patient being visited at home by her doctor. The doctor is trying to diagnose her condition using the technology of the day. He is taking her pulse, and next to him, an attendant is carrying a flask of urine. He will use the urine to perform uroscopy. The color, smell, and yes, even the taste of the urine, will help him diagnose her medical problem. A “clear pale lemon color leaning toward off-white, having a cloud on its surface”, for instance, indicated pregnancy.
A closer look at the foot of the chair reveals an adjacent bowl with a burned ribbon. This was another pregnancy test of the day. A ribbon was soaked in her urine and burned. If the smell nauseated her, she was pregnant.
The boy playing with the bow and arrow in the foreground resembling cupid hints at the real diagnosis. The doctor has it all wrong. The lady is suffering from love sickness. An insightful critique of seventeenth century diagnostic medicine by Jan Steen.
Have we made any progress? Are our contemporary diagnostic and screening tests any better? Fortunately, the scientific method has given us tools to assess diagnostic tests. Let’s explore three critical questions that must be ascertained before a test is ordered.
1. Is the test accurate?
In order to determine if a test is useful, we need to determine its accuracy first. Simply put - the less accurate a test the more false positives and false negatives it produces. Evaluating accuracy is mostly straightforward.
For example, Dr. Ning Fanggang, a prominent surgeon in Beijing, is skeptical of the famous ‘pulse diagnosis’ used by practitioners of Traditional Chinese Medicine (TCM), He is offering a cash reward to any TCM practitioner that can accurately diagnose pregnancy in a women by feeling pulse alone. Intuitively, pregnancy should be easy for a TCM practitioner to determine; for all the subtle and miraculous things TCM claims to accomplish, diagnosing such a major physiologic change should be a piece of cake.
Fanggang tests their accuracy with a diagnostic accuracy study. A ‘gold standard’ is compared with the test in question - pulse diagnosis. Forty women are chosen; some are pregnant and some are not. Pregnancy is first determined by the ’gold standard’ - blood test, ultrasound and the belly starting to show. Then, a blindfolded TCM practitioner assesses the pregnancy status of the 40 women by only feeling their pulse. The accuracy of pulse diagnosis simply depends on the number of women he assesses correctly. Random guessing would lead to 50% accuracy (20 out of 40 correct). Dr. Fanggang is asking for 80% (32 out of 40 correct) or greater to qualify for the reward. Thus far, one TCM practitioner has failed the challenge and there have been no other takers.
Like drugs, diagnostic and screening tests are regulated by the FDA. If a new test passes a diagnostic accuracy study it can be FDA approved. However, there is a loophole. Tests not formally studied can still be offered and sold by laboratories under the category of LDTs (Laboratory Developed Tests). Without a diagnostic accuracy study, LDTs are as good as shooting in the dark or burning a ribbon of urine.
How can an LDT be recognized? It may be difficult but… LDTs usually are not covered by insurance and the patient will be asked to pay out of pocket. When the results come back, reports are required to include a warning the test is not FDA approved.
In a paper from 2015, the FDA documented 20 case studies of LDTs proven to be inaccurate but still sold by laboratories. These include tests for Lyme disease, mercury, autism, fibromyalgia, cancer screening, and genetic screening. The FDA wants to clamp down on LDT’s but is coming up against resistance from industry. An informative article on this topic in Scientific American can be found here.
2. Is the test ordered in the right patient?
In addition to determining the accuracy of a test, a test must be ordered in the appropriate patient. If we test for rare diseases in patients that are at low risk of disease we will produce false positive results.
For instance, a patient has a friend recently diagnosed with Cushing’s Disease - a condition that leads to too much cortisol production causing obesity, diabetes, and hypertension. The patient himself has no signs or symptoms but wants the test regardless. He wants to “be proactive”, and asks “what’s the harm of a blood test anyway?”
To answer this question we need to know the prevalence of Cushing’s Disease in the general population - it is a rare condition - one in a hundred thousand people in the US. Now lets suppose we find a test for Cushing’s that is 99% accurate - much higher than our current tests. If the test is positive in our patient, what is the likelihood it is a true positive? This is known in statistics as the PPV or positive predictive value.
Because of the very low prevalence of Cushing’s Disease, there is only a 1 in a 1,000 chance a positive test result is actually a true positive in this patient. In other words, despite a positive result there is an extremely low chance the patient actually has the disease. Yet, an official-looking document says in bold writing ‘POSITIVE FOR CUSHING’S DISEASE’. This will unfortunately lead to a cascade of unnecessary medical testing and potentially dangerous interventions.
If the test on the other hand turns out negative, there is a 99.99999% chance it is a true negative and the patient really does not have Cushing’s Disease. But we didn’t need the test to tell us that.
The take home message here is that rare conditions or healthy patients produce an unacceptable number of false positive results even in the light of an accurate test.
3. Is the test useful?
A diagnostic test can be accurate and ordered in the correct patient but still not be useful. Ultimately, for a test to be useful it must lead to early treatment and better health outcomes. Outcomes can be measured in a clinical utility study. They are quite simple in theory and resemble a randomized controlled trial of drugs. A large group of subjects is randomly split into two. One group is tested, the diagnostic test group, and the other group is not, the control group. Positive test results in the diagnostic test group are acted on and treated. The groups are followed for several years, and if the patient outcomes are superior in the diagnostic test group, the test is deemed to have clinical utility.
The PSA blood test for the diagnosis of prostate cancer is a prime example for the need of a clinical utility study. Prior to clinical utility studies, there were widespread recommendations to screen all men over 50 with PSA. It made perfect sense because the PSA test is moderately accurate and it tests for the most prevalent cancer in men. A positive test leads to a biopsy. If there are cancer cells on the biopsy, treatment will save a life. Cancer always kills and finding it earlier is always better. Right?
After decades of follow up, two very large clinical utility trials finally yielded results in 2009. The PLCO trial in the US with 75,000 men and the ERSPC trial in Europe with 162,000 men. As expected, many more cancers were found in the PSA screened group versus the control group. These cancers were treated and treated early; however, there was only a very small improvement in the primary outcome - prostate cancer related mortality. In other words, despite finding more cancers and treating them earlier, the screened group died from prostate cancer at about the same rate as the control group. To put it in numbers; for every 50 treated cases of prostate cancer only one patient’s life was saved. The biopsies, surgeries, radiation treatments, proctitis, impotence, incontinence, and anxiety generated in the other 49 patients was all for naught.
There are many reasons for the apparent failure of PSA screening, the most important of which is non-progressive disease. From autopsy studies it appears that up to 50% of men in their seventies and eighties harbor prostate cancer but do not die from it. Why some prostate cancers behave this way and others kill, is still unknown. Another potential explanation for the lack of improved outcomes is that early treatment may not be early enough. The aggressive prostate cancers may already have metastasized (spread) by the time they are found with PSA. Although debate and disagreement regarding PSA screening continues, the American Cancer Society, The USPSTF, and even the American Urologic Association rescinded their recommendations for widespread screening. Currently, most of them say “discuss with your doctor if screening is right for you”.
Despite the need, most tests in use today have not been subjected to large clinical utility studies. Clinical utility studies are very expensive, take years or decades to complete, and are difficult to assure compliance of subjects. Ironically, newly introduced tests, which certainly have not had the chance to be subjected to a long clinical utility study, are mostly presented with ‘enormous excitement’. The CancerSEEK blood test that has been all over the news this week is a prime example.
The few tests that have been put through clinical utility studies, such as PSA for prostate cancer, ultrasound and CA-125 for ovarian cancer, and chest x-ray for lung cancer have been a disappointment. Mammography in women over 50, colonoscopy, and CT of the lungs in smokers are exceptions boasting a significant reduction in cancer related mortality. However, it is very important to understand that NO CANCER SCREENING TEST HAS EVER RESULTED IN A DECLINE IN OVERALL MORTALITY. Meaning, the effect is small and there are so many other things that will kill you, it’s like blowing in the wind.
The take home message is - sometimes earlier is not better. Sometimes earlier finds harmless forms of disease. Sometimes earlier is not early enough. And sometimes it’s something else that’s going to kill you.
This topic of overdiagnosis and overtreatment in cancer is covered extremely well in the following lectures. Otis Brawley, chief medical director of the American Cancer Society and Gilbert Welch of Dartmouth discussed overdiagnosis in mammography
The Popularity Paradox
Doctors and patients get excited over new tests. A lot of these tests derive false positive results or diagnose diseases that were never going to harm. However, unaware of their folly, this leads to survivor stories. Telling a friend, ‘I got this new test, my doctor caught the disease early, and saved my life’. Paradoxically, the inaccurate or useless test becomes popular, which leads to more testing, and more survivor stories, and so on. This is known as the popularity paradox of screening tests. .
Summary of the three questions
There are some diagnostic tests proven to be useful, some that have not, and others that may lead to dangerous interventions. It can be very difficult to distinguish good tests from bad tests. Moreover, bad tests outnumber goods tests and made more popular by the popularity paradox. The only things standing between you and ordering bad tests are reason, your pocket book, and yes, your insurance company. In this case, the avarice of your insurance company and the need for good evidence happens to be aligned. Therefore, if you or your doctor order a test, especially an uncovered test, you must ask the following questions:
Has accuracy of the test been formally determined?
Are you testing for a rare disease or are you at low risk for the disease in question?
Have clinical utility studies proven the test leads to better health outcomes?
Folk medicine is replete with diagnostic tests that seem strange and laughable knowing what we know now. However, many of our modern diagnostic tests have not been objectively evaluated for accuracy and clinical utility, and when they are, there is a surprisingly high failure rate. We may have replaced sight, smell, touch, and taste with fancier machines - but some day someone may be laughing at us.