Evaluating Medical Claims
- May 2, 2020
- 7 min read
(Sorry for the inconvenience. This page is currently being edited.)
"Creatine Increases Muscle Mass"
This is a claim made for the popular supplement creatine. Is it true? To evaluate it we must put the claim into perspective. In our desperate pursuit of health we generates a lot of claims. As a patient you are only seeing the tip of an immense iceberg.
A Plethora of Claims
Creatine has dozens of other claims, including: increased strength, improved cognition, reduced depression, etc. And creatine is just one of 90,000 other supplements, each with dozens of its own claims. If we include lifestyle medicine (nutrition, exercise, sleep etc), pharmaceuticals, and surgery there are literally millions of individual health claims.
A Simple Method to Evaluate Claims
As a patient, and consumer of healthcare, it's important to understand that all of these claims can’t be true. In fact, most are untrue. In this post, we will give you a simple method that can be used to evaluate any claim. If you are interested in healthcare, you must first learn how to evaluate healthcare.

About
The Categories of Evidence
Any claim is evaluated by evaluating the evidence behind it. The evidence in the field of medicine appears to be inaccessible to the layperson. It is not only highly technical but wrapped in its own insider language. Fortunately, medical evidence can be broken down into just a few broad categories, and evaluating a claim is as simple as identifying the category of evidence that supports it. Below are the categories. Please familiarize yourself with them:
In Vitro Experiments: Trying the intervention in a test-tube ('In vitro' means ‘in glass’).
Animal Experiments: Trying the intervention in animals - ‘Rat studies'.
Mechanism of Action: Speculating how the intervention works. Anti-inflammatory, increase circulation, etc.
Expert Opinion: In the absence of better evidence, the speculation of a single expert.
Anecdotes: A patient's informal trial of the intervention.
Case Series: A doctor's formal trial of the intervention on several patients.
Observational Studies: Trends in population data (eg. people who consume processed meat have a higher rate of colon cancer)
Making An Interventions Look Like It Works
Although each one of these categories is vital in the medical discovery process, most of the time they fool us. They make an intervention look like it works when it does not. For example, we are fooled by anecdotes because patients get better from placebo effects and natural healing, we are fooled by test-tube experiments because it’s easier to make an intervention work in test-tube than in an entire human, and we are fooled by observational studies because of confounding - association is not causation.
Cheap & Easy to Generate
These categories are not only misleading, but cheap and easy to generate. For example, ANY individual trying ANY intervention is ‘anecdotal evidence’. Millions of anecdotes are being generated this second. Likewise, there are millions of experts, millions of data sets, and millions of test-tube experiments.
Evidence That Proves Everything
With this combination of cheap and easy and misleading, one can find evidence to support ANY claim. For example, if you look at the 90,000 supplements on the market, each and every one has an expert that supports it, dozens of testimonials (anecdotes), and a test-tube experiment demonstrating the ‘mechanism’. This is in deed the greatest dilemma in our field: because our evidence can prove everything, it proves nothing.
About
The RCT: A New Category of Evidence
The field of medicine was in desperate need for a new category of evidence. A better test of a medical claim. A test engineered to account for all the flaws of the other categories of evidence - flaws that make an intervention look like it works when it does not. The result: the randomized, double-blind, placebo-controlled trial, or RCT for short. The first RCT was published in 1948, and currently, there are 50,000 RCTs published per year.
What is an RCT?
The RCT is a very simple and direct test of an intervention. It is a trial in humans - not in test-tubes or in animals. To avoid confounding, the subjects are randomly split into two identical groups. The only difference between the groups is that one gets the intervention and the other does not. To prevent placebo effects and experimenter bias, the group not getting the intervention gets a 'placebo'. This 'blinds' everyone in the study. They are then followed for a long period of time until REAL events happen, like death or heart attack, not just changes in biomarkers.
The Limitations of RCTs
There are four major limitation of RCTs:
Inadequate Funding: A well done RCT costs hundreds of millions of dollars.
Inadequate Blinding: Blinding is unfeasible for many interventions. How would you blind subjects in a trial for exercise or psychedelics?
Inadequate Duration: Testing healthy people for a prevention or longevity claim would require decades until events occur.
Inadequate Compliance: An issue in lifestyle medicine; for example, testing a vegan diet to prevent cancer would require several thousand people to be compliant on that for decades.
Unable to Evaluate Most Claims
This significantly limits the number of claims that can be tested with an RCT, and leaves us with millions of untested claims. Claims supported instead with the other less reliable categories of evidence.
About
Evaluating the Other Categories of Evidence
This begs the question: just how unreliable are the other categories of evidence. If somebody claims that a treatment works because, "it works on rats", just how likely will it work in humans? Out of a hundred claims based on rat studies, how many turned out to be true?
Evaluating how good the different categories of evidence is an excited new field called Meta-research - research about research a new field of study that does exactly this. Help us evaluate the quality of the evidence medical science generates.
But, we can use the RCT indirectly to evaluate a claim. By sifting through millions of claims, we can use RCT as a gold standard and evaluate the performance of the other categories of evidence.
Meta-Research
The National Library of Medicine has been storing every published study in the field of biomedicine since the 1960's. There are now over 35 million papers in their database. Many of you are familiar it, as it is accessed online for free at Pubmed. With this huge wealth of data to analyze, we can follow literally hundred of thousands of medical claims, from their inception, to the RCT. Applying the investor analogy, this is like looking back at all the startup companies, and determining what percentage made it.
Lessons From Meta-Research
Meta-research has generated several revelations that are paradigm shifting. Revelations with the broad message, that we in medical science generate a lot of noise.
Everyone Is Right
In an analysis of 2,000,000 published studies 96% had positive results. In other word, every time a scientist had an idea, 96% of the time they were right.
Academia’s Replication Crisis
Most of the animal studies and experiments in cells come from academia. When we try to replicate these experiments, less than 25% can be replicated. This is known as “the replication crisis” and is well documented in many fields of science.
Pharma’s Dismal Discovery Rate
Even when the preliminary research of animal studies and experiments in cells can be replicated, the likelihood of working in an RCT is still very low. How do we know this? As a regulated industry, the pharmaceutical industry is forced to use RCTs to confirm their ideas, and less than 1% of their ideas work. In Alzheimers disease, for example, they found 140 drug candidates over the last 30 years. None of them worked in large RCTs. They spent $600 billion.
The big message here, is that there is a lot of incentive to make the science look positive. That our intervention work. After all, no one every won a Nobel Prize for showing that a intervention does NOT work. However, when forced to test our ideas with really good tests, most of them fail.
About
Why we are fooled
When things appear to work in a human, like a friend improves after a treatment, up to 90% of common complaints just get better on their own, and up to 35% of the response can be placebo. When we look at large groups of people and spot trends, like those who exercise have better health than those who do not, those people who exercise also don't smoke, are wealthier, eat better, and are not obese. How do we know which factor(s) caused the better health? This is called confounding.
About
The Hierarchy of Evidence
From the above findings, we can establish a hierarchy of evidence, or an evidence pyramid (see diagram below). The pyramid shape highlights that most medical claims are based on the weakest and easiest kinds of evidence to produce, anecdotes and expert opinion, whereas very few medical claims are based on the strongest most expensive evidence, an RCT, or multiple RCTs in a Systematic review.

Anecdotes are unreliable - Although some major medical discoveries have been inspired by anecdotes, like Botox and Viagra, they are the exception, and they were verified with RCT’s. Anecdotes can be a starting point in medical research, not the endpoint.
Expert opinion is unreliable - To be clear, the term "expert opinion" is an opinion derived from a single expert's clinical experience and physiologic reasoning, in the absence of stronger kinds of evidence, such as RCT's.
Studies from academia are unreliable -
Evaluating a medical claim
The process of approximating the likelihood a medical claim is true is relatively simple: find the evidence for that claim, determine what level of the pyramid that evidence sits, the lower it sits, the less likely the claim is to be true. (For a more detailed explanation on evaluating medical claims check out my lecture on YouTube)
As a medical professional, finding evidence and assessing what kind it is, is relatively easy. As a patient, this is far more challenging, and a perfect application for Artificial Intelligence.
Using AI to help you evaluate claims
Despite AI’s reputation of being misleading and even hallucinating, if we learn to set the appropriate parameters and teach it HOW to think, it can be quite reliable. In AI lingo this is called "prompt engineering". With respect to healthcare, we must "prompt engineer" the AI with our lessons from meta-research, and direct it to look at the evidence, not expert opinion. Let me provide an example.
Suppose you want to find out if the popular supplement NAD+ will make you live longer. Asking AI, "Will NAD+ make me live longer?" will likely result in the response, "It looks promising". (Try it out for yourself). This, of course, is the wrong answer because you neglected to instruct it how to evaluate the claim. So it puts together, all the biased experts, from all the blogs and podcasts, selling the claim, which far otweigh the actual evidence from published studies.
Instead ask AI: "Taking into account the hierarchy of evidence, the replication crisis, and the pharmaceutical industry success rate of less than 1%… will NAD+ make me live longer?" AI will now be forced to assess the level of the evidence of NAD+, and calculate a probability of NAD+ working based on that evidence. Since most of the research is animal studies and cell experiments, it will correctly inform you that, "the likelihood NAD+ will make you live longer is very low".


