Hackademics II: The Hackers

Cover|200

Episode metadata

Show notes > One scientist decided to put the entire field of psychology to test to see how many of its findings hold up to scrutiny. At the same time, he had scientists bet on the success-rate of their own field. We look at the surprising paradoxes of humans being human, trying to learn about humans, and the elusive knowledge of human nature. Guest voices include Brian Nosek of the Center for Open Science, Andrew Gelman of Columbia University, Deborah Mayo of Virginia Tech, and Matthew Makel of Duke TiP. A philosophical take on the replication crisis in the sciences.
>
> Learn more about your ad choices. Visit megaphone.fm/adchoices

Episode AI notes

  1. Prioritizing replication over personal career incentives is crucial in academia to ensure the validity of research findings.
  2. Reproducibility is a core value of science, and scientific claims become credible through independent replication.
  3. Non-replication does not imply being wrong initially but rather highlights the uncertainty in the inference rule based on original findings.
  4. Replication plays a significant role in verifying the validity of study findings by involving larger samples, better measurements, and controls.
  5. Challenges in replication studies and effect sizes are observed across various human sciences fields, leading to issues with research standards and reliability.
  6. Selective reporting practices such as P-hacking and the file drawer effect undermine the integrity of research findings by manipulating data and selectively publishing successful studies.
  7. Questionable practices like optional stopping and data peeking blur the line between cheating and manipulating results in research, introducing bias and undermining validity.
  8. Awareness of researcher degrees of freedom in data analysis is crucial to combat biased results and ensure rigorous testing methodology.
  9. Balancing being right and not being wrong in research practices involves evolving towards more rigorous methodologies like pre-registering studies and implementing statistics effectively.
  10. Access Debra Mayo’s blog at errorstatistics.com, Andrew Gellman’s at AndrewGellman.com, and the Center for Open Science at COS.io for additional resources and information.

Snips

[03:39] The conflicting motivations of publishing or replication

🎧 Play snip - 2min️ (01:44 - 03:48)

✨ Summary

The study showed that individuals on extreme ends of the political spectrum struggle to identify shades of gray, seeing the world in black and white terms. Despite the initial significant finding, the researchers faced a dilemma between publishing the original study for career advancement and conducting a replication to validate the results. The allure of taking the striking initial finding directly to publication tempted the researchers due to career incentives, bypassing the need for replication. However, realizing the importance of scientific rigor, they decided to conduct a replication study, which yielded no significant results, showcasing the conflicting motivations between publishing for career success and ensuring the reliability of scientific findings.

📚 Transcript

Click to expand
Speaker 2

And what he found, astonishingly, was that people who were further out on the left and right were less able to identify accurately the shades of gray than people who are in the political Center. So literally, people who are politically extreme are not able to see shades of gray. They see the world in more black and white terms. In academic psychology, this kind of finding is a big deal. We were like, oh my God, this is gonna make his career as a senior grad student, so he's looking, getting close to me on the market. He is really enthused. We are both really enthused. The lab season is very enthused, and we say, well, okay, that is a crazy finding, but totally cool.

Speaker 1

I mean, think about it. They already had the best title for their paper, 50 Shades of Gray. That book was an enormous bestseller at the time. And it's an age of extremist politics in America. So you're basically just handing print media and science bloggers clickbait on a platter.

Speaker 2

The easy thing to do, which would be playing into the incentives of what is at stake for Matt and his career, what's at stake for my career, is to have taken that initial finding and published Just that. Not bother with doing a replication, because why would we do a replication? The only thing that we can do by doing a replication is lose this golden nugget.

Speaker 1

There's no requirement of researchers in psychology that before they publish a finding, they have to run the study again to make sure it works again. The original study had almost 2,000 participants too, so it's not like they had a small sample. And last week I talked about this rule of statistical significance, P less than 0.05, as the standard for drawing a conclusion from your study. No sick and a student, Matt Motel, got P less than 0.01. By any of the standards at the time, they did fine.

Speaker 2

And so we ran another study, did it again, and we got nothing.

[05:17] The Credibility of Scientific Claims

🎧 Play snip - 1min️ (04:05 - 05:02)

✨ Summary

Scientific claims gain credibility through reproducibility, where independent verification using the same methodology leads to the same results. If a claim cannot be replicated, its credibility diminishes. This principle is fundamental to the essence of scientific practice. In contrast, much of what is sensationalized in popular science media is often single-study findings supported by statistical significance, which may not hold true upon further investigation.

📚 Transcript

Click to expand
Speaker 2

A key aspect of a scientific claim become incredible is that you don't have to rely on me, the originator of that finding, to say it's true. A scientific claim becomes credible because you can reproduce it. An independent person can follow the same methodology and obtain that result. And if that doesn't happen, claims that are supposed to be scientific claims become less credible. It's just a core value of what makes science science. In the daily practice of science, it isn't often how research is done.

Speaker 1

It's not how research is done. The way that research is done is that you publish the 50 Shades of Grey finding. And then enter the pantheon of cool psychological findings. A lot of what you hear reported in the popular science press is like this. Not everything, but a lot. A flashy finding supported by a single study based on statistical significance.

[13:31] Understanding the Implications of Non-Replication in Studies

🎧 Play snip - 1min️ (12:00 - 13:09)

✨ Summary

A failure to replicate in a study does not necessarily mean the original study was false. It could indicate flaws in the replicating study, making the first study accurate. It might also suggest that the initial finding only applies in specific conditions and not more broadly. Overall, non-replication serves as a cautionary signal that one’s confidence in the original findings may be overstated, highlighting the inaccuracy of the inference-making rule employed initially.

📚 Transcript

Click to expand
Speaker 5

What does a failure to replicate even mean? It doesn't mean that your earlier study is false.

Speaker 1

It could mean that the replicant is not a failure to replicate even. It's a failure to replicate even. It's a failure to replicate even. It could mean that the replication was the bad study, and the first study was a good one. That's possible. It could mean that your original finding just held in a very, very narrow circumstance in the world, and that in broader circumstances, it's not going to hold. That's possible too. These are all legitimate questions, but I think they missed the key takeaway. Non-replication doesn't show that you're wrong. It shows you didn't know that you were right in the first place. The point is that the rule that everyone settled on to make inferences about the world from your experiment is much less accurate than you thought it was. Much, much less accurate. You're way overconfident in your original findings, even if you still think they're true.

[18:20] Predicting Replication Outcomes in Psychology Studies

🎧 Play snip - 3min️ (15:13 - 17:53)

✨ Summary

Researchers conducted a study where participants bet on the replication outcomes of various psychology studies, finding that 71% of the time, the market successfully predicted the replication outcome. This result suggests that psychologists can identify which studies are likely to be hard to replicate, even if true. The study highlights the discrepancy between original studies and replication attempts, implying a need to reevaluate the significance of original findings when replication fails.

📚 Transcript

Click to expand
Speaker 1

So all this means is that there was no one study where there was a consensus view about whether it would or wouldn't replicate. So once people had finished betting, you could see how accurate they were at predicting which studies would replicate.

Speaker 2

If the price was anywhere over 50 cents, then it's predicting replication success. And all outcomes were right on target. 71% of the time, the market successfully predicted the replication outcome. And so that was an amazing result to us, which was that you can actually predict these outcomes.

Speaker 1

What is going on? Psychologists are betting against their own standards for accepting studies. They're shorting themselves and winning. From the outside, one could very uncharitably state the finding of your paper as the finding that psychologists as a whole can detect the bullshit that's going on in their discipline. You don't have to respond.

Speaker 2

Yeah, it is a challenging result, for sure. As you can still have a charitable conclusion, which is they can identify which ones are going to be hard to replicate, even though they're true.

Speaker 1

Brian Nosek is being very nice here. He could be right that the judgments people are making, or that they won't find the effect in a replication, but it's still there. But what reason do people have for concluding that? Are the original studies so good that your conclusion should still stand even when a replication attempt fails? Andrew Gellman has an argument that this can't be right.

Speaker 2

The trick is that you do a mirror reflection.

Speaker 1

This is an excerpt from a talk he gave in Britain. He's asking us to think of what we would say if the replication came first, and the original study came second.

Speaker 2

In that case, what we have is a controlled pre-registered study consistent with no effect.

Speaker 3

Followed by a small sample study that was completely uncontrolled, where the researchers were allowed to look for anything, and they happened to find something that was statistically Significant.

Speaker 1

The replications had larger samples, better measurements, and controls, and were specifically designed to find the effects that the original studies claimed to have found. If the people betting in the markets really thought that the effects were there, but the better design studies would fail to detect them, they'd be putting a lot of weight on the original Studies just because they came first.

[20:03] Replicability and Reliability in Economic Research

🎧 Play snip - 1min️ (19:00 - 20:08)

✨ Summary

Economic research has a replicability rate of about 60% with significant p-values. However, like psychology, the effect sizes in replications are consistently smaller. When the field of economics bet on replication studies, they were accurate 75% of the time, indicating a discrepancy. Fields of science seem to doubt the reliability of their own standards, recognizing issues with consensus standards for research publishing. Despite knowing this, individual research continues, leading to a lack of transparency in reporting verifiable results. The focus on statistical significance with p-values less than.05 creates a facade of trustworthiness in publications, press releases, and media coverage, further misleading the public.

📚 Transcript

Click to expand
Speaker 1

People have now done it for economic research, which is about 60% replicable, based on P less than .05. But similarly, the effect sizes in the replications were all slower, just like in psychology. And just like Nocek's betting markets, when the field of economics bet on replications, they were 75% accurate. Something is up. Entire fields of science are betting against the reliability of their own standards. They know as a whole that something is off with the consensus standards for publishing research, but individually research like this is continuing. As a whole, these fields are pretty good at telling us which research can be independently verified and which can't. Then why don't they just tell us that? Why is there this whole song and dance around statistical significance P less than .05? Publications, press releases to the media, and the public turns out to believe all this stuff that they know can't be independently verified.

[23:41] The Statistical Significance Filter in Studies

🎧 Play snip - 2min️ (21:06 - 23:36)

✨ Summary

Studies based on statistics can exaggerate effect sizes due to the statistical significance filter. This filter causes a mathematical bias leading to overestimation of results. One example highlights the challenge of extracting accurate information from a small sample size, making it difficult to separate real effects from random chance. Statistical significance, often reliant on large differences, can render results misleading, as the confidence gained may be in noise rather than truth. Furthermore, in noisy studies, there is a significant risk that the actual effect might be contrary to the reported findings, even if deemed statistically significant.

📚 Transcript

Click to expand
Speaker 3

Statistical based studies tend to overestimate effect sizes because of what we call the statistical significance filter, and that's just a mathematical bias.

Speaker 1

It's a subtle, but rather profound point that explains a lot of what Brian Nossack found. Imagine that there really are one percent of people in the country that would change their vote on Election Day if their favorite football team won the weekend before. How would we find this out? In reality, people aren't always honest if you ask them who they're going to vote for, and they're not always honest or even knowledgeable about why they're voting for the person they're Voting for. But that's the best question you have to find out, so it'll have to be good enough. And you don't have enough time to talk to everyone in the country, so you can only pick a few hundred. This is an example of a noisy measurement and a small sample. A noisy measurement means you're going to get all of these differences in how people answer. That roughly picks out what you're looking for, but isn't very exact. When you combine all of this, now you have to compare two groups. Groups where their football team didn't win, and groups where their football team did win. When you compare those two groups and you found in your study a one percent difference between them, even though that's in reality the truth, that wouldn't be statistically significant Because you wouldn't be able to separate that from random chance. Mathematically, the only way you would get significance is if you got a huge difference. Like you had a group where 10 percent of the people would change their vote because their football team won. If you saw that, you get statistical significance and you report it. But you'd be reporting an illusion. It would be a result of the sample you happen to pick out, or due to the noise in your measurement or both.

Speaker 3

If you get statistical significance, you lost. Because you became very confident in something that is just noise.

Speaker 1

But it's even worse than that.

Speaker 3

If you have a noisy study, there can be a high chance that the true effect is in the opposite direction is where you think it is, even if it's statistically significant.

[28:17] Careerism and P-Hacking in Medical Trials

🎧 Play snip - 1min️ (27:13 - 28:26)

✨ Summary

Careerism can lead individuals to compromise research integrity by engaging in P-hacking techniques, such as selectively reporting data to support desired hypotheses. In medical trials, the issue of dropping out individuals who show improvement in the control group to ensure the treatment group performs significantly better is a common problem. This practice distorts the comparison between groups and undermines the credibility of study findings.

📚 Transcript

Click to expand
Speaker 1

Careerism incentivizes people to bend the rules.

Speaker 3

Here you are doing research and you want to make discoveries and your P value is like 0.1 instead of 0.05, that's too bad. So you can do things called P hacking. P hacking. P hacking. To get the P value less than 0.05. We all know that if you report data selectively, you will be able to find evidence for any hypothesis you want. So for example, you can throw away cases. Cherry picking. This happens a lot in the context of medical trials. It's a well-known problem that they like to drop out people who are getting better under the control.

Speaker 1

If it's your incentive to show that a drug works, you have to have the treatment group do significantly better than the control group. But if there's someone doing really well in the control group, that messes up your comparison. So you find some way to exclude them from the experiment. Maybe they weren't sick after all.

[29:46] The File Drawer Effect in Research

🎧 Play snip - 1min️ (28:59 - 29:53)

✨ Summary

The file drawer effect refers to the practice of researchers selectively reporting successful outcomes while omitting unsuccessful ones. This can lead to an inflated perception of success because statistically significant results are more likely to be shared. Researchers may unintentionally contribute to this effect by ignoring failed attempts and only reporting positive results, giving a distorted view of the actual research process.

📚 Transcript

Click to expand
Speaker 4

What they'll do is let's say we tried the test once and we found that we couldn't reject the hypothesis. But then finally the third, the fourth time we find some and we ignore the cases that didn't show the result. And we only report the ones that did. That's typically what goes on. In fact, the probability of finding at least one statistically significant result by chance alone can be very high, depending on how hard you try.

Speaker 1

This is called the file drawer effect. A single researcher runs a study multiple times. All of the failed attempts they stick back into their file drawer. And the successful ones they end up publishing. From the outside it looks like you ran one study and got a significant result. When in reality you very selectively showed your wins and not your losses. The file drawer effect works at a large scale without any nefarious intentions from researchers.

[30:54] Optional stopping and data peaking in experimental science

🎧 Play snip - 1min️ (30:38 - 31:20)

✨ Summary

Optional stopping in experimental science is akin to changing the criteria or sample size of a study until desired results are obtained, leading to biased outcomes. This practice, also referred to as data peaking, can inflate the significance of results and undermine the reliability of the study findings.

📚 Transcript

Click to expand
Speaker 1

Optional stopping is like playing rock paper scissors with a friend. And then when you lose you say, hey best two out of three. And then when you lose that you say, hey best three out of five. And then when you finally win you get congratulated for a fair victory. In experimental science sometimes this is called data peaking. You run an analysis after studying a hundred people, but your results are just iffy right on the borderline, not quite statistically significant. So then you add another 50 subjects and then boom you get statistical significance. So you report that you ran a study of a hundred and fifty people and got a result.

[33:15] Analyzing Data and Thresholds for Statistical Significance in Studies

🎧 Play snip - 1min️ (32:00 - 33:09)

✨ Summary

Breaking up data and analyzing it separately based on different variables can lead to finding significant results that were previously unnoticed. Researchers can adjust their thresholds for statistical significance based on the data breakdown to draw more accurate conclusions. For example, a study showed that women were more likely to wear red on cold days when ovulating, highlighting the importance of analyzing data subsets. It is crucial to choose comparisons thoughtfully, as selecting different comparison groups can impact the outcome of statistical significance in studies.

📚 Transcript

Click to expand
Speaker 3

Another thing you can do is break up your data. So we analyze all people. We found in this effect for men but not for women. Or just for women or not for men. There was a study that didn't find statistical significance. This was the study saying that women were more likely to wear red clothing during a certain time of the month. They looked and they found that they had data on two different days. And one day was a warm day and one day was a cold day. And it turned out there was a difference between the two days.

Speaker 1

And once you find a difference in the data when you break it up, you can then report, women are likely to wear red on cold days when they're ovulating. Even though originally you were trying to figure out whether they're likely to wear red in general when they're ovulating.

Speaker 3

You can change what your threshold is. There was the example of the study where the sociologist claimed that more beautiful parents were more likely to have girl babies. The attractiveness of the parents was rated on a one to five scale and he compared the fives to the one through fours. And he got statistical significance. But if he had compared the ones to threes to the fours and fives, he wouldn't have got it. Or one in two to three, four and five, he wouldn't have found it. He found the one comparison.

[35:47] Trouble with Researcher Degrees of Freedom

🎧 Play snip - 2min️ (35:20 - 36:56)

✨ Summary

Motivated reasoning can lead to significant results that are actually fragile due to researcher degrees of freedom. It is crucial to explore different decisions and their impact on results to ensure the findings are robust. Generating significant results consistently is essential to demonstrate a real effect. The challenge lies in accurately detecting genuine effects and ensuring that the test is reliable enough to confirm their presence.

📚 Transcript

Click to expand
Speaker 1

There's no dishonesty here. It's just the kind of motivated reasoning. To fight that, you have to consider the decisions you didn't make and how they would have affected your final results. And so Gellman has a solution to this problem, which people call the problem of researcher degrees of freedom. After you run your numbers for statistical significance, pretend you made different decisions and run the numbers for those decisions. Then you can see just how fragile your result turns out to be. We're not interested in isolated significance results.

Speaker 4

You have to show that you can generate significant results at will. You can generate results that rarely fail to be statistically significant to show that you have demonstrated an actual effect. This is the deeper problem.

Speaker 1

When you're trying to detect things, real things, real effects among human beings, and your particular test says, yeah, it's there. To be sure, or at least sure enough to tell everyone in the world, it can't be too easy for you to have failed to detect it. In all of these cases of single studies where people find significant results, the problem is never that the effect of the problem is not. The problem is never that the effects aren't there. Even if you're right and they are there, your test isn't good enough for you to know that.

[39:02] Critique of Bem’s Research Practices and Changing Field Standards

🎧 Play snip - 1min️ (37:44 - 38:33)

✨ Summary

Skeptics argue that Bem’s research merely showcased researcher degrees of freedom as he varied participant numbers based on significance levels. This practice, common in the field for years, is being revamped with the rise of Open Science Center, emphasizing pre-registering studies and a more informed use of statistics.

📚 Transcript

Click to expand
Speaker 2

The general consensus of the skeptics of the field say what Bem really demonstrated was researcher degrees of freedom. Because in his multiple experiments in the 2011 paper, they have different numbers of participants. Why? Was it that he collected 50 and then looked, no, it got something significant, it will stop. And then if he collected 50 and it wasn't significant yet, did he say let's collect 25 more and then peak again? I think he was doing what has largely been the accepted practice of the field for decades.

Speaker 1

These practices are changing with the advances at the Open Science Center, and the move towards pre-registering studies and a more knowledgeable application of statistics.