Science AMA Series: We’re the Reproducibility Project: Cancer Biology team. AUA!

Abstract

This month, as part of the Reproducibility Project: Cancer Biology (RP:CB), we published the results of our first five Replication Studies in the journal eLife. We plan to complete 20+ more replication studies, along with a final meta analysis of all studies conducted for RP:CB. Ultimately, the goal is to investigate the overall rate of reproducibility in a sample of high-impact cancer biology publications and to identify practices that facilitate both reproducibility and an accurate and efficient accumulation of scientific knowledge.

The results of the first five of our Replication Studies were mixed; we found that achieving reproducibility is hard. Each of these Replication Studies have elicited varied interpretations about what differs and why; and, ultimately, the results suggest that establishing reproducibility requires iterative experimentation and discovery. All of our work is done transparently; we are openly sharing all of our data, materials, analysis code, and methods upon publication (https://osf.io/e81xl/wiki/home/).

Ask us anything about our findings, process, or what we hope to accomplish with this research. We will be back at 2pm EST.

Responding are:

  • Timothy Errington, Center for Open Science
  • Alexandria Denis, Center for Open Science
  • Nicole Perfito, Science Exchange
  • Rachel Tsui, Science Exchange

Members of the Center for Open Science team, or the Science Exchange team may join us on their personal accounts to answer questions and participate in the discussion as well.

Edit: David Mellor, Center for Open Science and Courtney Soderberg, Center for Open Science will be participating on their personal accounts as well.

4:06 pm Edit: This was very enjoyable, thank you to everyone who participated! We're signing off for now but we'll be back over the next few hours if more questions and discussions arise

Hi

Thank you for participating in this AMA - your work is very important especially in this space where we are dealing with people's lives. It is also a difficult, complex area to work in, and I admire all of you for tackling this controversial space, since your findings may have far reaching implications on reputation, funding, and ultimately, clinical decision-making, patient outcomes, and mortality.

It must be hard sometimes to be seen as the "internal auditors for science", and to appear to be critical of the work of others who have gained reputation and standing for what they have done - for that, I want to salute you, and offer you assurance that what you are doing is indeed important - keep up the good work.

To my question = What's your view on the way research governance and funding is structured, which rewards first discoveries and exciting announcements of cures and breakthroughs, essentially a race to publish first, potentially at the expense of taking things slowly to ensure there is some internal testing first prior to publication? How would you change the system, if it is changeable in the first place?

Thanks in advance!

mvea

Great questions. We agree that there's a lot of emphasis put on novelty and 'clean' results even though research itself is nuanced and not nearly as binary as much of our published research hints at. That's the whole reason we are conducting research. But as you pointed out with incentives that reward exciting, unexpected breakthroughs it's a race to see who can obtain those outcomes and creates the 'publish or perish' culture we're in. This speaks to the current scientific culture, so any change is a cultural change and is not likely to occur overnight. But yes, we can change this system by shifting incentives, acknowledging an important aspect of science is cumulative knowledge, and rewarding open, reproducible, and rigorous research. So a general answer is making the full research process transparent. Science depends on transparency in order for others to evaluate the research. Right now, you only observe publishable outcomes, not process, data, materials, or detailed methods. Some specifics are to align incentives so that what is good for science is also good for success as a scientist.

(1) TOP Guidelies: https://cos.io/top - journals can incentives or require more transparency in what is published.

(2) Registered Reports: https://cos.io/rr - journals can conduct peer review in advance of doing the experimentation so the emphasis is on what is the best question and design instead of the outcome. This compliments peer review that occurs after the results are known. We are using this publishing model with this project.

(3) Registration: Exposing what research has been conducted so that results that do not survive peer review are still discoverable. And this helps distinguish confirmatory research (hypothesis testing) from exploratory research (hypothesis generating). Both are vital to science, but it's critical to be clear which is which (one cannot confidently generate and test a hypothesis with the same data). We used the Open Science Framework (https://osf.io/) in our project to expose all the data/methods/etc associated with each replication.

(4) Acknowledge open practices: https://cos.io/our-services/open-science-badges/ - This is another means to reward open practices. Journals can provide such incentives, which currently do not exist. This is different than imposing change, but rather acknowledging this behavior. One would expect these to be current practices, but they are not. This can provide another means to shift the incentives from rewarding outcome, with instead rewarding process and openness.


Hi all,

Thanks for doing this AMA! I'm a graduate student in psychology and have been studying some replication studies of classic psychology experiments. Although this is not your field, I wanted to ask your opinion on how much you think replication problems in science, not purely psychology or social sciences, relates to lack of understanding of methodologies/statistical tests. How can we educate researchers to conduct better/more robust studies?

Austion66

I think there are a lot of different issues underlying low rates of replicability in science, of which statistics is one. For example, a few different papers (Button et al, 2013; Dumas-Mellet et al., 2017) have documented generally low levels of power in a number of different disciplines. While power is often mentioned in statistics classes, in my experience and discussions with researchers, the false negative issues are mainly discussed, while the inflated effect sizes and low positive predictive value of underpowered studies are rarely taught.

In terms of improving education, I think there are a few options. One is to have methods/statistics classes include modules specifically about reproducibility/replicability; the statistical factors the increase or decrease replication likelihood, how to interpret results of replications, science as a cumulative model, etc. For example, this project is a collection of syllabi from researchers who have integrated these topics into methods courses. An alternative model would be to increase the availability and discoverability of free courses and learning materials that researchers could use for self-teaching on these topics.


Hi RP Project Team,

Sounds like interesting work. Not many people set out simply to test the tests. What sort of cancer studies have you/are you planning on replicating so far specifically? From the sounds of it, reproducing results isn't easy even when you are trying to, does this have strong ramifications to the reliability of the data these studies produce?

Lastly, how on earth do you get funding for your research when literally every aspect is publicly available!?

Thanks,

Kal.

HerbziKal

Hi Kal, Thanks for your questions. When looking for studies to reproduce, we looked for high impact cancer biology papers from 2010-2012 and excluded clinical trials, projects with extremely speciazlied equipment, and case study papers. See details of the paper selection method here (https://osf.io/e81xl/wiki/studies/). We still have about 25 more replication studies that are in progress and will be published later this year.

Reproducibility is difficult, but that doesn't mean that the data is not reliable. I think that these mixed results mean that we need to re-think how research is being conducted, and ensure that scientists are encouraged to share their raw data and protocols with their publication.

The RP:CB was funded by the Laura and John Arnold Foundation, which funds projects to promote transformational change. Since our goal is to change how science is being done, it's a fitting way to obtain funding!


It's a bit early in your study for this, but which funding sources tend to generate the most reproducible science? Of the original 20 studies you intend to repeat, what's the breakdown of who paid for them?

smackwagon

This is something we can look into at the end of the RP:CB study. We are planning on doing a meta-analysis of all the studies and this could be an interesting thing to correlate - funding sources and reproducibility. All of the original papers will list their funding sources, whether it is NIH, NSF, or a private foundation. Definitely something interesting to look into!


I'm probably arriving a bit too late to this AMA, but as I've been reading the questions and responses, one topic I don't see being covered are the incentives for sharing raw data. Many studies now generate tremendous amounts of data and even after an initial publication can still be used to generate more results and publications. What incentives are there to share raw data or how do we create ones for researchers who might feel that they are giving away data that took a lot of work to generate?

jrandym

That's an interesting question. For large data sets such as NGS sequencing results, having the raw data publically available after an initial publication does allow for other researchers to benefit from the generated data set. However, this has always been the case where scientists can and should build upon previous research. We feel that in the long run, being open and transparent with data improves efficiency and enables scientific discovery.

Adding to this answer is David Mellor from the Center for Open Science. We have several incentive programs to reward researchers for sharing their data. One is our badging program that allows researchers to signal best practices by being awarded an Open Data (or Open Materials, or Preregistered) badge. These badges are an easy way for journals to encourage data sharing -- see this paper on the effect of open data badges on the rates of data sharing: journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002456. Another is in the TOP Guidelines, which provides journals with different polices to increase the rates of data sharing by data disclosure statement, requirements to share data, or a policy of verifying shared data.


What have you found to be the main cause of irreproducibility? Is it an inability to identify the original reagents (lack of supplier info/unique ID/batch numbers etc)? Difficulty in replicating methodology in general (again due to poor reporting)? Or something more fundamental like poor initial experimental design or incorrect conclusion being drawn?

With so much hype around irreproducibility and things like 'bad antibodies' (see Nature), I'm keen to see how much the so-called reproducibility crisis falls at the feet of reagents vs human/experimental error.

BeardyNerd

Thanks for the question. This gets at the heart of what the project is aiming to understand. Not just what is an estimate of reproducibility (using multiple measures), but also what are the barriers to conducting replications. One of the biggest hurdles in getting the protocols nailed down before peer review was definitely lack of unique identification of reagents as well as lack of detail for the lab techniques. We always tried to ask the original authors for their input when those questions came up, but it points to one of the easy improvements we could make here, which is making these details more easily accessible to any scientist who would like to understand and follow-up on published results. Another common issue in biomedical research (as well as other disciplines) is low power designs. We're designing our studies with sufficient power (80% minimum) to detect the original effect size estimate. In many cases that means we're using larger sample sizes (such as more animals) than the original design. We'll explore more of this at the end of the project when we aggregate all the results.


What are the main challenges in achieving reproducibility, and are those challenges due things that things that can easily be overcome, like a lack of standardization in the methods that researchers use?

Also, were there cases where high-profile studies made it to mainstream news media only to be stopped by a lack of reproducibility? Or have many reproducibility studies been done by the time it gets picked up?

Really appreciate the work that you do, and thanks for doing this AMA!

Nocteliv

Some of the main challenges are, as you say, lack of standardization in the methods that researchers use. We've also found that often times, the main author (first author) are post-docs or graduate students who are no longer at the institution, resulting in data loss. This can make it difficult when reaching out to the original authors for assistance on protocol details if you are trying to reproduce their study.

With regard to your question about mainstream news media, usually the news will report and highlight new findings. These new findings usually have not yet been reproduced outside the scope of the original paper. That's not to say their results will not be reproducibile of course, but it's highly likely that formal replication studies will not yet have been conducted.


I just want to thank your team for specifically focusing on replication studies. As I am sure you know, they are a vital component of the scientific method, but unfortunately somewhat of a thankless job. The next time I see a click-bait headline about the latest new study showing some incredible correlation between eating chocolate and whatever, I will take solace in knowing that people like you are still out there doing real science. Now a question...What kind of tools do you have to help account for differences in genetic profiles of the human subjects you are studying?

theshsims

Most of the studies included in these replications used mouse models and cell lines as part of the experiments, but a few of the replications underway now have samples from human patients. If one of the original papers used the genetic profile of the experimental samples, then we would also screen our samples in the same way. For example if samples were screened for a specific genetic mutation, we would also confirm that mutation in our samples.


What are some of the challenges to reproducing studies? In my field, sometimes methods are not 100% step by step meaning I'd have a hard time reproducing a study with exactly the same setup. Even small variables would likely impact results. How do you figure out how to even set up the experiment to reproduce it? Do you work with the original authors? And how do those authors feel about your work?

firedrops

One of the biggest hurdles in getting the protocols nailed down before peer review was definitely lack of detail for experimental procedures. We started with the original paper (just like anyone else reading the study), but always tried to ask the original authors for their input when questions came up. Making these details more easily accessible to any scientist who would like to understand and follow-up on published results would greatly improve efficiency all around. The authors also had a chance to weigh in on the planned experimental protocols during peer review before any experimental work began. We saw a range of responses. Some authors did not provide information (that includes data, method details, unique materials), while others were willing to share information and data for the replications and in some cases shared materials such as cell lines, antibodies, etc. We'll explore this more at the end of the project when we aggregate the results of all the replications. You also raised a key point, that if a replication does not achieve the same outcome as the original study it could be a subtle detail/variable not communicated in the paper (or by the authors). Therefore, if there's no reason to expect a different result, but a different result is obtained, then differences not known to be important are now targets for hypothesizing and investigation. This can increase our understanding of the effect and the conditions needed to obtain it. This discovery is unlikely to occur without direct replication and transparent reporting.


How can reproducibility be improved? And how hard would this be to implement?

[deleted]

We think that improving reproducibility requires a shift in how research is being done. That is - we should try to incentivize scientists to be open with their data and results, even the negative results. We also need to change the infrastructure of scientific funding and the business models of scholarly publications. All of this requires coordination which can be challenging in the scientific community. However, with initiatives such as the RP:CB, we can help take the initial steps toward that change.


Hello, and thanks for coming here to answer questions.

In several of the RP:CB studies that have been published so far there were issues with the mouse model developing tumors faster/slower/differently to what was observed in the original publications (e.g. almost all mice developing tumors by 1 week in the PREX2 study). What do you think are the implications of this observation? On the one hand, this suggests that something different is clearly going on in the replication, so it can can arguably be said that some aspects of the original aren’t reproduced in the replication. On the other hand, this can be interpreted as an indication that the replication doesn’t completely recapitulate the conditions of the original in such a way that the critical observations are obscured. How do you interpret this, and is there any plan to look into the reasons behind the differences? A big question that has come up after the Reproducibility Project: Psychology (RP:P) has been about how much various factors influence reproducibility, particularly how much of a factor differences in experimental setup/context are relative to so called “questionable research practices”. In the discussion of these issues in psychology some people have raised the possibility that replicators who are less familiar with the techniques may fail to replicate a result because they aren’t familiar with the specifics of the experimental technique or setup. In the context of biological laboratory experiments, an adaption of this might be that it is not an issue of familiarity or skill per se, but that original authors may do initial work to validate the experimental model prior to running the experiment, while replicators may not. The issues with the mouse models in the RP:CB potentially suggests such an interpretation. How much credence would you give this interpretation? Is there any plan to adapt the procedure in future replications to account for potential biological variability or unforeseen differences between the studies?

Related to this, some of the issues that the RP:CB will discover may be specific to the types of experiments considered (e.g. animal models). Other types of research might be strongly influenced by other factors. For example, I believe another commenter mentioned issues with antibodies. I haven’t had a chance to look over all the studies that will eventually be replicated, but do you think there is any particular focus an certain areas of cancer biology?

What is the plan for quantifying “reproducibility” after the replication are completed? The RP:P used several methods, but the followup coverage has seems to focus a lot on p < 0.05 in the replication, which I think can be problematic or misleading in some cases. How will studies like the PREX2 study be coded in terms of reproducibility when looking at predictors/potential causes?

I haven’t looked into this in depth, but it seems that in some of the replications the SDs of quantitative measurements (particular it seems in treatment groups, but this may be me misremembering) are much higher than the reported SDs for the original replications. Have you noticed anything similar to this, or is it just my faulty recollection? If so, what could possibly be causing this?

To be a bit tongue-in-cheek, it rarely seems to be said outright, but I think a lot of people implicitly want to know the answer to the following question: Do replications fail because the replicators botched them or because of “hidden moderators”, or is it really because the original authors are all p-hacking and only submitting positive results to journals (aka publication bias)? Do you have opinions on the answer to this question? Will the RP:CB help in answering it?

meta_study

These are great questions. All of the replicating labs that are participating in the project are part of the Science Exchange network (https://www.scienceexchange.com) and are comprised of expert scientists at commercial research organizations, other academic labs and biotech companies. For the most part, the studies involve very standard laboratory techniques such as western blots, standard small animal surgeries, blood collection, and cell culture techniques. Given that the original authors have gone through the trial and error of optimizing the protocols they use, in some ways, the replicating lab's job should be easier if the protocols are well described. It's our opinion that the deviations of the models from the original paper are more likely the result of biological variation and factors that can potentially be explained with further experimental work.

To your broader question, there are many reasons why a replication might fail, including the ones you've listed, but also including a simple false positive of the origin effect or a false negative of the replication. It's often very difficult to disentangle these possibilities with one replication. Just like one original study is rarely definitive, one replication is also rarely definitive; both are just pieces of evidence that we can combine with other studies to work towards a more cumulative understanding of the phenomena in question. To really start to tease apart the different options, we often need many more replications. The Many Lab style replication projects (https://osf.io/wx7ck/) are an example of this type of replication project. Because many different labs are replicating the same effect, we get both a very precise estimate of the effect size as well as a statistical estimate of the heterogeneity of the effect size. If we see significant heterogeneity of an effect, this would indicate that there is a moderator of the effect that we may perhaps not know about, which could lead to failed replications.

Regarding how to code if a replication was a success or not, as you pointed out there are many ways to do this (statistical significance is one, which often gets utilized and over-emphasized). Another is looking at the effect size, the direction of the effect, meta-analysis of the two effects, and subjective assessment. Just like RP:P, these will be explored at the end of the study, but as you pointed out, some of these early results are already suggesting there will likely be a qualitative element to this aggregate analysis due to the models behaving differently between the two studies. Again, recognizing the aim of the project is to use each replication result as a single data point in an aggregate analysis, we would not be able to make any specific claims for specific studies (such as PREX2) unless one was to perform a Many Labs style project as described above. But if we look at factors across all studies (data,materials,method sharing rates, sample sizes, etc) it will help explore what factors might influence reproducibility. So yes, we hope RP:CB will help with these questions. But these are big questions for one study to address. Nevertheless, the results will provide an initial empirical basis to evaluate reproducibility and may help guide the broader discussion about reproducibility toward areas of significant challenge, productive areas for further inquiry, and possible interventions for improvement.


Hi all,

I have mixed feelings about these types of reproducibility projects. While I can certainly see some benefit to them, I was hoping you could help to allay my biggest concerns:

  • In many ways, science is a self-correcting reproducibility project. When an interesting result is published, labs (academic and industrial) from around the globe will try to reproduce the result and see if they can add to the body of work with additional twists and experiments. Work that is reproducible rises to the top and becomes highly cited; work that is not typically flounders at lower citation scores. If the magnitude of irreproducibility is large, then you will often see authors put out a report contradicting the original finding. This system seems to work reasonably well, so why the need for a task force directly targeting reproducibility? What is gained by having a reproducibility task force specifically focussing on repeating studies?

  • I believe your report talks about this, but streamlining models is hard. Mice behave differently in different animal facilities, tumors grow differently depending on a number of difficult to control conditions, the skill of technicians varies dramatically from location to location. So while I admire the concept of prospectively designing studies to only include a certain amount of trials for any given experiment, don't you feel it is important to first establish that the model works before undertaking a replication study? Otherwise, it seems like the results of a failure to replicate scenario is difficult to interpret.

  • The danger of witch hunting and unwarranted tarnishing of people's reputations seems very real with a project like this. What steps do you take to protect against this?

Thanks!

SirT6

Thank you, these are great questions and they highlight some of the most common concerns that are voiced with RP:CB and projects like it.

  1. We agree science is self-correcting, but at what rate? Right now we design multiple experiments to tackle and probe a phenomenon to see whether it holds up or not. But there's an assumption that each individual experiment is reproducible (reliable) - but do we test that? We're testing that with this project - how reproducible are individual results (direct replications) - opposed to how likely is a phenomenon to hold up (conceptual replications and extension work). We propose that there is an efficiency problem in that scientific progress is still being made, but not at the rate it could be. There are many reasons for this and not all of them are easily addressed; however, one recurring issue is that efficiency in this self-correcting mechanism can become stalled by lack of openness and data sharing. Science values openness and reproducibility, but they are not standard practice. For researchers to efficiently and accurately replicate and build on the work of others, they need to be able to access materials such as protocols and detailed methods of the original with minimal barriers. I think we were surprised at how difficult this process can be and how often we needed to reach out to the original authors to clarify, get additional information, or make assumptions when information wasn't available. This isn't to point fingers at specific papers, but to highlight that current norms don't lend themselves to efficiently reproducing others work. These added hurdles take time and effort on the part of everyone involved. One of the goals of RP:CB is to identify such barriers to reproducibility and consider ways to remedy them.

Additionally, publication bias can make it harder for null results to get published. Because of this, once an effect is found and published in the literature, it can be difficult to get null results for that effect published. This makes it potentially difficult to accumulate contradictory evidence, leading to a slower self-correcting model.

  1. The robustness of the animal model is definitely something that can affect whether a result is reproducible. And if the result differs between two studies, the follow-up experiments would then focus on the model, not the intervention. The details that you're referring to are often not shared in the original publication (specific control conditions that may influence take rates, for example or even take rates at all). Also, publications are accepted largely because of the results, without a full appreciation of the reliability of the model - that is, how the model behaves in different labs. So, yes, it's important to understand if models are behaving the same between two studies. Nature wrote a nice editorial on the project with points related to this question: http://www.nature.com/news/replication-studies-offer-much-more-than-technical-details-1.21311

  2. While there's no 'norm' of performing replications, we've taken considerations through the project to be aware of how this could be perceived towards reputations. One of the steps we took was to try and include the original authors wherever we could. We reached out to corresponding authors when preparing the Registered Reports, which were our protocols for replicating their studies. Some participated and some declined for various reasons. During peer review of the Registered Reports, an original author was also invited to participate. Additionally, we have made a conscious effort to avoid any claims about whether our replication did or did not 'succeed'. Although frustrating to some (mostly reporters), defining if a replication was a success (or not) is not easy to define and hopefully this project can shed more light on how to address this question. Also, no single replication from this project, just like no original experiment or study, can provide conclusive evidence for or against an effect; rather, it’s the cumulative evidence that forms the foundation of scientific knowledge. You might also be interested in this paper looking at scientists reputations, in respect to replications: http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002460


What is the biggest thing you've found out?

themightykunal

We have only published the first five of 20+ Replication Studies; but I would say that so far we've learned that reproducibility is hard and results are nuanced. Some of the biggest hurdles we have faced right at the start of the project were getting detailed protocols nailed down and filling in necessary information that was sometimes lacking in the original methods. I think that this points to an incentives problem -- scientists aren't currently incentivized to share data and detailed methods. Hopefully this project will open up the possibility for continued discussions about establishing norms and what level of detail is and isn't necessary to make available.


[removed]

[deleted]

1) in terms of direct replications (what I think you are referring to) there are two ways. We can compare this to how yeast replicate (sexual and assexual reproductive cycles). If a researcher in the same lab as the original study performs a replication of the original experiment, that is the equivalent to the assexual reproductive cycle in yeast, while if an independent researcher performs a replication, that would be equivalent to the sexual reproductive cycle.

2) We're quite interested in understanding how long it takes to perform replications. We'll explore this in more detail at the end of the project when we analyze all of the aggregated replications. We started this project at the end of 2013 and it took over 3 years to gather the necessary information, undergo peer review, perform the experiments, write-up the results and have those papers undergo peer review, which is what we published last month. Hopefully as we make data/methods/material sharing more normative practices of science the cycle can become faster.

3) This is a good question. For this project we are performing only one replication per study. However, if we had multiple labs performing replications at the same time then indeed we'd arrive at multiple replication study results. An example of multiple studies is the many labs 1 project (https://osf.io/wx7ck/). There instead of twins it produced triginta sextuplates!


Additional Assets

License

This article and its reviews are distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and redistribution in any medium, provided that the original author and source are credited.