top of page
Search

Why you shouldn't be afraid of playing with someone else's data - using existing public datasets to answer your own research questions

  • njtruter
  • Mar 27, 2024
  • 3 min read

Updated: Apr 17, 2024

As researchers, we typically generate hypotheses based on findings from our previous research and by trawling through literature describing the findings of others in our field. Rarely do we take the chance to dive directly into the datasets of other researchers or public databases. While the idea of delving into someone else's data may seem daunting, it presents a goldmine of opportunities for new ways of thinking and hypothesis generation.


The use of multiple datasets relevant to your research question could broaden the scope of inquiry and also spark new ideas for designing experiments or analyzing one's own data. While it's true that different experimental conditions can influence the accuracy of insights gleaned from different datasets, it can serve as an excellent starting point for hypothesis generation. Utilizing more than one dataset enables the cross-validation of findings, where consistent observations across independent datasets enhance the confidence in the accuracy of the findings. For instance, in trying to understand the molecular drivers of neuronal aging in Drosophila, we examined two datasets that characterised changes in the head proteome over the lifespan of the fly (link to study).


Additionally, we could relate the changes in the head proteome to the decline in climbing ability over time (a third dataset), despite the proteome studies not having looked at climbing ability. This is because there is a relative standard decline in climbing ability in Drosophila. Therefore, by looking across multiple studies on a research topic (e.g. aging in Drosophila), new questions and hypotheses are enabled by creatively interrogating their datasets.


Beyond using your own statistical analysis on public data, there are several existing tools enabling hypothesis generation using public data. The following tools are great examples from the cancer research field:

  1. Depmap portal - for easily visualising which genes/phenotypes play a significant role in how dependent some cancers are on specific genes. Pro tip: dividing them into lineages may reveal relationships that are otherwise not obvious

  2. cBioPortal for TCGA patient populations - this tool enables exploration of patient data, which can be used to determine if a gene of interest is significantly altered in a specific patient population. This tool has a steeper learning curve than Depmap, but since it offers direct insights into cancer patients it is time well spent.

  3. CellMiner - identification of different datasets and enabling the use of univariate and multivariate analyses in these datasets.


Finally, the recently established Hitchhikers AI community is enabling bench scientists to analyse public and proprietary datasets, by developing software informed by the challenges these scientists face in data analysis and use of artificial intelligence. This will enable you to use your own or custom public data for specific analyses you want to perform. For example, this community published a package enabling scientists to perform basic machine learning techniques on DepMap data: https://www.hitchhikersai.org/ml-depmap. This package was used to identify a known relationship between MTAP deficiency and the MAT2A/PRMT5/RIOK1 axis by correlating how dependent a cell line is on MTAP with abundance of the genes from the MAT2A/PRMT5/RIOK1 axis. The strength of this relationship was compared in different tissues, which could inform a hypothesis on which tissues and eventually for which patients targeting the MAT2A/PRMT5/RIOK1 axis would be effective.


As we delve into our own research, we should be taking advantage of existing data by asking "What public data exists in my field, and how can I use this data to inform my research hypothesis?".

 
 
 

Comentários


bottom of page