Making Common Fund data more findable: catalyzing a data ecosystem

Abstract

The Common Fund Data Ecosystem (CFDE) has created a flexible system of data federation that enables researchers to discover datasets from across the US National Institutes of Health Common Fund without requiring that data owners move, reformat, or rehost those data. This system is centered on a catalog that integrates detailed descriptions of biomedical datasets from individual Common Fund Programs’ Data Coordination Centers (DCCs) into a uniform metadata model that can then be indexed and searched from a centralized portal. This Crosscut Metadata Model (C2M2) supports the wide variety of data types and metadata terms used by individual DCCs and can readily describe nearly all forms of biomedical research data. We detail its use to ingest and index data from 11 DCCs.

Abstract

Reviewer1: J. Harry Caufield

This manuscript by Charbonneau et al. details efforts to address challenges in enhancing the value of metadata among projects in the NIH's Common Fund Data Ecosystem. They specifically detail how a new metadata model was developed and deployed to unify data properties across projects. Assembling such a model is a major accomplishment and a necessary step in promoting data reuse. Applying the model is another commendable achievement. The manuscript text undersells the value of these efforts. How has the value of data in the CFDE improved due to implementation of a unified metadata model and new infrastructure? The authors clearly delineate the challenges in searching CFDE data; these issues frequently appear in efforts toward improving biomedical data FAIRness and are directly relevant to the core challenges identified by Wilkinson et al. (2016) in their FAIR guiding principles. Much more emphasis could be placed on the overall impact of a consistent metadata model, whether within the CFDE alone or in the broader realm of bio-data management. Major issues:

As noted on page 11, "All C2M2 controlled vocabulary annotations are optional". Data producers will use terms outside the controlled vocabulary as needed, and are unlikely to consult any CFDE working groups in every instance. Is there some automated system for term normalization in place? How will data producers be encouraged to preferentially use controlled terms? Are they warned during submission, as noted on page 22 regarding data contents? Minor comments:
The first example of the mismatch between user expectations and actual results of searching for Common Fund program data is very illustrative. I appreciate how it notes that even instances like matching Dr. Phil Blood's name in a search can complicate Findability.
The abstract could include some brief description of the broader relevance and impact of the metadata model, including its potential for use outside the CFDE.
On page 5, the sentence "Thus, a researcher interested in combining data across CF programs is faced with not only a huge volume, richness, and complexity of data, but also a wide variety, richness, and complexity of data access systems with their own vocabularies, file types, and data structures" feels somewhat redundant and could benefit from some editing.
The structure of Figure 1 (or should this be Table 1?) is confusing. The general idea is clear - metadata types, properties, and formats are inconsistent across projects - but the two-column format presents issues with direct comparison.
It is interesting that, among all values presented in Fig. 1, just one includes a CURIE (HMP's ENVO:02000020). This may be worth further comment as it is striking that few of these projects have adopted unique identifiers within their metadata schemata.
Slightly more detail regarding the interviews with Common Fund programs would be helpful for understanding how these interactions contributed to the process. Were interviews primarily with PIs? Were several prominent issues repeatedly discussed in the context of multiple projects?
Is the C2M2 master JSON schema publicly accessible?
Some redundancy is present between the first and second paragraphs under the heading "Entities and associations are key structural features of the C2M2" - e.g., core entities and container entities are both described twice.
In Figure 2, some lines connecting tables are very close to the edge of the figure borders and are difficult to see as a result.
Is there a mechanism for dealing with obsolete terms as the ontologies contributing to the controlled vocabulary change? In the even that the NCBI Taxonomy renames a genus, for example, how will CFDE metadata change (if at all)?

Reviewer2: Carole Goble

The article is a very useful contribution to the growing number of metadata models and data catalogues in the life science data ecosystem. The recent NIH mandates in data sharing emphasize the need for findability of datasets, and the need to operate within a federation and ecosystem recognises the reality of independent data centers and legacy data collections. The paper states the context of the CFDE well, setting up the need for a centralized portal capable of ingesting, indexing, search and supporting cross dataset comparisons of dataset from different, independent data centers without the need for those centers to move, reformat or rehost their data. This is a common pattern that many data infrastructure providers will recognise. The incremental approach that supports minimal uploads and respects local DOI implementation is a pragmatic approach that has made onboarding the data centers feasible, I suspect. The insight that mapping to common ontologies does not actually lead to harmonised dataset and nor does it support search is a useful lesson that resonates and is useful to reiterate (although it is already well known). Given the approach is tabular, Frictionless data makes sense. The process of working with the Centers is interesting as is the choice of three core entities. Some more discussion on why these three and only these three would be appreciated. The ingest pipeline and process is not so clear.

It seems that each Center is required to map its datasets to the current C2M2 model in 48 TSV files, in a data package that is then uploaded to the catalogue and ingested into the portal's database. Is the data package a complete reupload each time or is the data package additive? There are hints in the text that it is a replacement each time.
What is the cost and complexity of this mapping and upload borne by the Centers? Any insights would be valuable. Is and tooling provide to help beyond the documentation?
Figure 5 could be improved to include the data that flows between the steps, and the actors. Could Figure 3 and 5 be merged?
If the datasets are reuploaded afresh each cycle, how are between-release analytics managed? By the use of the PIDs? Are there any restrictions on what cannot be changed between releases?
As the datasets can be incrementally improved with each release, are there any trends between releases that indicate changes in metadata enrichment - On page 18 you state that "DCCs get better at using the C2M2"
The data package needs clearer description: relationship between the TSV files, the Frictionless Data JSON and BDBag is of interest to many in the community and warrant a more thorough discussion. The portal
Why were these three basic kinds of search chosen? Were there user stories collected from the listening tour?
It would be helpful if there were some indications of the use of the catalogue by users rather than just the ingest and publishing pipeline. Page 5 the arguments are made that reusing common fund data for cross-cutting analysis is challenging and requires the hiring of dedicated bioinformaticians ("at considerable cost of NIH"). How does making the datasets available through a catalogue relieve the burden of skilled bioinformaticians to analyse data? The data still needs to be processed. Hasn't the burden just shifted to the Centers to prepare the TSV files for the ingest pipeline? Page 5 claims that the sociotechnical framework of the CFDE is a self-sustaining community. How? Working groups have been established but to what extent are these managed by the community and not the dedicated action of developing the portal? What is the sustainability of the portal? The easy expansion of the C2M2 seems to depend on two things: the incorporation of domain specific vocabularies and the cycle of ingest-releases at time points. Does this latter point constitute expansion that is easy? This would require each data center to adapt to the new table templates. Page 9 Containers are mentioned but it is not clear what the difference is between a container and a collection. Containers do not seem to appear again in the browsing. Page 18 the visibility of Biosamples changing over time in Figure 4 wasn't so clear to me

Read the original source

Making Common Fund data more findable: catalyzing a data ecosystem

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed