Richard Crooks's Website
Genetic Data Sharing
How Data Sharing Helps Clinical Genetics
It is said that "the product of a laboratory is data". Laboratories (of all forms) take samples and analyse them in order to produce data of interest to service users. Whether this is patient samples to help diagnose disease, manufacturing samples to help monitor manufacturing processes, or research samples to advance scientific knowledge, all of this produces a product, which is data.
As genetic medicine becomes more widely used, the amount of data being produced is dramatically increasing. However genetic medicine is perhaps unique as a medical discipline, in that the data produced by it is not purely diagnostic, but actually crosses into research.
In a traditional laboratory medicine discipline such as biochemistry, a result of a test describes what is going on in a particular patient. The reference ranges of such tests, and the clinical interpretations of those tests are derived from a strong evidence base of what happens in different physiological conditions. The results are limited to one patient.
In clinical genetics on the other hand the analysis crosses from purely diagnostics into research. A common finding in clinical genetics is variants of undetermined significance (VUSs) where the evidence to interpret them is insufficient. Furthermore, even variants where the clinical significance can be determined are done so with reference to available evidence, which means that a scientist's time is taken up gathering evidence of variant pathogenicity.
In simple monogenic conditions which are caused by one (or a few) variants acting in isolation, the scientists who analyse these results are conducting research to determine if variants they see are pathogenic, sometimes by reference to available literature, sometimes by testing relatives to see if the variants segregate, sometimes through in silico prediction tools and sometimes through relevant functional studies. The crucial point of this is that if a variant is pathogenic, then it should be pathogenic in all patients who it is seen in, thus clinical geneticists are building the "reference range" of normal variation and pathogenic variation.
Since this is research that is used to build a reference range that all clinical geneticists should be able to use, there would be a great advantage to, and even need to share data in order to improve the efficiency and quality of clinical geneticist's work. Let us describe a couple of scenarios that would benefit from data sharing.
Differently Incomplete Data
Different scientists can encounter and compile different pieces of evidence for interpreting the same variants, depending on the resources available in their hospitals and the range of patients they see. Variant X could be investigated by two scientists (A and B) working in different labs as follows.
Scientist A works in a large lab and has a research contract with a large Russell Group university and can access a wide variety of literature. They see a single patient who has a variant. The family history is limited, as the patient, a child, has unknown parentage. However as the scientist has access to some limited literature describing functional studies, they are able to assign this evidence to the interpretation.
Scientist B sees the same variant in an unrelated patient in another hospital. Scientist B works in a smaller hospital and does not have a research contract with a university and so does not have access to a large amount of literature where they can search for functional studies. However the patients parents are both known and it can be confirmed that the variant is de novo, with maternity and paternity confirmed.
Both of these scientists alone with these pieces of evidence cannot assign a variant classification other than VUS (class 3) according to the ACMG guidelines (Richards et al. 2015), since each they only have one piece of strong evidence. However if they were to collaborate and share their data, they would both have 2 strong pieces of evidence, which is sufficient to classify the variant as pathogenic. Similar occurrences can happen with other mixtures of evidence, such as a variant seen in an individual with the condition with another known cause, or where a variant is seen in an unaffected individual.
When a patient is seen by a clinical genetics service and the genetic testing subsequently discovers a variant, it is likely that close relatives will be tested initially to discover if any parents or siblings also have the variant, and exhibit any clinical phenotype, to see if the variant segregates with the condition.
Because number of immediate relatives available for testing are limited, and older generations may not be alive, or have a clear medical history so their genotype and phenotype may be unavailable. This means that the segregation data that is available in a single family is limited, unless it is a large family with a large number of affected and unaffected individuals.
Figure 1: An affected family with a mother and son having the condition and the variant, with the father, and the grandmother being confirmed as wildtype. Genotype of the grandfather cannot be determined as he is deceased, and his clinical history is unavailable.
Figure 2: Family 2 has an affected daughter and affected father, with wildtype variants confirmed in the unaffected mother and grandfather. The grandmother is deceased and no clinical information is available for her.
Figure 3: After the pedigrees are shared between the labs, and investigating the families’ histories, it was discovered that the families are related to one another, thus the evidence that the variant segregates with the condition was greatly strengthened.
Enabling Data Sharing
Although the advantages of data sharing are clear, navigating data sharing is potentially difficult. Although the general utility of sharing data is recognised within clinical genetics, as well as other medical specialties (Callahan et al., 2017), data protection law, specifically the European Union’s General Data Protection Regulation mean that there is concern about the legal status of sharing healthcare data (Neame, 2014), in genomics (Thorogood, 2018) and scientific research more broadly (Chassang, 2017). This has led to concern that the regulatory window between sharing data for patient benefit and protecting patient privacy has narrowed (Phillips, 2018).
A proposed framework for data sharing is the FAIR principles (Wilkinson et al., 2016), which state that scientific data should be Findable, Accessible, Interoperable and Reusable. These were originally developed for online repositories of research data, such as UniProt or wwPDB. They have also attracted interest from the clinical genetics (Corpas et al., 2018), other medical specialties (Callahan et al., 2017) as well as unrelated industries (Rychlik et al., 2018). This doesn’t mean that the data is shared directly; rather it is findable, so that others who may have use for the data can become aware that it exists (findable) and request access to it (accessible). This is particularly important for patient sensitive data, although details about a variant may be interesting to other scientists, and potentially help the diagnosis of other patients, these are still patient results and thus should be protected as patient results. Findable and Accessible doesn’t mean that variants, and the evidence used to classify them should be freely available for anyone to browse, rather it should exist and there should be systems in place to allow people to request access to it where there is a clinical need to access that data.
Furthermore the data should be in a format that can be used with multiple systems (interoperable) to aid its use by other clinical genetic services (reusable). There are a number of common data formats which can be used to share data between different systems. JSON and XML are two widely used standards in web based APIs, which return data in a structured format. These data structures can be readily read and processed by built in libraries in programming languages such as Python and PHP.
Data should be described in a way that is consistent and allows any user to know what the data is saying. There are standards being implemented in clinical genetics to consistently describe variants known as the HGVS nomenclature (den Dunnen, 2017) that provide a clear framework for describing variants. These allow data to be understood by anyone who has access to it, without issues such as whether the position as described is the gene coordinate or the genomic coordinate, or which transcript the variant is found in. Consistent formats like these mean that all scientists in the field can see unambiguous descriptions of variants.
ReferencesCallahan, A., Anderson, K. D., Beattie, M. S., Bixby, J. L., Ferguson, A. R., Fouad, K., Jakeman, L. B., Nielson, J. L., Popovich, P. G., Schwab, J. M. and Lemmon, V. P.; FAIR Share Workshop Participants. (2017) Developing a data sharing community for spinal cord injury research. Exp Neurol. 295:135-143. Chassang, G. (2017) The impact of the EU general data protection regulation on scientific research. Ecancermedicalscience. 11: 709. Corpas, M., Kovalevskaya, N. V., McMurray, A. and Nielsen, F. G. G. (2018) A FAIR guide for data providers to maximise sharing of human genomic data. PLoS Comput Biol. 14: e1005873. den Dunnen, J. T. (2017) Describing Sequence Variants Using HGVS Nomenclature. Methods Mol Biol. 1492: 243-251. Neame, R. L. (2014) Privacy protection for personal health information and shared care records. Inform Prim Care. 21: 84-91. Phillips, M. (2018) International data-sharing norms: from the OECD to the General Data Protection Regulation (GDPR). Hum Genet. In Press. Richards, S., Aziz, N., Bale, S., Bick, D., Das, S., Gastier-Foster, J., Grody, W. W., Hegde, M., Lyon, E., Spector, E., Voelkerding, K., Rehm, H. L. and ACMG Laboratory Quality Assurance Committee. (2015) Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genetics in Medicine 17: 405-24 Rychlik, M., Zappa, G., Añorga, L., Belc, N., Castanheira, I., Donard, O. F. X., Kouřimská, L., Ogrinc, N., Ocké, M. C., Presser, K. and Zoani, C. (2018) Ensuring Food Integrity by Metrology and FAIR Data Principles. Front Chem. 6: 49. Thorogood, A. (2018) Genomic data sharing in Canada: flying under the regulatory radar? Hum Genet. In Press. Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J. W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., Gonzalez-Beltran, A., Gray, A. J., Groth, P., Goble, C., Grethe, J. S., Heringa, J., 't Hoen, P. A., Hooft, R., Kuhn, T., Kok, R., Kok, J., Lusher, S. J., Martone, M. E., Mons, A., Packer, A. L., Persson, B., Rocca-Serra, P., Roos, M., van Schaik, R., Sansone, S. A., Schultes, E., Sengstag, T., Slater, T., Strawn, G., Swertz, M. A., Thompson, M., van der Lei, J., van Mulligen, E., Velterop, J., Waagmeester, A., Wittenburg, P., Wolstencroft, K., Zhao, J. and Mons, B. (2016) The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 3: 160018.