Genomics data storage: what, who and how?

By - World Healthcare Journal

Genomics data storage: what, who and how?

Several key questions must be addressed before genomic data can be harnessed to its revolutionary potential.

"An organism’s complete set of DNA including all of its genes”, was how the PHFG Foundation defined a genome in its recent report on identification and genomic data. Therefore, the report states, genomic data refers “to sequenced DNA that can be in the form of raw data derived from sequencing, a person’s genome in whole or in part ...or individual DNA variations. ” 

Genomics is a big data field; just a single human genome sequence accounts for 200 gigabytes of raw data – roughly equivalent to 40,000 MP3 tracks. The potential for this vast data field has been brought sharply into the spotlight by Covid-19, as scientists look to genomic sequencing to enhance detection of the virus and better understand the implications of new variants as they emerge. Going forward, this vast data field will be increasingly used to improve the global response to emerging infectious disease outbreaks beyond Covid-19.

However, while harnessing this data could be hugely beneficial for society, it requires individuals to share personal and intimate pieces of information, raising issues surrounding ethical storage and access. There are three key questions that must be addressed: (1) What is genomics data storage? (2) How is the data stored? (3) Who can access the data?

Should data be shared via open-access?

The Global Initiative on Sharing Avian Influenza Data (GISAID), the most popular platform for sharing SARS-CoV-2 genome sequences, currently holds over 1.5 million sequences. GISAID improves global collaboration and trend identification by allowing researchers from around the world to view each other’s submissions. However, unlike smaller platforms, such as the International Nucleotide Sequence Database Collaboration (INSDC) that allow for anonymous access, GISAID requires both identity confirmation and agreement that data will not be reshared.

Consequently, studies building on GISAID data cannot publish full data sets and be easily scrutinised by other researchers. In response, an open letter signed by over 780 members of the scientific community, including Emmanuelle Charpentier (Nobel Laureate, Scientific and Managing Director, Max Planck Unit for the Science of Pathogens) and Sharon Peacock (Executive Director, Covid-19 Genomics UK Consortium, University of Cambridge), encourages open SARS-CoV-2 data to improve efforts to stop the spread of Covid-19 and better prepare for future outbreaks via the INSDC platform. The letter argues that in “responding to a health crisis, data plays a critical role in understanding transmission, infection and symptoms, and in identifying drug targets, developing vaccines and designing public health responses. ” 

Covid provided a stark reminder that, in the context of rapidly increasing globalisation, diseases will spread without consideration for national borders. Therefore international collaboration is vital to effectively stem the tide of infectious disease in the modern era. However, the sentiment of this open letter does not represent a global voice - with 99 per cent of signatories based in Canada, the United States, and Europe. Some scientists in the global South argue that open-access genomics data undermines their efforts and deprives them of credit. This sentiment was presented by an anonymous writer in IOL, a South African news outlet. The writer suggests that encouraging poorer countries to waive rights to their data and to the right to be acknowledged for their contributions is a “neo-colonial mentality has long permeated the scientific community, as scientists in wealthy countries have regularly refused to acknowledge—and have blatantly misappropriated—the contributions and discoveries of scientists in developing countries. ” 

Furthermore, it can also be argued that a move towards open-access may encourage hesitant researchers to share their data with speed. This then forces us to question the quality of data sourced from open-access platforms.

What does data sharing mean for individuals?

Genomics data, when tied to a specific individual, contains detailed information regarding a person’s health. An EMBO report notes that the General Data Protection Regulation (GDPR) “lists genetic data as ‘special categories of personal data’ or sensitive data (Art. 9), which makes their processing for research purposes (Art. 9(2)(j) subject to the adoption of adequate organizational and technical safeguards. ” 

Inside and outside the scientific community, valid concerns are being raised surrounding ethical governance of data sharing consent. This is an issue of particular concern for the Global Alliance for Genomics and Health (GA4GH), a not-for-profit coalition of 600+ leading organisations formed to accelerate the potential of research and medicine to advance human health. In response to concerns, GA4GH published a Framework for Responsible Sharing of Genomic and Health-Related Data which aims, among other goals, to protect and promote the welfare, rights, and interests of individuals, complement laws and regulations on privacy and personal data protection, foster responsible data sharing, and serve as a tool for the evaluation of responsible research.

Blockchain technology: a solution?

In response to concerns regarding data sharing consent and the immense volume of data, researchers and companies are beginning to look to blockchain technology for a solution. While blockchain began as the basis for the bitcoin network, it offers promising benefits for genomic data storage. These benefits include (1) security: data is protected through strong and secure cryptographic protocols (2) decentralisation: a single entity cannot control the database, (3) immutability: network consensus must be reached on the validity of a piece of data to record it on the chain; once recorded, it cannot be altered. For a detailed explanation on blockchain technology within the genomics field, please see this paper on the potential application of blockchain technologies in genomics written by Halil Ibrahim OzercanAtalay Mert Ileri, Erman Ayday, and Can Alkan.

Nebula Genomics, a start-up based in the US, is currently exploring this avenue by offering whole-genome sequencing to be placed on a blockchain-based genetic marketplace. In comparison to traditional databases, users are granted both greater control over their data and greater privacy. Data is transmitted on a ‘need to know basis’; users can view their own results and can control who it is shared with, while Nebula Genomics employees can only access de-identified data that cannot be linked to individuals. This is a mutually beneficial data sharing relationship. Users can learn about and understand their genetic health data without risking potential abuse and misuse of this data. Meanwhile, companies and researchers can use de-identified data to understand population health dynamics without concern for breaching data-protection and privacy laws.

Moving forward and continuing the conversation   

As genomics continues to diffuse into the everyday lexicon, it is vital we all continue to ask these important questions surrounding how genomic data is stored, accessed, and shared. These conversations need to occur within policy and political circles, but also among families, friends, and our wider communities.

#whj #whjnews #whjfeature #whjpublichealth #ppp #globalgenomics #eleanormurray