An Active Genomic Data Recovery Attack

— With the decreasing cost and availability of human genome sequencing, genomic privacy becomes an important issue. Several methods have been proposed in the literature to overcome these problems including cryptographic and privacy-preserving data mining methods: homomorphic encryption, cryptographic hardware. In recent work, Barman et. al studied privacy threats and practical solutions considering an SNP based scenario. The authors introduced a new protocol where a malicious medical center processes an active attack in order to retrieve genomic data of a given patient. The authors have mentioned that this protocol provides a trade-off between privacy and practicality. In this paper, we first give an overview of the system for SNP based risk calculation. We provide the definitions of privacy threats and briefly Barman et al.’s protocol and solution. The authors proposed to use a weighted sum of SNP coefficients for calculating disease tendency. They argue that the specific choice of the bases would prevent unique identification of SNPs. Our main observation is that this is not true. Contrary to the security claim, SNP combinations can be identified uniquely in many different scenarios. Our method exploits a pre-computed look-up table for retrieving SNPs’ values from the test result. An attacker can obtain all SNP values of a given patient by using the pre-computed look-up table. We provide practical examples of weights and pre-computed tables. We also mention that even in the case where the table is large and the attacker can not handle it at one time, he can still gather information using multi queries. Our work shows that more realistic attack scenarios must be considered in the design of genetic security systems.

In hospitals, there is a lack of expertise in protecting the genomic data of their patients. Due to the size of the data and the limited resources, it is often difficult for hospitals to safely store, process and maintain the genomic data of patients. The prevention of cyber-attacks by hospitals may not be possible due to insufficient high skilled workers and technology. The solution to this problem is the storage and processing of genomic data in a privacy-protected manner in a third-party service provider. In this case, service providers must process them without seeing the content of the data.
The human genome consists of four different nucleotides (A,C,G,T). These nucleotides form about 20.000 -25.000 genes responsible for producing various types of proteins which are assigned inside the cells during whole life processes. About %99.5 of the genome is common in the human population where the remaining portion makes up the genetic variance. Most genetic variants in an individual are Single Nucleotide Polymorphisms (SNPs). A single nucleotide poly-morphism (SNP) can be defined as a variation occurring with some probability in a population where a single nucleotide differs from the reference genome. As a result of Genome-Wide Association Studies (GWAS), SNPs provide probabilistic information about the susceptibility of a disease. Generally, a few SNP combinations are evaluated together to calculate the overall inclination to a syndrome such as cardiovascular disease or Alzheimer. Since SNPs form the nonredundant part of the genome and contains minimalistic information, it makes sense to consider privacy-preserving protocols in terms of SNP's.
There are many different types of threats and security models where genomic privacy is a concern: querying on private genomic data, secure querying on public data, secure sequence alignment in public clouds [2]. Several methods have been proposed in the literature to overcome these problems including cryptographic and privacy-preserving data mining methods: homomorphic encryption [3], cryptographic hard-ware [4], [5].

A. Related Work
Ayday et al. [6] proposed a system based on homomorphic encryption to protect individual's privacy in disease risk tests. This work also proposes to use storage and processing unit to store sensitive data in encrypted form and disease risk tests are performed by authorized institutions using homomorphic encryption technique and secure integer comparison. In this solution, a storage and processing unit (SPU) stores all the An Active Genomic Data Recovery Attack M. AKGÜN  SNPs (approximately 40 million) of the patient. Ayday et. al solved the storage problem in [13] without sacrificing privacy. They classify SNPs as real SNPs and potential SNPs, where real SNPs are set of SNPs observed in the patient. SPU stores the real SNPs instead of storing all SNPs. However, this constitutes a problem for privacy as SPU stores positions of ✓ X X X X ✓ Ayday et al. [7] ✓ ✓ X X X ✓ Danezis and De Cristofaro [8] ✓ X X X ✓ ✓ Djatmiko et al. [9] ✓ X X X X ✓ Zhang et al. [10] ✓ X X X ✓ ✓ Fan and Mohanty [11] ✓ X X X ✓ ✓ Perillo and De Cristofaro [12] ✓  [14] instead of the Paillier cryptosystem in order to decrease the computational overhead. The patient has a smartcard that participates in the protocol execution. The lost of the smartcard can cause privacy violation. Furthermore, the cloud provider knows the number of SNPs of each patient. This is also a data leakage. Perillo and De Cristofaro [12] proposed a cryptographic protocol for running different types of tests on individuals' genetic data. Their scheme is also based on the use of AH-ECC [14]. Differently it provides authorization which means SNP wights and locations are verified by central authority such as the FDA.
Djatmiko et al. [9] proposed a privacy-preserving algorithm to compute genomic tests that need the linear combination of SNP values. They applied partially homomorphic Paillier encryption and private information retrieval techniques to protect patients' privacy. The computational overhead of their solution is very high when compared to that of Ayday et al.'s solution.
Zhang et al. [10] proposed a framework for disease risk calculation using SNP values. Their framework reduces the storage overhead of previous solutions significantly by using bloom filters. It also reduces communication cost by indexing the encrypted genomic data.
Fan and Mohanty [11] proposed a solution for privacy preserving calculation of the susceptibility of a patient to a particular disease. The proposed scheme is based on Shamir's (l, n) secret sharing [15] which allows the computation of a certain number of multiplications and unlimited additions. It is more efficient than Ayday et al.'s solution [6] in terms of storage and computation time.

B. Our Contributions
All existing works provide security under semi-honest model in which the involving parties are not able to deviate the protocol description. It is very easy to provide security under this model with lower communication and computation complexities because adversaries are not allowed to change their inputs and to collude with other parties. This shows that all previous works are vulnerable to active SNP retrieval attacks in which an attacker can modify SNP weights in order to learn SNP values. The comparison of previous solutions is given in Table I. Barman et al. [21] proposed a solution that makes all existing works secure to active SNP retrieval attacks. They studied privacy threats and practical solutions considering an SNP based disease risk calculation scenario. The authors introduced a new protocol where a malicious medical center processes an active attack in order to get SNP values of an individual. The authors mentioned that this protocol provides a tradeoff between privacy and practicality. In this study, we show that the solution offered by Barman et. al [21] does not prevent the leakage of SNP values. We show that SNP combinations can be uniquely identified in many different scenarios. Our method uses a pre-calculated lookup table to retrieve the values of the SNPs from the test result. The attacker can obtain all SNP values of a particular patient using the previously calculated lookup table. We present practical examples of weights and pre-calculated tables. We also observe that even if the lookup table is very large to handle, and the attacker can infer SNP values with multiple queries Our study shows that more realistic attack scenarios should be considered in the design of genetic security systems.
This paper is organized as follows. In Section II, we give an overview of the system model for genetic risk test calculation. In Section III, we give the definitions of privacy threats. In Section IV, we briefly define Barman et al.'s protocol [21] and their privacy solution. In Section V, we explain our observation that in fact, the solution is redundant. In Section VI, we explain possible and existing countermeasures in order to eliminate active SNP retrieval attacks. Finally, in Section VII, we conclude the paper.

II. SYSTEM MODEL
In this section, we give the overview of the generic model described in the literature [6], [21] before. This model is constructed in order to calculate genetic risk test in a privacypreserving way. In brief, a patient (P) sends his sample to the certificated institution (CI) for sequencing. The CI extracts genomic variants (SNPs) of the patient and encrypts SNPs. Then, the CI sends encrypted genomic data to the data center (DC). The CI is also responsible to distribute encryption keys to the related parties. The DC stores the encrypted genomic data. Medical center(s) (MC) communicate with the DC in order to compute genetic risk test in a privacy-preserving way. The system model is summarized in Figure 1. The genetic risk (G) is usually computed as a weighted sum of SNPs' values (Equation 1). W i is the contribution (weight) of SNP i . This computation can be done in a privacy-preserving way using secure multiparty computation or smart cards [22]. At the end of the test, the MC learns only the test result, but not the SNPs' values. Furthermore, DC does not learn the SNPs' weights.

III. PRIVACY THREATS
Barman et al. [21] investigated privacy threats for system model architecture described in Section II. In the literature, P and CI are considered as honest parties and the MC and the DC are considered as honest-but-curious parties. Barman et al. [21] extended possible privacy threats by considering the MC and the DC as honest, semi-honest (passive) and dishonest (active). They describe three main attacks.

A. Test Inference Attacks
The semi-honest DC can learn which SNPs are used and how often they are used from test queries. Therefore, DC can infer the disease which a corresponding patient is suffering from. If the DC can re-identify P, this violates the privacy of the patient P. Danezis et al. [8] proposed to use all SNP values of a given patient in the genetic risk computation in order to prevent test inference attack. Another solution [23] proposed to use oblivious RAM to prevent the DC from learning access patterns of the MC.

B. Passive SNP Retrieval Attack
The MC can learn SNPs' values from the test result because the risk calculation is a linear equation and the MC knows some parameters used in this equation such as SNPs' weights. As the number of queries increases for a given patient P, P's privacy decreases. Ayday et al. [7] proposed to deliver test result as a range in order to prevent this attack.

C. Active SNP Retrieval Attack
In active SNP retrieval attack, the dishonest MC can manipulate SNPs' weights in order to retrieve SNPs' values from test results easily. For example, the MC sets all SNP weights to 0 except W j = 1. The MC can retrieve the value of SNP j which is equal to the test result. In another version of active SNP retrieval attack, the MC sets SNPs' weights as consecutive powers of a number. Consider a test with three SNPs, the MC sets SNPs' weights as the following: W 0 = 4 0 , W 1 = 4 1 and W 2 = 4 2 . The test result G is (36) 10 = (210) 4 . An attacker can retrieve SNPs' values from G = (210) 4 as the following: SNP 2 = 2, SNP 1 = 1 and SNP 0 = 0.

IV. BARMAN ET AL.'S PROTOCOL
Barman et al. [21] offer a solution to overcome the active SNP retrieval attack. According to the authors' definition: active SNP retrieval attack can be practiced by the MC by setting new SNP weights for a given test to retrieve the SNPs' raw values without being detected. Their solution is to force the MC to iteratively utter some SNP weights to the DC until the DC assures that the current test is legitimate. As the authors' mention, this solution weakens the MC's privacy while giving more power to the DC. Learning the test parameters might allow the DC to practice the test inference attack and also the test parameters might be private to the MC. So, the MC can abort the protocol if it thinks that it has to give too much information about the test parameters. The authors assume that only the MC can get the mapping information from the CI and the SNPs are stored as shuffled at the DC. The suggested protocol based on the described system model is as follows: The authors declare that once the DC makes sure that the test is legitimate, it computes the encryption of the partial test result, ENC(G 2 ). This guarantees that an active SNP retrieval attack cannot be performed, independent from the weights used for ENC(G 1 ).

V. PROPOSED ACTIVE SNP RETRIEVAL ATTACK
In this section, we present an active SNP retrieval attack. We apply our attack to Barman  SNPs and a commitment for each SNP weight W i , to the DC. 6. The DC asks for the weights of random two indices j,k  [0, N-1]. The MC responses with the relevant W j and W k to the DC. 7. The DC controls both the commitments C j and C k , and the weights W j and W k . If both weights are non-zero, and not different powers of the same number, the DC assures the test is not an active SNP retrieval attack. Steps 3 and 4 are repeated until the DC is convinced or the MC aborts the protocol. For each iteration after the first, the DC can ask for only one new weight. 8. After believing that the test is not an attack, the DC sends the S (S = N-2 at most) encrypted SNPs corresponding to the weights not seen during the previous steps. In our attack, the DC is convinced eventually because we choose at least two nonzero weights and SNPs' weights are guaranteed not to be different powers of the same number. 9. The dishonest MC homomorphically computes and sends the encryption of the first part of the test result, ENC(G 1 ), according to the S SNPs.
10. The DC computes the encryption of the second part of the test result ENC(G 2 ), according to the other encrypted SNPs whose weights are known from steps 3 and 4. The two partial results are homomorphically added into ENC(G) = ENC(G 1 ) + ENC(G 2 ). ENC(G'), partial decryption of ENC(G) is sent to the dishonest MC. 11. The dishonest MC decrypts the ENC(G') and obtains G. Then, the dishonest MC retrieves SNPs' values by using the look-up table T.
Ayday et al. [7] proposed to give risk value as a range. In this solution, as the range value increases, patient privacy increases but the consistency of the test decreases.
As another solution, weight values can be stored encrypted and clinicians do not know these values. Thus, it becomes impossible to obtain SNP values for calculations where more than one SNP value is used. In this solution, it is difficult to keep the weight values secretly in a central system. In real life, this solution is very difficult to implement.
Another solution is to give the test result to the patient privately. To do this, the test result must be given to the patient in an encrypted form and the patient must be able to decrypt it. The patient can share the test result with the clinician if he wishes. A method for transferring, storing and decrypting the data must be specified. These operations can be done safely using smart card technology. The partially decrypted test result is transferred to the smart card of the user. The user can read the test result privately using a reader and software on the personal computer. Smart cards are capable of decryption. Since the smart cards are tamper-proof devices, the test result can be reliably stored.
None of the proposed solutions can provide a definitive solution. The clinician learning the exact result of the test can always infer the SNP values.

VII. CONCLUSION
A recent study [21] proposed a new threat model where malicious medical center tries to retrieve genomic data of a given patient. The authors also proposed a solution to this type of attack. They claim that their solution may be vulnerable to more sophisticated attacks involving multiple queries. We show that in fact there is a simpler type of attack. The attacker can learn genomic data using a simple pre-computed look-up table. It remains for future work to develop a security solution to prevent our active SNP retrieval attack. The researchers have to make a trade-off between privacy and efficiency in order to reduce the effects of active SNP retrieval attacks.