-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathproof.tex
122 lines (108 loc) · 9.86 KB
/
proof.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
\iffalse
Actually, we formulate the above Pufferfish privacy cases into Hidden Markov Models.
For instance, in \textit{Example 3}, the state space consists of ($0, 1, 2, \underline{0}, \underline{1}, \underline{2}$).
The states $\underline{0}$, $\underline{1}$ and $\underline{2}$ emit, with certainty,
observation $\textit{zero}$, $\textit{one}$ and $\textit{two}$ respectively.
Even though the attacker has the prior knowledge that the disease is contagious,
we want to make sure that he won't infer whether any member in the data set
has contracted the disease or not. Thus the initial distribution pair the attacker gets would be
($1,0,0,0,0,0$) and ($0,0,1,0,0,0$). And since the probabilities of observing
$\textit{zero}$ are $\frac{2}{3}$ and $\frac{1}{6}$, this violates $\ln(2)$-Pufferfish privacy.
\fi
Actually, we can model general Pufferfish privacy problems into HMMs and
design algorithms to check whether the privacy is preserved.
To formalize the problem, given a set of secrets $\bbfS$,
a set of discriminative secret pairs $\bbfS_{\textmd{pairs}}$, a set of data evolution
scenarios $\bbfD$ , and $\epsilon > 0$, we model the mechanism $\calM$ into a hidden Markov chain
and check whether it's $\epsilon$-Pufferfish ($\bbfS$, $\bbfS_{\textmd{pairs}}$
$\bbfD$) private. We use states and transitions to model the mechanism $\calM$,
set target initial distribution pairs according to prior knowledge $\bbfD$ and discriminative secrets $\bbfS_{\textmd{pairs}}$,
set target outputs as observations in states, and then check whether the probabilities under the same observation sequence
are mathematical similar(i.e, differ by at most the multiplicative factor $e^{\epsilon}$), for every distribution pair and every observation sequence.
Therefore, our task turns into finding the observation sequence and distribution pair that make the observing probabilities differ the most.
That is, in Pufferfish framework, for every observation sequence $\overline{\omega}=\omega_1\omega_2\ldots$, secret pair $(s_i, s_j) \in
\bbfS_{\textmd{pairs}}$ and $\theta \in \bbfD$, we have
\begin{equation}\label{max1}
\max_{\overline{\omega},(s_i, s_j) \in
\bbfS_{\textmd{pairs}},\theta \in \bbfD}
{ \Pr (\calM (\oD) = \overline{\omega}| s_i, \theta) - e^\epsilon \Pr (\calM (\oD) = \overline{\omega}| s_j, \theta) }
\end{equation}
\begin{equation}\label{max2}
\max_{\overline{\omega},(s_i, s_j) \in
\bbfS_{\textmd{pairs}},\theta \in \bbfD}
{ \Pr (\calM (\oD) = \overline{\omega}| s_j, \theta) - e^\epsilon \Pr (\calM (\oD) = \overline{\omega}| s_i, \theta) }
\end{equation}
no more than $0$.
However, as we will show below, this problem is NP-hard.
\begin{theorem}
The Pufferfish privacy problem in hidden Markov model is NP-hard.
\end{theorem}
\begin{proof}
In order to satisfy Pufferfish privacy in hidden Markov model, we have to decide whether
expressions (\ref{max1}) and (\ref{max2}) is no more than $0$.
Let's just simplify the problem by only having one initial distribution pair to compare
so that we only need to find the observation sequence.
We will show the problem to maximize the difference is NP-hard by
an reduction from SAT, which is known to be NP-hard. Assume we have a formula $F(x_1,\ldots,x_n)$ in conjuncted normal form,
with $n(n>=3)$ variables and $m$ clauses, $C_1,\ldots,C_m$. We shall construct a hidden Markov model $H = (K, \Obs, o)$
such that with $\epsilon = \ln(4)$, expressions (\ref{max1}) and (\ref{max2}) will take maximal value $0$
if and only if the formula $F(x_1,\ldots,x_n)$ is satisfiable.
\textit{Construction.} The construction is similar to \cite{PCT:87:CMDP}. We first describe the Markov Chain $K =
(S, p)$. $S$ contains a state group $A$ with six states $A_{ij}$, $A'_{ij}$, $T_{A ij}$, $T'_{Aij}$, $F_{Aij}$, $F'_{Aij}$ and
a state group $B$ with six states $B_{ij}$, $B'_{ij}$, $T_{Bij}$, $T'_{Bij}$, $F_{Bij}$, $F'_{Bij}$
for each clause $C_i$ and variable $x_j$. Besides, there are $4m$ states $A_{i,n+1}$, $A'_{i,n+1}$, $B_{i,n+1}$, $B'_{i,n+1}$ for each clause $C_i$.
The transition distribution $p$ is as follows. For group $A$, there are two transitions with same probability $\frac{1}{2}$ leading from
state $A_{ij}$ to $T_{Aij}$ and $F_{Aij}$ respectively; similarly there are two transitions leading with probability $\frac{1}{2}$
from $A'_{ij}$ to $T'_{Aij}$ and $F'_{Aij}$. There's only one transition leading with certainty from $T_{Aij}$, $F_{Aij}$, $T'_{Aij}$, $F'_{Aij}$,
to $A_{i,j+1}$, $A_{i,j+1}$, $A'_{i,j+1}$, $A'_{i,j+1}$ respectively with two exceptions: If $x_j$ appears positively in $C_i$,
the transition from $T'_{Aij}$ is to $A_{i,j+1}$ instead of $A'_{i,j+1}$; and if $x_j$ appears negatively, the transition from
$F'_{Aij}$ is to $A_{i,j+1}$. For the state group $B$, all the transitions imitate that in group $A$ only with different state names.
For instance, there are two transitions leading with same probability $\frac{1}{2}$ from state $B_{ij}$ to $T_{Bij}$ and $F_{Bij}$ and so on.
Next we describe the observations $\Obs$ and the observation distribution. In state
$A_{ij}$, $A'_{ij}$, $B_{ij}$, $B'_{ij}$ with $1\leq j \leq n$, one can observe $X_j \in \Obs$ with certainty.
In state $T_{Aij}$, $T'_{Aij}$, $T_{Bij}$, $T'_{Bij}$ with $1\leq j \leq n$, one can only observe $T_j \in \Obs$;
similarly, the sole observation $F_j \in \Obs$ can be observed in state $F_{Aij}$, $F'_{Aij}$, $F_{Bij}$, $F'_{Bij}$ with $1\leq j \leq n$.
In state $A_{i,n+1}$, we have probability $\frac{4}{5}$ to observe $\top \in \Obs$ and $\frac{1}{5}$ to observe $\bot \in \Obs$;
while in state $B_{i,n+1}$, we have probability $\frac{1}{5}$ to observe $\top$ and $\frac{4}{5}$ to observe $\bot$.
In state $A'_{i,n+1}$ and $B'_{i,n+1}$, there are equal probabilities of $\frac{1}{2}$ observing $\top$ and $\bot$.
Then we describe the Pufferfish privacy scenario in this Hidden Markov Model. Assume that according to
prior knowledge $\bbfD$ and discriminative secrets $\bbfS_{\textmd{pairs}}$, we only have
one initial distribution pair $\oD_1$ and $\oD_2$ to compare. $\oD_1$ induces a uniform distribution,
to start from each member in the state set $\{A'_{i1}\}$ with $1 \leq i \leq m$, whose probability is $\frac{1}{m}$.
Similarly, in $\oD_2$, the probability starting from each member in the state set $\{B'_{i1}\}$ is also $\frac{1}{m}$ with $1 \leq i \leq m$.
\textit{Reduction.} The intuition is that starting from state $A'_{i1}$ or $B'_{i1}$,
the clause $C_i$ is chosen and then the assignment of each variable will be considered one by one
in this clause. Once the assignment of a variable $x_j$ makes $C_i$ satisfied, immediately
state $A_{i,j+1}$ or $B_{i,j+1}$ is reached. So at last if state $A'_{i,n+1}$ or $B'_{i,n+1}$
is reached, it means that the clause $C_i$ is not satisfied under this assignment. Now, we
claim that $\Pr (\calM (\oD_1) = \overline{\omega}) - 4 \times \Pr (\calM (\oD_2) = \overline{\omega})$ takes the maximal value $0$
if and only if $\overline{\omega}$ is the observation sequence $X_1A_1X_2\ldots A_n \top$ such that
formula $F(x_1,\ldots,x_n)$ is satisfied under assignment with $A_i \in \{T_i,F_i\}$ for each variable $x_i$
(Similar analysis applies for $\Pr (\calM (\oD_2) = \overline{\omega}) - 4 \times \Pr (\calM (\oD_1) = \overline{\omega})$ except that it takes the
maximal value $0$ with $\bot$ as the last observation).
We argue that $0$ is the maximal value.
It's easy to see that if we take an arbitrary observation sequence $\overline{\omega} = X_1A_1X_2\ldots $,
as long as $\top$ or $\bot$ hasn't been observed, $\Pr (\calM (\oD_1) = \overline{\omega}) - 4 \times \Pr (\calM (\oD_2) = \overline{\omega}) < 0$.
That's because the state group $B$ just imitate the state group $A$ before reaching the state
$B_{i,n+1}$ and $B'_{i,n+1}$. Thus the maximal value must be less than 0 or be obtained after we observe $\top$
or $\bot$.
Then we consider $\overline{\omega}=X_1A_1X_2\ldots A_n \top$. Note that if $C_i$ is satisfied under observation $\overline{\omega}$, we start from $A'_{i1}$ and $B'_{i1}$ both with
probability $\frac{1}{m}$, finally reaching $A_{i,n+1}$ and $B_{i,n+1}$ with probabilities $2^{-n} \times \frac{1}{m} \times \frac{4}{5}$
and $2^{-n} \times \frac{1}{m} \times \frac{1}{5}$ respectively;
if $C_i$ is not satisfied, we finally reach $A'_{i,n+1}$ and $B'_{i,n+1}$ with equal probabilities of $2^{-n}\times \frac{1}{m} \times \frac{1}{2}$ .
Thus a satisfied clause will contribute $2^{-n} \times \frac{1}{m} \times \frac{4}{5} - 4 \times 2^{-n} \times \frac{1}{m} \times \frac{1}{5} = 0$ to
the result; while if some clause is not satisfied, $\Pr (\calM (\oD_1) = \overline{\omega})- 4\times \Pr (\calM (\oD_2) = \overline{\omega})$ is strictly less than $0$.
Therefore, if we choose a observation sequence ended with $\top$ such that all the
clauses are satisfied, $\Pr (\calM (\oD_1) = \overline{\omega})- 4\times \Pr (\calM (\oD_2) = \overline{\omega})$ will take the maximal value 0.
If we consider $\overline{\omega}=X_1A_1X_2\ldots A_n \bot$, similar analysis concludes that $\Pr (\calM (\oD_1) = \overline{\omega}) - 4 \times \Pr (\calM (\oD_2) = \overline{\omega})$
will be strictly less than $0$.
This indicates that $0$ is the maximal value of $\Pr (\calM (\oD_1) = \overline{\omega}) - 4 \times \Pr (\calM (\oD_2) = \overline{\omega})$
among all the observation sequences.
Finally from the process above, it's easy to see that
$\Pr (\calM (\oD_1) = \overline{\omega}) - 4 \times \Pr (\calM (\oD_2) = \overline{\omega})$ takes the maximal value $0$
if and only if $F(x_1,\ldots,x_n)$ is satisfied under observation sequence $\overline{\omega}=X_1A_1X_2\ldots A_n \top$
with assignment $A_i \in \{T_i,F_i\}$ for each variable $x_i$.
Since determining whether Pufferfish privacy is preserved
is equivalent to determining whether the maximal value is above $0$,
we prove that the general problem for $\epsilon$-Pufferfish privacy is NP-hard.
\end{proof}