Skip to content
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Commit 0c79e42

Browse files
kototensorflower-gardener
authored andcommittedOct 1, 2023
Added new threat model.
PiperOrigin-RevId: 569813726
1 parent 022d32e commit 0c79e42

File tree

1 file changed

+165
-256
lines changed

1 file changed

+165
-256
lines changed
 

‎SECURITY.md

+165-256
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# Using TensorFlow Securely
22

3-
This document discusses the TensorFlow security model. It describes how to
4-
safely deal with untrusted programs (models or model parameters), and input
5-
data. We also provide guidelines on what constitutes a vulnerability in
3+
This document discusses the TensorFlow security model. It describes the security
4+
risks to consider when using models, checkpoints or input data for training or
5+
serving. We also provide guidelines on what constitutes a vulnerability in
66
TensorFlow and how to report them.
77

88
This document applies to other repositories in the TensorFlow organization,
@@ -16,272 +16,181 @@ use a term commonly used by machine learning practitioners) are expressed as
1616
programs that TensorFlow executes. TensorFlow programs are encoded as
1717
computation
1818
[**graphs**](https://developers.google.com/machine-learning/glossary/#graph).
19-
The model's parameters are often stored separately in **checkpoints**.
20-
21-
At runtime, TensorFlow executes the computation graph using the parameters
22-
provided. Note that the behavior of the computation graph may change depending
23-
on the parameters provided. **TensorFlow itself is not a sandbox**. When
24-
executing the computation graph, TensorFlow may read and write files, send and
25-
receive data over the network, and even spawn additional processes. All these
26-
tasks are performed with the permission of the TensorFlow process. Allowing for
27-
this flexibility makes for a powerful machine learning platform, but it has
28-
security implications.
29-
30-
The computation graph may also accept **inputs**. Those inputs are the
31-
data you supply to TensorFlow to train a model, or to use a model to run
32-
inference on the data.
33-
34-
**TensorFlow models are programs, and need to be treated as such from a security
35-
perspective.**
36-
37-
## Execution models of TensorFlow code
38-
39-
The TensorFlow library has a wide API which can be used in multiple scenarios.
40-
The security requirements are also different depending on the usage.
41-
42-
The API usage with the least security concerns is doing iterative exploration
43-
via the Python interpreter or small Python scripts. Here, only some parts of the
44-
API are exercised and eager execution is the default, meaning that each
45-
operation executes immediately. This mode is useful for testing, including
46-
fuzzing. For direct access to the C++ kernels, users of TensorFlow can directly
47-
call `tf.raw_ops.xxx` APIs. This gives control over all the parameters that
48-
would be sent to the kernel. Passing invalid combinations of parameters can
49-
allow insecure behavior (see definition of a vulnerability in a section below).
50-
However, these won’t always translate to actual vulnerabilities in TensorFlow.
51-
This would be similar to directly dereferencing a null pointer in a C++ program:
52-
not a vulnerability by itself but a coding error.
53-
54-
The next 2 modes of using the TensorFlow API have the most security
55-
implications. These relate to the actual building and use of machine learning
56-
models. Both during training and inference, the TensorFlow runtime will build
57-
and execute computation graphs from (usually Python) code written by a
58-
practitioner (using compilation techniques to turn eager code into graph mode).
59-
In both of these scenarios, a vulnerability can be exploited to cause
60-
significant damage, hence the goal of the security team is to eliminate these
61-
vulnerabilities or otherwise reduce their impact. This is essential, given that
62-
both training and inference can run on accelerators (e.g. GPU, TPU) or in a
63-
distributed manner.
64-
65-
Finally, the last mode of executing TensorFlow library code is as part of
66-
additional tooling. For example, TensorFlow provides a `saved_model_cli` tool
67-
which can be used to scan a `SavedModel` (the serialization format used by
68-
TensorFlow for models) and describe it. These tools are usually run by a single
69-
developer, on a single host, so the impact of a vulnerability in them is
70-
somewhat reduced.
71-
72-
## Running untrusted models
73-
74-
As a general rule: **Always** execute untrusted models inside a sandbox (e.g.,
75-
[nsjail](https://github.com/google/nsjail)).
76-
77-
There are several ways in which a model could become untrusted. Obviously, if an
78-
untrusted party supplies TensorFlow kernels, arbitrary code may be executed.
79-
The same is true if the untrusted party provides Python code, such as the Python
80-
code that generates TensorFlow graphs.
81-
82-
Even if the untrusted party only supplies the serialized computation graph (in
83-
form of a `GraphDef`, `SavedModel`, or equivalent on-disk format), the set of
84-
computation primitives available to TensorFlow is powerful enough that you
85-
should assume that the TensorFlow process effectively executes arbitrary code.
86-
One common solution is to allow only a few safe Ops. While this is possible in
87-
theory, we still recommend you sandbox the execution.
88-
89-
It depends on the computation graph whether a user provided checkpoint is safe.
90-
It is easily possible to create computation graphs in which malicious
91-
checkpoints can trigger unsafe behavior. For example, consider a graph that
92-
contains a `tf.cond` operation depending on the value of a `tf.Variable`. One
93-
branch of the `tf.cond` is harmless, but the other is unsafe. Since the
94-
`tf.Variable` is stored in the checkpoint, whoever provides the checkpoint now
95-
has the ability to trigger unsafe behavior, even though the graph is not under
96-
their control.
97-
98-
In other words, graphs can contain vulnerabilities of their own. To allow users
99-
to provide checkpoints to a model you run on their behalf (e.g., in order to
100-
compare model quality for a fixed model architecture), you must carefully audit
101-
your model, and we recommend you run the TensorFlow process in a sandbox.
102-
103-
Similar considerations should apply if the model uses **custom ops** (C++ code
104-
written outside of the TensorFlow tree and loaded as plugins).
105-
106-
## Accepting untrusted inputs
107-
108-
It is possible to write models that are secure in the sense that they can safely
109-
process untrusted inputs assuming there are no bugs. There are, however, two
110-
main reasons to not rely on this: First, it is easy to write models which must
111-
not be exposed to untrusted inputs, and second, there are bugs in any software
112-
system of sufficient complexity. Letting users control inputs could allow them
113-
to trigger bugs either in TensorFlow or in dependencies.
114-
115-
In general, it is good practice to isolate parts of any system which is exposed
116-
to untrusted (e.g., user-provided) inputs in a sandbox.
117-
118-
A useful analogy to how any TensorFlow graph is executed is any interpreted
119-
programming language, such as Python. While it is possible to write secure
120-
Python code which can be exposed to user supplied inputs (by, e.g., carefully
121-
quoting and sanitizing input strings, size-checking input blobs, etc.), it is
122-
very easy to write Python programs which are insecure. Even secure Python code
123-
could be rendered insecure by a bug in the Python interpreter, or in a bug in a
124-
Python library used (e.g.,
125-
[this one](https://www.cvedetails.com/cve/CVE-2017-12852/)).
126-
127-
## Running a TensorFlow server
19+
Since models are practically programs that TensorFlow executes, using untrusted
20+
models or graphs is equivalent to running untrusted code.
21+
22+
If you need to run untrusted models, execute them inside a
23+
[**sandbox**](https://developers.google.com/code-sandboxing). Memory corruptions
24+
in TensorFlow ops can be recognized as security issues only if they are
25+
reachable and exploitable through production-grade, benign models.
26+
27+
### Compilation
28+
29+
Compiling models via the recommended entry points described in
30+
[XLA](https://www.tensorflow.org/xla) and
31+
[JAX](https://jax.readthedocs.io/en/latest/jax-101/02-jitting.html)
32+
documentation should be safe, while some of the testing and debugging tools that
33+
come with the compiler are not designed to be used with untrusted data and
34+
should be used with caution when working with untrusted models.
35+
36+
### Saved graphs and checkpoints
37+
38+
When loading untrusted serialized computation graphs (in form of a `GraphDef`,
39+
`SavedModel`, or equivalent on-disk format), the set of computation primitives
40+
available to TensorFlow is powerful enough that you should assume that the
41+
TensorFlow process effectively executes arbitrary code.
42+
43+
The risk of loading untrusted checkpoints depends on the code or graph that you
44+
are working with. When loading untrusted checkpoints, the values of the traced
45+
variables from your model are also going to be untrusted. That means that if
46+
your code interacts with the filesystem, network, etc. and uses checkpointed
47+
variables as part of those interactions (ex: using a string variable to build a
48+
filesystem path), a maliciously created checkpoint might be able to change the
49+
targets of those operations, which could result in arbitrary
50+
read/write/executions.
51+
52+
### Running a TensorFlow server
12853

12954
TensorFlow is a platform for distributed computing, and as such there is a
130-
TensorFlow server (`tf.train.Server`). **The TensorFlow server is meant for
131-
internal communication only. It is not built for use in an untrusted network.**
55+
TensorFlow server (`tf.train.Server`). The TensorFlow server is intended for
56+
internal communication only. It is not built for use in untrusted environments
57+
or networks.
13258

13359
For performance reasons, the default TensorFlow server does not include any
13460
authorization protocol and sends messages unencrypted. It accepts connections
13561
from anywhere, and executes the graphs it is sent without performing any checks.
136-
Therefore, if you run a `tf.train.Server` in your network, anybody with
137-
access to the network can execute what you should consider arbitrary code with
138-
the privileges of the process running the `tf.train.Server`.
139-
140-
When running distributed TensorFlow, you must isolate the network in which the
141-
cluster lives. Cloud providers provide instructions for setting up isolated
142-
networks, which are sometimes branded as "virtual private cloud." Refer to the
143-
instructions for
144-
[GCP](https://cloud.google.com/compute/docs/networks-and-firewalls) and
145-
[AWS](https://aws.amazon.com/vpc/)) for details.
146-
147-
Note that `tf.train.Server` is different from the server created by
148-
`tensorflow/serving` (the default binary for which is called `ModelServer`).
149-
By default, `ModelServer` also has no built-in mechanism for authentication.
150-
Connecting it to an untrusted network allows anyone on this network to run the
151-
graphs known to the `ModelServer`. This means that an attacker may run
152-
graphs using untrusted inputs as described above, but they would not be able to
153-
execute arbitrary graphs. It is possible to safely expose a `ModelServer`
154-
directly to an untrusted network, **but only if the graphs it is configured to
155-
use have been carefully audited to be safe**.
156-
157-
Similar to best practices for other servers, we recommend running any
158-
`ModelServer` with appropriate privileges (i.e., using a separate user with
159-
reduced permissions). In the spirit of defense in depth, we recommend
160-
authenticating requests to any TensorFlow server connected to an untrusted
161-
network, as well as sandboxing the server to minimize the adverse effects of
162-
any breach.
163-
164-
## Multitenancy environments
62+
Therefore, if you run a `tf.train.Server` in your network, anybody with access
63+
to the network can execute arbitrary code with the privileges of the user
64+
running the `tf.train.Server`.
65+
66+
## Untrusted inputs during training and prediction
67+
68+
TensorFlow supports a wide range of input data formats. For example it can
69+
process images, audio, videos, and text. There are several modules specialized
70+
in taking those formats, modifying them, and/or converting them to intermediate
71+
formats that can be processed by TensorFlow.
72+
73+
These modifications and conversions are handled by a variety of libraries that
74+
have different security properties and provide different levels of confidence
75+
when dealing with untrusted data. Based on the security history of these
76+
libraries we consider that it is safe to work with untrusted inputs for PNG,
77+
BMP, GIF, WAV, RAW, RAW\_PADDED, CSV and PROTO formats. All other input formats,
78+
including tensorflow-io should be sandboxed if used to process untrusted data.
79+
80+
For example, if an attacker were to upload a malicious video file, they could
81+
potentially exploit a vulnerability in the TensorFlow code that handles videos,
82+
which could allow them to execute arbitrary code on the system running
83+
TensorFlow.
84+
85+
It is important to keep TensorFlow up to date with the latest security patches
86+
and follow the sandboxing guideline above to protect against these types of
87+
vulnerabilities.
88+
89+
## Security properties of execution modes
90+
91+
TensorFlow has several execution modes, with Eager-mode being the default in v2.
92+
Eager mode lets users write imperative-style statements that can be easily
93+
inspected and debugged and it is intended to be used during the development
94+
phase.
95+
96+
As part of the differences that make Eager mode easier to debug, the [shape
97+
inference
98+
functions](https://www.tensorflow.org/guide/create_op#define_the_op_interface)
99+
are skipped, and any checks implemented inside the shape inference code are not
100+
executed.
101+
102+
The security impact of skipping those checks should be low, since the attack
103+
scenario would require a malicious user to be able to control the model which as
104+
stated above is already equivalent to code execution. In any case, the
105+
recommendation is not to serve models using Eager mode since it also has
106+
performance limitations.
107+
108+
## Multi-Tenant environments
165109

166110
It is possible to run multiple TensorFlow models in parallel. For example,
167111
`ModelServer` collates all computation graphs exposed to it (from multiple
168-
`SavedModel`) and executes them in parallel on available executors. A denial of
169-
service caused by one model could bring down the entire server, but we don't
170-
consider this as a high impact vulnerability, given that there exists solutions
171-
to prevent this from happening (e.g., rate limits, ACLs, monitors to restart
172-
broken servers).
173-
174-
However, it is a critical vulnerability if a model could be manipulated such
175-
that it would output parameters of another model (or itself!) or data that
176-
belongs to another model.
177-
178-
Models that also run on accelerators could be abused to do hardware damage or to
179-
leak data that exists on the accelerators from previous executions, if not
180-
cleared.
181-
182-
## Vulnerabilities in TensorFlow
183-
184-
TensorFlow is a large and complex system. It also depends on a large set of
185-
third party libraries (e.g., `numpy`, `libjpeg-turbo`, PNG parsers, `protobuf`).
186-
It is possible that TensorFlow or its dependencies may contain vulnerabilities
187-
that would allow triggering unexpected or dangerous behavior with specially
188-
crafted inputs.
189-
190-
Given TensorFlow's flexibility, it is possible to specify computation graphs
191-
which exhibit unexpected or unwanted behavior. The fact that TensorFlow models
192-
can perform arbitrary computations means that they may read and write files,
193-
communicate via the network, produce deadlocks and infinite loops, or run out of
194-
memory. It is only when these behaviors are outside the specifications of the
195-
operations involved that such behavior is a vulnerability.
196-
197-
A `FileWriter` writing a file is not unexpected behavior and therefore is not a
198-
vulnerability in TensorFlow. A `MatMul` allowing arbitrary binary code execution
199-
**is** a vulnerability.
200-
201-
This is more subtle from a system perspective. For example, it is easy to cause
202-
a TensorFlow process to try to allocate more memory than available by specifying
203-
a computation graph containing an ill-considered `tf.tile` operation. TensorFlow
204-
should exit cleanly in this case (it would raise an exception in Python, or
205-
return an error `Status` in C++). However, if the surrounding system is not
206-
expecting the possibility, such behavior could be used in a denial of service
207-
attack (or worse). Because TensorFlow behaves correctly, this is not a
208-
vulnerability in TensorFlow (although it would be a vulnerability of this
209-
hypothetical system).
210-
211-
As a general rule, it is incorrect behavior for TensorFlow to access memory it
212-
does not own, or to terminate in an unclean way. Bugs in TensorFlow that lead to
213-
such behaviors constitute a vulnerability.
214-
215-
One of the most critical parts of any system is input handling. If malicious
216-
input can trigger side effects or incorrect behavior, this is a bug, and likely
217-
a vulnerability.
218-
219-
**Note**: Assertion failures used to be considered a vulnerability in
220-
TensorFlow. If an assertion failure only leads to program termination and no
221-
other exploits, we will no longer consider assertion failures (e.g.,
222-
`CHECK`-fails) as vulnerabilities. However, if the assertion failure occurs only
223-
in debug mode (e.g., `DCHECK`) and in production-optimized mode the issue turns
224-
into other code weakness(e.g., heap overflow, etc.), then we will consider
225-
this to be a vulnerability. We recommend reporters to try to maximize the impact
226-
of the vulnerability report (see also [the Google VRP
227-
rules](https://bughunters.google.com/about/rules/6625378258649088/google-and-alphabet-vulnerability-reward-program-vrp-rules)
228-
and [the Google OSS VRP
229-
rules](https://bughunters.google.com/about/rules/6521337925468160/google-open-source-software-vulnerability-reward-program-rules)).
230-
231-
**Note**: Although the iterative exploration of TF API via fuzzing
232-
`tf.raw_ops.xxx` symbols is the best way to uncover code weakness, please bear
233-
in mind that this is not a typical use case that has security implications. It
234-
is better to try to translate the vulnerability to something that can be
235-
exploited during training or inference of a model (i.e., build a model that when
236-
given a specific input would produce unwanted behavior). Alternatively, if the
237-
TensorFlow API is only used in ancillary tooling, consider the environment where
238-
the tool would run. For example, if `saved_model_cli` tool would crash on
239-
parsing a `SavedModel` that is not considered a vulnerability but a bug (since
240-
the user can use other ways to inspect the model if needed). However, it would
241-
be a vulnerability if passing a `SavedModel` to `saved_model_cli` would result
242-
in opening a new network connection, corrupting CPU state, or other forms of
243-
unwanted behavior.
112+
`SavedModel`) and executes them in parallel on available executors. Running
113+
TensorFlow in a multitenant design mixes the risks described above with the
114+
inherent ones from multitenant configurations. The primary areas of concern are
115+
tenant isolation, resource allocation, model sharing and hardware attacks.
244116

245-
## Reporting vulnerabilities
117+
### Tenant isolation
246118

247-
Please use [Google Bug Hunters reporting form](https://g.co/vulnz) to report security
248-
related issues.
119+
Since any tenants or users providing models, graphs or checkpoints can execute
120+
code in context of the TensorFlow service, it is important to design isolation
121+
mechanisms that prevent unwanted access to the data from other tenants.
249122

250-
Please use a descriptive title for your report.
123+
Network isolation between different models is also important not only to prevent
124+
unauthorized access to data or models, but also to prevent malicious users or
125+
tenants sending graphs to execute under another tenant’s identity.
251126

252-
In addition, please include the following information along with your report:
127+
The isolation mechanisms are the responsibility of the users to design and
128+
implement, and therefore security issues deriving from their absence are not
129+
considered a vulnerability in TensorFlow.
130+
131+
### Resource allocation
132+
133+
A denial of service caused by one model could bring down the entire server, but
134+
we don't consider this as a vulnerability, given that models can exhaust
135+
resources in many different ways and solutions exist to prevent this from
136+
happening (e.g., rate limits, ACLs, monitors to restart broken servers).
137+
138+
### Model sharing
139+
140+
If the multitenant design allows sharing models, make sure that tenants and
141+
users are aware of the security risks detailed here and that they are going to
142+
be practically running code provided by other users. Currently there are no good
143+
ways to detect malicious models/graphs/checkpoints, so the recommended way to
144+
mitigate the risk in this scenario is to sandbox the model execution.
145+
146+
### Hardware attacks
147+
148+
Physical GPUs or TPUs can also be the target of attacks. [Published
149+
research](https://scholar.google.com/scholar?q=gpu+side+channel) shows that it
150+
might be possible to use side channel attacks on the GPU to leak data from other
151+
running models or processes in the same system. GPUs can also have
152+
implementation bugs that might allow attackers to leave malicious code running
153+
and leak or tamper with applications from other users. Please report
154+
vulnerabilities to the vendor of the affected hardware accelerator.
155+
156+
## Reporting vulnerabilities
253157

254-
* Your name and affiliation (if any).
255-
* A description of the technical details of the vulnerabilities. It is very
256-
important to let us know how we can reproduce your findings.
257-
* A minimal example of the vulnerability.
258-
* An explanation of who can exploit this vulnerability, and what they gain
259-
when doing so -- write an attack scenario. This will help us evaluate your
260-
report quickly, especially if the issue is complex.
261-
* Whether this vulnerability is public or known to third parties. If it is,
158+
### Vulnerabilities in TensorFlow
159+
160+
This document covers different use cases for TensorFlow together with comments
161+
whether these uses were recommended or considered safe, or where we recommend
162+
some form of isolation when dealing with untrusted data. As a result, this
163+
document also outlines what issues we consider as TensorFlow security
164+
vulnerabilities.
165+
166+
We recognize issues as vulnerabilities only when they occur in scenarios that we
167+
outline as safe; issues that have a security impact only when TensorFlow is used
168+
in a discouraged way (e.g. running untrusted models or checkpoints, data parsing
169+
outside of the safe formats, etc.) are not treated as vulnerabilities..
170+
171+
### Reporting process
172+
173+
Please use [Google Bug Hunters reporting form](https://g.co/vulnz) to report
174+
security vulnerabilities. Please include the following information along with
175+
your report:
176+
177+
- A descriptive title
178+
- Your name and affiliation (if any).
179+
- A description of the technical details of the vulnerabilities.
180+
- A minimal example of the vulnerability. It is very important to let us know
181+
how we can reproduce your findings. For memory corruption triggerable in
182+
TensorFlow models, please demonstrate an exploit against one of Alphabet's
183+
models in <https://tfhub.dev/>
184+
- An explanation of who can exploit this vulnerability, and what they gain
185+
when doing so. Write an attack scenario that demonstrates how your issue
186+
violates the use cases and security assumptions defined in the threat model.
187+
This will help us evaluate your report quickly, especially if the issue is
188+
complex.
189+
- Whether this vulnerability is public or known to third parties. If it is,
262190
please provide details.
263191

264-
After the initial reply to your report, the security team will endeavor to keep
265-
you informed of the progress being made towards a fix and announcement.
266-
TensorFlow uses the following disclosure process:
267-
268-
* When a report is received, we confirm the issue and determine its severity.
269-
**Please try to maximize impact in the report**, going beyond just obtaining
270-
unwanted behavior in a fuzzer.
271-
* If we know of specific third-party services or software based on TensorFlow
272-
that require mitigation before publication, those projects will be notified.
273-
* An advisory is prepared (but not published) which details the problem and
274-
steps for mitigation.
275-
* The vulnerability is fixed and potential workarounds are identified.
276-
* We will publish a security advisory for all fixed vulnerabilities.
277-
278-
For each vulnerability, we try to ingress it as soon as possible, given the size
279-
of the team and the number of reports. Vulnerabilities will, in general, be
280-
batched to be fixed at the same time as a quarterly release.
281-
282-
Security advisories from 2018 to March 2023 are listed
283-
[here](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/security/README.md).
284-
From TF 2.13 onwards, we have sunset this list and only use GitHub's Security
285-
Advisory format, to simplify the post-vulnerability-fix process. In both
286-
locations, we credit reporters for identifying security issues, although we keep
287-
your name confidential if you request it.
192+
We will try to fix the problems as soon as possible. Vulnerabilities will, in
193+
general, be batched to be fixed at the same time as a quarterly release. We
194+
credit reporters for identifying security issues, although we keep your name
195+
confidential if you request it. Please see Google Bug Hunters program website
196+
for more info.

0 commit comments

Comments
 (0)
Please sign in to comment.