- Quick basics overview
- Strong Authentication
- Better Authorization
- Encryption
- Auditing & Visibility
- CM-based Configuration
- Perimeter
- Strong authentication
- Network isolation, edge nodes
- Firewalls, iptables
- Access
- Authorization controls
- Granular access to HDFS files, Hive/Impala objects
- Data
- Encryption-at-rest
- Encryption-in-transit
- Transport Layer Security (TLS)
- Visibility
- Auditing data practices without exposing content
- Separation of concerns: storage management vs. data stewardship
-
"Hadoop in Secure Mode" lists four areas of authentication concern. All of them depend on Kerberos, directly or indirectly
- Users
- Hadoop services
- Web consoles
- Data confidentiality
-
Linux supports MIT Kerberos
- See your Hadoop for Administrators notes for an overview
-
"Hadoop in Secure Mode" relies on Kerberos
- Data encryption services available out of the box
- RPC (SASL QOP "quality-of-protection")
- Browser authentication supported by HTTP SPNEGO
- Data encryption services available out of the box
-
LDAP/Active Directory integration
- Applying existing user databases to Hadoop cluster is a common ask
-
ELI5: Kerberos: Great introduction / refresher to Kerberos concepts.
- Cloudera recommends Direct-to-AD integration as preferred practice.
- The alternative is a one-way cross-realm trust to AD
- Requires MIT Kerberos realm in Hadoop cluster
- Avoids adding service principals to AD
- Common sticking points
- Admin reluctance(!)
- Version / feature incompatibility
- Misremembered details
- Other settings that "shouldn't be a problem"
-
/etc/krb5.conf
doesn't authenticate to KDC- Test with
kinit AD_user
- Test with
-
Required encryption type isn't allowed by JDK
- Install Unlimited Policy files
-
Supported encryption types are disjoint
- Check AD "functional level"
-
To trace Kerberos & Hadoop
export KRB5_TRACE=/dev/stderr
- Include
-Dsun.security.krb5.debug=true
inHADOOP_OPTS
(& export it ) export HADOOP_ROOT_LOGGER="DEBUG,console"
- HDFS permissions & ACLs
- Need principal definitions beyond user-group-world
- Relief from edge cases and implications of hierarchical data
- Can provide permissions for a restricted list of users and groups
- Apache Sentry (incubating)
- Database servers need files for storage, managed by admins
- Authorizations needed for database objects may be disjoint
- Plain HDFS permissions are largely POSIX-ish
- Execution bit doesn't work except as a sticky bit
- Applied to simple or Kerberos credentials
- The NameNode process owner is the HDFS superuser
- POSIX-style ACLs also supported
- Disabled by default (
dfs.namenode.acls.enabled
) - Additional permissions for named users, groups, other, and the mask
- chmod operates on mask filter -> effective permissions
- Best used to refine, not replace, file permissions
- Some overhead to store/process them
- Disabled by default (
- Originally a Cloudera project, now Apache incubating
- Some useful docs are not yet migrated to ASF
- Supports authorization for database objects
- Objects: server, database, table, view, URI
- Authorizations:
SELECT
,INSERT
,ALL
- A Sentry policy is defined by mapping a role to a privilege
- A group (LDAP or Linux) is then assigned to an Sentry role
- Users can be added or removed from the group as necessary
- Supports Hive (through HiveServer2), Impala and Search (Solr) out of the box
- Sentry policy is defined by mappings
- Local/LDAP groups -> Sentry roles
- Sentry roles -> database object, privileges
- Each service has to bind to a policy engine
- Currently
impalad
and HiveServer2 have hooks - Cloudera Search integration is a workaround
- Currently
- Service Provider interfaces for persisting policies to a store
- Support for file storage to HDFS or local filesystem
- The policy engine grants/revokes access
- Rules applied to user, the objects requested and the necessary permission
- Sentry / HDFS Synchronization
- Automatically adds ACLs to match permission grants in Sentry
- A fully-formed config example is here
- You can watch a short video overview here
Sentry and HiveServer2
- Relational model and storage
- Introduced in C5.1
- Uses a database to store policies
- CDH supports migrating file-based authorization
sentry --command config-tool --policyIni policy_file --import
- Impala & Hive must use the same provider (db or file)
- Cloudera Search can only use the file provider
Network ("in-flight") encryption
- For communication between web services (HTTPS)
- Digital certificates, private key stores
- HDFS Block data transfer
dfs.encrypt.data.transfer
(very slow - not recommended for now)
- RPC support already in place
- Support includes MR shuffling, Web UI, HDFS data and fsimage transfers
At-rest encryption
- Encryption/decryption that is transparent to Hadoop applications
- Need: Key-based protection
- Need: Minimal performance cost
- AES-NI on recent Intel CPUs.
- Navigator Encrypt
- Block device encryption at OS level
- HDFS Transparent Data Encryption
- Encryption Zones
- Key Management Server (KMS)
- Key Trustee
- Cloudera's enterprise-grade keystore
Other requirements
- Tokenization
- Data masking
- Leverage partners for this (Protegrity, Dataguise etc)
- Provided by Cloudera Navigator
- See who has accessed resources (filesystem, databases, log of queries run)
- Custom reports
- e.g. show all failed access attempts
- Redaction of sensitive information
- Separation of duties
- Know the network ports that CDH and third-party software use
- Set up a dedicated Kerberos Domain Controller
- KRB5 MIT instructions are here
- Cloudera slightly higher-level instructions are here
- Or you can use RedHat's documentation
- Make sure your KDC allows renewable tickets =======
- Make sure your KDC allows renewable tickets
- Kerberos tickets are not renewable by default in many Linux distributions (see RHEL docs).
- Configure renewable tickets before Kerberos database is initialized.
- If you modify these parameters after initialization, you can:
- Change the maxlife for all principals (
krbtgt/REALM
too) withmodprinc
, or - Destroy the KDB and remake it.
- Change the maxlife for all principals (
- Create a KDC account for the Cloudera Manager user
- Plan one: follow the documentation here
- Plan two: Launch the Kerberos wizard and complete the checklist.
- Set up an MIT KDC
- Create a Linux account with your GitHub name
- Once your integration succeeds, add these files to your
security/
folder:/etc/krb5.conf
/var/kerberos/krb5kdc/kdc.conf
/var/kerberos/krb5kdc/kadm5.acl
- Create a file
kinit.md
that includes:- The
kinit
command you use to authenticate your user - The output from
klist
showing your credentials
- The
- Create a file
cm_creds.png
that shows the principals CM generated
- There's a lot work in this lab. If you choose to do it, be sure to:
- Ignore the steps to set up CDH 5 (already done)
- Test client connectivity with JDBC
- Set up and integrate an Active Directory instance
- Test with a secured client connection
- Enable Kerberos
- Add a Sentry configuration to the mix
- Test client connection again
If you're comfortable with AD, this may take an hour. If not, maybe 2-3 hours. Let your instructors know if you want to attempt this lab.
Complete one of the following labs: