How can Hadoop help secure your big data?
Big data demands new large-capacity storage, and these relatively new platforms potentially make that data vulnerable to hacking or leakage. Hadoop is one mainstream big data platform available to organizations – so what steps are needed to make it as secure as possible?
Hadoop is an open-source software framework comprised of various components that interact with each other — and this structure can be vulnerable if not adequately secured.
With our clients, we have noticed there is a tendency to focus primarily on perimeter security at the expense of neglecting internalized security. Sometimes, inconsistent authorization methods are implemented across the different Hadoop components, leading to oversights and missed configurations. It is also essential to consider critical security settings over and above Hadoop’s default configurations.
There are four important areas to focus on when implementing Hadoop securely:
- Data encryption
There are two recommended measures to strengthen authentication in the Hadoop ecosystem:
- Implement Kerberos as the authentication mechanism
This provides greater security than using the Hadoop default simple authentication protocol. Kerberos is an authentication mechanism that allows communication through a ticketing system, and requires both services (“components”) and users to be authenticated before gaining access to the cluster.
- Integrate authentication with an Enterprise Directory
This enables centralized management and administration of user access, as well as single sign-on (SSO). Hadoop provides native integration with Enterprise Directory services, including Lightweight Directory Access Protocol (LDAP) and Active Directory (AD).
Authorization is one of the most critical parts of securing a Hadoop infrastructure and specifies the actions that authenticated users can perform. Some key challenges include:
- Configuring the authorization to prevent access between departments
- Multilevel authorizations and queue management
- Managing the different configurations that are unique to each component, due to the different services that the component provides
Here are some recommended approaches for setting up and managing authorization within a Hadoop ecosystem:
- Set permissions down to the file level
Different groups within a shared Hadoop infrastructure can share the same file system. Enabling Hadoop Distributed File System (HDFS) permissions at file level ensures unauthorized users cannot access the data at rest and segregates data in a multi-tenant environment.
- Configure service-level authorizations
The service-level authorizations dictate the users and groups that are able to run jobs or services, so it is important to secure components that provide access to the data in addition to securing data at rest.
- Limit access to job data solely to the user that requested it
Secure jobs executed within the Hadoop ecosystem so that only the originator of the request has access to the output of their data. This is important as, by default, all Hadoop jobs are launched with the same system ID “yarn” — if compromised, this means anyone in the cluster can read all the data within it.
- Implement a comprehensive queue modeling system
While securing access to individual queues, set priorities to segregate access for different departments or user groups in a multi-tenant environment.
- Centralize the management of authorization configurations
Use an authorization manager, such as Apache Ranger, to reduce risk of inconsistent or out-of-date configurations across the Hadoop cluster.
Auditing measures can actively help prevent security breaches — so they
should be treated as more than a means of satisfying regulatory and security compliance.
Two steps to take in creating an audit baseline within a Hadoop ecosystem are:
- Ensure HDFS1 and MapReduce2 audit logs are adequately set to track both service and job activities at the file system level and at the compute layer.
- Ensure aggregate logs generated by different components within the Hadoop ecosystem are passed into a Security Information and Event Management (SIEM) application, such as IBM QRadar or Apache Ranger. This helps manage the vast volume of logs that are generated by the various Hadoop components and allow for correlation.
No control is complete without applying data-level controls. There are two categories of data within Hadoop to focus on protecting:
- Sensitive data that has been loaded into Hadoop (business data or customer data) for analysis
- “Insights” — i.e., data that has already been analyzed. Such information, if exposed, can lead to great losses, as correlation has already been established
This data can flow within a Hadoop ecosystem as data at rest (in HDFS) or in motion (through the components). In either case, data can be secured by:
- Setting encryption zones across a file system and leaving other areas unencrypted, in order to achieve balance between performance and security in a multi-tenant environment. A Hadoop cluster can easily contain Petabytes of data; therefore, securing this data is critical due to the wealth of information that is stored.
- Securing access across end point connections, as the connections can provide an entry point for attacks to penetrate the Hadoop ecosystem. For example, at the client level, secure the clients communicating to HDFS through remote procedure calls and data transfer protocol; in user mechanisms, secure the browser level and command line interfaces via HTTPS and JDBC3 measures; and to secure the shuffle, apply HTTPS during data exchange in the core components of analytics.
Securing a big data infrastructure introduces further challenges that are not applicable to traditional relational database technologies and, as such, a structured approach should be taken.
Although there are many more configurations that can be applied, this article highlights just some of the key areas that should be prioritized when securing a big data infrastructure using Hadoop.