Many of the basic principles for securing a data lake will be familiar to anyone who has secured a cloud security storage container. Of course, since most commercial data lakes build off of existing cloud infrastructure, this should be the case.
However, data lakes add additional elements such as data feeds, data analysis (data lake house, third-party analysis tools, etc.) which increase the complexity of interactions beyond a simple storage container. Essentially, we are securing an app at scale with enormous requirements for stored data, incoming data, data interactions, and network connections. Given the importance of “Big Data” analytics and applications to a company’s financial performance, securing data lakes is a critical priority for security teams.
There are too many possible data lake variations to provide specific steps for every security use case; however, just as with a server, a network, or any other IT infrastructure component, we can outline security principles. These principles assist teams in understanding the goals and we then need to go through the various options available and ensure they are in alignment with our goals, principles, and policies.
Data Lake Security Scope
A data governance manager will intensely focus on the access, transmission, and storage of data, but an IT security manager must have a broader perspective that encompasses the infrastructure and tools. However, while the scope of possible concerns range from the bare metal devices to the code within the applications, the realistic scope will be more narrow and depend on how our specific data lake has been established.
Some data lakes lean towards a Software-as-a-Service (SaaS) model where most security functions have been built into the software and IT security teams just need to verify a few details. At the other end of the spectrum, in-house data lakes may even involve securing the bare metal devices and the rooms and buildings that contain them.
Despite this spectrum of possible implementations, IT professionals should already have a reasonable understanding of how to secure the typical SaaS solution or the bare-metal data center. Therefore this article will focus on data lake-specific concerns and also ignore aspects of security that apply general and well understood security such as: identity verification, scanning for malware, resilience (backups, etc.), firewalls, network threat detection, and incident response.
Critical Data Lake Security Concerns: Visibility & Controls
A data lake potentially contains all of the data of an organization. With all data in one location, the risks become compounded.
Attackers will focus their efforts on gaining access to the data lake to attempt to exfiltrate valuable information. Insider threats will attempt to exceed their access limitations or inappropriately download valuable information. Even authorized and appropriate data access needs to be managed to protect from accidental breach of top secret or regulated data to unauthorized personnel.
To guard against inappropriate access, security teams need to know their data, know their users, and define authorized access. Once the security team implements appropriate access they then need to test that access and monitor ongoing access for inappropriate use.
These data lake security concerns can be generalized as either visibility or controls. Without mastering these two aspects, any other aspect of security may be considered to be at risk.
Data Lake Visibility
To ensure the appropriate use of data, the data security team must first have visibility into the data. Once the data is known, controls should enforce appropriate access, but even then the security team will then need visibility into usage to verify the controls are working properly.
Data Lake Data Visibility: Classification
Classification of the data will be critical for effective access control within the data lake. While initial classification can be performed based upon the intake sources, ultimately the data within the data lake files will need to be inspected for sensitive data.
Different data lakes will have different capabilities through built-in features and may have additional features available through separate product or third-party add-on products. The capabilities of these features will also vary, with some products able to classify files and others only adding classifications to the data extracted from the files.
For perspective on what this might mean consider the following examples:
- Offers an add-on service: Amazon Macie.
- Macie is billed on a per GB processed basis
- Macie will search documents using Machine Learning (ML) algorithms to identify and flag sensitive information
- Classification applies to data stored in tables and views
- Classification incurs compute costs, but no extra licensing fee
- Classification uses machine learning to suggest categories
- Classification categories include
- Semantic categories (name, address, age, etc.)
- Privacy categories (subcategories of Semantic categories)
- Direct identifiers (name, social security number, etc.)
- Indirect identifiers (age, gender, zip, etc.)
- Personal attributes (salary, healthcare status, etc.)
With the enormous scale of a typical data lake, the best practice will require automated detection and classification of data as data is loaded into the data lake. Once categorized, data can also be organized, cleaned, and moved to enable strong controls.
While more of a data governance issue, data security teams need to know how regulated data should be handled so that they can properly categorize and establish the controls for the data. For example, if a social security number or personal information related to a European Union (EU) citizen is located, what happens next?
Databricks recommends a proactive approach and deleting EU PII during data ingestion because the General Data Protection Regulation (GDPR) allows penalties up to 20 million Euro or higher to be levied against violators. Violations can occur for data breaches or even for failure to erase the data (under the EU right to be forgotten) properly.
Even if the company decides to keep the data, data governance needs to determine who can see or search the data and under what circumstances. One way to ensure proper handling will be to move data into specific folders (for Azure, Google, etc.) or categorized with the restricted data category to flag the data within the database (for Snowflake, Databricks, etc.).
While discussed in more detail below, folder-level security controls scale much better than more granular object-based controls (on files, tables, etc.) that need to be separately assigned, monitored, and maintained. Data lake tools can detect categories and move data during ingestion so that the data can be automatically categorized and secured quickly and at scale.
Data Lake Usage Visibility: Monitoring & Logging
Once a security team sets up the perfect categorization and controls, the data lake is protected! At least theoretically. The security team will need visibility into the actions of users and APIs to see if the theory continues to work in practice.
Security teams will need to establish, maintain, and audit various logs and alerts within the data lake environment. Considering the scale of the data lake, automation may be required on most alerts to block suspicious activity.
Security teams need to double-check audit logging within the data lake to determine what needs to be enabled based upon the capacity and budget of the security team. For example, admin activity is on by default for Google data lakes, but data access logs are off by default to reduce noise and storage volume.
In the event of an incident, the standard incident response practice should apply equally to the data lake as much as any other resource. However, security teams do need to verify the correct flow of alerts and evidence to the incident response team.
Also read: How to Create an Incident Response Plan
Data Lake Controls
Once data governance teams determine how data should be handled, data security teams enact controls to enforce those rules. Often, these rules will be extensions of existing IT policies, but some policies may need to be reviewed and revised. The major types of security controls for data lakes fall into the categories of: isolation, authorization, encryption, transmission, and storage.
Data Lake Isolation
As security professionals, we want to limit the number of connections we need to control and manage for our IT infrastructure. Many vendors recommend isolation as the first step in setting up a data lake.
Although terminology may vary between solutions, data lakes can be established to be private and hidden to prevent casual detection by outside parties. Some vendors recommend only allowing the data lake to be accessible from within a secured corporate network. With wider adoption of perimeter-less security corporate networks might be replaced in that recommendation with secure gateways or other zero trust network setups.
Tightly restricting communication between the data lake and the outside world can reduce the ability of attackers to exfiltrate data. However, Data Loss Prevention (DLP) measures should also be established to deter insider threat and successful attacks from exfiltrating data.
Similar isolation controls should be placed on computer clusters managed by the organization that will execute queries. Such controls can include, but are not limited to restricting secure shell (SSH) and network access.
Azure Private Link and AWS PrivateLink offer cloud-native and branded solutions that create private network connections between company resources and isolate the data lake from the internet or other public resources. Of course, companies can always put in more effort and establish their own secure gateways for a similar result.
Some vendors, such as Databricks, build-in some of these security features in the default cluster, however, it is the responsibility of the security team to verify the specifics and to consider possible weaknesses that may require mitigation.
Data Lake Authorization Controls
Authorization involves how to control access based upon data categorization. Access will be determined based upon the user, the API, or even the query.
In alignment with modern zero-trust principles, the default data access status should be set to deny access entirely. Access should be actively granted based upon need.
To allow users to gain entry to the data lake, most data lakes will integrate with standard identity access management (IAM) technology, although the specifics might vary between implementations. Each technology can then add additional layers of security to associate specific users, groups, projects, or companies with specific data repositories (folders, files, or data columns).
For some examples of affiliated IAM tools:
- Azure Data Lakes
- Azure Active Directory (AAD)
- Azure Security Groups
- AWS Data Lakes
- AWS Identity and Access Management
- AWS Directory Service
- Integrates with both Azure and AWS IAM
After authorizing a user to connect to the data lake, we need to determine the duration of the connection permitted for users. Most data lakes should apply a default session time-out level, which may be modified to match normal corporate policies. However, security teams will need to work with the data scientists to ensure that accounts don’t time-out in the middle of long database queries.
As for access to the data itself, most data lakes operate with some sort of hierarchy that roughly corresponds to traditional IT infrastructure.
Some data lakes (Azure, etc.) put raw data into folders and secure the folders like one might secure folders on a shared server. Folders can be assigned to specific user groups, projects, or even specific users and sub-folders can either inherit parent folder rights or be assigned unique rights.
Assigned rights might be full read-write-copy-query or may be limited (read-only, etc.). Security teams will need to verify how their specific data lake implementation will handle additional child folders to ensure parent folder rights and user inheritance.
Other tools (Snowflake, etc.) use a combination of Discretionary Access Control (DAC) and Role-based Access Control (RBAC) to provide similar rights but based upon specific objects (warehouse, database, etc.) instead of using folders. In both cases, users need to be carefully assigned to groups or roles so that they are granted the appropriate rights.
Some tools provide many different preset roles and limits to principal account owners. For example, each policy in Google’s data lake only supports 1,500 principals but does come with many predefined roles such as “Actions Admin,” “ApiGateway Viewer,” and “Monitoring Dashboard Configuration Editor.”
With integration to AD, these rights might be inherited from the organization’s source IT infrastructure, but security teams will need to audit these roles to verify that:
- Roles in AD transferred correctly
- Users are in the appropriate roles
- The appropriate roles correspond to the correct data in the data lake.
Data Lake Encryption
Regarding encryption, most tools will provide built-in encryption keys, but many will also integrate with various key management technologies to allow for an organization’s direct control of the encryption keys.
- Azure defaults to the Azure Key Vault technology
- AWS Data Lakes default to AWS Key Management Services
- Databricks integrates with Azure, AWS, and customer’s internal key management services
- Snowflake generates private keys, supports customer’s internal key management with a minimum 2048-bit RSA key pair required.
Encryption should be applied for data in transit, in storage, and also for data staged for loading.
Security teams should verify that the default encryption applied meets the minimum criteria for the organization’s security standards. Some data lakes will also permit periodic rekeying of encrypted data and security teams seeking to adopt this higher level of security should check the time required to execute a rekey and the costs that might be incurred.
Data Lakes Transmission Security
We primarily think of networks when we consider data transmission. Most data lakes will default to encrypted data transmission, but security teams need to double-check and verify encryption is in place.
Data lakes should limit network exposure by constricting connections to the data lake through single IP lists or IP address ranges to internal networks or network gateways (see Isolation, above). Some data lakes tools (Eample: Snowflake) will allow for network policies to be applied for individual users, although this level of granularity may be difficult to maintain at scale for large numbers of users and applications.
For tools that connect to multiple cloud resources (Databricks, Snowflake), network security managers will also need to ensure that the connections, even if established automatically through the application, have been configured correctly. There may also be manual private network connections (EX: AWS PrivateLink, etc.) that will be fully the responsibility of the organization’s security team to connect, secure, and maintain.
Security teams must beware that not all tools will default to encrypted communication. For example, Microsoft Azure requires “Secure transfer required” to be enabled to block unencrypted HTTP and SMB connections.
However, beyond these network connections, data lakes also require security teams to consider API connections or specific query connections and the metadata or database column information they might return. Many tools use additional interface controls that can add additional security features, or the tools themselves may act as intermediaries that receive and pass data along the queries but prevent the requestor from touching the data directly.
Some sensitive data will be retained and available for query, but not available for display. Certain metadata columns might be designated to be obfuscated for certain types of data or classifications of users who will see the results, but with sensitive data replaced by asterisks or otherwise encrypted or blocked.
Transmission security should also apply to any tool connected to the data lake. Each data lake tool will have its own specific connectors, APIs, and drivers that require specific formatting, configuration, and procedures to securely connect to the data lake.
Snowflake categorizes connections as Snowflake Ecosystem tools, Snowflake Partner connections, General Configuration (diagnostic tools, query text size limits, etc.), SnowSQL command-line client, and other connections and drivers for Python, Spark, etc. Security teams will need to check the equivalent connections for their specific data lake implementation and monitor the data lake to prevent rogue connections.
Data Lake Storage Controls
The most common security control for data storage is encryption, which we’ve already covered. So long as the data has been properly classified, the access controls should handle most of the access permissions to stored data.
However, security teams will also need to check what types of default security exist within the data lake and what might need to be enabled. For example, Microsoft Azure recommends enabling Microsoft Defender for the Cloud for all storage accounts to detect and eliminate malware loaded to the data lake.
Critical data can usually be designated for storage in immutable data, where it may not be altered or deleted through user actions on the data lake. Soft delete options may also be available that will allow for container or data recovery within a certain time period after deletion.
Data can also be automatically designated for transfer to cold storage or deletion based upon its usage or age. Data security teams should work with data governance teams to classify and enable the data on the data lake appropriately for immutable, soft delete, cold storage, and automated deletion.
The basics for data lake security build off of tried and true fundamentals of IT security: visibility and control. However, also like any other IT infrastructure, those fundamentals can be undermined if we ignore the details of the specific implementation.
Data lakes become more complicated because our IT security teams need to work with data governance and data mining professionals more directly than in many other applications. However, when all interested parties can communicate effectively, governing policies can be implemented to automatically classify and secure the data within a data lake effectively.