SHARE

Security Data Lakes Emerge to Address SIEM Limitations

Every security team craves clear visibility into the endpoints, networks, containers, applications, and other resources of the organization. Tools such as endpoint detection and response (EDR) and extended detection and response (XDR) send an increasing number of alerts to provide that visibility. Unfortunately, the high storage and processing fees for traditional security information and event […]

Written By

Chad Kime

Sep 21, 2022

10 minute read

eSecurity Planet content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

Every security team craves clear visibility into the endpoints, networks, containers, applications, and other resources of the organization. Tools such as endpoint detection and response (EDR) and extended detection and response (XDR) send an increasing number of alerts to provide that visibility.

Unfortunately, the high storage and processing fees for traditional security information and event management (SIEM) tools often cause security teams to limit the alerts and logs that they feed into the tool in order to control costs. This limitation on the alerts also limits the visibility for the security team and constrains the ability of modern artificial intelligence (AI) and machine learning (ML) tools to learn and recognize potentially malicious behavior.

To address that limitation, a new tool is emerging: Security data lakes (SDLs), which might provide a solution that enables unfiltered visibility for security teams. However, what are the tradeoffs to that approach?

See the Best SIEM Tools & Software

What is SIEM?
What is a Security Data Lake?
Security Data Lakes vs. SIEM: Pros & Cons
SIEM Compatibility with Security Data Lakes
Features to Look for in a Security Data Lake Vendor
Security Data Lake Vendors
Choosing an SDL or SIEM Solution

What is SIEM?

Organizations acquire security information and event management tools to consolidate, manage, and provide analysis on security alerts. These alerts are generated by the resources of an enterprise such as computers, servers, network traffic, and cloud applications.

SIEM tools ingest the security alerts, process the different formats into a common framework, and analyze the data. The analysis attempts to determine a baseline for normal behavior, so malicious and anomalous behavior can be flagged and escalated for review. Advanced SIEMs also use AI or ML to perform triage and prioritize or rank the logs and alerts for human review.

What is a Security Data Lake?

The typical data lake serves a repository for an organization and holds unstructured data regarding company products, financial data, customer data, supplier data, and marketing information. Data lakes can easily be expanded to encompass the security data on company resources and function as a security data lake.

Real value can be extracted from SDLs that not only take in security logs and alerts but also encompass related security information such as open-source intelligence information (OSINT), external threat intelligence feeds, malware databases, IP reputation databases, operation logs, and even dark web activity.

Also read: Security Considerations for Data Lakes

Security Data Lakes vs. SIEM: Pros & Cons

In practice, SIEM and SDLs can overlap in capabilities depending on the tool and the organization. For the purpose of this comparison, we’ll focus on broad SIEM and SDL characteristics for the typical product in the respective category.

Storage

SDLs offer potential advantages over SIEM in storage pricing and time scope.

Many SIEM vendors charge by the amount of data processed and stored in their systems, and this charge can be quite high, especially when compared with the commodity storage prices of the cloud.

Data lakes store data at enormously reduced prices, with prices below $25 per month for the equivalent of 3TB of log data. SDLs that share storage with an organization’s general data lake strategy may benefit from additional bulk storage discounts with the data lake provider.

In the case of time scope, the typical SIEM will hold less than a year of logs and alert data — often a mere 90 days. While this time scope captures the short-term health of the organization, longer term trends and patterns cannot be recognized. By comparison, security data lakes can scale easily and retain security data for years instead of days.

Between the enormous cost savings and the much larger time scope, SDLs hold a distinct strategic and financial advantage over SIEM solutions.

Data ingestion

When it comes to data ingestion, the advantage depends upon the implementation or the tools. Self-service SDLs will require much more work to achieve the capabilities of even basic SIEM or SDL tools, so this aspect must be considered prior to executing an SDL strategy.

Logs and alerts arrive in many different file types such as JSON, XML, PCAP, and Syslog. A SIEM may have restrictions on the type of data it can ingest, but it will process compatible security event logs directly and normalize them for efficient and automatic processing.

SDLs have no limit to the type of data that can be loaded and will accept all file types. Any SDL can load logs and information unrelated to security events such as access records, threat intelligence feeds, and performance logs.

However, the value of this additional data will be limited if the SDL cannot process the data for search and analysis. Some SIEM solutions can interface with SDLs to process data, and some SDL vendors process and normalize data.

The advantage in this category is capability-dependent in that there is no overall advantage between SIEM or SDLs. Security teams will need to perform tests to verify that their key data can be processed for adequate analysis and searches.

Additionally, for teams used to a SIEM feed environment optimized for limited data processing, some log generation options may have been disabled in the past. Security teams should check if they need to adjust log file generation to restart some log file generation and ensure the SDL advantage can be realized.

Infrastructure

Before anyone can hunt threats or analyze information, the infrastructure needs to be established, secured, and then maintained. In general, there is no inherent advantage to SDLs over SIEM tools, but the scale of the transition and shifting of resource management responsibilities will be critical to address before adopting an SDL strategy.

For those teams where the organization has embraced a company-wide data lake strategy, the security team can add their data to that existing data lake. These teams might even offload the infrastructure burden to those managing the overall data lake strategy and who might have superior technical or data scientist expertise with data lakes.

Some SIEM and SDL tools build in assistance to manage data feed connections, data process, analysis, queries, and storage. Others do not and will require the security team to deploy programming skills and data science expertise.

Every security team using a self-hosted SIEM understands how to connect it to their data feeds, secure the SIEM infrastructure, and how to properly host data in the tool. Switching to an SDL will require the security team to learn or build these skills from scratch unless the SDL tool makes connections easy.

While SIEM solutions might have an edge in infrastructure support capabilities simply due to familiarity, there is no inherent advantage over SDLs over the long run for infrastructure management. For an organization determined to implement an SDL but lacking the infrastructure management capabilities, it will simply need to find a more full-service SDL solution.

Threat hunting

The key advantage of SDL technology over SIEM technology relates to threat hunting. SDLs can store more data, host that data for longer, ingest many more data types, and use all of this additional data for threat hunting or to train AI and ML algorithms.

SIEM tools skillfully parse alerts and can flag specific events for further investigation, but threat hunting must then typically be performed outside of the tool. SDLs hold the contextual information and the data query interface to assist a threat hunter to further investigate key alerts and the context needed to understand them.

However, this advantage relies upon several key assumptions:

The data is ingested and processed correctly.
The team has the personnel resources for investigation.
SDL AI or ML algorithms are at least as good as the SIEM AI or ML algorithm.

The narrow focus and alerts generated by SIEMs often frustrate security teams because they lack the context of the organization, the user, and similar information. SDLs will not natively improve context, but proper data ingestion and feature-rich SDL tools can be applied to add data enrichment and context during ingestion.

For example, a malicious action on an IP address may be difficult to track down in a Wi-Fi environment where IP addresses are reused and assigned dynamically. With a strong processing methodology, the log file of that IP address can be associated with users, mapped to hosts, connected to geolocation data, and more.

Alerts

As for alerts, SIEM tools process specific security data and provide standardized reports and alerts based upon that data, but some teams cannot keep up the volume of alerts. However, teams suffering from alert fatigue may not find any relief in a switch to an SDL.

Evaluators need to run tests on the tools alongside the engineers using them to ensure any additional tools help the team instead of burdening them. Some tools claim that more efficient searches on the broader SDL dataset can dramatically reduce investigation time, but security teams need to verify those results for themselves before they find themselves with even more alerts and more data to deal with.

As for AI and ML algorithms, theoretically a limited dataset runs the risk of biasing the algorithm and preventing proper algorithm training. The unfiltered dataset of the SDL offers the possibility of a more robust training of AI and ML models to detect threats and anomalies.

However, despite this theoretical advantage of the SDL, different tools use different AI/ML algorithms, and security managers may need to work with data scientists to ensure the organization selects a vendor with adequate AI/ML algorithms. Most algorithms tend to operate as black boxes, so it may take significant testing time to verify AI/ML performance.

SIEM Compatibility with Security Data Lakes

Just because an SDL presents significant advantages for data storage and threat hunting does not mean organizations should abandon a quality SIEM. Many SIEM solutions now integrate with SDLs to try and deliver the best of both worlds.

The SIEMs can continue to analyze a limited set of key logs to provide meaningful security alerts, and security teams can return to the SDL to investigate those alerts within the context provided by the SDL. SIEMs possess much more experience in catering to the needs of security teams and features such as alerts, dashboards, and ticketing will be very difficult to build from scratch.

Security teams considering the development of an SDL strategy can easily look into incorporating existing or similar SIEM tools to minimize disruption of their current threat hunting processes. Security teams will need time to learn SDL functions, and integrating a SIEM tool can prevent a drop in threat hunting capabilities during the SDL training.

Also read: Implementing and Managing Your SIEM Securely: A Checklist

Features to Look for in a Security Data Lake Vendor

The capabilities and focus of a specific SDL tool will vary from vendor to vendor. Just as with SIEM vendors, SDL vendors focus on different types of customers and offer a spectrum of capabilities from full-service to self-service data analysis and infrastructure control.

However, across all potential SDL solutions, four key capabilities should be present:

Automated Collection and Parsing: Enterprises may receive billions of security-related logs and other data feed items per day. An effective SDL must be able to automatically ingest the data, convert it into a usable format, and parse the data for analysis. Some vendor tools might only connect to a limited number of feeds, so evaluators must explore if the tool’s API (application programming interface) and feed processes will be sufficient.
Security Context & IP Mapping: Event logs may be associated with specific IP addresses that are reassigned regularly. To be useful for security analysis, security information needs to be mapped or connected to relevant associated information such as hostnames, MAC addresses, user IDs, etc.
Simplified Analysis and Reporting Interface: Security investigators need to be experts in security, not programming. Instead of learning to program an analytical language such as R, the SDL should perform a simple interface that facilitates analysis and reporting with minimal required programming.
Scalable Architecture: The point of the SDL is to house as much data as possible, so the SDL tools should be able to scale with the ever-increasing size of the security data.

Security Data Lake Vendors

The vendors below represent prominent SDL vendors from different categories. This list is not comprehensive and other vendors and capabilities are sure to be added in the near future.

Elysium

Elysium runs on Snowflake as an add-on software-as-a-service (SaaS) application to improve analysis of security feeds through ML, graphical representations, and other features. This tool will be favored by those seeking a full-service SDL experience through Snowflake.

Exabeam

SIEM expert Exabeam expanded its Log Manager solution to become the Exabeam Data Lake product. It integrates with other Exabeam products such as Cloud Connectors, Advanced Analytics, and the Security Intelligence Platform to combine SDL and SEIM capabilities. This tool will be favored by those seeking a full-service and segregated SDL experience.

Gurucul Security Data Lake

Gurucul focuses on log file and alert analytics. While customers can point Gurucul at other data repositories, Gurucul encourages the use of SDLs and even offers a free SDL with its products. This tool will be favored by customers seeking a self-service SDL experience.

Panther Security Data Lake

Panther provides an SDL enablement tool to collect security logs and parse, normalize, and analyze data with 200+ customizable Python detections. Panther can be deployed on AWS or Snowflake, and it automatically flags suspicious events and retains data in a customer-hosted data lake. This service will appeal to both full-service and self-service SDL customers.

Snowflake

As a leader in data lake hosting and analysis, Snowflake also offers its own solutions to explore cybersecurity data using the Snowflake tool. This tool will be favored by customers seeking a full-service SDL experience.

Varada

Varada bolts onto an existing data lake or other virtual private cloud solution to accelerate searches for security analytics. Varda estimates that 90% of compute resources are wasted on scanning data for searches, which it tries to eliminate with more effective searches and data caching to run searches as much as 100x faster and up to 60% cheaper. This tool will be featured by customers seeking a self-service SDL experience.

Choosing an SDL or SIEM Solution

While some vendors claim that security data lakes will replace SIEM solutions, not all SDL solutions match the features and alerting capabilities of all SIEM tools. Organizations considering SDLs need to verify their SDL capabilities and may even decide to integrate with a SIEM.

SDLs remain a technology in development. Tools supporting SDLs can have narrow capabilities, and the quality of those tools also vary, which makes evaluation challenging.

As with any IT or security product, customers must separate their analysis from the hype and really understand their own capabilities and what they want from their security solution. With that understanding, evaluation of specific technologies and candidate vendors becomes much more focused and easier to perform.

Chad Kime

eSecurity Planet lead writer Chad Kime covers a variety of security, compliance, and risk topics. Before joining the site, Chad studied electrical engineering at UCLA, earned an MBA from USC, managed 200+ ediscovery cases, and helped market a number of IT and cybersecurity products, then transitioned into technical writing policies and penetration test reports for MSPs and MSSPs.

Security Data Lakes Emerge to Address SIEM Limitations

What is SIEM?

What is a Security Data Lake?

Security Data Lakes vs. SIEM: Pros & Cons

Storage

Data ingestion

Infrastructure

Threat hunting

Alerts

SIEM Compatibility with Security Data Lakes

Features to Look for in a Security Data Lake Vendor

Security Data Lake Vendors

Elysium

Exabeam

Gurucul Security Data Lake

Panther Security Data Lake

Snowflake

Varada

Choosing an SDL or SIEM Solution

Chad Kime

Product Name

Product Name

Product Name

Product Name

Product Name

Company

Categories