As a CEO of a data analytics firm, I’m often asked what the best strategy is for storing large amounts of data that will be analyzed at a later date. The purpose of this article is to define and explain the benefits of using a data lake and a data warehouse.

What is a data lake?

A system designed as a central repository for all types of data both structured and unstructured.

What is a data warehouse?

A system designed to store structured data used for reporting and data analytics.

While both a data lake and a date warehouse store data and enable in-depth analytics, customers should evaluate four components when deciding whether to implement a data lake or a data warehouse.

Type of Data

Depending on who you talk to there are numerous types of data – big data, machine data, real time data, etc. For the purposes of this article, I am going to organize data into two categories, structured and unstructured data. Structured data fits into a tabular format with relationships between the rows and the columns.

Unstructured data is everything else, such as videos, pictures, and emails. Data warehouses are better fits for structured data whereas data lakes retain both structured and unstructured data.

Business Objective

When assessing a client’s data needs, I find that typically clients fall into two groups of common data challenges. Group A, the client has multiple sources of data from disparate sources and want to consolidate all the data into one location. They might have an idea of what they want to do with the data, but most importantly they want the data to be flexible and scalable depending on what they may want to analyze in the future. Group B, the client knows exactly what they want to analyze, and they need that analysis to be very efficient for specific business decisions based on that data. These clients tend to want to spend the time cleaning the data and put the data into a uniform schema to enable faster queries of the data.

The client in Group A would opt for creating a data lake. Data lakes are cost effective for storing sizable amounts of data from many sources. Data lakes are flexible and do not rely on cleaning or segmenting the data into a schema. The client in Group B would opt to build a data warehouse. A data warehouse is much more efficient for analyzing historical data for specific decisions.

Data Users

Typically,

Size of the data

Data lakes were designed as an efficient way to store massive amounts of data and are comparatively cheaper than a data warehouse. Cheaper in a couple ways, first the actual cost of the storing data in a data lake is less expensive, second a data lake is built to collect all types of data, thus there is no time needed to sort and segment the data that must be done in a data warehouse.

Comparison Chart of a Data lake versus data warehouse:

Conclusion

Remember when evaluating whether to implement a data lake or a data warehouse, evaluate the type of data, data users, size of the data and the business objectives. At times it might sense to use both depending on the business objective you are trying to achieve. So, if you’re in the evaluation process, the