From the last few years, we have observed a massive growth in the data than we have ever seen. Many organizations find an opportunity from this big data and develop different strategies to monetize it. But the major challenge is “Where to store all the data?”
We have data warehouses that store the data as per the prescribed standards of the organization. That means, when the data is coming, it may be stopped, different cleaning and smoothing operations might be performed and then are stored in the data warehouse. This indeed gives the concern about what to do about the data that won’t be requiring frequently and still different resources on processing that data are utilized.
This is where the “Data Lake” can be introduced.
A Data Lake is a gigantic data repository where the data is stored in its indigenous form. It acts as a centralized repository where the data coming from different sources are stored in its raw form without any cleaning or transformations thereby storing the data in its true form.
So why should one opt for Data Lake?
From the past two years, it has been observed that massive amounts of data are generated and there is a need to address this massive explosion of data. Most of the times, there is a comparison between Data Warehouse and Data Lake, but Data Warehouse consists of different components and stores the data in some standards which can be prescribed in the data transformation processes. The data lake can be thought of as a system that comes before data warehousing.
The term “DataLake” was first coined by James Dixon, CTO of Pentaho in 2012 to contrast with “Data Mart” or “Data Warehouse” which is a smaller repository of refined data extracted from the raw data.
He explained: “If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake and various users of the lake can come to examine, dive in, or take samples.”
Indeed, the Data Lake is not a replacement for Data Warehouse, actually if designed right, it can complement with your existing Data Warehouse and work effectively together. The best part of this integration will be that it can store all formats of the data (Structured, Semi-Structured and Unstructured) that is situated into one place.
(Image Source: www.solutionsreview.com)