Data Backup Digest

Do-It-Yourself Windows File Recovery Software: A Comparison

results »

What is a Data Lake?

With the widespread onset of big data, both mainstream computer users and IT experts alike are suddenly becoming familiar with a whole new language of jargon, archaic phrases and muddled concepts. Just in case your ad hoc vocabulary wasn't large enough as it is, you'll now want to familiarize yourself with the idea of the data lake.

To some, the data lake is pretty self-explanatory. When breaking it down to its most fundamental definition, a data lake is nothing more than a large repository for raw data. The term is most aptly used when describing vast amounts of raw data that are stored in a non-hierarchical platform. Instead, a data lake utilizes unique identifiers and metadata tags in order to facilitate data queries and analyses as called upon.

Even more specifically, the term data lake is typically associated with the Hadoop object storage platform. In this application, data is collected, mined and analyzed from the Hadoop framework itself. While some see the term "data lake" as a marketing ploy to increase consumer interest and familiarity with Hadoop, the phrase is increasingly being applied to other systems.

James Dixon, founder and CTO with Pentaho, has been credited with originally coining the phrase. He explained the concept of a data lake by saying: "If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples."

Key Differences

Although many see data lakes and data warehouses as one in the same, they are actually quite different. Whereas data warehouses will only stored data that has been specifically modeled or structured to a certain format, data lakes can accommodate any datasets.

This also means that data warehouses and data lakes process the information in vastly different ways. Because data needs to be properly structured in order to be entered into a data warehouse, there is a certain amount of pre-processing that needs to be done. Alternatively, data lakes rely on a post-processing algorithm that structures the raw data as it's called upon.

Data lakes can also help reduce an organization's overall IT costs. Because Hadoop is based on an open source framework, both licensing rights and community-driven technical support are available at no cost to you.

The Purpose of the Data Lake

Ultimately, the data lake is an attempt to solve to primary issues. The first issue is that of data silos, which essentially serve as individual data repositories. These can be centralized and emptied into larger data lakes, thereby reducing overall costs and greater accessibility across the board.

The second issue involves big data, or, specifically, big data analytics that factor in various types of datasets from multiple sources. Because this data cannot be properly modeled into a standard format while it is incoming, the data lake approach is highly beneficial.


No comments yet. Sign in to add the first!