![]() ![]() But if your data lake does not satisfy all these requirements, you should ask yourself why first, and then decide when you do need to implement these parts. You don’t need to, and really you can’t implement all of this at once. Easily accessible: support data engineers’ and analysts’ tools of choice.Readily usable: be data that is cleansed and consistent across data sources. ![]() Secure: adhere to internal and regulatory data and security requirements.Raw: preserve as much detail as possible.It should be a storage area for raw data that makes any data readily available to anyone to use when they need it. The purpose of a data lakeĪfter reliving the history, and the lessons learned, I came back to the original definition and goals of a data lake. If you want to combine your data lake and data warehouse, you might call it a “lakehouse.”Īll of these definitions of a data lake that go beyond just “raw storage” cause problems. For example, if you are doing ETL, you need something like a “delta lake” to hold the raw data and intermediate data you create during Spark-based ETL and other data processing. It’s really important for most companies to understand what a data lake should be, because many different vendors are selling different versions of “lakes” based on their role along the data pipeline. I realized it when I saw coupled storage and compute crippling some data pipelines and analytics architectures. A data lake should just be about storage. James, which may not be too surprising, has remained true to his original definition, and for a few great reasons. I was OK accepting the reality of this coupling because that’s what customers did, and the customer is always right. This meant the way you accessed the data in HDFS was typically Spark and/or Hive. For example, back when data lakes started to exist in some form, several companies implemented one using Hadoop. I went back to James because I wanted to get his perspective on whether it was OK for a data lake to be coupled to a specific compute engine. James’ original definition was: “ A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data." James and I were teenagers at the time, of course. For the record, James and I worked together back in the late 90s, when he helped create a really good ad hoc analytics tool called Wired for OLAP. This last round I asked James Dixon, who first defined it while he was at Pentaho. Every few years I’ve had to go back and revisit my own definition of a data lake as we’ve learned more about what a data lake should be, and I’ve had to be “re-corrected” a few times. Part of the problem is that a data lake is like a hairdo it’s constantly changing. Sometimes when I look at all the definitions of data lakes that are pushed by so many different vendors, I think of Eminem’s The Real Slim Shady and wonder will the real data lake please stand up? Making sense of a data lake, delta lake, lakehouse, data warehouse and more ![]()
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |