Big data and Analytics are always being a hot topic for the last couple of years. There is an enormous data is being created at each moment by every person. And these data have been used for making better solutions and opportunities in various sectors like AI, BI, Healthcare etc. At the same time, This is creating a lot of problems in terms of storing and processing these data. Years ago, Gartner estimated that 60% of big data projects are failing and recent study shows this is even worse. Choosing the right tool is the key element of success in Big Data analytics. The intention of this blog is to understand the differences between data lakes and data warehouses which are used to store and manage the big data efficiently.
“Information is the oil of the 21st century, and analytics is the combustion engine”
-Peter Sondergaard, Senior Vice President, Gartner
Data Lake
A data lake is a storage system, which is intended to store a large volume of data in its raw format, usually, object blobs or files. The intended purpose of storing these data are usually unknown at the time of storing the data. The sources of data in the data lake can be from a variety of places like social media posts, news, POS terminals, blogs, sensors etc. The forms of data can be structured data from relational databases (tables, rows and columns), partially structured (CSV, logs, XML, JSON), unstructured (documents, PDFs, emails) and binary (images, video, audio). As the data is stored on its natural format and there is no specific business logic requirement is on these data, there won’t be any specific schema on these data at the time of writing. Instead, it is having terminology called ‘Read on Schema’, This means the schema is only created when reading the data based on the specific purpose of business logic.
The main characteristics of Data Lakes are:
- Centralization – Centralization allows easier to manage, as well as making it easier to innovate non- disruptively around heterogeneous data sets.
- Scalability and easy access – As the emerging of cloud storage spaces like Google, Azure and AWS s3
- It is easier to store the data at any scale at low cost and high availability.
- Locate, Curate, and Secure Data – It is easier to keep track of what data you have, who has access to it, what type of data you are storing, and what it’s being used for in a centralized data lake.
- Increased flexibility – Having a centrally curated data lake allows your business to innovate in new ways of processing data. You can introduce new use cases without remodeling your architecture.
- The main criticism for the data lake is, it is often leading to a data swamp. Data swamp is a collection of data that is not useful or very difficult to analyze
Popular Data Lake examples are:
- Cloud storage services such as Google Cloud Storage, Azure and Amazon S3
- Distributed file system such as Apache Hadoop.
Data Warehouse
A data warehouse is also a storage system, which is intend to store large volume of data, but in a structured or semi structured format. The intended purpose of storing these data is usually known at the time of storing the data. The sources of data in data warehouse can also be from variety of places like CSV, TSV, Flat files, Relational databases etc. A data warehouse is a repository of many kinds of data and is highly modelled. Data you find in a data warehouse is carefully related to all of the other data in the data warehouse. In addition, data in a warehouse happens to be standardized and highly cleansed.
The main functions which are needed to perform are:
Data Extraction
This is the first step in ETL. In this step, Parse and extract the data from various source systems, which can be like JSON, relational databases, CSV, TSV, XML and flat files into the staging area. This step also includes the validation of data.
Data Transformation
Usually the extracted raw data from the first step is not in a usable format. In this step, some business logic is applied on the extracted data and converted into desirable format. In fact, this is the pivotal point where ETL process generates value and transforms data so that insightful BI reports can be generated. This step can also contain Filtering, Cleaning, Sorting of data.
Data Loading
The third and final step of the ETL process is loading. In this step, the cleaned and transformed data are loaded into the data warehouse. The loading process should be calibrated for performance as it needs to be done in small interval of time. Also, there should be recovery mechanism in load process to keep the integrity of the data
A data warehouse may contain multiple databases called data marts. Within each data marts, data is organized into tables and columns. These data are targeted for specific business purpose. These data will be well fitted in intended schema. This schema is used by Query tools to determine which data tables to be accessed and analyzed.
Popular Data Warehouse tools are:
- AWS Redshift
- Snowflake
- Yellowbrick
- Teradata
Key differences between Data lake and Data warehouse
Data Storage
In the data lake, all the data is kept in its native form. It will be transformed to a schema when reading occur. (The purpose is unknown until the read) In data ware house, the data will be kept as well-structured schema according to the purposes.
Data Versatility
In data lake, capture all kind of data semi-structured and unstructured in their original form from source systems like social media posts, news, POS terminals, blogs, sensors. In data ware house, capture structured information and organizes them in schemas as defined for data warehouse purposes. source of data are POS terminals, Flat files, Relational database etc.
Data Usability
Data lakes can store all the data as it receives, no cleaning and discarding the data is there. Also, it stores the important and non-important data. This helps the data analyst to go back in time and do analysis on whatever he wants. However, this may lead to data swamping phenomena. In the data warehouse, ETL process makes only relevant to particular business purpose data will keep and all others will be discarded.
Data Consumers
Data Consumers of data lakes are including mainly data scientist, machine learning, data scientists who need advanced analytical tools with capabilities such as predictive modelling and statistical analysis. The data warehouse is mainly used for BI reports, Data visualization, Marketing research etc.
History
Data lakes is relatively new technology. The data warehouse concept, unlike data lakes, has been used for a while in data analysis.
Storage Costs
Data storage in big data technologies is comparatively less expensive than storing data in a traditional data warehouse. Data warehouse is expensive and eats up a lot of time.
Task
Data lakes are not limited to any specific task. In fact, its purposes are unknown until it reads. So, thedata is not modelled to any specific kind of tasks. This gives data analyst lot of flexibility and liberty to do different kind of task on the data. On the other hand, data warehouses is limited predefined tasks.
Processing time
Data lakes have advantage on processing time compared to traditional data warehouse as it doesn’t need to do extract, validate, cleaned and transformed. This makes it getting the results fast. Data warehouses need to do lot of extract, validate, cleaning and transformation operation before it loads to data warehouses. Thus, it will be time consuming and there won’t be any real-time results.
Flexibility
Data is kept in raw form. It is only transformed when it is reading the data. This makes them adding new type of data and business logic is less expensive. On other hand Data warehouses offer insights into pre-defined business logic and it bounds to pre- defined data types. Making any change or adding a new business logic is time consuming and expensive operation
Applications of Data lake and Data warehouses
Data Science
Data science is a field of science that uses scientific methods to turn data into values. The methods include various algorithms and systems to extract knowledge from many structural and unstructured data. Data science deals with machine learning, deep learning, data mining, big data etc. Data scientists are heavily dependent on data lakes or data warehouses or a combination of both. which helps them to store the data from various sources and get insights.
In modern world the importance of data science is unlimited and still exploring. Some of the use cases are
- Pattern matching
- Anomaly detection
- Automation and decision-making,
- Classifications
- Healthcare
- Self-driving car
BI (Business intelligence) and Marketing research
Business intelligence and market research are both integral parts of developing a solid marketing strategy. BI usually used for getting better business decision by analysing the past and current data. It collects, clean, and enrich structured or semi-structured data and then stores them in different. SQL databases. Data warehouse stores data from different sources. The main purposes of BI technologies involve analytics, reporting, dashboard development, online analytical processing, data mining, complex event conversion, business performance management, benchmarking, predictive analytics, text mining, and prescriptive analytics. Marketing researchers usually collect the data from various POS terminals or by surveying the customers directly or by surveying through online application. The volume of data from POS terminals and surveys can be very huge. Usually, the ETL process of these data is very complex and time consuming. This needs to be cleaned, validated and apply some business logics before loading to Data ware houses. The data will be again categorised to by any specific criteria and loaded to a different database or data mart in order to get fast easy access. Marketing researchers usually do various analysis based on these data to get insights and making better decision for business solutions.
Conclusion
Data lakes are more useful when there are more data to store with no particular type of structures are defined. And need not do analyze all of it right away. The important characteristic of the data lake is its flexibility over data warehouse. There won’t be any ETL processing part in the data lake, which is intended to be for specific business logic. On the other hand, in traditional data warehouses, it processes and transforms the data for advanced analytics and querying in a highly structured database ecosystem. It gives you more accurate and specific functionality on BI and Reporting solutions. Although the architecture and capability of both have some similarities, they never meant a direct replacement for each other. they act as a co-existing technology that serves different use cases with some overlap. Data lake solutions are generally considered complementary to data warehouses. Nowadays, most of organizations are keeping data lake to support data warehouses. However, as the data volume growing, cloud data warehouses and data lakes are becoming the preferred solution over traditional data warehouses. The modern cloud technologies provide solutions like scaling, data security, monitoring, reliability and maintenance at low cost.