In this article, we explain what a modern data architecture is, what its layers and components are, and analyze the pros and cons of both the data warehouse and the data lake.
What is a data architecture?
Data architecture is a combination of technologies that allows an organization to address its information needs. For example: how much was sold, how many customers were gained or are at risk of being lost, what is the level of product stock, etc. It provides all the data that the business needs to make data-driven decisions.
Data architecture is the technological structure behind a data solution, whether it be dashboards, reports, alerting systems, artificial intelligence models, etc. The most important aspect is that it can ensure a comprehensive view of the business.
Components and Benefits of a Data Architecture
A data architecture has four main components:
Components of a data architecture
1) Data Sources: Information sources (or origin data) encompass everything that can store or generate data. They come from different places and can be structured (spreadsheets, CRM, sales, databases) and/or unstructured (social networks, emails, videos, images, third-party APIs, etc.).
2) Data Engineering Layer: Data engineering is the discipline responsible for obtaining, cleaning, integrating, and enriching data to enable any type of subsequent analysis. This stage, also known as the ETL (Extract, Transform, and Load) layer, is critical in the entire process and constitutes 80% of the time in analytics projects.
3) Storage: This is the place where data is stored — it can be a data warehouse, a data lake, or a lakehouse. This repository ensures a comprehensive view, availability, quality, and governance of data. If this doesn’t happen, unfortunately, the data architecture won’t serve the business.
4) Consumption: Consumption can be visual (through dashboards, reports, alerts) or analytical (data science, machine learning) and ensures that people in the organization can make decisions based on data. Additionally, consumption from other systems can be done through APIs.
The benefits of developing a good data architecture are:
- Achieving a unified view of reality by integrating various necessary sources of information.
- Ensuring the availability and quality of information.
- Centralizing access to information by providing a reasonable Service Level Agreement (SLA). This means that the entire organization consumes information from a single repository, rather than relying on different Excel spreadsheets, for example.
- Implementing data governance to ensure security, consumption, compliance with regulations (such as GDPR).
- Feeding all recurring information systems.
Data Warehouse: Pros and Cons
The data warehouse is where information from different sources is stored, following a transformation process through Extract, Transform, Load (ETL). This ensures data consumption by the business.
The main advantages of having a data warehouse are:
- Allows meeting the vast majority of data needs as it responds to any request for structured information (dashboards, reports, etc.)
- Subject-oriented: Data is organized so that all elements related to the same event or real-world object are linked.
- Integrated: The database contains data from all organizational systems. Therefore, this data must be consistent.
- Time-variant: Changes in data over time are recorded to ensure that generated reports reflect those variations.
- Non-volatile: Information is not modified or deleted; once data is stored, it becomes read-only and is maintained for future queries.
However, in practice, storing data in a data warehouse also involves some challenges.
- It can become a bottleneck as the number of data sources increases, and the capacity to process them does not keep up.
- It may lose information if the business demands data that was not initially modeled.
- It is designed to work with static information, not with real-time schemas or unstructured information (such as videos, images, social media, etc.).
- It does not support data science or machine learning teams as it lacks the necessary detail to train them.
Therefore, we could say that the data warehouse leaves many needs of some teams within the organization unresolved.
Data Lake: Pros and Cons
It is a repository of raw data, where data is stored as it is in its original form without any transformation, unlike what happens in the data warehouse. The advantages of having a data lake are:
- Addresses two of the main problems posed by the data warehouse: loss of information and slowness to react.
- Has virtually unlimited storage capacity as it can be deployed on distributed environments (Spark, Databricks) or in cloud environments (Azure, AWS, GCP).
- Has the ability to handle unstructured data.
However, a data lake also presents other challenges.
- Technologically, it is more complex to build.
- It focuses more on loading data and less on consumption, which is where the real value lies.
- Often becomes a data dumping ground rather than an informative repository since the latter needs to be structured for meaningful consumption.
- Increases complexity and decreases performance, impacting productivity as teams work increasingly in isolation.
The layers of a modern data architecture
Modern data architectures emerge as a response to the challenges posed by traditional data warehouses. They consist of three layers through which data flows based on the transformations applied to them or the purpose of their analysis. The names of these layers may vary in different literature; in this article, we will adopt the approach proposed by the Medallion Architecture.
Source: Databricks
- Layer 1 – Data Lake (Bronze): In this layer, raw data is stored as it comes from the source (structured or unstructured), maintaining it in its original state, including any errors or duplications.
- Layer 2 – Data Sandbox (Silver): At this level, certain transformations have been applied to the data, and it is easily accessible through SQL. Data engineering teams work on transformations, data science teams on their models, and some business analysts or power users might also contribute.
- Layer 3 – Data Warehouse (Gold): At this level, data is ready for use. The gold layer possesses the same characteristics as a traditional data warehouse but, due to the addition of the other two layers, it no longer loses information. End users, typically representing 90% of the community, access this layer.
Therefore, a modern data architecture combines the data warehouse complemented by the data lake.
Recommendations to Consider:
Shaping a modern data architecture involves building a network of services and assigning a specific purpose to each. Previously, the data warehouse solved everything, but now, at a minimum, one must consider the three layers explained above.
We are facing a paradigm shift: before, we modeled, loaded, and analyzed. Now things are done in reverse: we load everything into the data lake, analyze it, and only then create the model.
We have moved from ETL (Extract, Transform, Load) to ELT (Load, Transform, Extract): the data loading strategy adapts to the volume and computing capacity. In other words, instead of bringing the data to computing, we bring computing to the data.
Building a data architecture is not a simple task. Therefore, every effort should be made to minimize the complexity of the solution. Keep in mind that, regardless of the approach, we will have to maintain the solution. Hence, the key is to keep everything as simple as possible.
—
The original version of this article was written in Spanish and translated into English by ChatGPT.