In this article we're going to introduce key concepts of data engineering, including building data lakes and data warehouses.
The explosion of data creation in recent years has made data engineering a highly valuable in today's market. The importance of data engineering comes from the fact that the vast majority of data is not being analyzed today, which be summarized by the following quote from McKinsey Research:
By 2020, some 50 billion smart devices will be connected, along with additional billions of smart sensors, ensuring that the global supply of data will continue to more than double every two years...and we’re still in the early days of this one: though about 90 percent of the digital data ever created in the world has been generated in just the past two years, only 1 percent of that data has been analyzed.
This article is based on notes from the Data Engineering with Google Cloud Specialization and is organized as follows:
- The Role of a Data Engineer
- Data Engineering Challenges
- Data Lakes and Data Warehouses
- Transactional Databases vs. Data Warehouses
- Partnering with Other Data Teams
- Managing Data Access & Governance
- Building Production-Ready Pipelines
Stay up to date with AI
This post may contain affiliate links. See our policy page for more information.
The Role of a Data Engineer
The primary role of data engineer is to build data pipelines in order to enable data-driven decision making.
The ultimate goal of a data pipeline is to get the company's raw data into a place and format where it can be useful, for example to be used in an analytics dashboard or a machine learning model.
A key term in data engineering is the concept of a data lake, which brings together data from across the enterprise into a single location. For example, the data may come from a spreadsheet or a relational database from various departments and then the raw data stored in a data lake.
One option for storing a data lake is to use a cloud storage bucket.
When choosing between data lake options, a few key considerations include:
- Can the data lake handle all the data types you have?
- Can it elastically scale to meet your data demands?
- Does it support high-throughput ingestion?
- Is there fine-grained access control to objects?
- Can other tools connect to the data lake easily?
In answering these questions, keep in mind that the purpose of a data lake is to make data easily accessible for analytics.
A cloud storage bucket, for example, is designed to stage all of your data in one place before building transformation pipelines to send it to a data warehouse.
If the raw data needs additional processing, a data engineer may need to extract it from its original location, transform it, and load it. This is referred to as ETL, or Extract, Transform, and Load.
With Google Cloud Platform, for example, we can perform this data processing with services like Cloud Dataproc or Cloud Dataflow.
If your data arrives continuously and endlessly (i.e. streaming data), batch pipelines may not be enough. In this case you would need to use streaming data processing with services like Cloud Pub/Sub and BigQuery.
Data Engineering Challenges
A few of the most common challenges encountered by data engineers when building data pipeline include:
- Data Access: Difficulty accessing data as it may be scattered across a variety of databases and external platforms.
- Data Quality: After accessing the data, it may not have the accuracy or quality that's required for analytics or machine learning models.
- Computational Resources: Even if quality exists, you may find that data transformations require significant amount of computational resources.
- Query Performance: There may be challenges around query performance and being able to run all the necessary queries and transformations with the available computational resources.
Let's look at a few ways that you can solve each of these challenges.
What makes data access so difficult is that data is often siloed in various departments, each of which have their own transactional systems and business processes. In other words, data is often siloed in many upstream source systems.
Data Accuracy & Quality
Cleaning, formatting, and preparing data for business insights often requires that you build ETL pipelines. Once the data has been cleaned and transformed, it is then stored in a data warehouse as opposed to a data lake.
Unlike a data lake in which the data is in a raw format, data in a data warehouse is easily joinable and can be queried efficiently.
In general, you can assume that any data from a raw source will need to be cleaned, transformed, and stored in a data warehouse before insights can be extracted.
Availability of Computational Resources
If the raw data requires a significant amount of consolidation and clean up, the availability of computational resources can be a challenge.
If you're using an on-premise system, data engineers need to manage server and cluster capacity to ensure there's enough capacity to perform the necessary ETL processes.
The challenge is the the compute that's required for ETL jobs is not constant over time. This means that when traffic is low computational resources may be wasted and when traffic is high the ETL jobs may take too long.
Once the data is in the data warehouse, you then need to optimize the query performance to ensure users are making the most efficient use of available compute resources.
One way to manage server overhead so that you can focus on business insights is to use a serverless data warehouse such as BigQuery. This service allows you to replace the typical hardware setup of a traditional data warehouse.
Here's a high-level overview of the data warehouse solution:
- A serverless data warehouse such as BigQuery organizes data into units called datasets and each dataset is tied to a GCP project.
- A data lake may contain files such as cloud storage or transactional data, for example, and BigQuery can define an external schema and issue queries directly on the external data source.
- Database tables and views function in BigQuery the same way as in a traditional data warehouse, which allows you to write queries using standard SQL
- Cloud Identity and Access Management (IAM) is used to grant permission to perform actions with BigQuery
In short, cloud-based data warehouses allow data engineers to spend less time managing hardware and enable analytics to scale.
In the case of computational provisioning, cloud solutions also allow you to allocate storage and query resources dynamically based on usage patterns.
Data Lakes and Data Warehouses
Now that we've defined data lakes and data warehouses at a high-level, let's look at them in a bit more detail.
Recall that one of the primary roles of a data engineer is to get the data into a usable condition so that it can be used to extract insights.
Raw data first gets replicated and stored in a data lake. We then use an ETL pipeline in order to make the data usable and ready for storage in a data warehouse.
A few key considerations to make when deciding between data warehouses include:
- Can it serve as a sink for both batch and streaming data pipelines?
- Can the data warehouse scale to the needs of the organization?
- How is data organized, cataloged, and access controlled?
- Is it designed for query performance?
- What level of maintenance is required by the data engineers?
Aside from using ETL pipelines, you can also treat a data warehouse such as BigQuery as just a query engine and allow it to query data directly in the data lake.
You can find an example of external data queries with BigQuery on Github here.
Transactional Databases vs. Data Warehouses
Data Engineers will often be responsible for both backend transactional database systems that support the company's application and the data warehouse that supports analytical workflows.
If you're using SQL Server, MySQL or Postgres as your relational database, these can all be migrated to Cloud SQL, which is a fully managed relational database.
The difference between these two is that:
- Cloud SQL is optimized as a database for transactions
- BigQuery is a data warehouse that's optimized for reporting workloads
In short, a relational database management system (RDBMS) helps your business manage transactions.
Here's a high-level overview of how these all fit in the context of data engineering:
- Operational systems like relational databases that store transactions, inventory, promotions, etc. are the raw data source
- This raw data is gathered together into a single consolidated location, which is the data lake
- The data is processed through transformations and is then outputted to a data warehouse
- The data is then ready for use by downstream analytic teams, such as an ML engineering team or a business intelligence team using it for reporting and dashboards
Partnering with Other Data Teams
The three most common teams that Data Engineers need to partner with include:
- Machine Learning Engineers: ML teams rely on data engineers to help them capture new features in a stable pipeline.
- Data Analysts: Data analysts and business intelligence teams rely on data engineering to showcase their latest insights through reporting and dashboards.
- Other Data Engineers: other data engineering teams may rely on your pipelines being timely and error-free.
Managing Data Access & Governance
In order to work with these other teams, data engineers will need to set up data access and governance policies. A few key considerations to make in managing datasets and data pipelines include:
- Who should and shouldn't be able to access the data?
- How is personally identifiable information (PII) such as phone numbers or emails handled?
- How will end users discover different datasets that are available for analysis?
A solution for data governance from GCP is Cloud Data Catalog, which is a managed data discovery platform and the Data Loss Prevention API for protecting personal information. You can find examples of serverless data loss prevention on Github here.
Building Production-Ready Pipelines
After your data lakes and data warehouses are set up and the governance policy is in place, the next step is to productionize the entire pipeline.
Common goals of a production-ready pipeline include:
- Ensuring the health of the pipeline and cleanliness of the data
- Minimizing the maintenance required and maximize the uptime
- To respond and adapt to changing schemas and business needs
- Using the latest data engineering tools and best practices
A common workflow orchestration tool used by enterprises are Apache Airflow and Cloud Composer.
Summary: Data Engineering, Data Lakes, and Data Warehouses
In this introduction to data engineering, we discussed key concepts including how raw data sources feed into a data lake and are then processed into a data warehouse where it's used for analysis.
A data lake is a consolidated location for raw data that is durable and highly available. A data warehouse is where the data is stored after it has gone through preprocessing and is ready for analytical and machine learning workloads.