Designing a Data Lake on GCP with Data Fusion and Composer

Published in

Google Cloud - Community

8 min readJan 14, 2021

With an increasing number of organisations migrating their data platforms to the cloud, there is also a demand for cloud technologies that allow utilising the existing skillsets in the organisation while also ensuring successful migration.

ETL developers often form a sizeable part of data teams in many organisations. These developers are well versed in the use of GUI based ETL tools as well as complex SQL and also have or are beginning to develop programming skills in languages such as Python.

In this 3 part article series, I will share an overview of

a scalable data lake architecture for storage and consumption of structured data on Google Cloud Platform (GCP) using data integration and orchestration services suitable for the skill set described above [this article]
detailed solution design for easy to scale ingestion using Data Fusion and Cloud Composer
enabling discoverability of the data lake through Cloud Data Catalog [coming soon]

I will publish the code for this solution in the near future for those who might be interested in reuse. Look out for an update to this article with the link to the code.

Who will find this article useful

This article series will be useful for solution architects and designers getting started with GCP and looking to establish a data platform/data lake on GCP.

Key Requirements of the Use Case

There are a few broad requirements that I am taking into account while designing the architecture. These are detailed below:

Leverage existing ETL skill set available in the organisation

2. Ingest from hybrid sources such as on-premise RDBMS (e.g., SQL Server, Postgres), flat files and 3rd party API sources.

3. Support complex dependency management in job orchestration, not just for the ingestion jobs, but also custom pre and post ingestion tasks.

4. Design for a lean code base and configuration driven ingestion pipelines

5. Enable data discoverability while still ensuring appropriate access controls

Solution Architecture

Architecture designed for the data lake to meet above requirements in shown below. The key GCP services involved in this architecture include services for data integration, storage, orchestration and data discovery.

Considerations for Tool Selection

GCP provides a comprehensive set of data and analytics services. There are multiple service options available for each capability and the choice of service requires architects and designers to consider a few aspects that apply to their unique scenarios.

In the following sections, I have described the considerations to be made in selection of different types of services used in the data lake architecture and the rationale behind my final selections for each type of service.

There are multiple ways to design the architecture with different service combinations and what is described here is just one of the ways. Depending on your unique requirements, priorities and considerations, there are other ways to architect a data lake on GCP.

Data Integration Service

The decision tree below details the considerations involved in selecting a data integration service on GCP.

Integration Service Chosen

For my use case, data had to be ingested from a variety of data sources including on-premise flat files and RDBMS such as Oracle, SQL Server and PostgreSQL, as well as 3rd party data sources such as SFTP servers and APIs. The variety of source systems was expected to grow in the future. Also, the organisation this was being designed for had a strong presence of ETL skills in their data and analytics team.

Considering these factors, Cloud Data Fusion was selected for creating data pipelines.

What is Cloud Data Fusion?

Cloud Data Fusion is a GUI based data integration service for building and managing data pipelines. It is based on CDAP, which is an open source framework for building data analytics applications for on-premise and cloud sources. It provides a wide variety of out of the box connectors to sources on GCP, other public clouds and on-premise sources.

Below is a screenshot of a simple pipeline in Data Fusion.

What sits under the hood?

Under the hood, Data Fusion executes pipelines on a Dataproc cluster. Data Fusion automatically converts GUI based pipelines into Dataproc jobs for execution whenever a pipeline is executed. It supports two execution engine options: MapReduce and Apache Spark.

What more can you do with Data Fusion?

Beyond the capability to create code free GUI based pipelines, Data Fusion also provides features for visual data profiling and preparation, simple orchestration features, as well as granular lineage for pipelines.

Orchestration

The decision tree below details the considerations involved in selecting an orchestration service on GCP.

My use case required managing complex dependencies such as converging and diverging execution control. Also, UI capability to access operational information such as historical runs and logs, and the ability to restart workflows from the point of failure was important. Owing to these requirement, Cloud Composer was selected as the orchestration service.

What is Cloud Composer?

Cloud Composer is a fully managed workflow orchestration service. It is a managed version of open source Apache Airflow and is fully integrated with many other GCP services.

Workflows in Airflow are represented in the form of a Direct Acyclic Graph (DAG). A DAG is simply a set of tasks that needs to be performed. Below is a screenshot of a simple Airflow DAG.

Airflow DAGs are defined using Python.

Here is a tutorial on how you can write your first DAG. For a more detailed read, see tutorials in Apache Airflow documentation. Airflow operators are used to perform tasks in a DAG. Airflow Operators are available for a large number of GCP services as well as other public clouds. See this Airflow documentation page for different GCP operators available .

Segregation of Duties between Data Fusion and Composer

In the Data Lake solution being outlined, Data Fusion is used purely for data movement from source to destination. Cloud Composer is used for orchestration of Data Fusion pipelines and any other custom tasks performed outside of Data Fusion. Custom tasks could be written for tasks such as audit logging, applying data dictionary to the tables, archiving files or automating any other tasks in the integration lifecycle. This is described in more detail in the next article in the series.

Data Lake Storage

Storage layer for the data lake needs to consider the nature of the data being ingested and the purpose it will be used for. The image below (credit GCP website) provides a decision tree for storage service selection based on these considerations.

***Credits: Image is from*** ***GCP website***. *Database options on GCP and when to use which service*

Since this article aims to address the solution architecture for structured data which will be used for analytical use cases, GCP BigQuery was selected as the storage service/database for this data lake solution.

Data Discovery

Cloud Data Catalog is the GCP service for data discovery. It is a fully managed and highly scalable data discovery and metadata management service that automatically discovers technical metadata from BigQuery, PubSub and Google Cloud Storage.

There is no additional process or workflow required to make data assets in BigQuery, Cloud Storage and PubSub available in Data Catalog. Data Catalog self discovers data assets and makes it available to the users for further discovery.

A glimpse again at the architecture…

Now that we have a better understanding of why Data Fusion and Cloud Composer services were chosen, the rest of the architecture is self explanatory.

The only additional aspect I want to touch upon is the reason for opting for a GCS landing layer.

To land or not to land files on GCS?

In this solution, data from on-premise flat files and SFTP is landed into Google Cloud Storage before ingestion into the lake. This was to address the requirement that the integration service should only be allowed to access selective files. This was to prevent any sensitive files from ever being exposed to the data lake.

Below is a decision matrix with a few points to consider when deciding whether or not to land files on GCS before loading into BigQuery. It is quite likely that you will see a combination of these factors, and the approach you decide to take will be the one that works for all those factors that apply to you.

Source System: On-premise and SFTP Files

* Files can be copied or moved from network drives to GCS using GCP Cloud Storage Transfer Service for on-premise, or by writing custom scripts that run on prem.

** Samba is supported but other protocols/tools of sharing files such as Connect:Direct, WebDav, etc are not supported.

Source System: 3rd Party APIs

* Data Fusion out of box source connector for API sources (i.e., HTTP source plugin) supports basic authentication (id/password based) and OAUTH2 based authentication of source APIs.

RDBMS

No landing zone was used in this architecture for data from on-premise RDBMS systems. Data Fusion pipelines were used to directly read from source RDBMS using JDBC connectors available out of the box. This is because there was no sensitive data in those sources that needed to be restricted from being ingested into the data lake.

Takeaway

To recap, GCP provides a comprehensive set of services for Data and Analytics and there are multiple service options available for each task. Deciding which service option is suitable for your unique scenario requires you to consider a few factors that will influence the choices you make.

In this article, I have provided some insight into the considerations you need to make to decide the right GCP service for your needs in order to design a data lake.

Also, I have described the GCP architecture for a data lake that ingests data from a variety of hybrid sources, with ETL developers being the key persona in mind for skill set availability.

What next?

In the next article in this series, I have described in detail the solution design to ingest structured data into the data lake based on the architecture detailed in this article.