Flight Plan: Automated Data Pipeline (ADP)

This BoosterPack was created and authored by: Intelius

DAIR BoosterPacks are free, curated packages of cloud-based tools and resources about a specific emerging technology, built by experienced Canadian businesses who have built products or services using that technology and are willing to share their expertise.

Ready for takeoff?

Here’s what you’ll find in this Flight Plan

Overview

What is an Automated Data Pipeline? 

An automated data pipeline (ADP) is an end-to-end solution for automating the ingestion, transformation, storage, and presentation of data on a scalable platform. This BoosterPack demonstrates how a series of open-source tools can be integrated to create an ADP. In the Sample Solution, two types of data are used to showcase this capability:  

  1. stock market data, and  
  2. news data.  

The main tools used in this BoosterPack are Apache Airflow, Apache Kafka, and MySQL. The entire solution is deployed on a one-node Kubernetes cluster, which is created by leveraging the same technique published in the “Automate Cloud Orchestration with Kubernetes” DAIR BoosterPack. 

What value does it add to my business? 

This solution: 

  • automates the processes of ingesting, processing, storing, and presenting data. 
  • has zero licensing cost as all the tools used in this BoosterPack are open source. 
  • is cloud agnostic and can be deployed in various cloud platforms. 
  • implements a microservices-based architecture where the entire application is a collection of loosely coupled services. The individual services are independently deployable, highly maintainable, and testable. 
  • can be adapted for various business cases. 

The advantage of this BoosterPack is that it enables you to select and integrate a range of open-source tools to provide a reliable, scalable, extendable end-to-end solution for real-time (or near real-time) data management 

Why choose an Automated Data Pipeline over the alternatives? 

Traditionally, organizations must hire architects (data or enterprise) to design the architecture and then have a development team develop the required services. Today, an equivalent ADP framework would take several months to build and cost thousands of dollars to start development from the ground up. 

This BoosterPack Sample Solution follows a generalized approach which suits most common data management projects. The Sample Solution helps organizations get started quickly by using this solution as a foundation and then customizing it according to their specific business use case. 

The tools used in this solution can be downloaded for free and most have been used in previous BoosterPack Sample Solutions. This BoosterPack provides the ability to select and integrate all these tools in a way that produces a reliable, scalable, and extendable end-to-end solution for real-time (or near real-time) data management.   

Best Practices

HTTPS Encryption

It is strongly recommended you use a valid HTTPS certificate to ensure communications between client and server are encrypted. Refer to the Security section in the Sample Solution on the use of TLS certificates. Let’s Encrypt can be used if the organization does not have an existing SSL/TLS certificate and/or domain registered in DNS, but this is not the preferred production solution.

Backup Strategy

Argus was designed for deployment on K3s, a lightweight Kubernetes implementation installed on a single VM. K3s simplify the process so that you don’t need to thoroughly understand and implement underlying components required to operate Kubernetes. This allows the solution to be easily backed up by using snapshots of the VM and recovery through VM restores. Not everyone will adopt this backup strategy, but it does allow teams to quickly test out Argus and recover to a known copy quickly without resorting to expensive backup solutions. Organizations who want to use Argus in a full Kubernetes environment can access the YAML files from the source code repository and customize the source to meet their needs (for example, if they wish to use Kubernetes packages like FluxCD for more DevOps oriented workflows).

Admin Passwords

Strong default admin passwords have been supplied by default for this Sample Solution. Do not keep passwords in the source code repository; instead, use the ‘secrets’ capability supplied by the CI/CD pipeline tools or cloud providers to securely hide them.

Tips & Traps

The code in this Sample Solution can be customized according to your specific business use case:  

  • The ingestion service can be customized to ingest data from any other live data feed such as IoT devices, data from a WebSocket, etc. 
  • Processing service can be customized to include code for data cleaning, transformations to a different schema, joining with reference datasets, imputation, augment with calculated columns, etc. 
  • Analytical services can be used to build simple to complex machine learning or deep learning models. The models can be re-trained on the data ingested via cold path and later deployed on the cloud. The processing service can then invoke the most recent saved model in real-time to predict data that arrive in a hot path, thus getting real-time prediction results. 

Resources

 

Tutorials 

The table below provides some useful tutorials to help you get “hands on” experience with the tools and technologies used in this package.

Tutorial Content  Summary 
Apache Airflow Tutorial  Learn how to define a pipeline, instantiate a DAG, define tasks, and scheduling DAGs. 
Kubernetes Official Tutorials  Learn the basics about Kubernetes clusters (orchestration platform for containers).   
Docker 101 Tutorial  Learn the basics of containerization and Docker by installing Docker Desktop and following the first few lessons in this tutorial. 
Fast API Tutorial – User Guide Intro  Learn how to develop and call Rest APIs in Python.  
Intro to Apache Kafka: How Kafka Works  Several quick introductory videos will help you understand how Kafka works.  

 

Documentation

The following list provides links to some useful references for learning more about the tools and technologies used in this Sample Solution.

Document  Summary 
Official Apache Airflow Documentation  Airflow is a platform to programmatically author, schedule, and monitor workflows. This document discusses Apache Airflow core configuration and concepts, specifically DAGs and DAG runs, tasks, scheduler, and Celery executer. 
Official Documentation for Apache Airflow Helm Chart  While there are several Helm charts to install Airflow on a Kubernetes cluster, the official Helm chart (link to source code) is used in this package. Specifically, you can refer to the “Parameters References” to learn about all the parameters that can be configured in the Helm chart’s values.yaml file. 
Apache Kafka Documentation  Documentation about Apache Kafka, the event streaming and messaging tool used in this package. 
Documentation and Source Code for Apache Kafka Helm Chart (created by Bitnami)  A GitHub repository that contains the Helm chart source code to install Apache Kafka used in this package.  
MySQL Documentation  Official documentation for MySQL, our selected DBMS in this package. 
FastAPI Documentation   Documentation about FastAPI, a modern and high-performance framework for building APIs based on Python. FastAPI is used in this package to expose the news sentiment prediction API and middle-tier services APIs. 
BigBitBus’s Kubernetes Automation Toolkit main README  The main ReadMe and documentation for the toolkit used in the “Automate Cloud Orchestration with Kubernetes” BoosterPack. This resource also contains a concise introduction to containers and important Kubernetes concepts. 
Kubernetes Official Documentation  A comprehensive and well-structured reference to Kubernetes concepts and configuration. 

 

Support

Intelius Analytics has published the source code for this BoosterPack as an open source project on Github. If you find a bug, issue, or have improvement ideas, you can open an in issue in this Git repository. 

In addition, you can post your questions in the DAIR Slack channel or via email to [email protected]. 

Got it? Now we’ll show you how to deploy ADP on the DAIR Cloud… 

Automated Data Pipeline (ADP) Sample Solution

For organizations that need a low-cost framework to automate their data management processes, the ADP Sample Solution is a great option. It demonstrates how open-source tools can be integrated, creating a framework based on micro-services architecture on a scalable platform that can be customized based on your organization’s business needs.  

This can be accomplished without the need to embark on the time-consuming and costly journey of developing a highly customized framework from scratch, employing skilled data architects and engineers, or relying on costly SaaS solutions from vendors and public cloud providers.  

Click here for instructions on how to implement the ADP Sample Solution.