The role of ETL pipelines in the modern data stack

Fast-growing companies modernize their data infrastructure by adopting a suite of cloud-native tools called the modern data stack. ETL data pipelines are a vital component of the modern data stack and enable businesses to embrace new data sources and improve business intelligence.


Modern Data Stack


A data stack is a set of tools that perform specific processes as part of an integrated system. By combining tools that perform functions like storing data from multiple sources, transforming, and visualizing data you can assemble more functional and agile data operations with increased customizability and interchangeability. This means that for example, when your data storage needs increase and your current storage solution becomes obsolete, then you can easily change that layer of the stack without having to replace the entire stack.


These days an effective modern data stack enables three basic operations: 1) collecting data from different sources and ingesting it into a storage system, 2) cleaning and transforming data 3) using the transformed data for BI visualizations. These three processes are all part of the data pipeline. The structure for data pipelines may differ based on a company’s requirements, but they all have similarities summarized in the diagram below.




Main characteristics of a modern data stack:


  • A variety of connectors to raw data sources, enabling quick access to data
  • Automatic schema migration to detect changes to data sources
  • An intuitive user interface
  • An automated data pipeline tool such as Intempt
  • A cloud data destination such as Snowflake, BigQuery, or AWS Redshift
  • A business intelligence engine such as Looker, Chartio, or Tableau

ETL Pipelines in a Modern Data Stack


ETL is an acronym for Extract, Transform and Load. An ETL pipeline is a series of processes (connected steps) extracting data from a source, transforming it, and loading it into a destination. The source can be, for example, a website, application,  APIs, marketing tools, or transaction databases, and the destination can be a database, data warehouse, or a cloud-hosted database (e.g., Amazon RedShift, Google BigQuery, and Snowflake.)


The purpose of the ETL Pipeline is to find the right data, make it ready for reporting, and store it in a place that allows for easy access and analysis.


In a modern data stack, ETL Pipelines are useful to:


  • Clean data. During ETL transformations, data is cleaned and joined across sources before it is saved in the database. These operations allow you to work with clear information and reject dirty data.
  • Unify data. A well-designed ETL pipeline includes all of the business sources relevant to your operations in a single destination database.
  • Improve data quality. ETL processes validate data at extraction or remove data at the transformation. This ensures that the quality of data is always controlled before it is analyzed. It increases the accuracy of the data analysis with BI tools.
  • Increase data velocity. ETL processes can be extremely fast, improving the workflow of the whole organization. In a modern data stack, ETL processes can be designed to trigger the entire ETL pipeline whenever new data arises in the sources or when existing data is changed.
  • Discover new insights. The entire ETL process brings structure to your company’s data stack. This allows you to spend more time analyzing complex questions and evaluating deep insights, rather than trying to perform tedious technical procedures.


Example Use Case for ETL Pipelines: Optimize Your Funnel for Conversions


When new users sign up for the product, register in the system, or add items to the cart, some will inevitably get stuck and not finish the conversion. Maybe they’re confused by the UX, aren’t sure how to proceed, or don’t know the right pricing plan.


With accurate analytical data, you can optimize your funnel and increase the conversion rate by finding where the funnel leaks.


To analyze it, you need to extract data from multiple sources, transfer it to a data warehouse, and analyze it with a BI tool. In this instance, we will use Intempt Platform to gather the necessary data from the Shopify connection (transaction data), Web, and iOS sources (web event data) and transfer all collections to the Redshift warehouse. After that, the visualization can be created by Mode Analytics.


Preparation


For most products and services, there is a sequence of steps you’d expect new users to navigate through to achieve the conversion.


The first step is to map out these events and start tracking those in Intempt.


Most common conversion events across different industries:


·  eCommerce – Order completed

·  SaaS – Trial/subscription started

·  Travel & Hospitality – Booking placed


Whatever yours is, make sure it ties to some meaningful part of your businesses’ revenue.


Once you have the one metric that matters, make sure you’re tracking the 3-5 steps which lead a user to the conversion event.


For example, for the eCommerce page, these steps could be:


Visited landing page → Visited product page → Added an item to cart → Opened checkout page.


ETL Process Flow


1.Extract



In the ETL Pipeline extraction part, the data is sourced and extracted from different systems (e.g., CSVs, web services, advertising platforms, CRMs, and other services).


The most common data sources from which data can be extracted are:


  1. Databases
  2. Files
  3. Web
  4. Other sources like user feed (RSS feeds)

Data extraction methods:


  1. Full Extraction
  2. Partial Extraction- without update notification.
  3. Partial Extraction- with an update notification

Add sources


In this example, we will use Web (iOS and JS) data sources to start tracking selected events (e.g., page views, interaction events).




Next, we will connect a Shopify connector as a source and get all the transaction data from your CMS database.



Create Pipelines


Raw data from different sources need to be transferred to the data warehouse (e.g., Redshift). After that, the BI tool (e.g., Mode Analytics) can query the data via the Redshift cluster to create funnel models.


You will need to create three different pipelines with 3 data sources you have already added.


How to create pipeline #1 (Shopify)


  1. Go to Connections
  2. Select Intempt Demo Store (JS) as a source
  3. Select your warehouse as the pipeline destination


How to create pipeline #2 (Web JS)


  1. Go to Connections
  2. Select Intempt Demos Store as a source
  3. Select your warehouse as the pipeline destination


How to create a pipeline #3 (iOS source)


  1. Go to Connections
  2. Select Intempt iOS  Demo Store as a source
  3. Select your warehouse as the pipeline destination


2. Transform


Transformation means converting the data into a format that can be used by different applications. This could mean a change from the format the data is stored into the format needed by the application to use the data.


There are multiple ways to transform data:


Data cleaning. It involves identifying irregular data, correcting or removing it. For example:

  • Removing missing data;

Filtering. Choose only specific columns to load.

  • Recoding missing data into NULLS or 0s;
  • Split a column into multiples and merge multiple columns into a single column.
  • Convert data types to standard forms (e.g., convert DateTime and Unix objects to the same data type);
  • Converting different versions of the data to a common denominator ( e.g., T, 1, true to “TRUE”)
  • Use lookups to merge data.
  • Data validation check. For example, age cannot be more than two digits.

Data enriching. Data enriching means adding new information to the collected raw data.

  • Join different sources. For example, consolidate customer information by joining data from a variety of purchasing apps.
  • Deduplication. Identify which information is duplicated and remove the redundant fields.
  • Calculations. For example, calculate the average order value of the customers at import.

In our example, we also need to apply transformations to the data tables. Transformations must be done so the delivered data would match the schema in the data warehouse.


Example transformation:


Property rename. Input schema from Shopify has a property name “order_id.” However,  the schema in the data warehouse has the same property with a different name, “Purchase_Id.” Transformation is required so the output schema would match the schema present in the data warehouse. To do this, apply property rename transformation to the source schema - change the string “order_id” to “Purchase_Id.”



Data enrichment. Data is enriched by adding new columns based on the lookup tables. For example, after using the lookup tables store_number is converted to store_name, which makes the output data far more accessible.


Similar transformations are required to be implemented to the Web and iOS source pipelines. Both sources may have different event names or property values, so you need to make sure they match (e.g., an event “Added to Cart).


3. Load


The last stage in the ETL pipeline is “load.” Now that the data is transformed, it must be stored in a database (relational database, NoSQL data store, data warehouse, or data lake), where it is ready for analysis). There are three main options on how to load the data to the warehouse:


  • Initial Load — populating all the Data Warehouse tables.
  • Incremental Load — applying ongoing changes as when needed regularly.
  • Full Refresh -  removing the contents of one or more tables and refilling with new data.

In our example, we will implement an initial load process.


After the data is delivered, you can use a BI tool like Mode Analytics to query the warehouse’s data and visualize the five funnel events.


Each bar in the chart will show the number of users who have made it to the next stage and gives us a sense of the fall-off.


Mode analytics will also give us a view of the individual users who fell off at each funnel stage.




Which ETL tool to choose for a Modern Tech Stack?


There are many ETL tools that automate, accelerate, and take care of your ETL processes for you.


Their primary value is that they accelerate the data pipeline delivery, automatize ETL processes, and reduce maintenance costs (the tools take care of all changes, updates, and overall maintenance).


Intempt Platform automates your ETL process:


  • Automate data collection from third-party apps (and databases) with the Connectors.
  • Automate data cleaning and transformations with Transformations.
  • Flow data between different storages with Pipelines.
  • Schedule your data pipeline tasks with advanced settings.


Intempt goes beyond the standard ETL process by offering advanced solutions for a variety of data needs:


  • Automate and orchestrate data pipelines by deploying them in just a couple of clicks with Intempt Pipelines.
  • Select which events and segments you want to send to your destination.

Try running data pipelines for yourself with Intempt’s free trial. Go to app.intempt.com to activate your free platform onboarding. Feel free to test the waters with the wide ecosystem that Intempt offers, which extends well beyond the ETL.