Travel & Hospitality
Data Cleansing for ETL pipelines
Clean data provides better insights into the product or business area and helps users understand the scenario to make informed decisions.
Determining the quality of data requires an examination of its characteristics, then weighing those characteristics according to what is most important to your organization and the application(s) for which they will be used.
5 characteristics of quality data:
- Validity. The degree to which your data conforms to defined business rules or constraints.
- Accuracy. Ensure your data is close to the true values.
- Consistency. Ensure your data is consistent within the same dataset and/or across multiple data sets.
- Uniformity. The degree to which the data is specified using the same unit of measure.
You can clean your data by implementing the following steps:
Step 1: Create a pipeline
Create a pipeline and add the source for which you want to cleanse the data.
Step 2: Identify Critical Fields
At first, it is crucial to analyze and identify the critical/essential fields among the data for a given project to ensure proper and effective Data Cleansing. Next, apply the necessary operators to clean the data.
Step 3: Remove Duplicates
When the data originates from multiple sources, there is a likelihood of duplication in the data. De-duplicating the data can help to free up the storage and removes redundancy from the data.
Step 4: Filter Unwanted Data
Unwanted information refers to the data records that are not relevant for the particular analysis. Suppose you are analyzing data for the current year, and your dataset contains data from older years. Removing this information will help to make a more efficient analysis and minimize the risk.
Step 5: Handle Missing Values
There may be some missing values in the data. Those need to be handled correctly; otherwise, algorithms will not accept any missing/blank data. You can either drop the records containing missing values, or input specific values based on historical observation, or create a system that can handle null values.
Step 6: Standardize the Data
Before Data Cleansing, standardization of data is necessary so that it can be replaced easily. For example, in a particular field, we are getting the following values as N.A., NA, Not Applicable, etc. This type of data has to be standardized so that we get one value across the rows.
Step 7: Set up the Process
Use the Intempt Pipeline to process the Once you identify what data from the dataset has to be cleansed, next is to determine the process by which it can be applied across the data.
Step 8: Analyze the results
At the end of the Data Cleansing activity, you need to perform QA on the clean data to answer the questions:
- Does the cleaned data make sense as per the requirement?
- Is it ready to feed into the algorithm?
- Is the data free from errors/unwanted rows, and contains standardized fields?