TMTOWTDI: Building an ETL Pipeline with Perl and Amazon Redshift

Creating an ETL pipeline that interacts with a data warehouse (e.g., Amazon Redshift, Google BigQuery, Snowflake, etc.) is a common use case in modern data engineering. In this blog post, we’ll walk through building an ETL pipeline in Perl that extracts data from a data warehouse, transforms it, and loads it into another data warehouse or database. For this example, we’ll use Amazon Redshift as the data warehouse.

Overview

This ETL pipeline will:

Extract: Fetch data from an Amazon Redshift data warehouse.
Transform: Perform transformations on the data (e.g., cleaning, aggregations, or calculations).
Load: Insert the transformed data into another Amazon Redshift table or a different data warehouse.

Prerequisites

Perl: Ensure Perl is installed on your system.
CPAN Modules:
- DBD::Pg for connecting to Amazon Redshift (Redshift is PostgreSQL-compatible).
  Install the module using CPAN:
```
cpan DBD::Pg
```
Amazon Redshift:
- Set up two Redshift clusters or databases:
  - Source Cluster: Contains the raw data.
  - Target Cluster: Stores the transformed data.
AWS Credentials:
- Ensure you have the necessary AWS credentials (access key and secret key) to connect to Redshift.

Script: ETL Pipeline with Amazon Redshift

#!/usr/bin/perl
use strict;
use warnings;
use DBI;

# Redshift Configuration
my $source_redshift_config = {
    dsn      => 'DBI:Pg:dbname=source_db;host=your-source-redshift-endpoint;port=5439',
    username => 'your_source_username',
    password => 'your_source_password',
};

my $target_redshift_config = {
    dsn      => 'DBI:Pg:dbname=target_db;host=your-target-redshift-endpoint;port=5439',
    username => 'your_target_username',
    password => 'your_target_password',
};

# Step 1: Extract - Fetch data from the source Redshift cluster
sub extract_data {
    my ($db_config) = @_;

    # Connect to the source Redshift cluster
    my $dbh = DBI->connect($db_config->{dsn}, $db_config->{username},

    $db_config->{password}, { RaiseError => 1 }) or

    die "Could not connect to source Redshift: $DBI::errstr";


    # Query to fetch data
    my $query = "SELECT id, name, age, salary FROM employees WHERE

    last_updated > NOW() - INTERVAL '1 hour'";

    my $sth = $dbh->prepare($query);

    $sth->execute();

    # Fetch all rows
    my $data = $sth->fetchall_arrayref();

    # Disconnect from the source Redshift cluster
    $dbh->disconnect();

    return $data;
}

# Step 2: Transform - Perform data transformations
sub transform_data {
    my ($data) = @_;
    my @transformed_data;

    foreach my $row (@$data) {
        my ($id, $name, $age, $salary) = @$row;

        # Example transformations:
        # 1. Convert name to uppercase
        $name = uc($name);

        # 2. Calculate bonus (10% of salary)
        my $bonus = $salary * 0.10;

        # Add transformed row to the array
        push @transformed_data, [$id, $name, $age, $salary, $bonus];
    }

    return \@transformed_data;
}

# Step 3: Load - Insert transformed data into the target Redshift cluster
sub load_data {
    my ($data, $db_config) = @_;

    # Connect to the target Redshift cluster
    my $dbh = DBI->connect($db_config->{dsn}, $db_config->{username}, $db_config->{password},

    { RaiseError => 1 }) or die "Could not connect to target Redshift: $DBI::errstr";

    # Create table if it doesn't exist
    $dbh->do("
        CREATE TABLE IF NOT EXISTS transformed_employees (
            id INT PRIMARY KEY,
            name VARCHAR(255),
            age INT,
            salary DECIMAL(10, 2),
            bonus DECIMAL(10, 2)
        )
    ");

    # Prepare insert statement
    my $sth = $dbh->prepare("
        INSERT INTO transformed_employees (id, name, age, salary, bonus)
        VALUES (?, ?, ?, ?, ?)
    ");

    # Insert each row
    foreach my $row (@$data) {
        $sth->execute(@$row);
    }

    # Disconnect from the target Redshift cluster
    $dbh->disconnect();
}

# Main ETL Process
sub main {
    # Step 1: Extract
    my $extracted_data = extract_data($source_redshift_config);

    # Step 2: Transform
    my $transformed_data = transform_data($extracted_data);

    # Step 3: Load
    load_data($transformed_data, $target_redshift_config);

    print "ETL process completed successfully!\n";
}

# Run the ETL pipeline
main();

Explanation

Extract:
- The extract_data function connects to the source Amazon Redshift cluster using the DBD::Pg module.
- It fetches data from the employees table where records have been updated in the last hour.
Transform:
- The transform_data function performs transformations on the extracted data:
  - Converts names to uppercase.
  - Calculates a bonus (10% of salary) as a derived field.
Load:
- The load_data function connects to the target Amazon Redshift cluster and inserts the transformed data into a table called transformed_employees.
- The table is created if it doesn’t already exist.
Main:
- The main function orchestrates the ETL process by calling the extract, transform, and load functions in sequence.

Database Setup

Source Redshift Cluster:

Table: employees

CREATE TABLE employees (
    id INT PRIMARY KEY,
    name VARCHAR(255),
    age INT,
    salary DECIMAL(10, 2),
    last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Target Redshift Cluster:

Table: transformed_employees

CREATE TABLE transformed_employees (
    id INT PRIMARY KEY,
    name VARCHAR(255),
    age INT,
    salary DECIMAL(10, 2),
    bonus DECIMAL(10, 2)
);

Running the Script

Save the script to a file, e.g., etl_redshift.pl.
Make the script executable:
```
chmod +x etl_redshift.pl
```
Run the script:
```
./etl_redshift.pl
```

Sample Data

Source Table (employees):

id name age salary last_updated

1 John Doe 30 50000 2023-10-01 12:00:00

2 Jane Smith 25 60000 2023-10-01 12:30:00
Target Table (transformed_employees):

id name age salary bonus

1 JOHN DOE 30 50000 5000

2 JANE SMITH 25 60000 6000

id	name	age	salary	last_updated
1	John Doe	30	50000	2023-10-01 12:00:00
2	Jane Smith	25	60000	2023-10-01 12:30:00

id	name	age	salary	bonus
1	JOHN DOE	30	50000	5000
2	JANE SMITH	25	60000	6000

This blog post demonstrated how to build an ETL pipeline in Perl that interacts with Amazon Redshift, a popular data warehouse. The script extracts data from a source Redshift cluster, transforms it, and loads it into a target Redshift cluster. This example can be extended to work with other data warehouses like Google BigQuery or Snowflake by adjusting the connection details and SQL syntax.

This pipeline is modular, scalable, and suitable for real-time or batch ETL processes. It’s an excellent reference for developers working with data warehouses and ETL pipelines in Perl.

Labels: Building an ETL Pipeline with Perl and Amazon Redshift

TMTOWTDI
[There's More Than One Way To Do It]

Main Menu

Thursday, 30 January 2025

Building an ETL Pipeline with Perl and Amazon Redshift

Overview

Script: ETL Pipeline with Amazon Redshift

Explanation

Database Setup

Running the Script

Sample Data

0 Comments:

Post a Comment

About Me

Previous Posts

TMTOWTDI [There's More Than One Way To Do It]

Main Menu

Thursday, 30 January 2025

Building an ETL Pipeline with Perl and Amazon Redshift

Overview

Script: ETL Pipeline with Amazon Redshift

Explanation

Database Setup

Running the Script

Sample Data

0 Comments:

Post a Comment

About Me

Previous Posts

TMTOWTDI
[There's More Than One Way To Do It]