With the ever-growing amounts of data, enterprises create an increasing demand for data warehousing projects and systems for advanced analytics. ETL is their essential element. It ensures successful data integration within various databases and applications. In this ETL tools comparison, we will look at:
AWS Data Pipeline
They are among the most popular ETL tools 2019. Let's compare the pros and cons to find out the best solution for your project.
The ETL meaning is often misunderstood, due to the "simple" interpretation of its abbreviation. It stands for 3 data warehouse concepts: extract, transform, load. ETL process includes:
extracting data from different external sources
transforming it as the business model requires
loading data into the new warehouse
ETL is only a subset of data movement tasks. Ralph's Kimball book "The Data Warehouse ETL Toolkit" defines its 3 fundamental features:
data is downloaded in the suitable for analytics form
it is enriched with additional information
the system records and documents origin of data
So, the data shouldn't be just reloaded from one place to another. They are improved in the loading process. For example, an ETL developer can add new calculated or technical attributes. It's important to track how the data appeared in the database, as well as how and when it was changed.
ETL process steps
A web programmer can imagine ETL architecture as a set of 3 areas:
a data source
an intermediate area
a data receiver
A data stream is the movement of data from the source to the receiver. Each of the stages can be quite complicated. The process of creating ETL software includes different challenges:
The variety of external sources.
Unification of data according to business rules.
The frequency of updates and other specific requirements.
That's why an IT company needs to have a clear picture of structures of the source and destination applications.
What is an ETL example?
The common ETL task is to transfer data from RDBMS to another database that works better for business intelligence tools. ELT jobs are divided into 2 types:
The case with taking data from RDBMS is an example of the stream job. The data is transferred separately one by one for further processing. Otherwise, we can talk about a batch job. It means you can take a whole file, process it and save into a large file. Various ETL systems cope with these tasks in different ways.
Nowadays, the batch-only approach becomes a relic of time. The growing number of streaming data sources causes the popularity of ETL tools for stream job. They make the most recent data available as quickly as possible.
The variety of common and cloud-based data integration tools makes the choice really difficult. So, I prepared the list of 5 ETL solutions that are reliable in my experience.
Perfect implementation of dataflow programming concept
The opportunity to handle binary data
Lack of live monitoring and per-record statistics
The first in the list of the best ETL tools is an open source project Apache NiFi. Developed by Apache Software Foundation, it is based on the concept of Dataflow Programming. It means this ETL tool allows visually assemble programs from boxes and run them almost without coding. So, you don't have to know any programming languages.
One of the most popular open source ETL tools is capable to work with different sources. For example, RabbitMQ, JDBC query, Hadoop, MQTT, UDP socket, etc. As for the actions, there is an opportunity to filter, adjust, join, split, enhance, verify data.
Apache NiFi is written in Java and distributed under the Apache 2.0 license. It runs on a JVM and supports all JVM languages.
This ETL tool helps to create long-running jobs and suits for processing both streaming data and periodic batches. As for manually managed jobs, they are also possible. However, there is a risk to face difficulties while setting them up.
Thanks to the well-rounded architecture, Apache NiFi is considered as one of the best open source ETL tools. It's a powerful and easy-to-use solution. FlowFile includes meta-information. So, the tool's possibilities aren't limited to CSV. You can work with photos, videos, audio files, or binary data.
The processors include 3 outputs:
Failure means there are problems with appropriate processing of FlowFile.
Original shows that an incoming FlowFile has been processed.
Success denotes that the processing of FlowFiles was finished.
If you want to drop terminated outputs, you can use special checkboxes. You should pay attention to the Process Groups. They are necessary for combining elements of a complex dataflow in advanced ETL programming.
Another great feature is the possibility of using various queue policies (FIFO, LIFO, and others). Data Provenance is a connected service that records almost everything in your dataflows. It's very convenient, because you can see how the data was saved or performed. The only drawback is that the function requires lots of disk space.
Some users complain about the interface of Apache NiFi. Actually, it isn't impressive, but the usability is good enough. The UI has a clear, laconic design without extra elements. The only exception is the lack of automatic adjustment of text fields for long SQL queries. You should do them manually.
There is also an in-built Node cluster. You can pick up several instances and it will pull out the necessary ETL data. Apache NiFi includes back pressure. It is needed for fast connecting to MySQL, getting the file and adding it to the next processor.
To sum up, Apache NiFi is a nice alternative to other mainstream ETL tools. Its main advantage is more than 100 different embedded processors. They provide an opportunity to download files via HTTP or S3, Google Data Source and upload them to MySQL or other data receivers. You need just to configure the UI, press the RUN button and if everything is right, it will work.
Each processor has individual per-record statistics with nice visualization for effective debugging
Attractive user interface
Good tool for streaming or record-based data
The absence of reusable JDBC configuration
Changing a setting of one processor requires stopping of the whole dataflow
Apache StreamSets is a strong competitor of Apache NiFi. It's difficult to say which on these free ETL tools is better.
All data that you put into Streamsets automatically converts into exchangeable records. The common format is designed for smooth streaming. Unlike Apache Nifi, this ETL tool doesn't show queues between processors. If you want to use different formats, Apache Nifi requires turning from one version of the processor to another. Streamsets allows avoiding these manipulations. Instead of stopping only one processor, you need to stop the whole dataflow to change the settings.
It seems that fixing bugs is more difficult at Streamsets. In fact, it's easier due to the real-time debugging tool. Thanks to the user-friendly interface with live dashboard and all needed statistics, you can notice and fix all errors in time. Moreover, there is an opportunity to put record filters on the connections between processors to check suspicious records. There are 4 variations of the processor:
Origin processor receives information from data sources.
Processors that get and transform the received data.
Destinations put transformed data to the external files.
Executors process actions, completed by other processors.
Streamsets processors can generate actions and events, including bugs. In order to track and fix them, you need executors. Some users prefer Apache NiFi, because its design is simpler. All you need are the Processors and Controller Services. However, Streamsets also has well-thought architecture, which isn't difficult to get used to. And the UI also looks better.
I felt the lack of Controller Services, which is quite important for JDBC settings. Adjusting all JDBC settings for each processor is really annoying.
Streamsets checks all processors before you can run the dataflow. This feature seems quite useful. In my experience, it's a controversial thing. Streamsets doesn't allow you to leave disconnected processors for fixing bugs in the future. All of them must be connected before the dataflow starts. As for other cons, I felt a lack of possibility to choose more than 1 processor at once. Moving many processors and reorganizing them one-by-one takes too much time and efforts.
All in all, it's a mature, open source ETL tool with convenient visual data flow and modern web interface. I recommend you to try Streamsets and Apache NiFi to find out which of them is the most suitable for your purposes.
The modern platform for designing, creating and tracking workflows is an open source ETL software. It can be used with cloud services, including, GCP, Azure, and AWS. There is an opportunity to run Airflow on Kubernetes using Astronomer Enterprise.
You can code on Python, but not engage in XML or drag-and-drop GUIs. The workflows are written in Python, however the steps themselves can be done in anything you want. Airflow was created as a perfectly flexible task scheduler. One of the top ETL tools is suitable for lots of different purposes. It is used to train ML models, send notifications, track systems and power functions within various APIs.
The main facts about the platform:
Airflow-as-a-Service is available from Qubole and astronomer.io.
It is created by Airbnb in 2015 and transitioned to Apache in 2016.
The basis for Google's Cloud Composer (beta summer 2018).
Workflows are performed as directed acyclic graphs (DAGs).
Apache Airflow was designed according to 4 fundamental bases. The software developers aimed to create a dynamic, extensible, elegant and scalable solution. So, it provides a possibility of dynamic pipeline generation through coding on Python. You can also define your own operators, executors and extend the library according to the needed level of abstraction. The pipelines are clear and accurate because parameterizing is included into the core of the platform. Thanks to the modular design with a message queue, Airflow can be easily scaled.
Apache Airflow is suitable for most of the everyday tasks (running ETL jobs and ML pipelines, delivering data and completing DB backups). However, it's a bad choice for stream jobs.
The platform has a modern UI that is full of visualization elements. You can see all the running pipelines, track progress, and fix bugs. This helps complete complex tasks on DAGs.
As for workflows, they are constant and stable. The structure is just a little bit more dynamic than an ETL database. If you define workflows as code, they will be more collaborative, versionable, testable and maintainable.
The platform runs on a private Kubernetes cluster. It also includes resource management tools and analytics (StatsD, Prometheus, Grafana).
What about the ETL testing of Airflow workflows? You can use:
end-to-end tests (in some cases)
The first type is suitable for checking DAG loading, Python Operator functions, custom Operators and Bash/EMR scripts. The platform doesn't require any original configurations. The only thing that should be changed is the DB connection string. You need to create an empty database, and give the user permission to CREATE/ALTER. So, an airflow command will handle all the rest.
To conclude, Apache Airflow is a free, independent framework written on Python. It a good example of open source ETL tools. Airflow can be challenging for running alone, so you should use different operators.
The web service ensures processing and moving data between an AWS compute and various data sources. It provides permanent access to the stored data, as well as its transformation. The final results can be transferred to AWS services. They are Amazon DynamoDB, Amazon RDS, Amazon EMR, Amazon S3. This ETL tool simplifies the process of creating complex data processing workloads. It helps to achieve repeatable, highly available and reliable case-load.
AWS Data Pipeline gives a possibility to move and process data that was previously locked up in on-premises data silos. Amazon asserts that its ETL tool has 6 heavy advantages:
AWS Data Pipeline is a reliable service that automatically retries the active processes in case of any failures. You will also receive notifications via Amazon SNS. They can be set for successful runs, delays, or failures.
The drag-and-drop console allows fast and simple designing of pipelines. The built-in preconditions prevent you from writing any extra logic to use them. The web developer enjoys various popular features. I mean scheduling, dependency tracking and issues handling. The service's flexible design allows smooth processing of numerous files.
This product isn't expensive compared to other ETL tools. AWS Data Pipeline is an orchestration serverless service and you pay only for what you use. Moreover, there is a free trial version for new users. It's a transparent solution. The user receives full information on the pipelines and has complete control over the computational resources.
Finally, I especially recommend this ETL tool for performing pipe jobs. I use it on my current project for transferring data. Although AWS Data Pipeline doesn't have many built-in functions, it provides convenient UI. It can spawn instances and ensure cascading file management. I like this simple, inexpensive and useful tool with built-in processors that allow you to do everything via UI.
The code-based, serverless ETL alternative to traditional drag-and-drop platforms is effective, but an ambitious solution. AWS Glue allows creating and running an ETL job in the AWS Management Console. The service takes data and metadata from AWS, puts it in the catalog and makes it searchable, queryable, and available for ETL. The process includes 3 steps:
classifying data through building catalog (JSON, CSV, Parquet and many other formats are available)
generating ETL code and editing transformations (written on Scala or Python)
scheduling and running ETL jobs
Amazon distinguishes 3 main benefits of this ETL tool.
1. Convenience Having tight integration with numerous AWS services and engines, this tool is simple for those who already use Amazon products. The drawback is that you can't implement it on-premise or in any other cloud environment.
2. Profitable The serverless solution means you don't need to provision or manage infrastructure. So, the cost depends on the measure of "Data Processing Units." You pay only for the jobs that are running.
3. Powerful The automatization of creating, maintaining, and running ETL jobs is perfect. On the other hand, the service causes a lot of manual work too.
Apache Spark is used as a base of ETL logic. However, you may notice significant differences from an ordinary Spark. The service has "dynamic frame" with specific Glue methods, while Spark uses "data frame".
AWS Glue is a modern and strong part of the AWS ecosystem. But you should be mindful of its nuances. The service provides a level of abstraction in which you must identify tables. They represent your CSV files. There is a lot of manual work here, but in the end it will generate the code for Spark and launch it. You can download this code in Scala or Python and change it as you want. It's suitable for a wide range of data sources, but the service forces you to choose a specific solution. If you want to try another way, you may not be able to do that.
How to select the right ETL tool?
InfoWorld asserts that ETL causes the largest costs in building data warehousing systems. It's the bottleneck that requires special attention of any web developer. A correct ETL implementation is your chance to optimize costs and speed-up work. Choosing the best of the top ETL tools, consider 5 criteria:
the complexity of your system
your data requirements
costs of ETL technologies
special business needs
I hope this review was helpful for you in chasing for perfect ETL solution. If you want to learn more information on web development, subscribe to FreshCode blog. You can also get a free tech consultation on your project.