Data migration: Thinking about using AWS Data Pipeline?

Rédigé par Wiem Trigui

At Sicara, we are always encountering challenging tasks while manipulating data. One of the important tasks that were assigned to me was migrating data across AWS accounts. The target of this mission was duplicating DynamoDB tables from a production environment into a testing environment in order to observe the application behave with real data.

AWS solution: AWS Data pipeline

One solution proposed by AWS to migrate DynamoDB data from one account to another is to use AWS Data Pipeline. A complete guide exists to perform the migration. The solution consists in creating a Data Pipeline in the source account that will dump data from DynamoDB tables to an S3 bucket in the destination account and another Data Pipeline in the destination account to transform S3 objects data to DynamoDB tables.

AWS solution to migrate DynamoDB data across AWS accounts

What is AWS Data Migration Pipeline and how does it work?

AWS Data Pipeline is a web service that configures a pipeline to help you transform and move data between different AWS Compute and Storage services. The pipeline can be executed manually or automatically at specified intervals.

It consists of the following components :

A pipeline definition where you can communicate the business logic of your data management to the AWS Data Pipeline. You can define names, formats, and locations of your data source, activities to transform it, the schedule of these activities, resources that run them, preconditions needed to run them...
A Pipeline that schedules and runs tasks by creating Amazon EC2 instances to perform the defined work activities. You can upload your pipeline definition to the pipeline and then activate it. To change a data source, you have to deactivate the pipeline, make the modification and then activate the pipeline again. But if you only need to edit the pipeline, you don’t have to deactivate it.
A task runner that takes out tasks mentioned in the pipeline definition and performs those tasks. It is installed and runs automatically on the resources created by pipeline definitions. AWS Data Pipeline provides a Task Runner application. You can also write a custom application.

To migrate DynamoDB Data between different AWS accounts, we configured the Data Pipeline to use an EMR cluster. When a Data Pipeline runs, it creates some AWS EC2 virtual machines and installs a Hadoop cluster. Finally, it executes a script on the cluster to dump data from one AWS resource to another. This work is done in parallel by many workers. It sends finally a notification to the user with the execution status.

AWS Data Pipeline Configuration to dump data from DynamoDB to S3 bucket

Feedback on using AWS Data Pipeline

The solution proposed by AWS seems to be quite simple and easy to use. Unfortunately, it hides many difficulties that make customising it to our needs very complicated. In fact, not only retrieving the relevant information from the documentation is difficult but also the logs are not always available.

Terraforming a Data Pipeline?

Terraforming the Data Pipeline was a necessary step in our case since by policy we don’t use the web console for deployments in the production environment. After hours of scouring Google search results, it was clear for us that terraforming a Data Pipeline definition is still impossible. An aws_datapipeline_pipeline resource has been implemented in terraform repository but it can only describe the name of the resource and the optional tags attached to it. A manual step is still required to have the pipeline definition specified.

The alternative was to create the Data pipeline in the testing environment, verify that it worked correctly and finally export the Data Pipeline architecture in the JSON format and import it to the production environment. Hence, we were obliged to make manual manipulations in the production environment although it's something we didn't want to do.

AWS Data Pipeline logging system

AWS permits the saving of the pipeline log files in a persistent location and provides a guide to find those logs. Thus, we configured our pipeline definition to create log files in an S3 bucket. This helped us a lot to understand the causes when the pipeline failed.

However, it wasn’t always the case. In fact, when we created firstly an AWS Data Pipeline in the testing environment and activated it, the pipeline was cancelled. We tried to figure out the problem but no logs were available. The log files were not even created and no information was given to explain the problem. This was really painful since we couldn’t understand the reason of the pipeline cancellation and no solution proposed in the Internet was helpful. After many hypothesis and attempts, we decided to contact AWS support team who explained to us that the EMR cluster could not be launched successfully because no subnet was specified in the Data Pipeline.

Troubleshooting when the pipeline failed was also a painful point. In fact, it is really time consuming since it takes us about 15 minutes just waiting for runners to begin the backup.

Dumping data from multiple DynamoDB tables

The solution given by AWS supports only one table migration. But, creating an AWS Data Pipeline resource per table was really insane in our case since we were having 14 DynamoDB tables to migrate.

Another alternative was finally found which consisted in modifying the Data Pipeline architecture by adding an additional EmrActivity for each table. The pipeline creates thus an EMR cluster and assigns all the activities to it. We only have to activate the pipeline and the EMR cluster will run the 14 activities in parallel.

This solution wasn’t cool neither since we have to create manually all the activities, their steps and the data sources they manipulate. But at least it saves us from creating many Data Pipelines and activating them several times.

Data Pipeline Architecture to dump 14 DynamoDB tables to an S3 bucket

Conclusion

Using AWS Data Pipeline to migrate DynamoDB data across AWS accounts was not an easy task since it hides many difficulties :

The approach is complicated to set up and is not documented enough
The job is very slow to start up
The logs were not always available.

People who struggled while using AWS Data Pipeline proposed other options such as coding their own solution using AWS Lambdas or using AWS glue.

Are you looking for data migration experts? Don’t hesitate to contact us!

Cet article a été écrit par

Wiem Trigui