One of the overarching design principles of the AWS Well Architected Framework places strong focus on iteration, that your architecture will continuously evolve as you review what you have built — and as new services and features become available. It is one of the major reasons that Infrastructure-as-code (IaS) and CI/CD Pipelines are key for deployments.

Well, a new service was announced not too long ago — AWS recently announced DataSync which is their service that simplifies, automates and accelerates data transfer between on-premises storage systems and AWS storage services.

The DataSync architecture reminded me of an architecture (see very high level depiction in Figure 1 below) that I designed for implementation at a customer a while ago. It was not a complex architecture, but it was processing a very critical workload.

The client was looking to move files from on-premises to AWS where further processing would then be carried out. The speed of transfer was important, as delayed processing could have potentially resulted in penalties. The sensitivity of the data was quite high, together with regulation around this data — security was, therefore, a huge consideration.

Another challenge that was faced was the handling of deltas — due to a long list of considerations we didn’t want to transfer full files unnecessarily.

At the time, we were not aware of an AWS service that could solve for the above challenges while at the same time taking care of:

  1. encryption
  2. data integrity validation
  3. the transfer of only the data that had changed
Figure 1: Old Architecture pre-AWS DataSync

The following characteristics of AWS DataSync address the challenges detailed above:

  • Delta file transfer — files containing only the data that has changed can be transferred without changes to the application, this frees up time for the engineers and ensures minimization of errors during file creation.
  • Data protection — getting buy-in from owners of sensitive data that proved challenging previously (as was experienced during this project) is now made easier by the availability of encryption mechanisms and the optional data integrity validation.
  • Rapid transfer of data — this is significant because of the penalties that could have potentially been imposed in the event of late processing of files.

The DataSync architecture that I have appended below (Figure 2), illustrates all the components of DataSync (i.e. Agent, Location, Task, Task execution — — read more on these here) that now make it easier to move data between on-premises and AWS.

I have also included a diagram of the DataSync Task Execution Transition Phases (Figure 3) as I will be referring to these phases later.

Figure 3: Task Execution Transition Phases (individual run of a task)

The following components of AWS DataSync address the challenges detailed above:

  • Transfer of only the data that has changed — The PREPARING and the TRANSFERRING phases of Task Execution component take care of this (see Figure 3). Files are scanned for differences and only data that has been changed/added/deleted will be TRANSFERRED.
  • Encryption of data — The Shared Responsibility Model of course always applies when it comes to securing data in and for the cloud. The use of TLS for data encryption, along with AWS IAM integration for access management, and the support of S3-managed encryption keys (SSE-S3) among others helps give assurance of data protection to the data owners.
  • Data integrity validation — The optional data integrity validation (see Figure 4 below) further assists in ensuring that data was not tampered with/lost during the migration process.
  • Rapid transfer of data — data can be transferred over the network, at a rate of up to 10 Gbps.
Figure 4: Task Configuration Settings

The DataSync service is available at most AWS regions and can be accessed via the Management Console. DataSync can be programmatically configured using the AWS CLI and/or DataSync API.

References:

https://docs.aws.amazon.com/datasync/latest/userguide/sync-dg.pdf#iam