Reproducible Assets for CloudFormation Stacks

When CloudFormation is used to create the infrastructure or application it is important to take a look at the artifacts that are generated and deployed. When it comes to artifacts and especially Java artifacts (jar or zip files) there are some important requirements that have to be followed. Otherwise, the deployment might take longer than required, or in cases where the CDK Pipeline is used to deploy the application the pipeline can go into an infinite loop updating itself over and over again.

When a CDK stack is created or updated via cdk deploy, all parts of the application like CloudFormation templates, images, scripts and Java code is packed as assets in the cloud assembly directory (usually cdk.out), published to S3 and used by the CloudFormation service during deployment. An important property of each asset is its hash. In this case the hash is an sha256 hash of the content of each file. This is used in the generated template and also in the filename of the asset. For example the Java code for a Lambda function may become a file called asset.2789342b793c2d260717ac962952ef1ec03511f8a355f6abb9a5bfcd32bee712.jar. You can find these files in the cdk.out folder after the stack has been synthesized (cdk synth or cdk deploy).

When it comes to hashing you have to keep in mind that any change of a file will also alter the hash. For archive files (jar, zip, tar etc.) the hash is not only affected by the contents of the files in the archive (e.g. compiled Java classes). More importantly the order of the files in the archive, their file permissions and date will also influence the hash:

Jar contents:

 0 Fri Mar 24 09:09:32 CET 2023 META-INF/
64 Fri Mar 24 09:09:32 CET 2023 META-INF/MANIFEST.MF
 2 Fri Mar 24 09:09:14 CET 2023 MyClass1.class
 2 Fri Mar 24 09:09:18 CET 2023 MyClass2.class

Hash: 03fb71a84bd307794ef2737274cfc41f4c080e0a4d13827e5e8a95a853738626

Jar contents:

 0 Fri Mar 24 09:11:22 CET 2023 META-INF/
64 Fri Mar 24 09:11:22 CET 2023 META-INF/MANIFEST.MF
 2 Fri Mar 24 09:10:32 CET 2023 MyClass1.class
 2 Fri Mar 24 09:10:34 CET 2023 MyClass2.class

Hash: b15559db5309131b4ab0863b948b36ef3be358391d2014788ee9936ea3b8aec9

Even if the contents of the files contained in both archives is exactly the same the hash of the archive itself is different due to the different timestamps of the files.

The result is that CDK/CloudFormation will treat this asset as updated/changed and re-deploy it, even if the actual implementation did not change.

When a Lambda is deployed this may not be an issue. It will only take some more time to transmit and re-deploy the archive.

However, when it comes to a CDK pipline this will lead to an infinite loop, because the assets of the pipline stack change each time the pipeline is created:

  • Pipeline is synthesized
  • Asset hashes changed
  • Pipeline is updated
  • Pipeline restarts, because it has changed
  • Pipeline is synthesized
  • Asset hashes changed
  • Pipeline is updated
  • Pipeline restarts, because it has changed

Create Reproducible Archives

The previous section outlined why it is important that the content and therefore hash of a Java archive (jar or zip) only changes when the actual implementation has changed. In order to achieve this the measures can be taken to mitigate the outlined problems:

  1. always use the same timestamp for all files
  2. preserve the file order in the archive
  3. prevent code generators from generating changing content

When using Gradle to build the project, the first two requirements (fixed timestamp and file order) are achieved by adding the following configuration to the build.gradle.kts script:

tasks.withType<Jar> {
    isPreserveFileTimestamps = false
    isReproducibleFileOrder = true
}

The Gradle documentation gives a more detailed explaination about this topic. For Maven there is also an article about Configuring for Reproducible Builds in the documentation.

When it comes to the last requirement it depends on the generator that is being used. For example MapStruct will include a timestamp in each generated class by default:

@Generated(
    value = "org.mapstruct.ap.MappingProcessor",
    date = "2017-07-04T12:08:56.235-0700",
    comments = "version: 1.5.3.Final, compiler: IncrementalProcessingEnvironment from gradle-language-java-8.0.2.jar, environment: Java 11.0.17 (Amazon.com Inc.)"
)
public class MyFancyMapperImpl implements MyFancyMapper {
    //...
}

Since the date attribute will change each time the mapper implementation is generated (gradle clean build) the hash of the resulting archive will also change, even if the mapper implementation is exactly the same.

To prevent this in the case of MapStruct you need to add the following configuration:

@MapperConfig(suppressTimestampInGenerated = true)
public interface GeneralMapperConfig {
}

Then add this configuration to all mapper interfaces or classes:

@Mapper(
        config = GeneralMapperConfig.class
)
public interface MyFancyMapper {
    // ...
}

With this configuration MapStruct will omit the timestamp and re-create the same implementation class each time.

Conclusion

In this article we learned, why it is important to create reproducible assets for applications that are deployed using CDK. We then looked at ways on how to create these assets in Java projects by configuring the generation of application artifacts like jar archives. Finally we looked at MapStruct as a generator that also has to be configured in order to create reproducible implementations.