Introduction

Recently, we built a proof of concept around improving the “developer experience” for one of our customer’s teams. It was a very successful exercise and so we thought we would share.

The organization we are working with are some way into their DevOps journey. They have a product team which has moved to deploying their software as containers and their ops and infrastructure team have formalised as a platform team providing Kubernetes clusters in the cloud for deploying these containers to.

Since the product team is moving fast, it and the platform team are working together in a highly collaborative mode. Whilst this is adding significantly to the “cognitive load” of both teams, it is a great approach as both teams “find their way”.

However a third team, an R&D team within the organisation, also put their hand up wanting the services of the platform team. However their problem space was slightly different than the product team’s. First off they “just wanted somewhere to run stuff in the cloud”. They also did not have bandwidth, nor the inclination to learn the ins and outs of deploying to Kubernetes or running their own infrastructure. 

Secondly, because their deployments are for R&D work, they are often short-lived and ad hoc. This didn’t fit well with the model that the platform team was running. Additionally, from the platform team’s side, they did not have the capacity to engage in a collaborative manner with yet another team.

A different approach

So, it was decided to try a different approach. First off, it was decided to set up the communication mode between the two teams as “X-as-a-Service” with a heavy focus on implementing “self-service” for the R&D team. The plan being that the R&D team would be able to deploy workloads without the direct involvement of the platform team but with all the rigor around security, scaling, logging and costs that has been put in place by the platform team.

It was also decided to put a heavy focus on the developer experience, making it as simple as possible for them to use, so that they did not have to learn new skills and perform onerous steps to get deployments to work.

Finally, this would be done as a proof of concept to capture “lessons learned” as fast as possible.

Step 1 – Packaging the software

The R&D was already familiar with Docker and many of their experiments used some pretty esoteric configuration and installs, especially those involving Machine Learning. So it was settled on using containers to package up their software for deployment.

With developer experience in mind, a simple GitHub action was put together to build and push an image to an Amazon Elastic Container Registry whenever a piece of software’s repo was “released”:

This is pretty much a “drag and drop” setup. All the R&D developer needs to do is create a valid Dockerfile and add the GitHub action to their repo.

They can then use the GitHub UI to create a release with well defined notes and a version tag and their software will “magically” build and be ready for deployment.

Step 2 – Deploying the software

In order to limit the amount of new things that the R&D team would need to learn, it was decided to create a simple JSON “manifest” file. This file contains the minimal amount of information required to deploy a container. The most important of which are the version of the container to deploy and the amount of resources to consume:

These manifest files, one per container, are stored in a GitHub repo. Providing visibility and auditability of changes as they are made.

For the proof of concept, a commit of a manifest change triggers the deployment (or upgrade) of a container. It is then available for use in a minute or two and the job is done from the R&D team’s perspective. 

Progress, success and failure reports for the deployment are broadcasted into a Slack channel to keep the R&D team informed of what is happening.

Removing a manifest results in the undeployment of the associated container, making it easy to decommission containers that are no longer required.

Step 3 – Behind the scenes

Behind the scenes, a commit of a manifest change triggers a GitHub action that runs Terraform to configure and deploy AWS infrastructure. 

For the proof of concept, it was decided to use the Amazon Elastic Container Service to host containers. This allows the infrastructure to be “scaled down to zero” when nothing is in use saving significant money.

The Terraform scripts iterate the manifests and configure a number of resources, including:

  • Security groups
  • Log groups
  • ECS tasks and services
  • Listeners for the Application Load balancer (for REST and Websocket based communication with the deployed containers)

The scripts also ensure that all the foundational elements, such as the application load balancer, the ECS cluster, auto-scaling and IAM roles, are created and are in the correct state. In fact, the only items that are “hand crafted” are the TLS certificate (managed by AWS Certificate Manager) and the CNAME for the application load balancer (managed by AWS Route53). This is due to the way that DNS is set up and managed for the organization.

Since many of the containers deal with Machine Learning, the manifest supports the specification of the number of GPUs the container requires using a “gpu” parameter. If a non-zero value is specified, the container is deployed to a GPU enabled EC2 instance managed by ECS. These EC2 instances are in an aggressively auto-scaling group to keep costs down and if no GPU enabled containers are running, the number of instances are scaled to zero.

If no GPUs are required by the container, the Terraform scripts configure ECS to deploy the container to much cheaper infrastructure, using AWS Fargate.

As mentioned before, if a manifest is removed the Terraform scripts will remove the running container and all its associated resources.

Conclusion

The proof of concept has been very successful. The R&D team is able to self-service the deployment of their software with minimal effort and the platform team is comfortable that all the “right things” are being done to ensure that the deployments are safe and cost effective.

The use of technologies such as Docker, Terraform, git and Slack that are familiar to the teams was well received and eased adoption.

The lessons learned are currently being reviewed and the next steps are being worked through to address things such as: self-service access to the container logs and the configuration of container environment variables and secrets (without complicating the manifests). Get in touch with us if you need any help or advice on your DevOps journey, containerisation, continuous delivery pipelines, Cloud friendly architectures or team topologies.

 

By Jonathan Ackerman, Solution Architect, ClearPoint