Cutting AWS on-demand spending by half may seem like crazy talk but at Power Costs, Inc. (PCI) we achieved this by automating the shutdown and snapshotting of our instances. It’s been about 2 years since we moved the majority of our internal development servers to AWS EC2. This has given us new levels of capability and flexibility and the monetary costs that comes with it.
In the graph below you can see our journey through EBS snapshots. Before we took any cost cutting measures, EBS comprised about 70% of our daily AWS EC2 spending. When we introduced our first snapshot routine EBS costs fell to about 55% for volumes and 2% for snapshots. These savings came from snapshotting the volume that held the actual database information. Our second snapshot routine snapshots all volumes attached to an instance. This led to EBS volumes comprising 25% of our daily spend and 8% on snapshots.
When we first started out on our move to the cloud we decided to create a simple CLI app for users. This app talks to a server that performs all the AWS API calls and tracks instance state and metadata. In the beginning we focused on the basics: create, stop, start and terminate. We also automated instance shutdown at 7 PM to keep initial costs under control. This taught us the fundamentals of knowledge about how AWS and EC2 worked.
As usage of our service grew, we started analyzing our costs to determine the most expensive part of our AWS daily spend. It turns out that over 70% of our cost was due to EBS volume space. This is because of two main reasons:
- Databases running on EC2 needed anywhere from 30 to 900 GB of volume space
- Users create a new database, use it once and let it sit around offline for months
To begin reducing our EBS usage we decided to snapshot each database as it was shutdown. Because we would be deleting and creating volumes, our solution needed to be robust. We considered writing our own implementation but discovered an Amazon service that fit the bill.
Sample code and services referenced in this post are available on our Github.
AWS Step Functions
Step Functions enable coordination of multiple AWS services into a serverless workflow. Step Functions are built out of task, choice and wait states to control your workflow. Coca-Cola gave a great talk at re:Invent 2017 on how they use Step Functions for creating nutrition labels.
Our Step Functions are chains of AWS Lambda functions that call the AWS EC2 API. These Step Functions shutdown an instance and convert its EBS volumes to snapshots. Looping is made possible by using the “wait and choice” states. Instead of waiting inside a Lambda function for a snapshot to complete we output the current status into the Step Function state. Then we check that output and verify that the action was successful. If the action failed then we wait for a period of time and loop back to the check status function. If it succeeded, we go to the next step in the workflow.
StopInstances: | |
Type: Task | |
Resource: ${self:custom.function-arn}-StopInstances | |
Next: WaitForInstancesStop | |
WaitForInstancesStop: | |
Type: Wait | |
Seconds: 15 | |
Next: CheckInstancesStopped | |
CheckInstancesStopped: | |
Type: Task | |
Resource: ${self:custom.function-arn}-CheckInstancesStopped | |
Next: EvaluateInstancesStopped | |
EvaluateInstancesStopped: | |
Type: Choice | |
Choices: | |
- Variable: '$.instanceShutdownStatus' | |
StringEquals: 'FAILED' | |
Next: StopInstancesFailure | |
- Variable: '$.instanceShutdownStatus' | |
StringEquals: 'SUCCESS' | |
Next: DetachVolumes | |
Default: WaitForInstancesStop |
Stop Step Function | Start Step Function |
---|---|
Drawbacks
There are some drawbacks to this approach. First, there is a known performance degradation of volumes created from snapshots. When you create a new volume from a snapshot, AWS loads the blocks from S3 as the operating system requests them. This can degrade performance until the volume has received all its blocks from S3. Amazon has a recommended solution if this is a concern for you. Second, this approach increased startup and shutdown time. Typical EC2 startup and shutdown time is a few minutes. Our shutdown and startup process take about 7 minutes each way.
Other Considerations
One thing to keep in mind when designing a Step Function is secure loop iteration. If you have an array of objects that need an action performed on them only once you need a secure way to do so. The pattern we follow is to:
- Have the actor Lambda function take in the array and an index value to act upon
- Actor Lambda performs work on that index element of the array
- Iterator Lambda increments the index after the actor Lambda completes
- Choice state completes the loop or sends it back to step 1 if there are more elements in the array
This pattern allows you to handle a single array element failure instead of trying to reprocess the entire array. A great example is detaching volumes from an instance. If you have 2 volumes and only 1 detaches on the first call, Amazon will throw an error if you repeat the exact same call. We have identified several key API calls that need this pattern:
- Creating volumes
- Detaching volumes
- Attaching volumes
- Creating snapshots
Amazon has a great example in the Step Function docs on how to do this. You can view our own Iterator lambda here and see it in action below.
CreateSnapshot: | |
Type: Task | |
Resource: ${self:custom.function-arn}-CreateSnapshot | |
Next: CreateSnapshotIterator | |
Retry: ${file(common.yml):reqLimitRetry} | |
CreateSnapshotIterator: | |
Type: Task | |
Resource: ${self:custom.function-arn}-Iterate | |
InputPath: '$.snapshotCreateIterator' | |
ResultPath: '$.snapshotCreateIterator' | |
Next: IsCreateSnapshotIterationComplete | |
Retry: ${file(common.yml):reqLimitRetry} | |
IsCreateSnapshotIterationComplete: | |
Type: Choice | |
Choices: | |
- Variable: '$.snapshotCreateIterator.continue' | |
BooleanEquals: true | |
Next: CreateSnapshot | |
Default: WaitForSnapshotsCreate | |
WaitForSnapshotsCreate: | |
Type: Wait | |
Seconds: 60 | |
Next: CheckSnapshotsCreateStatus | |
CheckSnapshotsCreateStatus: | |
Type: Task | |
Resource: ${self:custom.function-arn}-CheckSnapshotsCreateStatus | |
Next: EvaluateSnapshotsCreateStatus | |
ResultPath: '$.snapshotStatus' | |
Retry: ${file(common.yml):reqLimitRetry} | |
EvaluateSnapshotsCreateStatus: | |
Type: Choice | |
Choices: | |
- Variable: '$.snapshotStatus' | |
StringEquals: ${self:custom.status.failed} | |
Next: StopInstancesFailure | |
- Variable: '$.snapshotStatus' | |
StringEquals: ${self:custom.status.success} | |
Next: DeleteVolume | |
Default: WaitForSnapshotsCreate |
Another great feature of Step Functions is the Retry block. Amazon’s SDK retries API calls that received a throttled error code. During times of increased API activity we get throttled more than the SDK can handle. AWS recommends that you wrap API calls in an error retry and exponential backoff pattern. The Retry block handles this situation without having to write your own implementation.
Here is a list of the various exception names that we have discovered through trial and error. These cover the various throttling exceptions in the EC2 and EBS APIs.
CheckEC2TargetStatus: | |
Type: Task | |
Resource: ${self:custom.function-arn}-CheckEC2TargetStatus | |
Next: EvaluateEC2TargetStatus | |
ResultPath: '$.onEC2TargetStatus' | |
Retry: | |
- ErrorEquals: | |
- RequestLimitExceeded | |
- ThrottlingException | |
- SnapshotCreationPerVolumeRateExceeded | |
- Lambda.SdkClientException | |
- Lambda.AWSLambdaException | |
- Lambda.ServiceException | |
- Throttling | |
- PriorRequestNotComplete | |
- Lambda.Unknown | |
IntervalSeconds: 10 | |
MaxAttempts: 10 | |
BackoffRate: 2 |
Final Thoughts
Our next goal is to split our current Step Functions into small composable actions. This will allow us to string actions together via a meta “Runner” Step Function. The Runner function will execute a child “Action” Step Function and watch its progress. Once the first action is complete it will start the next action with the output of the previous action. Using this pattern will also mean that we should be able to regression test all our actions via the Runner.
I hope you have enjoyed this blog post and learned something along the way. Reach out to me on Twitter at @chadjvw if you have any questions and I’d be happy to answer them.