How to keep AWS accounts tidy?

Summary

Managing multiple AWS accounts across teams often leads to resource sprawl, including orphaned volumes, unused Elastic IPs, and forgotten infrastructure. Manual audits are reactive and inefficient, and third-party governance tools often involve billing intermediaries or percentage-based models, which we aim to avoid. This post explores best practices for automated AWS resource hygiene and cost governance in multi-account environments.

Root Cause

The root cause lies in lack of automation and inconsistent enforcement of cleanup policies. Key issues include:

Manual processes: DevOps teams rely on periodic audits, which are time-consuming and error-prone.
Forgotten resources: Temporary infrastructure (e.g., backup volumes) is often overlooked, leading to unnecessary costs.
Inadequate tagging: Poor tagging practices make it difficult to identify and manage resources.
Limited visibility: Teams lack real-time insights into resource usage and costs.

Why This Happens in Real Systems

Silos between teams: Different teams manage resources independently, leading to inconsistent practices.
Rapid provisioning: AWS’s ease of use encourages quick resource creation, but cleanup is often neglected.
Complexity of multi-account setups: AWS Organizations and Control Tower add layers of complexity, making governance harder.
Lack of ownership: Resources created for temporary purposes often lack clear ownership, leading to orphaned assets.

Real-World Impact

Financial waste: Unused resources incur unnecessary costs, as seen in the 8 TB backup volume example.
Operational inefficiency: Manual audits divert resources from higher-value tasks.
Security risks: Forgotten resources may expose vulnerabilities or violate compliance policies.
Scalability challenges: As the number of accounts and resources grows, manual management becomes unsustainable.

How Senior Engineers Fix It

Senior engineers implement proactive, automated solutions:

AWS-native tools: Leverage AWS Config, AWS Lambda, and AWS Systems Manager for automated cleanup.
Tagging policies: Enforce mandatory tagging via Service Control Policies (SCPs) in AWS Organizations.
Lifecycle management: Use AWS Lifecycle Manager for automated resource expiration.
Custom automation: Build serverless scripts to detect and delete unused resources (e.g., EBS volumes, snapshots).
Cost monitoring: Utilize AWS Cost Explorer and Budgets with alerts for anomalous spending.
Open-source tools: Adopt tools like Cloud Custodian or CloudHealth for policy-based governance.

Why Juniors Miss It

Junior engineers often:

Underestimate cleanup importance: Focus on provisioning rather than decommissioning.
Lack awareness of AWS-native tools: Rely on manual methods instead of automation.
Ignore tagging: Fail to implement consistent tagging practices, making resource tracking difficult.
Overlook cost implications: Don’t realize the long-term financial impact of orphaned resources.

Example or Code (if necessary and relevant)

import boto3

def delete_unused_volumes(dry_run=True):
    ec2 = boto3.client('ec2')
    volumes = ec2.describe_volumes(Filters=[{'Name': 'status', 'Values': ['available']}])
    for volume in volumes['Volumes']:
        print(f"Deleting volume: {volume['VolumeId']}")
        if not dry_run:
            ec2.delete_volume(VolumeId=volume['VolumeId'])

This script identifies and deletes unused EBS volumes, demonstrating automation in action.

Key Takeaways

Automate resource cleanup using AWS-native tools and open-source frameworks.
Enforce tagging policies to improve visibility and accountability.
Monitor costs proactively with AWS Cost Explorer and Budgets.
Educate teams on the importance of resource hygiene and cost governance.