Summary
Managing multiple AWS accounts across teams often leads to resource sprawl, including orphaned volumes, unused Elastic IPs, and forgotten infrastructure. Manual audits are reactive and inefficient, and third-party governance tools often involve billing intermediaries or percentage-based models, which we aim to avoid. This post explores best practices for automated AWS resource hygiene and cost governance in multi-account environments.
Root Cause
The root cause lies in lack of automation and inconsistent enforcement of cleanup policies. Key issues include:
- Manual processes: DevOps teams rely on periodic audits, which are time-consuming and error-prone.
- Forgotten resources: Temporary infrastructure (e.g., backup volumes) is often overlooked, leading to unnecessary costs.
- Inadequate tagging: Poor tagging practices make it difficult to identify and manage resources.
- Limited visibility: Teams lack real-time insights into resource usage and costs.
Why This Happens in Real Systems
- Silos between teams: Different teams manage resources independently, leading to inconsistent practices.
- Rapid provisioning: AWS’s ease of use encourages quick resource creation, but cleanup is often neglected.
- Complexity of multi-account setups: AWS Organizations and Control Tower add layers of complexity, making governance harder.
- Lack of ownership: Resources created for temporary purposes often lack clear ownership, leading to orphaned assets.
Real-World Impact
- Financial waste: Unused resources incur unnecessary costs, as seen in the 8 TB backup volume example.
- Operational inefficiency: Manual audits divert resources from higher-value tasks.
- Security risks: Forgotten resources may expose vulnerabilities or violate compliance policies.
- Scalability challenges: As the number of accounts and resources grows, manual management becomes unsustainable.
How Senior Engineers Fix It
Senior engineers implement proactive, automated solutions:
- AWS-native tools: Leverage AWS Config, AWS Lambda, and AWS Systems Manager for automated cleanup.
- Tagging policies: Enforce mandatory tagging via Service Control Policies (SCPs) in AWS Organizations.
- Lifecycle management: Use AWS Lifecycle Manager for automated resource expiration.
- Custom automation: Build serverless scripts to detect and delete unused resources (e.g., EBS volumes, snapshots).
- Cost monitoring: Utilize AWS Cost Explorer and Budgets with alerts for anomalous spending.
- Open-source tools: Adopt tools like Cloud Custodian or CloudHealth for policy-based governance.
Why Juniors Miss It
Junior engineers often:
- Underestimate cleanup importance: Focus on provisioning rather than decommissioning.
- Lack awareness of AWS-native tools: Rely on manual methods instead of automation.
- Ignore tagging: Fail to implement consistent tagging practices, making resource tracking difficult.
- Overlook cost implications: Don’t realize the long-term financial impact of orphaned resources.
Example or Code (if necessary and relevant)
import boto3
def delete_unused_volumes(dry_run=True):
ec2 = boto3.client('ec2')
volumes = ec2.describe_volumes(Filters=[{'Name': 'status', 'Values': ['available']}])
for volume in volumes['Volumes']:
print(f"Deleting volume: {volume['VolumeId']}")
if not dry_run:
ec2.delete_volume(VolumeId=volume['VolumeId'])
This script identifies and deletes unused EBS volumes, demonstrating automation in action.
Key Takeaways
- Automate resource cleanup using AWS-native tools and open-source frameworks.
- Enforce tagging policies to improve visibility and accountability.
- Monitor costs proactively with AWS Cost Explorer and Budgets.
- Educate teams on the importance of resource hygiene and cost governance.