$ cat posts/the-10000-label-go-clean-architecture-aws-finops.md
The $10,000 Label: How We Used Go, Clean Architecture, and AWS to Build a FinOps-Driven Cloud
Imagine your cloud bill is a massive corporate expense report. Without proper tagging, you're paying thousands every month for line items labelled simply "Server." You know something is running — you just have no idea what it belongs to, who owns it, or whether it's still needed.
That's the problem we set out to solve with sys-tag-manager.
The Real Cost of Inconsistent Tagging
Organisations running at scale face three compounding problems when tagging is treated as optional:
Untracked resources generate unnecessary costs. Orphaned infrastructure — EC2 instances from old experiments, forgotten load balancers, EBS volumes nobody unmounted — keeps billing you. Without tags, these resources are invisible to cost allocation tooling.
Finance teams can't do accurate cost attribution. When a CTO asks "how much does the payments platform cost to run?", the answer shouldn't be "we'll get back to you in two weeks." Proper tagging makes showback and chargeback possible in near real-time.
Unmanaged resources fall outside compliance and security patch cycles. If you don't know a resource exists, you're not patching it, not monitoring it, and not rotating its credentials.
Why We Built Our Own Tool
AWS Config rules and third-party FinOps platforms exist, but they either require significant configuration overhead or lock you into a vendor. We needed something that:
- Integrated with our existing Terraform-managed SSM Parameter Store for rules
- Ran on a schedule as a Kubernetes CronJob or Lambda
- Could be extended by any team without understanding AWS internals
- Started fast and stayed cheap to run
That last point drove the language choice.
Why Go
We chose Go because of its minimal memory footprint and extremely fast startup time, which matters a lot for Lambda and CronJob workloads. A Python or JVM-based equivalent would pay a cold-start penalty on every invocation. At the frequency we needed to run compliance checks, that adds up.
Go also gives us a single statically-linked binary — no runtime dependency management, no virtualenv, no JVM tuning. The container image is tiny. The Lambda package is under 10 MB.
The Architecture: Clean Architecture in Go
We structured sys-tag-manager around Clean Architecture principles, with hard boundaries between layers:
sys-tag-manager/
├── domain/
│ ├── resource.go # Core entity: what a cloud resource looks like
│ └── compliance.go # Business rule: what "compliant" means
├── usecases/
│ ├── discover.go # Orchestrate: find resources, check compliance
│ └── remediate.go # Apply: tag resources that fail validation
├── adapters/
│ ├── aws/
│ │ ├── explorer.go # AWS Resource Explorer implementation
│ │ ├── ssm.go # SSM Parameter Store for rules
│ │ └── tagger.go # AWS tagging API calls
│ └── config/
│ └── loader.go # Environment & config loading
└── cmd/
└── main.go
Domain Layer
The domain layer contains pure Go — no AWS SDK imports, no HTTP clients. It defines what a resource is and what compliance means:
// domain/resource.go
type Resource struct {
ARN string
Region string
Type string
Tags map[string]string
}
// domain/compliance.go
type ComplianceRule struct {
RequiredKeys []string
AllowedValues map[string][]string
}
func (r *Resource) IsCompliant(rule ComplianceRule) bool {
for _, key := range rule.RequiredKeys {
if _, ok := r.Tags[key]; !ok {
return false
}
}
return true
}
This layer is independently testable. You can unit test all business logic without mocking AWS.
Use Cases Layer
Use cases orchestrate the workflow without knowing how AWS is actually called:
// usecases/discover.go
type ResourceDiscoverer interface {
Discover(ctx context.Context, regions []string) ([]domain.Resource, error)
}
type RuleLoader interface {
LoadRules(ctx context.Context) (domain.ComplianceRule, error)
}
type DiscoverUseCase struct {
discoverer ResourceDiscoverer
rules RuleLoader
}
func (uc *DiscoverUseCase) Run(ctx context.Context) ([]domain.Resource, error) {
rules, err := uc.rules.LoadRules(ctx)
if err != nil {
return nil, err
}
resources, err := uc.discoverer.Discover(ctx, []string{"eu-west-1", "us-east-1"})
if err != nil {
return nil, err
}
var nonCompliant []domain.Resource
for _, r := range resources {
if !r.IsCompliant(rules) {
nonCompliant = append(nonCompliant, r)
}
}
return nonCompliant, nil
}
Adapters Layer
The adapters layer is where AWS lives. Swapping AWS Resource Explorer for a different discovery mechanism only requires changing this layer:
// adapters/aws/explorer.go
type ResourceExplorerAdapter struct {
client *resourceexplorer2.Client
}
func (a *ResourceExplorerAdapter) Discover(
ctx context.Context,
regions []string,
) ([]domain.Resource, error) {
var resources []domain.Resource
paginator := resourceexplorer2.NewSearchPaginator(a.client, &resourceexplorer2.SearchInput{
QueryString: aws.String("resourcetype:aws:ec2:instance OR resourcetype:aws:s3:bucket"),
})
for paginator.HasMorePages() {
page, err := paginator.NextPage(ctx)
if err != nil {
return nil, err
}
for _, r := range page.Resources {
resources = append(resources, domain.Resource{
ARN: aws.ToString(r.Arn),
Region: aws.ToString(r.Region),
Type: aws.ToString(r.ResourceType),
Tags: flattenTags(r.Properties),
})
}
}
return resources, nil
}
Discovery with AWS Resource Explorer
AWS Resource Explorer gives us cross-region resource discovery through a single API call — no need to iterate over every service in every region manually. We index all resource types on a schedule and query against the index.
The key advantage: it returns tags as part of the resource payload, so a single query gives us both the resource inventory and current tag state. We don't need separate Describe* calls per resource type.
Compliance Rules in SSM Parameter Store
Rules live in SSM Parameter Store as JSON, managed by Terraform:
resource "aws_ssm_parameter" "tagging_rules" {
name = "/finops/tagging-rules/v1"
type = "String"
value = jsonencode({
required_keys = ["team", "environment", "cost-center", "service"]
allowed_values = {
environment = ["production", "staging", "development"]
}
})
}
This gives us an auditable GitOps workflow: rule changes go through PR review, are applied by Terraform, and are automatically picked up by the next sys-tag-manager run. No manual config file deployments.
Handling Shared Infrastructure
Not every resource belongs to a single team. Networking components — VPCs, transit gateways, NAT gateways, shared security groups — are used by multiple teams and shouldn't be flagged as non-compliant for missing a team tag.
We handle this with a fallback mechanism: resources matching a set of ARN patterns or resource types are classified as "shared infrastructure" and receive a generic set of tags rather than team-specific ones:
func isSharedInfrastructure(r domain.Resource) bool {
sharedTypes := map[string]bool{
"aws:ec2:vpc": true,
"aws:ec2:natgateway": true,
"aws:ec2:transitgateway": true,
}
return sharedTypes[r.Type]
}
This reduces false positives significantly and keeps the compliance reports actionable.
Results
After rolling sys-tag-manager out across our AWS accounts:
| Metric | Before | After | |---|---|---| | Time to full compliance audit | Weeks | Minutes | | Orphaned resources | ~12% of inventory | <1% | | Cost allocation accuracy | Disputed in every review | Automated, audit-ready |
The orphaned resource reduction alone recovered meaningful spend — resources that had been running unnoticed for months were finally visible, investigated, and terminated.
What We'd Do Differently
Start with a read-only mode. Our first production run tagged more resources than expected due to a rule misconfiguration. Read-only dry-run mode should be the default, with explicit opt-in to write mode.
Add a notification layer early. Teams need to know when their resources are flagged non-compliant before the remediation run tags them automatically. We added a Slack notification adapter later — it should have been in the first version.
Version the rules schema. We're on /finops/tagging-rules/v1 in SSM for a reason. When we updated the rule format, old runs kept reading cached versions. Explicit schema versioning saves you from subtle bugs during rule migrations.
The Broader Point
Tagging is governance infrastructure. It's boring until it isn't — until a cost spike lands and nobody can explain it, or a security audit finds resources with no known owner. Building tagging enforcement as a proper software system, rather than a wiki page of guidelines, is what makes it stick.
Go and Clean Architecture turned out to be the right combination here: fast enough to run frequently, structured enough to extend safely, and simple enough that any engineer on the team can contribute without needing to understand the full AWS tagging API surface.
Originally published on DEV Community