Tagging Everything

I was always a big believer in metadata and observability. Tagging is another form of observability, the idea is very basic and yet not well explored in our industry. You add metadata to a resource. Metadata is just data that describes data. Why bother? Well, at scale, there will be hundreds to thousands of resources including ec2 machines, container images, security groups, load balancers, and all kinds of applications like services, bffs, aggregators, a much more. How do you make sense of these resources? How do you know if you need them after all? Cloud computing is great but also is a big cost center. Understanding your resources is critical, not only for savings but for better infrastructure management. Tags help with cost, but they go beyond cost. 

Endless Resources

So let's say you have some scale, easily you can have hundreds to thousands of ec2 and dozens to hundreds of lamdas. The first question that should come to mind is ownership, how own all these resources?

By not having ownership, you will have these challenges and issues:
  • Maybe they are not used anymore, and therefore is just a cost, who can tell? who do you ask?
  • How will apply patches and security fixes?
  • In case of an incident, who will troubleshoot it? 
  • If inefficiencies are present, who will improve it? 

Lambdas and serverless are often taken for granted and as a free lunch they are not, in the infrastructure management perspective they are not different them services running on EC2 after all. Sure serverless via FaaS provides a higher level of abstraction by abstracting OS and language runtime for you, but you still manage dependencies and the concerns listed above.

OK. But that's only trying for services running on ec2 and lambdas right?


No. The same philosophy applies to things that you might not even consider like Unit Tests and Jenkins jobs. If you have hundreds of services you should have thousands of tests. The same will go for Jenkins jobs, maybe not thousands but hundreds for sure. Test automation and Jenkins jobs are different resources when compared with ec2 servers and lambdas but in a sense, off ownership, we have exactly the same concerns. 

Tagging

Tagging can be done in a variety of ways, here are a couple of different ways we can apply tagging:
Tagging is metadata, tagging should be applied to all resources and we need to think of custom tags that help us to make sense of all the resources in our infrastructure.

Tagging Use Cases

There are many use cases, that tagging can help. Here are a few:

Governance: Thing about access rules and seggregation of responsibilities, should team A have access to the team B server job? Should that call happen? When a service calls another service it creates coupling, coupling is fine if is low but you don't want high coupling, some calls should not be allowed at all, like service A calls service B database. Tagging can allow us to see such relationships. For service calls, tagging would not be the only solution, there are other solutions like dynamic tracing, which is cool but expensive and intrusive at scale.

Management: How do you effectively plan and manage time? Don't you need to know if you are going to spend 3 days troubleshooting some network issues because you don't know exactly why your service call doesn't work? A better understanding of your infrastructure and code allows more realistic estimates. Otherwise when you don't know, is the proverbial, and yet wrong is 35 sprints. A better approach would be to do a spike for 1 week and figure it out in a POC before doing such an estimate, but either way, tags help you understand.

Cost: Probably the number #1 reason to do tags, as you know what the resource is and some extra information is easier to reason about, for instance, consider we apply the following tags to an ec2 machine in AWS:

By having the business domain and owner, you have a point of contact and someone to respond and be accountable for the cost. By having the create date and the last deploy tag in tags, you can tell if is up to date or not, and will help with patches and even help in investigations to say if the machine is actually used. 

Inventory: Inventories are a must for cloud migrations. Platform teams do migrations for a living, there are no ending migrations. Since you do a lot of migrations, is a good idea to learn how to do migrations better and more effectively. One of the very first steps in a migration is to have an inventory, let's say you are migrating from JUnit 4 to JUnit 5, tags can be useful, consider the following tags for JUnit tests:

Having the business domain, feature, and ID of the target service is critical to understand the coverage of the tests. Now if we have the type of test we can also have a better understanding of the coverage, for instance, we can analyze if we have too much 2e2 and too few unit tests. 

Having the latest or average execution time(feel free to add percentiles like p75, p90, p95, and p99) we can tell what tests are slow and have an overall metric across all tests given a service or business domain. Tags play an important role here, not only for understanding but also for giving us the full picture to make the tests better, more reliable, and with some decent coverage. 

Troubleshooting: Requires all we can get to make sense of systems, Linux perf tools, logs, metrics and alerts, profilers, debuggers, and tags. Tags play a role in troubleshooting as well. Tags will not replace metrics but also be a compliment, consider the tradeoffs between metrics and tags:


Metrics tend to grow much bigger them tags, tags can also be sent in the form of metrics to a metric system for correlation resources but metrics tend to be a bit more static and have a bigger granularity. The funny fact is that good metric systems often allow you to add tags to metrics. :-) 

Security: Let's say you don't allow services to auto-tag, in that case. You could have a central coordinator or even a git-ops solution that applies tags for you, some tags could be used to check and allow or restrict service calls. AWS has amazing capabilities with IAM and tags can be entered on the custom expressions to validate access rules

Take Aways

Here are some takeaways from this blog post, and some ideas to keep in mind.

  • Tags are a form of observability
  • Tags should be used alongside metrics
  • Tags can be used in a variety of use cases like Governance, Management, Cost optimization, Inventory for migrations, troubleshooting, and security.
  • Tag all your resources for better understanding and support for custom engineering
  • Cost is a big thing in cloud computing be aware and work to improve it always.

cheers,

Diego Pacheco



Popular posts from this blog

Having fun with Zig Language

C Unit Testing with Check

HMAC in Java