Health checks are becoming an essential part of modern microservices setup. Every service is expected expose a health check endpoint which can be accessed by server monitoring tool. Health checks provide important attributes as they allow the process responsible for running the application to restart or kill it when it starts to misbehave or fail. Design with this pattern needs to be incredibly careful and not too aggressive to use major cycles to utilize this.
What needs to be recorded with health checks is entirely one’s choice. However, you might run into some recommendations as follows,
– Data store connection status (general connection state, connection pool status)
– Current response time (rolling average)
– Current connections
– Bad requests (running average)
How to determine what would cause an unhealthy state needs to be part of the discussion during the design of the service. For example, no connectivity to the database means the service is completely inoperable, it would report unhealthy and would allow the orchestrator to recycle the container at the same time, an exhausted connection pool could just mean that the service is under high load, and while it is not completely inoperable it could be suffering degraded performance and should just serve a warning.
The same goes for the current response time, when you load test your service once it has been deployed to production, you can build up a picture of the thresholds of operating health. These numbers can be stored in the config and used by the health check. For example, if you know that your service will run an average service request with a 50 milliseconds latency for 4,000 concurrent users; however at 5,000, this time grows to 500 milliseconds as you have exhausted the connection pool. You could set your SLA upper boundary to be 100 milliseconds; then you would start reporting degraded performance from your health check. This should, however, be a rolling average based on the normal distribution. It is always possible for one or two requests to greatly be outside the standard deviation of normal operation, and you do not want to allow this to skew your average which then causes the service to report unhealthy, when in fact the slow response was actually due to the upstream service having slow network connectivity, not your internal state.
When discussing health checks, the pattern of a handshake is considered in most occasions, where each client would send a handshake request to the downstream service before connecting to check if it was capable of receiving its request. Under normal operating conditions and most of the time, this adds an enormous amount of chatter into your application resulting in an overkill. It also implies that you are using client-side load-balancing, as with a server side approach you would have no guarantees that the service you handshake is the one you connect to. The concept however of the downstream service making a decision that it can or can’t handle a request is a valid one. Why not instead call your internal health check as the first operation before processing a request? This way you could immediately fail and give the client the opportunity to attempt another endpoint in the cluster. This call would add almost no overhead to your processing time as all you are doing is reading the state from the health endpoint, not processing any data.
When we discussed service discovery, we examined the concepts of server-side and client-side discovery. For many years server-side discovery was the only option, and there was also a preference for doing SSL termination on the load balancer due to the performance problems. It is a good idea to use TLS secure connections internally. However, what about being able to do sophisticated traffic distribution? That can only be achieved if you have a central source of knowledge. However, there could be a benefit to only sending a certain number of connections to a particular host; but then how do you measure health? You can use layer 6 or 7, but as we have seen by using smart health checks, if the service is too busy then it can just reject a connection. To be able to implement multiple strategies for the load balancer, such as round-robin, random, or more sophisticated strategies like distributed statistics, across multiple instances you can define your own strategy.
One way you can improve the performance of service is by caching results from databases and other downstream calls in an in-memory cache or a side cache like Redis, rather than by hitting a database every time. Caches are designed to deliver massive throughput by storing precompiled objects in a fast-access data store, frequently based around a concept of a hash key. We know from looking at algorithm performance that a hash table has the average performance of O(1); that is as fast as it gets. Without going too in depth into Big O notation, this means it takes one iteration to be able to find the item you want in the collection. What this means is that, not only can one reduce the load on database, can also reduce your infrastructure costs. Typically, a database is limited by the amount of data that can be read and written from the disk and the time it takes for the CPU to process this information. With an in-memory cache, this limitation is removed by using pre-aggregated data, which is stored in fast memory, not onto a state-full device like a disk. This comes at the cost of consistency because one cannot guarantee that all clients will have the same information at the same time.
Caching strategies can be calculated based on your requirements for this consistency. In theory, the longer the cache expiry, the greater cost saving, and the faster system is, at the expense of reduced consistency. So when planning a feature, one should be talking about consistency and the tradeoffs with performance and cost, and documenting this decision, as these decisions will greatly help create a more successful implementation.
You have probably heard the phrase Premature optimization, so does that mean you should not implement caching until you need it? No; it means you should be attempting to predict the initial load that your system will be under at design time, and the growth in capacity over time, as you are considering the application lifecycle. When creating this design, you will be putting together this data, and you will not be able to reliably predict the speed at which a service will run at. However, you do know that a cache will be cheaper to operate than a data store; so, if possible, you should be designing to use the smallest and cheapest data store possible, and making provision to be able to extend your service by introducing caching at a later date. This way you only do the actual work necessary to get the service out of the door, but you have done the design up front to be able to extend the service when it needs to scale.
The cache will normally have an end date on it. However, if you implement the cache in a way that the code decides to invalidate it, then you can potentially avoid problems if a downstream service or database disappears. Again, this is back to thinking about failure states and asking what is better: the user seeing slightly out-of-date information or an error page? If your cache has expired, the call to the downstream service fails. However, you can always decide to serve the stale cache back to the calling client. In some instances, this will be better than returning a 50x error.