Many Layers of Availability

Today’s world is very different from what it was 20 years ago. We are used to a website loading in a second, a mobile app calculating a complex route in moments, or ordering a taxi with one button. This is made possible by combining many technologies and approaches that ensure smooth and fast operation.

It is hard to imagine a taxi service suddenly stopping for 10 minutes. It is equally hard to picture a map or GPS navigator that suddenly stops updating, or a major online store freezing during payment.

One of the key reasons such services work reliably is application availability. But what does that mean?

In technical terms, availability is the percentage of total time when an application is ready to process requests. Suppose a shop works from 8 a.m. to 8 p.m. every day. Then the checkout counter’s availability is 50% - 12 hours out of 24 per day, or 4,380 hours out of 8,760 in a year. For most web-based applications, availability is much higher - 99%, 99.9%, or even 99.99%. The higher the availability, the less downtime the app has. For example, 99.99% means only 52 minutes of downtime per year.

So how is high availability achieved? Let’s go step by step.

Application

The first line of defense for improving availability is the application itself, and there are many tools here.

  flowchart LR
  User[User]

  subgraph App[Application]
    LBnode[Load Balancer]

    R1[Replica 1]
    R2[Replica 2]
    R3[Replica 3]
  end

  User --> LBnode
  LBnode lb-r1@--> R1
  LBnode lb-r2@--> R2
  LBnode lb-r3@--> R3

lb-r1@{animate: true}
lb-r2@{animate: true}
lb-r3@{animate: true}

Architecture

A major step toward better availability is rethinking the application architecture. A large monolithic application is a single point of failure. If it crashes or becomes unreachable due to network issues, the entire service goes down.

A microservices or other distributed architecture helps avoid this. When we split an app into independent modules, the failure of one usually does not bring down the whole system. Only part of the functionality is affected, and we can minimize that impact with other practices.

Redundancy

One such practice is redundancy of components, called replication. Each module runs as several instances at the same time. Incoming load - whether requests, messages, or background tasks - is distributed among them. If one process fails, others take over its workload, keeping the service running.

Load balancing

For requests, load distribution is handled by a load balancer - a network component that knows the list of replica addresses and forwards incoming requests to them. In a simple case, it cycles through addresses in turn. In more advanced setups, it considers the load on each replica and adjusts distribution accordingly.

This also supports scaling - adding or removing replicas so total computing resources match demand. A load balancer helps bring a new replica online for clients or remove an old one by directing traffic to healthy processes.

For message processing, distribution is handled by a queue or message broker, which also uses load-balancing algorithms.

Reducing dependencies

Still, these measures cannot fully prevent outages. Poor architecture can cause a chain reaction where the failure of one module cascades to others, making the whole system unavailable. To avoid this, we must minimize dependencies - communicate only through APIs, keep API models small, and group related features in one module so each handles a separate business domain.

This limits the “blast radius” of failures, improving overall availability.

Asynchronous interaction

However, even within one module’s boundaries, failures can affect others that depend on it. To avoid this, synchronous calls (client-server) can be replaced with asynchronous communication (message-based or event-based).

With message-based communication, services do not depend on each other at runtime. If one fails, the result is only a delay, not a failure. Still, logical dependency remains.

Event-based communication removes even that. Services no longer know about each other - they react to events happening in the system, published by other services. This approach underlies event sourcing, where the system’s state is stored as a sequential log of events rather than as models in a database.

Platform

  flowchart LR
  User[User]
  LB[API Load Balancer]
  CLB[Application</br>Load Balancer]

  subgraph CP["Control-Plane"]
    direction TB
    CP1[cp-node-1]
    CP2[cp-node-2]
    CP3[cp-node-3]
  end

  subgraph WP["Worker Nodes"]
    direction LR
    W1[worker-1]
    W2[worker-2]
    W3[worker-3]
  end

  %% Access path
  User --> CLB
  CLB --> W1
  CLB --> W2
  CLB --> W3

  LB --> CP1
  LB --> CP2
  LB --> CP3

  W1 --> LB
  W2 --> LB
  W3 --> LB

Many engineers believe that distributed, scalable, and ideally asynchronous architecture guarantees the magic “four nines” or “five nines” (99.999%) availability. But that’s not enough - the infrastructure running the application must also be fault-tolerant, from the platform down to the lowest level.

Cluster architecture

Modern platforms like Kubernetes or Hadoop use cluster architecture - a set of nodes (virtual or physical machines) that share load. Clusters allow scaling resources by adding or removing machines.

Replica distribution

Another key method is spreading service replicas across nodes (anti-affinity). This avoids placing all replicas on one node, where a single failure could take the whole service down.

Self-healing

Platforms can also self-heal. If a node fails, the cluster automatically creates a new one and moves the workload. A distributed application architecture plus a load balancer prevents downtime here, and anti-affinity ensures not all replicas are lost at once.

Data

Availability is not just about the app - data safety and availability are often even more critical. Managing data in a high-load, complex system requires deep expertise.

  flowchart LR
  U[Users]
  LB[Load Balancer]
  APP[App Service]
  CACHE[(Cache)]
  subgraph database
    direction LR
    DBW[(Primary DB / Writer)]
    DBR[(Read Replicas)]
  end

  U --> LB --> APP
  APP -->|read| CACHE
  CACHE -->|miss| DBR
  APP -->|write| DBW
  DBW --replicates--> DBR
  DBW -.invalidate.-> CACHE

Redundancy

Data faces many risks - read/write errors, failed transactions, database faults. The solution is redundancy. Databases run multiple parallel replicas that copy data between them. Typically, writes go to one replica, but reads are possible on all. If one replica fails, the database load balancer reroutes queries. If the leader fails, another replica becomes the new leader.

A similar approach works for message brokers. In Apache Kafka, each topic is written to one broker and replicated to others. If one broker fails, the remaining ones keep the topic available.

Caching

Another tool for availability is caching. Both services and databases may be temporarily unavailable. Without caching, requests that depend on them will fail. With caching, we can serve a recent version of the data, avoiding failure and reducing inter-component dependency.

Infrastructure

Infrastructure includes virtual machines, disks, and networks - the foundation of any system.

  flowchart LR
  LB[Load Balancer]

  subgraph Rack["Primary Rack (Compute Hosts)"]
    direction LR
    VM1[VM #1]
    VM2[VM #2]
  end

  Storage[(Shared Storage / Datastore)]
  Mgmt[Cluster Manager]
  DR[(DR Site / Replicated Storage)]

  %% Traffic
  LB --> VM1
  LB --> VM2

  %% Storage & DR
  VM1 --> Storage
  VM2 --> Storage
  Storage -.replication.-> DR

  %% Orchestration & Hot Migrationes
  Mgmt --> Rack
  VM1 -.hot migration.-> VM2

Virtual machines

VMs often support hot migration to another physical host. If a host fails, the VM’s memory is copied and its storage reattached to another host, restoring service in seconds.

Cloud providers also offer instance groups - sets of identical VMs that the provider monitors and automatically replaces if they fail, with optional auto-scaling.

Networks and load balancers

A working network is essential. High network availability is achieved through SDN (software-defined networking) and hardware protocols like TCP, which ensures packet delivery through checksums and retransmission.

Load balancers themselves must be replicated so they do not become a single point of failure.

Data storage

Reliable storage systems like Ceph split data into small chunks, replicate them across many disks, and enable fast reads and quick recovery if a disk fails.

Hardware

At the hardware level, solutions like RAID combine multiple drives into a redundant array so data remains available if one disk fails.

In data centers, servers from the same cluster are placed in different racks to avoid losing them all if one rack’s power fails. Networks also have backup internet links to keep connectivity.

Global availability

High availability extends beyond one data center. A whole data center can still fail - due to fire, earthquake, or even an excavator cutting cables. Cloud providers offer disaster recovery regions - groups of three or more data centers connected by high-speed networks. Platforms and databases are spread across these zones, so losing one does not stop operations.

  flowchart LR
  U[Users 🌍]
  GSLB[Global Server Load Balancer]

  subgraph R1["Region B"]
    R1LB[Regional Load Balancer]
    R1APP[App/Service Pool]
  end

  subgraph R2["Region A"]
    R2LB[Regional Load Balancer]
    R2APP[App/Service Pool]
  end

  U --> GSLB
  GSLB -->|primary / nearest| R1LB
  GSLB -->|failover / overflow| R2LB
  R1LB --> R1APP
  R2LB --> R2APPes

For critical services, even a full-region outage must be considered. In such cases, services are replicated across multiple regions worldwide, with GSLB (global server load balancing) and BGP Anycast directing users to the nearest available server.

Closing thoughts

This is a large but incomplete list of techniques for improving application availability. It’s a complex topic requiring expertise at all levels. Most of these complexities are hidden from everyday engineers by cloud providers or internal IT departments.

Still, understanding them is essential for designing critical systems with high availability. Since total availability is the product of each component’s availability, it’s important to know how it’s achieved at each level and how to combine these approaches into a working solution.

comments