Review of “Designing Distributed Systems”

June 18, 2019

Review of “Designing Distributed Systems”

Filed under: Computing,Technology — admin @ 11:34 am

The “Designing Distributed Systems” book provides design patterns for building distributed systems with support of container technologies such as Kubernetes. The book consists of three sections where first section focuses on single-nodes, second section focuses on long-running services, and third section focuses on batch computation.

Sidecar

The first pattern in the book introduces concept of sidecar pattern for modularity and reusability where a single application requires two containers: application container and sidebar container where sidebar container provides additional functionality such as adding SSL proxy for the service, collecting metrics for the application container. The side-bar container can be configured via dynamic configuration service.

Ambassadors

The ambassador pattern introduces an ambassador container that sits between the application and external services and all incoming/outgoing traffic goes through it. It also helps with modularity and reusability where the ambassador may abstract sharded service (or A/B testing) so that client or service itself doesn’t need to know all details . You may also use ambassador container for service brokering where it looks up an external service and connects to it.

Anti-corruption layer

The anti-corruption layer integrates two systems that don’t share the same semantic data model.

Adapters

The adapter pattern uses special container to modify the interface of application container, e.g. you can deploy monitoring adapter to automatically collect health metrics using Prometheus or other tools. Similarly, you may use adapter container to collect kubernetes logs (stdout/stderr) and reformat the logs before sending them to log aggregator (Fluentd).

Replicated Load-Balanced Services

This pattern is part of long-running services where a load balancer is added in front of the service for scalability. Each service is designed as a stateless so that requests can be sent to any replica of the service behind the load balancer. Each service needs to provide readiness probe so that load balancer knows if it can serve the requests. In some cases, you may need to support session-tracked services where user requests are routed to the same replica using sticky session or consistent hashing function. You may add a caching layer that is deployed along with your service container (as sidebar). Further, you may need to provide rate-limiting and protect against DOS attacks (X-RateLimit-Remaining headers). This pattern can also implement SSL Termination where external traffic is encrypted with different certificate compared with internal traffic (Varnish).

Sharded Services

This pattern partitions the traffic where each shard serves subset of all requests. As opposed to replicated services that are generally used for stateless services, sharded services are used for building stateful services. You may use sharded cache for each shard that sits between user and front-end to optimize end-user performance and latency. You may add replicas for each shard for further redundancy and scalability. Sharding requires selecting a key to route the traffic, e.g. you may use IP-address or consistent hash function to avoid remapping when new shards are added. If one of the shard becomes hot, you can add replicated sharded cache to handle the increased load.

Scatter/Gather

The scatter/gather pattern adds parallelism in servicing requests where work is broken and spawned to multiple services and then result is aggregated before returning to the user. For example, you can implement distributed document search by farming multiple leaf machines that returns matching document and root node aggregates the results. You can also add support for sharded data by searching each shard in parallel and root node generates union of all documents returned by each shard (leaf node). One downside of this pattern is that it may suffer straggler problem as total response time depends on the slowest response so you may need to replicate each shard to improve computational power.

Functions and Event-Driven Processing

This pattern is used to implement function-as-a-service (FaaS) products. FaaS simplifies development and deployment as the code is managed and scaled automatically. However, FaaS requires that you decouple your application into small parts that can be run independently. Faas uses event systems to communicate with each function or create a data pipeline. You can use external data services for storing states that is shared by these functions.

Ownership Election

This pattern helps in multi-node environment where a specific task must be owned by a single process. For example, when you have multiple replicas, you may need to elect master using consensus algorithm such as Paxos, Raft or frameworks such as etcd, ZooKeeper, and consul. You can use distributed locks to implement ownership (optionally with a lease or TTL). You may need to verify if you hold the lock before proceeding, e.g.

func (Lock l) isLocked() boolean {
return l.locked && l.lockTime + 0.75 * l.ttl > now()
}

Work Queue Systems

This pattern is part of batch computation section to handle work items within a certain amount of time. You may use a work-queue manager container along with an ambassador container to connect to external queue source where source might use storage API, network storage, pub/sub systems like Kafka or Redis. Once the queue manager receives a work item, it launches a worker container. Kubernetes contains a Job object that allows for the reliable execution of the work queue. In order to limit number of worker containers running concurrently, you can limit the number of Job objects that your work queue is willing to create. You can also use the multi-worker pattern when different worker containers are transformed into a single unified container that implements the worker interface.

Event-Driven Batch Processing

This pattern allows data pipelining where an output of one work queue becomes input to another work queue, referred as workflow systems. Here are patterns of event-driven processing:
Copier: This pattern just duplicates the work item into two or more identical streams.
Filter: This pattern reduce a stream of work items to a smaller stream of work items by filtering out items that don’t meet particular criteria.
Splitter: This works like filter, but instead of eliminating input, it sends different inputs to different queues based on criteria.
Sharder: This is more generic form of splitter and splits a work item into smaller work items based on sharding function.
Merger: This is opposite of copier and merges two different work queues into a single work queue.

You may use pub/sub API to communicate between different workers.

Coordinated Batch Processing

This is similar to Reduce part of MapReduce pattern where a work is broken up and distributed to multiple nodes in parallel. You may need Join or Barrier Synchronization to wait for intermediate results before proceeding to the next stage of the workflow. For example, reduce phase aggregates merges several outputs into a single output.

Claim-check

Instead of sending large messages to a messaging queue, you can store payload to an external storage service and send reference in the messaging event.

Index Table

If the data-store doesn’t support secondary indexes, you can define a separate index table where the primary key is the secondary field for query such as customer phone number. You can use same partition-key for both fact-table and index tables so that they are colocated.

Compensating Transaction

When invoking several external service with eventual consistency model that can fail, you can use a compensating transaction to undo the operation and revert back changes.

Saga distributed transactions

The saga pattern uses a sequence of transaction to invoke multiple services, however if a step fails then it executes compensating transactions.

Scheduler Agent Supervisor

This pattern coordinates distributed actions as a transaction and it will undo the work if any action fails. It uses a scheduler to orchestrate steps of transaction, an agent to invoke a remote service and a supervisor to monitor the task being performed.

Competing Consumers

This allows multiple concurrent consumers to receive messages from same messaging queue. It allows better scalability, reliability and availability.

Command and Query Responsibility Segregation (CQRS)

CQRS separates query and update operations for a data-store such that commands store task/transaction based data in the data-store. The query model is optimized for high performance read model.

Event Sourcing

The event-sourcing allows storing state of domain object in an append-only store as events that are applied to update state of the application. This allows better performance, scalability and audit-trail, compensating actions.

Asynchronous Request-Reply

The front-end applications often work with synchronous APIs and this pattern uses request-reply facade to decouple the backend asynchronous APIs.

Retry

The retry allows recovering from failures by retrying the failed operation after exponential delay and jitter.

Circuit Breaker

When a network request fails due to transient errors, the circuit breaker can be used to prevent an application from repeatedly retrying the failed operation.

Bulkhead

The bulkhead pattern components of an application are isolated so that failure in one component doesn’t cause cascading failure. For example, you may use different database connection pools, threads pool or rate-limit for different types of requests.

Orchestration vs Choreography

Micro-services can communicate with other services using orchestration model where a centralized service orchestrates requests to other services and acknowledges all operations. Alternatively, you can use message-driven or asynchronous messaging to implement choreography design.

Deployment Stamps

In order to scale deployment by tenants or groups, you can use deployment stamp or scale-unit to host multiple instances of the application. This provides scalability, sharding and separation of data per subset of customers.

External Configuration Store

The external configuration store allows storing application configuration to a centralized location that can be cached and shared across applications.

Edge Workload Configuration

The Edge-workload configuration is often used with IoT deployment to allow scaling of data-store and configuration at the edge. For example, you may define an edge-configuration controller and edge-configuration data-store to host device configurations that are updated asynchronously from cloud configuration controller and data-store.

Federated Identity

The federated identity allows delegating authentication to an external identity provider so that user management, authentication and authorization can be simplified.

Gatekeeper

The gatekeeper acts as a broker between client and services so that requests can be sanitized, throttled and authenticated.

Gateway Aggregation

A client may need to invoke multiple data services to aggregate application state, instead gateway aggregation allows using a gateway to invoke multiple services.

Gateway Offloading

When communicating with a service that requires unique requirements for connections or security, you can use a gateway to proxy all requests so that other services don’t need to implement all those measure.

Gateway Routing

This allows client to use a single endpoint to access all service that can authenticated, throttled and validated consistently.

Geode

This allows deploying applications to geographically node or regions in active/active style so that latency and availability can be improved.

Health Endpoint Monitoring

This allows checking application health from external tools for monitoring purpose.

Leader Election

In order to avoid conflicts, a single task instance can be elected as a leader so that it coordinates actions of other subordinate tasks.

Pipes and Filters

The pipes and filters allow decomposing tasks into a series of components that can be shared to improve performance, scalability, and reusability.

Rate Limiting

The rate-limiting allows controlling resources based on limits so that you can predict throughput accurately.

Sequential Convoy

This allows processing a group of messages in sequential order by using consumers that are partitioned by the group identifier.

Valet Key

The valet key acts as a token to provide access to restricted resource. For example, the client first requests a valet key, which is then used to access the resource within the lease-time.

Shahzad Bhatti Welcome to my ramblings and rants!

June 18, 2019