Patterns are typical solutions to common problems in various phases of the software development lifecycle and you may find many books and resources on various types of patterns such as:
However, you may also find low-level implementation patterns that developers often apply to common coding problems. Following are a few of these coding patterns that I have found particularly fascinating and pragmatic in my experience:
Functional Options Pattern
The options pattern is used to pass configuration options to a method, e.g. you can pass a struct that holds configuration settings. These configuration properties may have default values that are used when they are not explicitly defined. You may implement options patter using builder pattern to initialize config properties but it requires explicitly building the configuration options even when there is nothing to override. In addition, error handling with the builder pattern poses more complexity when chaining methods. The functional options pattern on the other hand defines each configuration option as a function (referenced in Functional Options in Go and 100 Go Mistakes and How to Avoid), which can validate the configuration option and return an error for invalid data. For example:
type options struct {
port *int
timeout *time.Duration
}
type Option func(opts *options) error
func WithPort(port int) Option {
return func(opts *options) error {
if port < 0 {
return errors.New("port shold be positive")
}
options.port = &port
return nil
}
}
func NewServer(addr string, opts ...Option) (*http.Server, error) {
var options options
for _, opt := range opts {
err := opt(&options)
if err != nl {
return nil, err
}
}
var port int
if options.port == nil {
port = defaultHTTPPort
} else {
if *options.port == 0 {
port = randomPort()
} else {
port = *options.port
}
}
...
}
You can then pass configuration options as follows:
Above solution allows handling errors when overriding default values fails and passing empty list of options. In other languages where errors can be implicitly passed to the calling code may still use builder pattern for configuration such as:
However, above solution still requires building config options even when no properties are overridden.
State Pattern with Enum
The state pattern is part of GoF design patterns, which is used to implement finite-state machines or strategy pattern. The state pattern can be easily implemented using “sum” (alternative) algebraic data types with use of union or enums constructs. For example, here is an implementation of state pattern in Rust:
However, above implementation relies on runtime to validate state transitions. Alternatively, you can use struct to check valid transitions at compile time, e.g.,
The recursion uses divide and conquer to solve complex problems where a function calls itself to break down a problem into smaller problems. However, each recursion requires adding stack frame to the call stack so many functional languages converts recursive implementation into an iterative solution by eliminating tail-call where recursive call is the final action of a function. In languages that don’t support tail-call optimization, you can use thunks and trampolines to implement it. A thunk is a no-argument function that is evaluated lazily, which in turn may produce another thunk for next function call. Trampolines define a Computation data structure to return result of a computation. For example, following code illustrates an implementation of a Trampoline in Rust:
trait FnThunk {
type Out;
fn call(self: Box<Self>) -> Self::Out;
}
pub struct Thunk<'a, T> {
fun: Box<dyn FnThunk<Out=T> + 'a>,
}
impl<T, F> FnThunk for F where F: FnOnce() -> T {
type Out = T;
fn call(self: Box<Self>) -> T { (*self)() }
}
impl<'a, T> Thunk<'a, T> {
pub fn new(fun: impl FnOnce() -> T + 'a) -> Self {
Self { fun: Box::new(fun) }
}
pub fn compute(self) -> T {
self.fun.call()
}
}
pub enum Computation<'a, T> {
Done(T),
Call(Thunk<'a, Computation<'a, T>>),
}
pub fn compute<T>(mut res: Computation<T>) -> T {
loop {
match res {
Computation::Done(x) => break x,
Computation::Call(thunk) => res = thunk.compute(),
}
}
}
fn factorial(n: u128) -> u128 {
fn fac_with_acc(n: u128, acc: u128) -> Computation<'static, u128> {
if n > 1 {
Computation::Call(Thunk::new(move || fac_with_acc(n-1, acc * n)))
} else {
Computation::Done(acc)
}
}
compute(fac_with_acc(n, 1))
}
fn main() {
println!("factorial result {}", factorial(5));
}
Memoization
The memoization allows caching results of expensive function calls so that repeated invocation of the same function returns the cached results when the same input is used. It can be implemented using thunk pattern described above. For example, following implementation shows a Rust based implementation:
use std::borrow::Borrow;
use std::marker::PhantomData;
use std::ops::{Deref, DerefMut};
enum Memoized<I: 'static, O: Clone, Func: Fn(I) -> O> {
UnInitialized(PhantomData<&'static I>, Box<Func>),
Processed(O),
}
impl<I: 'static, O: Clone, Func: Fn(I) -> O> Memoized<I, O, Func> {
fn new(lambda: Func) -> Memoized<I, O, Func> {
Memoized::UnInitialized(PhantomData, Box::new(lambda))
}
fn fetch(&mut self, data: I) -> O {
let (flag, val) = match self {
&mut Memoized::Processed(ref x) => (false, x.clone()),
&mut Memoized::UnInitialized(_, ref z) => (true, z(data))
};
if flag {
*self = Memoized::Processed(val.clone());
}
val
}
fn is_initialized(&self) -> bool {
match self {
&Memoized::Processed(_) => true,
_ => false
}
}
}
impl<I: 'static, O: Clone, Func: Fn(I) -> O> Deref for Memoized<I, O, Func> {
type Target = O;
fn deref(&self) -> &Self::Target {
match self {
&Memoized::Processed(ref x) => x,
_ => panic!("Attempted to derefence uninitalized memoized value")
}
}
}
impl<I: 'static, O: Clone, Func: Fn(I) -> O> DerefMut for Memoized<I, O, Func> {
fn deref_mut(&mut self) -> &mut Self::Target {
//self.get()
if self.is_initialized() {
match self {
&mut Memoized::Processed(ref mut x) => return x,
_ => unreachable!()
};
} else {
*self = Memoized::Processed(unsafe { std::mem::zeroed() });
match self {
&mut Memoized::Processed(ref mut x) => return x,
_ => unreachable!()
};
}
}
}
impl<I: 'static, O: Clone, Func: Fn(I) -> O> Borrow<O> for Memoized<I, O, Func> {
fn borrow(&self) -> &O {
match self {
&Memoized::Processed(ref x) => x,
_ => panic!("Attempted to borrow uninitalized memoized value")
}
}
}
enum Memoized<I: 'static, O: Clone, Func: Fn(I) -> O> {
UnInitialized(PhantomData<&'static I>, Box<Func>),
Processed(O),
}
impl<I: 'static, O: Clone, Func: Fn(I) -> O> Memoized<I, O, Func> {
fn new(lambda: Func) -> Memoized<I, O, Func> {
Memoized::UnInitialized(PhantomData, Box::new(lambda))
}
fn fetch(&mut self, data: I) -> O {
let (flag, val) = match self {
&mut Memoized::Processed(ref x) => (false, x.clone()),
&mut Memoized::UnInitialized(_, ref z) => (true, z(data))
};
if flag {
*self = Memoized::Processed(val.clone());
}
val
}
fn is_initialized(&self) -> bool {
match self {
&Memoized::Processed(_) => true,
_ => false
}
}
}
impl<I: 'static, O: Clone, Func: Fn(I) -> O> Deref for Memoized<I, O, Func> {
type Target = O;
fn deref(&self) -> &Self::Target {
match self {
&Memoized::Processed(ref x) => x,
_ => panic!("Attempted to derefence uninitalized memoized value")
}
}
}
impl<I: 'static, O: Clone, Func: Fn(I) -> O> DerefMut for Memoized<I, O, Func> {
fn deref_mut(&mut self) -> &mut Self::Target {
//self.get()
if self.is_initialized() {
match self {
&mut Memoized::Processed(ref mut x) => return x,
_ => unreachable!()
};
} else {
*self = Memoized::Processed(unsafe { std::mem::zeroed() });
match self {
&mut Memoized::Processed(ref mut x) => return x,
_ => unreachable!()
};
}
}
}
impl<I: 'static, O: Clone, Func: Fn(I) -> O> Borrow<O> for Memoized<I, O, Func> {
fn borrow(&self) -> &O {
match self {
&Memoized::Processed(ref x) => x,
_ => panic!("Attempted to borrow uninitalized memoized value")
}
}
}
mod test {
use super::Memoized;
#[test]
fn test_memoized() {
let lambda = |x: i32| -> String {
x.to_string()
};
let mut dut = Memoized::new(lambda);
assert_eq!(dut.is_initialized(), false);
assert_eq!(&dut.fetch(5), "5");
assert_eq!(dut.is_initialized(), true);
assert_eq!(&dut.fetch(2000), "5");
let x: &str = &dut;
assert_eq!(x, "5");
}
}
Type Conversion
The type conversion allows converting an object from one type to another, e.g. following interface defined in Spring shows an example:
public interface Converter<S, T> {
@Nullable
T convert(S source);
default <U> Converter<S, U> andThen(Converter<? super T, ? extends U> after) {
Assert.notNull(after, "'after' Converter must not be null");
return (S s) -> {
T initialResult = convert(s);
return (initialResult != null ? after.convert(initialResult) : null);
};
}
}
This kind of type conversion looks very similar to the map/reduce primitives defined in functional programming languages, e.g. Java 8 added Function interface for such as transformation. In addition, Scala also supports implicit conversion from one type to another, e.g.,
Rust also supports From and Into traits for converting types, e.g.,
use std::convert::From;
#[derive(Debug)]
struct Number {
value: i32,
}
impl From<i32> for Number {
fn from(item: i32) -> Self {
Number { value: item }
}
}
fn main() {
let num1: Number = Number::from(10);
let num2: Number = 20.into();
println!("{:?} {:?}", num1, num2);
}
Though, REST standard for remote APIs is fairly loose but you can document API shape and structure using standards such as Open API and swagger specifications. The documented API specification ensures that both consumer/client and producer/server side abide by the specifications and prevent unexpected behavior. The API provider may also define service-level objective (SLO) so that API meets specified latency, security and availability and other service-level indicators (SLI). The API provider can use contract tests to validate the API interactions based on documented specifications. The contract testing includes both consumer and producer where a consumer makes an API request and the producer produces the result. The contract tests ensures that both consumer requests and producer responses match the contract request and response definitions per API specifications. These contract tests don’t just validate API schema instead they validate interactions between consumer and producer thus they can also be used to detect any breaking or backward incompatible changes so that consumers can continue using the APIs without any surprises.
In order to demonstrate contract testing, we will use api-mock-service library to generate mock/stub client requests and server responses based on Open API specifications or customized test contracts. These test contracts can be used by both consumers and producers for validating API contracts and evolve the contract tests as API specifications are updated.
Sample REST API Under Test
A sample eCommerce application will be used to demonstrate contracts testing. The application will use various REST APIs to implement online shopping experience. The primary purpose of this example is to show how different request structures can be passed to the REST APIs and then generate a valid result or an error condition for contract testing. You can view the Open-API specifications for this sample app here.
Customer REST APIs
The customer APIs define operations to manage customers who shop online, e.g.:
Product REST APIs
The product APIs define operations to manage products that can be shopped online, e.g.:
Payment REST APIs
The payment APIs define operations to charge credit card and pay for online shopping, e.g.:
Order REST APIs
The order APIs define operations to purchase a product from the online store and it will use above APIs to validate customers, check product inventory, charge payments and then store record of orders, e.g.:
Generating Stub Server Responses based on Open-API Specifications
In this example, stub server responses will be generated by api-mock-service based on open-api specifications ecommerce-api.json by starting the mock service first as follows:
It will generate test contracts with stub/mock responses for all APIs defined in the ecommerce-api.json Open API specification. For example, you can produce result of customers REST APIs, e.g.:
Above response is randomly generated based on the types/formats/regex/min-max limits of properties defined in Open-API and calling this API will automatically generate all valid and error responses, e.g. calling “curl http://localhost:8000/customers” again will return:
* Mark bundle as not supporting multiuse
< HTTP/1.1 500 Internal Server Error
< Content-Type:
< Vary: Origin
< X-Mock-Path: /customers
< X-Mock-Request-Count: 9
< X-Mock-Scenario: getCustomerByEmail-customers-500-8a93b6c60c492e730ea149d5d09e79d85701c01dbc017d178557ed1d2c1bad3d
< Date: Sun, 01 Jan 2023 20:41:17 GMT
< Content-Length: 67
<
* Connection #0 to host localhost left intact
{"logRef":"achieve_output_fresh","message":"buffalo_rescue_street"}
Consumer-driven Contract Testing
Upon uploading the Open-API specifications of microservices, the api-mock-service generates test contracts for each REST API and response statuses. You can then customize these test cases for consumer-driven contract testing.
For example, here is the default test contract generated for finding a customer by id with path “/customers/:id”:
You can customize above response contents using builtin template functions in the api-mock-service library or create additional test contracts for each distinct input parameter. For example, following contract defines interaction between consumer and producer to add a new customer:
Above template defines interaction for adding a new customer where request section defines format of request and matching criteria using match_content property. The response section includes the headers and contents that are generated by the stub/mock server for consumer-driven contract testing. You can then invoke test contract using:
curl -X POST http://localhost:8000/customers -d '{"address":{"city":"rwjJS","countryCode":"US","id":"4a788c96-e532-4a97-9b8b-bcb298636bc1","streetAddress":"Cura diu me, miserere me?","zipCode":"24121"},"creditCard":{"balance":{"amount":57012,"currency":"USD"},"cardNumber":"5566-2222-8282","customerId":"tgzwgThaiZqc5eDwbKk23nwjZqkap7","expiration":"70/6666","id":"d966aafa-c28b-4078-9e87-f7e9d76dd848","type":"VISA"},"email":"andrew.recorder@ipsas.net","firstName":"quendam","id":"071396bb-f8db-489d-a8f7-bbcce952ecef","lastName":"formaeque","phone":"1-345-6666-0618"}'
Note: The response will not match the request body as the contract testing only tests interactions between consumer and producer without maintaining any server side state. You can use other types of testing such as integration/component/functional testing for validating state based behavior.
Producer-driven Generated Tests
The process of defining contracts to generate tests for validating producer REST APIs is similar to consumer-driven contracts. For example, you can upload open-api specifications or user-defined contracts to the api-mock-service provided mock/stub server.
For example, you can upload open-API specifications for ecommerce-api.json as follows:
Upon uploading the specifications, the mock server will generate contracts for each REST API and status. You can customize those contracts with additional validation or assertion and then invoke server generated tests either by specifying the REST API or invoke multiple REST APIs belonging to a specific group. You can also define an order for executing tests in a group and can optionally pass data from one invocation to the next invocation of REST API.
For testing purpose, we will customize customer REST APIs for adding a new customer and fetching a customer by its id, i.e.,
The request section defines content property that will build the input request, which will be sent to the producer provided REST API. The server section defines match_contents to match regex of each response property. In addition, the response section defines assertions to compare against response contents, headers or status against expected output.
Above template defines similar properties to generate request body and defines match_contents with assertions to match expected output headers, body and status. Based on order of tests, the generated test to add new customer will be executed first, which will be followed by the test to find a customer by id. As we are testing against real REST APIs, the REST API path is defined as “/customers/{{.id}}” for finding a customer will populate the id from the output of first test based on the pipe_properties.
Uploading Contracts
Once you have the api-mock-service mock server running, you can upload contracts using:
You can start your service before invoking generated tests, e.g. we will use sample-openapi for the testing purpose and then invoke the generated tests using:
Above command will execute all tests for customers group and it will invoke each REST API 5 times. After executing the APIs, it will generate result as follows:
Though, generated tests are executed against real services, it’s recommended that the service implementation use test doubles or mock services for any dependent services as contract testing is not meant to replace component or end-to-end tests that provide better support for integration testing.
Recording Consumer/Producer interactions for Generating Stub Requests and Responses
The contract testing does not always depend on API specifications such as Open API and swagger and instead you can record interactions between consumers and producers using api-mock-service tool.
For example, if you have an existing REST API or a legacy service such as above sample API, you can record an interaction as follows:
export http_proxy="http://localhost:9000"
export https_proxy="http://localhost:9000"
curl -X POST -H "Content-Type: application/json" http://localhost:8080/customers -d \
'{"address":{"city":"rwjJS","countryCode":"US","id":"4a788c96-e532-4a97-9b8b-bcb298636bc1","streetAddress":"Cura diu me, miserere me?","zipCode":"24121"},"creditCard":{"balance":{"amount":57012,"currency":"USD"},"cardNumber":"5566-2222-8282","customerId":"tgzwgThaiZqc5eDwbKk23nwjZqkap7","expiration":"70/6666","id":"d966aafa-c28b-4078-9e87-f7e9d76dd848","type":"VISA"},"email":"andrew.recorder@ipsas.net","firstName":"quendam","id":"071396bb-f8db-489d-a8f7-bbcce952ecef","lastName":"formaeque","phone":"1-345-6666-0618"}'
This will invoke the remote REST API, record contract interactions and then return server response:
You can then invoke consumer-driven contracts to generate stub response or invoke generated tests to test against producer implementation as described in earlier section. Another benefit of capturing test contracts using recorded session is that it can accurately capture all URLs, parameters and headers for both requests and responses so that contract testing can precisely validate against existing behavior.
Summary
Though, unit-testing, component testing and end-to-end testing are a common testing strategies that are used by most organizations but they don’t provide adequate support to validate API specifications and interactions between consumers/clients and producers/providers of the APIs. The contract testing ensures that consumers and producers will not deviate from the specifications and can be used to validate changes for backward compatibility when APIs are evolved. This also decouples consumers and producers if the API is still in development as both parties can write code against the agreed contracts and test them independently. A service owner can generate producer contracts using tools such as api-mock-service based on Open API specification or user-defined constraints. The consumers can provide their consumer-driven contracts to the service providers to ensure that the API changes don’t break any consumers. These contracts can be stored in a source code repository or on a registry service so that contract testing can easily access them and execute them as part of the build and deployment pipelines. The api-mock-service tool greatly assists in adding contract testing to your software development lifecycle and is freely available from https://github.com/bhatti/api-mock-service.
The software development cycle for microservices generally include unit testing during the development where mock implementation for the dependent services are injected with the desired behavior to test various test-scenarios and failure conditions. However, the development teams often use real dependent services for integration testing of a microservice in a local environment. This poses a considerable challenge as each dependent service may be keeping its own state that makes it harder to reliably validate the regression behavior or simulate certain error response. Further, as the number of request parameters to the service or downstream services grow, the combinatorial explosion for test cases become unmanageable. This is where property-based testing offers a relief as it allows testing against automatically generated input fuzz-data, which is why this form of testing is also referred as a generative testing. A generator defines a function that generate random data based on type of input and constraints on the range of input values. The property-based test driver then iteratively calls the system under test to validate the result and assert the desired behavior, e.g.
def pre_condition_test_input_param(kind):
### assert pre-condition based on type of parameter and range of input values it may take
def generate_test_input_param(kind):
### generate data meeting pre-condition for the type
def generate_test_input_params(kinds):
return [generate_test_input_param(kind) for kind in kinds]
for i in range(max_attempts):
[a, b, c, ...] = generate_test_input_params(type1, type2, type3, ...)
output = function_under_test(a, b, c, ...)
assert property1(output)
assert property2(output)
...
In above example, the input parameters are randomly generated based on a precondition. The generated parameters are passed to the function under test and the test driver validates result based on property assertions. This entire process is also referred as fuzzing, which is repeated based on a fixed range to identify any input parameters where the property assertions fail. There are a lot of libraries for property-based testing in various languages such as QuickCheck, fast-check, junit-quickcheck, ScalaCheck, etc. but we will use the api-mock-service library to demonstrate these capabilities for testing microservice APIs.
Following sections describe how the api-mock-service library can be used for testing microservice with fuzzing/property-based approaches and for mocking dependent services to produce the desired behavior:
Sample Microservices Under Test
A sample eCommerce application will be used to demonstrate property-based and generative testing. The application will use various microservices to implement online shopping experience. The primary purpose of this example is to show how different parameters can be passed to microservices, where microservice APIs will validate the input parameters, perform a simple business logic and then generate a valid result or an error condition. You can view the Open-API specifications for this sample app here.
Customer APIs
The customer APIs define operations to manage customers who shop online, e.g.:
Product APIs
The product APIs define operations to manage products that can be shopped online, e.g.:
Payment APIs
The payment APIs define operations to charge credit card and pay for online shopping, e.g.:
Order APIs
The order APIs define operations to purchase a product from the online store and it will use above APIs to validate customers, check product inventory, charge payments and then store record of orders, e.g.:
Defining Test Scenarios with Open-API Specifications
In this example, test scenarios will be generated by api-mock-service based on open-api specifications ecommerce-api.json by starting the mock service first as follows:
Above response is randomly generated based on the properties defined in Open-API and calling this API will automatically generate all valid and error responses, e.g. calling “curl http://localhost:8000/products” again will return:
Applying Property-based/Generative Testing for Clients of Microservices
Upon uploading the Open-API specifications of microservices, the api-mock-service automatically generated templates for producing mock responses and error conditions, which can be customized for property-based and generative testing of microservice clients by defining constraints for generating input/output data and assertions for request/response validation.
Client-side Testing for Listing Products
You can find generated mock scenarios for listing products on the mock service using:
In above example, we slightly improved the test template by generating product entries in a loop and using built-in functions to randomize the data. You can upload this scenario using:
Invoking “curl -v http://localhost:8000/products“ will randomly return both of those test scenarios so that client code can test for various conditions.
Client-side Testing for Creating Products
You can find mock scenarios for creating products that were generated from above Open-API specifications using:
The client code can test for product properties and other error scenarios can be added to simulate failure conditions.
Applying Property-based/Generative Testing for Microservices
The api-mock-service test scenarios defined above can also be used to test against the microservice implementations. You can start your service, e.g. we will use sample-openapi for testing purpose and then invoke test request for server-side testing using:
If you try to run it again, the execution will fail with following error because none of the categories include X:
{
"results": {
"getProducts_0": {},
"getProducts_1": {},
"getProducts_2": {},
"getProducts_3": {},
"getProducts_4": {}
},
"errors": {
"saveProduct_0": "failed to assert '{{VariableContains \"contents.category\" \"X\"}}' with value 'false'",
"saveProduct_1": "failed to assert '{{VariableContains \"contents.category\" \"X\"}}' with value 'false'",
"saveProduct_2": "failed to assert '{{VariableContains \"contents.category\" \"X\"}}' with value 'false'",
"saveProduct_3": "failed to assert '{{VariableContains \"contents.category\" \"X\"}}' with value 'false'",
"saveProduct_4": "failed to assert '{{VariableContains \"contents.category\" \"X\"}}' with value 'false'"
},
"succeeded": 5,
"failed": 5
}
Summary
Using unit-testing and other forms of testing methodologies don’t rule out presence of the bugs but they can greatly reduce the probability of bugs. However, with large sized test suites, the maintenance of tests incur a high development cost especially if those tests are brittle that requires frequent changes. The property-based/generative testing can help fill in gaps in unit testing while keeping size of the tests suite small. The api-mock-service tool is designed to mock and test microservices using fuzzing and property-based testing techniques. This mocking library can be used to test both clients and server side implementation and can also be used to generate error conditions that are not easily reproducible. This library can be a powerful tool in your toolbox when developing distributed systems with a large number services, which can be difficult to deploy and test locally. You can read more about the api-mock-library at “Mocking and Fuzz Testing Distributed Micro Services with Record/Play, Templates and OpenAPI Specifications†and download it freely from https://github.com/bhatti/api-mock-service.
Over the last few decades, the software systems has evolved from the mainframe and client-server models to distributed systems and service oriented architecture. In early 2000, Amazon and Netflix forged ahead the industry to adopt microservices by applying Conway’s law and self-organizing teams structure where a small 2-pizza team own entire lifecycle of the microservices including operational responsibilities. The microservice architecture with small, cross-functional and independent structure has helped team agility to develop, test and deploy microservices independently. However, the software systems are becoming increasingly complex with the Cambrian explosion of microservices and the ecosystem of microservices is reaching a boiling point where building new features, releasing the enhancements and operational load from maintaining high availability, scalability, resilience, security, observability, etc are slowing down the development teams and raising the artificial complexity as a result of mixing different concerns.
Following diagram shows difference between monolithic architecture and microservice architecture:
As you can see in above diagram, each microservice is responsible for managing a number of cross-cutting concerns such as AuthN/AuthZ, monitoring, rate-limit, configuration, secret-management, etc., which adds a scope of work for development of each service. Following sections dive into fundamental causes of the complexity that comes with the microservices and a path forward for evolving the development methodology to build these microservices more effectively.
2. Perils in Building Distributed Systems
Following is a list of major pitfalls faced by the development teams when building distributed systems and services:
2.1 Coordinating Feature Development
The feature development becomes more convoluted with increase in dependencies of downstream services and any large change in a single service often requires the API changes or additional capabilities from multiple dependent services. This creates numerous challenges such as prioritization of features development among different teams, coordinating release timelines and making progress in absence of the dependent functionalities in the development environment. This is often tackled with additional personnel for project management and managing dependencies with Gantt charts or other project management tools but it still leads to unexpected delays and miscommunication among team members about the deliverables.
2.2 Low-level Concurrency Controls
The development teams often use imperative languages and apply low-level abstractions to implement distributed services where each request is served by a native thread in a web server. Due to extensive overhead of native threads such as stack size limits, the web server becomes constrained with the maximum number of concurrent requests it can support. In addition, these native threads in the web server often share a common state, which must be protected with a mutex, semaphore or lock to avoid data corruption. These low-level and primitive abstractions couple business logic with the concurrency logic, which add accidental complexity when safeguarding the shared state or communicating between different threads. This problem worsens with the time as the code size increases that results in subtle concurrency related heisenbugs where these bugs may produce incorrect results in the production environment.
2.3 Security
Each microservice requires implementing authentication, authorization, secure communication, key management and other aspects of the security. The development teams generally have to support these security aspects for each service they own and they have to be on top of any security patches and security vulnerabilities. For example, a zero-day log4j vulnerability in December 2021 created a havoc in most organizations as multiple services were affected by the bug and needed a patch immediately. This resulted in large effort by each development team to patch their services and deploy the patched services as soon as possible. Worst, the development teams had to apply patches multiple times because initial bug fixes from the log4j team didn’t fully work, thus further multiplying the work by each development team. With growth of the dependency stack or bill of material for third party libraries in modern applications, the development teams face an overwhelming operational burden and enormous security risk to support their services safely.
2.4 Web Server
In general, each microservice requires a web server, which adds additional computing and administration overhead for deploying and running the service stack. The web server must be running all the time whether the service is receiving requests or not, thus wasting CPU, memory and storage resources needlessly.
2.5 Colocating Multiple Services
The development teams often start with a monolithic style applications that hosts multiple services on a single web server or with segregated application servers hosting multiple services on the same web server to lessen the development and deployment effort. The monolithic and service colocation architecture hinders speed, agility and extensibility as the code becomes complicated, harder to maintain and reasoned due to lack of isolation. In this style of deployment, computing resources can be entirely consumed by a single component or a bug in one service can crash the entire system. As each service may have unique runtime or usage characteristics, it’s also arduous to scale a single service, to plan service capacity or to isolate service failures in a colocated runtime environment.
2.6 Cross Cutting Concerns
Building distributed systems require managing a lot of horizontal concerns such as security, resilience, business continuity, and availability but coupling these common concerns with the business logic results in inconsistencies and higher complexity by different implementations in microservices. Mixing these different concerns with business logic in microservices means each development team will have to solve those concerns independently and any omission or divergence may cause miserable user experience, faulty results, poor protection against load spikes or a security breach in the system.
2.7 Service Discovery
Though, microservices use various synchronous and asynchronous protocols for communicating with other services but they often store the endpoints of other services locally in the service configurations. This adds maintenance and operational burden for maintaining the endpoints for all dependent services in each development, test and production environment. In addition, services may not be able to apply certain containment or access control policies such as not invoking cross-region service to maintain lower latency or sustain a service for disaster recovery.
2.8 Architectural Quality Attributes
The architecture quality attributes include performance, availability sustainability, security, scalability, fault tolerance, performance, resilience, recovery and usability, etc. Each development team not only has to manage these attributes for each service but often requires coordination with other teams when scaling their services so that downstream services can handle additional load or meet the availability/reliability guarantees. Thus, the availability, fault tolerance, capacity management or other architectural concerns become tied with downstream services as an outage in any of those services directly affect upstream services. Thus, improving availability, scalability, fault tolerance or other architecture quality attributes often requires changes from the dependent services, which adds scope of the development work.
2.9 Superfluous Development Work
When a developing a microservice, a development team owns end-to-end development and release process that includes a full software development lifecycle support such as :
maintaining build scripts for CI/CD pipelines
building automation tools for integration/functional/load/canary tests
defining access policies related to throttling/rate-limits
implementing consistent error handling, idempotency behavior, contextual information across services
providing customized personal and system dashboards
supporting data encryption and managing secret keys
defining security policies related to AuthN/AuthZ/Permissons/ACL
defining network policies related to VPN, firewall, load-balancer, network/gateway/routing configuration
adding compression, caching, and any other common pre/post processing for services
As a result, any deviations or bugs in implementation of these processes or misconfiguration in underlying infrastructure can lead to conflicting user experience, security gaps and outages. Some organizations maintain lengthy checklists and hefty review processes for applying best practices before the software release but they often miss key learnings from other teams and slow down the development process due to cumbersome release process.
2.10 Yak shaving when developing and testing features
Due to enormous artificial complexity of microservices with abysmal dependency stack, the development teams have to spend inordinate amount of time in setting up a development environment when building a new feature. The feature development requires testing the features using unit tests with a mock behavior of dependent services and using integration tests with real dependent services in a local development environment. However, it’s not always possible to run complete stack locally and fully test the changes for new features, thus the developers are encumbered with finding alternative integration environment where other developers may also be testing their features. All this yak shaving makes the software development awfully tedious and error prone because developers can’t test their features in isolation with a high confidence. This means that development teams find bugs later in phases of the release process, which may require a rollback of feature changes and block additional releases until bugs are fixed in the main/release branch.
3. Path to the Enlightenment
Following are a few recommendations to remedy above pitfalls in the development of distributed systems and microservices:
3.1 Higher level of Development Abstraction
Instead of using low-level imperative languages or low-level concurrency controls, high-level abstractions can be applied to simplify the development of the microservices as follows:
3.1.1 Actor Model
The Actor model was first introduced in 1973 by Carl Hewitt and it provides a high-level abstraction for concurrent computation. An actor uses a mailbox or a queue to store incoming messages and processes one message at a time using local state in a single green thread or a coroutine. An actor can create other actors or send messages without using any lock-based synchronization and blocking. Actors are reactive so they cannot initiate any action on their own, instead they simply react to external stimuli in the form of message passing. Actors provide much better error handling where applications can define a hierarchy of actors with parent/child relationships and a child actor may crash when encountering system errors, which are monitored and supervised by parent actors for failure recovery.
Actor model is supported natively in many languages such Erlang/Elixir, Scala, Swift and Pony and it’s available as a library in many other languages. In these languages, actors generally use green threads, coroutines or a preemptive scheduler to schedule actors with non-blocking I/O operations. As the actors incur much lower overhead compare to native threads, they can be used to implement microservices with much greater scalability and performance. In addition, message passing circumvents the need to guard the shared state as each actors only maintains a local state, which provides more robust and reliable implementation of microservices. Here is an example of actor model in Erlang language:
In above example, an actor is spawned to run loop function, which uses a tail recursion to receive next message from the queue and then processes it based on the tag of the message such as add or get. The client code uses a symbol sumActor to send a message, which is registered with a local registry. As an actor only maintains local state, microservices may use use external data store and manage a state machine using orchestration based SAGA pattern to trigger next action.
3.1.2 Function as a service (FaaS) and Serverless Computing
Function as a service (FaaS) offers serverless computing to simplify managing physical resources. Cloud vendors offer APIs for AWS Lambda, Google Cloud Functions and Azure Functions to build serverless applications for scalable workloads. There are also open source support for FaaS computing such as OpenFaas and OpenWhisk on top of Kubernetes or OpenShift. These functions resemble actor model as each function is triggered based on an event and is designed with a single responsibility, idempotency and shared nothing principles that can be executed concurrently.
The FaaS and serverless computing can be used to develop microservices where business logic can be embedded within a serverless functions but any platform services such as storage, messaging, caching, SMS can be exposed via Backend as a Service (BaaS). The Backend as a Service (BaaS) adds additional business logic on top of Platform as a Service (PaaS).
3.1.3 Agent based computing
Microservices architecture decouples data access from the business logic and microservices fetch data from the data-store, which incurs a higher overhead if the business logic needs to fetch a lot of data for processing or filtering before generating results. As opposed, agent style computing allow migrating business logic remotely where data or computing resources reside, thus it can process data more efficiently. In a simplest example, an agent may behave like a stored procedure where a function is passed to a data store for executing a business logic, which is executed within the database but other kind of agents may support additional capabilities to gather data from different data stores or sources and then produces desired results after processing the data remotely.
3.2 Service and Schema Registry
The service registry allows microservices to register the endpoints so that other services can look up the endpoints for communication with them instead of storing the endpoints locally. This allows service registry to enforce any authorization and access policies for communication based on geographic location or other constraints. A service registry may also allow registering mock services for testing in a local development environment to facilitate feature development. In addition, the registry may store schema definitions for the API models so that services can validate requests/responses easily or support multiple versions of the API contracts.
3.3 API Router, Reverse-Proxy or Gateway
API router, reverse-proxy or an API gateway are common patterns with microservices for routing, monitoring, versioning, securing and throttling APIs. These patterns can also be used with FaaS architecture where an API gateway may provide these capabilities and eliminate the need to have a web server for each service function. Thus, API gateway or router can result in lowering computing cost for each service and reducing the complexity for maintaining non-functional capabilities or -ilities.
3.4 Virtualization
Virtualization abstracts computer hardware and uses a hypervisor to create multiple virtual computers with different operating systems and applications on top of a single physical computer.
3.4.1 Virtual Machines
The initial implementation of virtualization was based on Virtual Machines for building virtualized computing environments and emulating a physical computer. The virtual machines use a hypervisor to communicate with the physical computer.
3.4.2 Containers
Containers implement virtualization using host operating system instead of a hypervisor, thus provide more light-weight and faster provisioning of computing resources. The containers use platforms such as Docker and Kubernetes to execute applications and services, which are bundled into images based on Open Container Initiative (OCI) standard.
3.4.3 MicroVM
MicroVMs such as Firecracker and crosVM are based on kernel-based VM (KVM) and use hostOS acting as a hypervisor to provide isolation and security. As MicroVMs only include essential features for network, storage, throttling and metadata so they are quick to start and can scale to support multiple VMs with minimal overhead. A number of serverless platforms such as AWS Lambda, appfleet, containerd, Fly.io, Kata, Koyeb, OpenNebula, Qovery, UniK, and Weave FireKube have adopted Firecracker VM, which offers low overhead for starting a new virtual machine or executing a serverless workload.
3.4.4 WebAssembly
The WebAssembly is a stack-based virtual machine that can run at the edge or in cloud. The applications written Go, C, Rust, AssemblyScript, etc. are compiled into WebAssembly binary and are then executed on a WebAssembly runtime such as extism, faasm, wasmtime, wamr, wasmr and wagi. The WebAssembly supports WebAssembly System Interface (WASI) standard, which provides access to the systems APIs for different operating systems similar to POSIX standard. There is also an active development of WebAssembly Component Model with proposals such as WebIDL bindings and Interface Types. This allows you to write microservices in any supported language, compile the code into WASM binary and then deploy in a managed platform with all support for security, traffic management, observability, etc.
In order to reduce service code that deals specifically with non-functional capabilities such as authentication, authorization, logging, monitoring, etc., the business service can be instrumented to provide those capabilities at compile or deployment time. This means that the development team can largely focus on the business requirements and instrumentation takes care of adding metrics, failure reporting, diagnostics, monitoring, etc. without any development work. In addition, any external platform dependencies such as messaging service, orchestration, database, key/value store, caching can be injected into the service dynamically at runtime. The runtime can be configured to provide different implementation for the platform services, e.g. it may use a local Redis server for key/value store in a hosted environment or AWS/Azure’s implementation in a cloud environment.
When deploying services with WebAssembly support, the instrumentation may use WebAssembly libraries for extending the services to support authentication, rate-limiting, observability, monitoring, state management, and other non-functional capabilities, so that the development work can be shortened as shown below:
3.6 Orchestration and Choreography
The orchestration and choreography allows writing microservices that can be easily composed to provide a higher abstractions for business services. In orchestration design, a coordinator manages synchronous communication among different services whereas choreography uses event-driven architecture to communicate asynchronously. An actor model fits naturally with event based architecture for communicating with other actors and external services. However, orchestration services can be used to model complex business processes where a SAGA pattern is used to manage state transitions for different activities. Microservices may also use staged event-driven architecture (SEDA) to decompose complex event-driven services into a set of stages that are connected by different queues, which supports better modularity and code-reuse. SEDA allows enforcing admission control on each event queue and grants flexible scheduling for processing events based on adaptive workload controls and load shedding policies.
3.7 Automation, Continuous Deployment and Infrastructure as a code
Automation is a key to remove any drudgery work during the development process and consolidate common build processes such as continuous integration and deployment for improving the productivity of a development team. The development teams can employ continuous delivery to deploy small and frequent changes by developers. The continuous deployment often uses rolling updates, blue/green deployments or canary deployments to minimize disruption to end users. The monitoring system watches for error rates at each stage of the deployment and automatically rollbacks changes if a problem occurs.
Infrastructure as code (IaC) uses a declarative language to define development, test and production environment, which is managed by the source code management software. These provisioning and configuration logic can be used by CI/CD pipelines to automatically deploy and test environments. Many cloud vendors provide support for IaC such as Azure Resource Manager (ARM), AWS Cloud Development Kit (CDK), Hashicorp Terraform etc to deploy computing resources.
3.8 Proxy Patterns
Following sections shows implementing cross cutting concerns using proxy patterns for building microservices:
3.8.1 Sidecar Model
The sidecar model helps modularity and reusability where an application requires two containers: application container and sidebar container where sidebar container provides additional functionality such as adding SSL proxy for the service, observability, collecting metrics for the application container.
The Sidecar pattern generally uses another container to proxy off all traffic, which enforces security, access control, throttling before forwarding the incoming requests to the microservice.
3.8.2 Service Mesh
The service mesh uses a mesh of sidecar proxies to enable:
Dynamic request routing for blue-green deployments, canaries, and A/B testing
Load balancing based on latency, geographic locations or health checks
Service discovery based on a version of the service in an environment
Circuit breaker and retries using libraries like Finagle, Stubby, and Hysterix to isolate unhealthy instances and gradually adding them back after successful health checks
Error handling and fault tolerance
Control plane to manage routing tables, service discovery, load balancer and other service configuration
The service mesh pattern uses a data plane to host microservices and all incoming and outgoing requests go through a sidecar proxy that implements cross cutting concerns such as security, routing, throttling, etc. The control-plan in mesh network allows administrators to change the behavior of data plane proxies or configuration for data access. Popular service mesh frameworks include Consul, Distributed Application Runtime (Dapr), Envoy, Istio, Linkerd, Kong, Koyeb and Kuma for providing control-plane and data-plane with builtin support for networking, observability, traffic management and security. Dapr mesh network also supports Actor-model using the Orleans Virtual Actor pattern, which leverages the scalability and reliability guarantees of the underlying platform.
3.9 Cell-Based Architecture
A Cell-based Architecture (CBA) allows grouping business functions into single units called “cells”, which provides better scalability, modularity, composibility, disaster-recovery and governance for building microservices. It reduces the number of deployment units for microservices, thus simplifying the artificial complexity in building distributed systems.
A cell is an immutable collection of components and services that is independently deployable, manageable, and observable. The components and services in a cell communicate with each other using local network protocols and other cells via network endpoints through a cell gateway or brokers. A cell defines an API interface as part of its definition, which provides better isolation, short radius blast and agility because releases can be rolled out cell by cell. However, in order to maintain the isolation, a cell should be either self contained for all service dependencies or dependent cells should be on the same region so that there is no cross region dependency.
3.10 12-Factor app
The 12-factor app is a set of best practices from Heroku for deploying applications and services in a virtualized environment. It recommends using declarative configuration for automation, enabling continuous deployment and other best practices. Though, these recommendations are a bit dated but most of them still holds except you may consider storing credentials or keys in a secret store or in an encrypted files instead of simply using environment variables.
4. New Software Development Workflow
The architecture patterns, virtual machines, WebAssembly and serverless computing described in above section can be used to simplify the development of microservices and easily support both functional and non-functional requirements of the business needs. Following sections describe how these patterns can be integrated with the software development lifecycle:
4.1 Development
4.1. Adopting WebAssembly as the Lingua Franca
WebAssembly along with its component model and WASI standard has been gaining support and many popular languages now can be compiled into wasm binary. Microservices can be developed in supported languages and ubiquitous WebAssembly support in production environment can reduce the development and maintenance drudgery for the application development teams. In addition, teams can leverage a number of serverless platforms that support WebAssembly such as extism, faasm, netlify, vercel, wasmcloud and wasmedge to reduce the operational load.
4.1.2 Instrumentation and Service Binding
The compiled WASM binary for microservices can be instrumented similar to aspects so that the generated code automatically supports horizontal concerns such as circuit breaker, diagnostics, failure reporting, metrics, monitoring, retries, tracing, etc. without any additional endeavor by the application developer.
In addition, the runtime can use service mesh features to bind external platform services such as event bus, data store, caching, service registry or dependent business services. This simplifies the development effort as development team can use the shared services for developing microservices instead of configuring and deploying infrastructure for each service. The shared infrastructure services support multi-tenancy and distinct namespaces for each microservice so that it can manage its state independently.
4.1.3 Adopting Actor Model for Microservices
As discussed above, the actor model offers a light-weight and highly concurrent model to implement a microservice APIs. The actors based microservices can be coded in any supported high-level language but then compiled into WebAssembly with a WASI support. A number of WebAssembly serverless platforms including Lunatic and wasmCloud already support Actor model while other platforms such as fermyon use http based request handlers, which are invoked for each request similar to actors based message passing. For example, here is a sample actor model in wasmCloud in Rust language though any language with wit-bindgen is supported as well:
4.1.4 Service Composition with Orchestration and Choreography
As described above, actors based microservices can be extended with the orchestration patterns such as SAGA and choreography/event driven architecture patterns such as SEDA to build composable services. These design patterns can be used to build loosely coupled and extensible systems where additional actors and components can be added without changing existing code.
4.2 Deployment and Runtime
4.2.1 Virtual Machines and Serverless Platform
Following diagram shows the evolution of virtualized environments for hosting applications, services, serverless functions and actors:
In this architecture, the microservices are compiled into wasm binary, instrumented and then deployed in a micro virtual machine.
4.2.2 Sidecar Proxy and Service Mesh
Though, the service code will be instrumented with additional support for error handling, metrics, alerts, monitoring, tracing, etc. before the deployment but we can further enforce access policies, rate limiting, key management, etc. using a sidecar proxy or service mesh patterns:
For example, WasmEdge can be integrated with Dapr service mesh for adding building blocks for state management, event bus, orchestration, observability, traffic routing, and bindings to external services. Similarly, wasmCloud can be extended with additional non-functional capabilities by implementing capability provider. wasmCloud also provides a lattice, self-healing mesh network for simplifying communication between actors and capability providers.
4.2.3 Cellular Deployment
As described above, Cell-based Architecture (CBA) provides better scalability, modularity, composibility and business continuity for building microservices. Following diagram shows how above design with virtual machines, service-mesh and WebAssembly can be extended to support cell-based architecture:
In above architecture, each cell deploys a set of related microservices for an application that persists state in a replicated data store and communicate with other cells with an event-bus. In this model, separate cells are employed to access data-plane services and control-plane services for configuration and administration purpose.
5. Conclusion
The transition of monolithic services towards microservices architecture over last many years has helped development teams to be more agile in building modular and reusable code. However, as teams are building a growing number new microservices, they are also tasked with supporting non-functional requirements for each service such as high availability, capacity planning, scalability, performance, recovery, resilience, security, observability, etc. In addition, each microservice may depend on numerous other microservices, thus a microservice becomes susceptible to scaling/availability limits or security breaches in any of the downstream services. Such tight coupling of horizontal concerns escalates complexity, development undertaking and operational load by each development team resulting in larger time to market for new features and larger risk for outages due to divergence in implementing non-functional concerns. Though, serverless computing, function as a service (Faas) and event-driven compute services have emerged to solve many of these problems but they remain limited in the range of capabilities they offer and lack a common standards across vendors. The advancements in micro virtual machines and containers have created a boon to the serverless platforms such as appfleet, containerd, Fly.io, Kata, Koyeb, OpenNebula, Qovery, UniK, and Weave FireKube. In addition, widespread adoption of WebAssembly along with its component model and WASI standard are helping these serverless platforms such as extism, faasm, netlify, vercel, wasmcloud and wasmedge to build more modular and reusable components. These serverless platforms allow the development teams to primarily focus on building business features and offload all non-functional concerns to the underlying platforms. Many of these serverless platforms also support service mesh and sidecar patterns so that they can bind platform and dependent services and automatically handle concerns such as throttling, security, state management, secrets, key management, etc. Though, cell-based architecture is still relatively new and is only supported by more matured serverless and cloud platforms, but it further raises scalability, modularity, composibility, business continuity and governance of microservices. As each cell is isolated, it adds agility to deploy code changes to a single cell and use canary tests to validate the changes before deploying code to all cells. Due to such isolated deployment, cell-based architecture reduces the blast radius if a bug is found during validation or other production issues are discovered. Finally, automating continuous deployment processes and applying Infrastructure as code (IaC) can simplify local development and deployment so that developers use the same infrastructure setup for local testing as the production environment. This means that the services can be deployed in any environment consistently, thus reduces any manual configuration or subtle bugs due to misconfigurations.
In summary, the development teams will greatly benefit from the architecture patterns, virtualized environments, WebAssembly and serverless platforms described above so that application developers are not burdened with maintaining horizontal concerns and instead they focus on building core product features, which will be the differentiating factors in the competing markets. These serverless and managed platform not only boosts developer productivity but also lowers the infrastructure cost, operational overhead, cruft development work and the time to market for releasing new features.
Building large distributed systems often requires integrating with multiple distributed micro-services that makes development a particularly difficult as it’s not always easy to deploy and test all dependent services in a local environment with constrained resources. In addition, you might be working on a large system with multiple teams where you may have received new API specs from another team but the API changes are not available yet. Though, you can use mocking frameworks based on API specs when writing a unit tests but integration or functional testing requires an access to the network service. A common solution that I have used in past projects is to configure a mock service that can simulate different API operations. I wrote a JVM based mock-service many years ago with following use-cases:
Use-Cases
As a service owner, I need to mock remote dependent service(s) by capturing/recording request/responses through an HTTP proxy so that I can play it back when testing the remote service(s) without connecting with them.
As a service owner, I need to mock remote dependent service(s) based on a open-api/swagger specifications so that my service client can test all service behavior per specifications for the remote service(s) even when remote service is not fully implemented or accessible.
As a service owner, I need to mock remote dependent service(s) based on a mock scenario defined in a template so that my service client can test service behavior per expected request/response in the template even when remote service is not fully implemented or accessible.
As a service owner, I need to inject various response behavior and faults to the output of a remote service so that I can build a robust client that prevents cascading failures and is more resilient to unexpected faults.
As a service owner, I need to define test cases with faulty or fuzz responses to test my own service so that I can predict how it will behave with various input data and assert the service response based on expected behavior.
Features
API mock service for REST/HTTP based services with following features:
Record API request/response by working as a HTTP proxy server (native http/https or via API) between client and remote service.
Playback API response that were previously recorded based on request parameters.
Define API behavior manually by specifying request parameters and response contents using static data or dynamic data based on GO templating language.
Generate API behavior from open standards such as Open API/Swagger and automatically create constraints and regex based on the specification.
Customize API behavior using a GO template language so that users can generate dynamic contents based on input parameters or other configuration.
Generate large responses using the template language with dynamic loops so that you can test performance of your system.
Define multiple test scenarios for the API based on different input parameters or simulating various error cases that are difficult to reproduce with real services.
Store API request/responses locally as files so that it’s easy to port stubbed request/responses to any machine.
Allow users to define API request/response with various formats such as XML/JSON/YAML and upload them to the mock service.
Support test fixtures that can be uploaded to the mock service and can be used to generate mock responses.
Define a collection of helper methods to generate different kind of random data such as UDID, dates, URI, Regex, text and numeric data.
Ability to playback all test scenarios or a specific scenario and change API behavior dynamically with different input parameters.
Support multiple mock scenarios for the same API that can be selected either using round-robin order, custom predicates based on parameters or based on scenario name.
Inject error conditions and artificial delays so that you can test how your system handles error conditions that are difficult to reproduce or use for game days/chaos testing.
Generate client requests for a remote API for chaos and stochastic testing where a set of requests are sent with a dynamic data generated based on regex or other constraints.
I used this service in many past projects, however I felt it needed a bit fresh approach to meet above goals so I rewrote it in GO language, which has a robust support for writing network services. You can download the new version from https://github.com/bhatti/api-mock-service. As, it’s written in GO, you can either download GO runtime environment or use Docker to install it locally. If you haven’t installed docker, you can download the community version from https://docs.docker.com/engine/installation/ or find installer for your OS on https://docs.docker.com/get-docker/.
Alternatively, you can run it locally with GO environment, e.g.,
make && ./out/bin/api-mock-service
For full command line options, execute api-mock-service -h that will show you command line options such as:
./out/bin/api-mock-service -h
Starts mock service
Usage:
api-mock-service [flags]
api-mock-service [command]
Available Commands:
chaos Executes chaos client
completion Generate the autocompletion script for the specified shell
help Help about any command
version Version will output the current build information
Flags:
--assetDir string asset dir to store static assets/fixtures
--config string config file
--dataDir string data dir to store mock scenarios
-h, --help help for api-mock-service
--httpPort int HTTP port to listen
--proxyPort int Proxy port to listen
Recording a Mock Scenario via HTTP/HTTPS Proxy
Once you have the API mock service running, the mock service will start two ports on startup, first port (default 8080) will be used to record/play mock scenarios, updating templates or uploading OpenAPIs. The second port (default 8081) will setup an HTTP/HTTPS proxy server that you can point to record your scenarios, e.g.
Above curl command will automatically record all requests and responses and create mock scenario to play it back. For example, if you call the same API again, it will return a local response instead of contacting the server. You can customize the proxy behavior for record by adding X-Mock-Record: true header to your request.
Recording a Mock Scenario via API
Alternatively, you can use invoke an internal API as a pass through to invoke a remote API so that you can automatically record API behavior and play it back later, e.g.
In above example, the curl command is passing the URL of real service as an HTTP header Mock-Url. In addition, you can pass other authorization headers as needed.
Viewing the Recorded Mock Scenario
The API mock-service will store the request/response in a YAML file under a data directory that you can specify. For example, you may see a file under:
The matching request parameters will be used to select the mock scenario to execute and you can use regular expressions to validate, e.g. above example will be matched if content-type is application/json and it will validate that name query parameter is alphanumeric from 1-50 size.
Example Request Parameters:
The example request parameters show the contents captured from the record/play so that you can use and customize to define matching parameters:
URL Query Parameters
URL Request Headers
Request Body
Response Properties
The response properties will include:
Response Headers
Response Body statically defined or loaded from a test fixture
Response can also be loaded from a test fixture file
Status Code
Matching header and contents
Assertions You can copy recorded scenario to another folder and use templates to customize it and then upload it for playback.
The matching header and contents use match_headers and match_contents similar to request to validate response in case you want to test response from a real service for chaos testing. Similarly, assertions defines a set of predicates to test against response from a real service:
Though, you can customize your template with dynamic properties or conditional logic but you can also send HTTP headers for X-Mock-Response-Status to override HTTP status to return or X-Mock-Wait-Before-Reply to add artificial latency using duration syntax.
Debug Headers from Playback
The playback request will return mock-headers to indicate the selected mock scenario, path and request count, e.g.
In above example, I assigned a name stripe-cash-balance to the mock scenario and changed API path to /v1/customers/:customer/cash_balance so that it can capture customer-id as a path variable. I added a regular expression to ensure that the HTTP request includes an Authorization header matching Bearer sk_test_[0-9a-fA-F]{10}$ and defined dynamic properties such as {{.customer}},{{.page}} and {{.pageSize}} so that they will be replaced at runtime.
The mock scenario uses builtin template syntax of GO. You can then upload it as follows:
As you can see, the values of customer, page and pageSize are dynamically updated and the response header includes name of mock scenario with request counts. You can upload multiple mock scenarios for the same API and the mock API service will play it back sequentially. For example, you can upload another scenario for above API as follows:
which will return response with following error response
> GET /v1/customers/123/cash_balance?page=2&pageSize=55 HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.65.2
> Accept: */*
> Authorization: Bearer sk_test_0123456789
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 500 Internal Server Error
< Content-Type: application/json
< Mock-Request-Count: 1
< X-Mock-Scenario: stripe-customer-failure
< Stripe-Version: 2018-09-06
< Vary: Origin
< Date: Sat, 29 Oct 2022 17:29:15 GMT
< Content-Length: 15
Dynamic Templates with Mock API Scenarios
You can use loops and conditional primitives of template language and custom functions provided by the API mock library to generate dynamic responses as follows:
Above example includes a number of template primitives and custom functions to generate dynamic contents such as:
Loops
GO template support loops that can be used to generate multiple data entries in the response, e.g.
{{- range $val := Iterate .pageSize }}
Builtin functions
GO template supports custom functions that you can add to your templates. The mock service includes a number of helper functions to generate random data such as:
{{if NthRequest 10 }} -- for every 10th request
{{if GERequest 10 }} -- if number of requests made to API so far are >= 10
{{if LTRequest 10 }} -- if number of requests made to API so far are < 10
The template syntax allows you to define a conditional logic such as:
In above example, the mock API will return HTTP status 500 or 501 for every 10th request and 200 or 400 for other requests. You can use conditional syntax to simulate different error status or customize response.
Loops
{{- range $val := Iterate 10}}
{{if LastIter $val 10}}{{else}},{{end}}
{{ end }}
Above example loads a random line from a lines.txt fixture. As you may need to generate a deterministic random data in some cases, you can use Seeded functions to generate predictable data so that the service returns same data. Following example will read a text fixture to load a property from a file:
In above example, the mock API will return HTTP status 500 or 501 for every 10th request and 200 or 400 for other requests. You can use conditional syntax to simulate different error status or customize response.
Test fixtures
The mock service allows you to upload a test fixture that you can refer in your template, e.g.
Above example loads a random line from a lines.txt fixture. As you may need to generate a deterministic random data in some cases, you can use Seeded functions to generate predictable data so that the service returns same data. Following example will read a text fixture to load a property from a file:
You can also customize response status by overriding the request header with X-Mock-Response-Status and delay before return by overriding X-Mock-Wait-Before-Reply header.
Using Test Fixtures
You can define a test data in your test fixtures and then upload as follows:
In above example, test fixtures for lines.txt and props.yaml will be uploaded and will be available for all GET requests under /devices URL path. You can then refer to above fixture in your templates. You can also use this to serve any binary files, e.g. you can define an image template file as follows:
The API mock service defines following custom functions that can be used to generate test data:
Numeric Random Data
Following functions can be used to generate numeric data within a range or with a seed to always generate deterministic test data:
Random
SeededRandom
RandNumMinMax
RandIntArrayMinMax
Text Random Data
Following functions can be used to generate numeric data within a range or with a seed to always generate deterministic test data:
RandStringMinMax
RandStringArrayMinMax
RandRegex
RandEmail
RandPhone
RandDict
RandCity
RandName
RandParagraph
RandPhone
RandSentence
RandString
RandStringMinMax
RandWord
Email/Host/URL
RandURL
RandEmail
RandHost
Boolean
Following functions can be used to generate boolean data:
RandBool
SeededBool
UDID
Following functions can be used to generate UDIDs:
Udid
SeededUdid
String Enums
Following functions can be used to generate a string from a set of Enum values:
EnumString
Integer Enums
Following functions can be used to generate an integer from a set of Enum values:
EnumInt
Random Names
Following functions can be used to generate random names:
RandName
SeededName
City Names
Following functions can be used to generate random city names:
RandCity
SeededCity
Country Names or Codes
Following functions can be used to generate random country names or codes:
RandCountry
SeededCountry
RandCountryCode
SeededCountryCode
File Fixture
Following functions can be used to generate random data from a fixture file:
RandFileLine
SeededFileLine
FileProperty
JSONFileProperty
YAMLFileProperty
Generate Mock API Behavior from OpenAPI or Swagger Specifications
If you are using Open API or Swagger for API specifications, you can simply upload a YAML based API specification. For example, here is a sample Open API specification from Twilio:
openapi: 3.0.1
paths:
/v1/AuthTokens/Promote:
servers:
- url: https://accounts.twilio.com
description: Auth Token promotion
x-twilio:
defaultOutputProperties:
- account_sid
- auth_token
- date_created
pathType: instance
mountName: auth_token_promotion
post:
description: Promote the secondary Auth Token to primary. After promoting the
new token, all requests to Twilio using your old primary Auth Token will result
in an error.
responses:
'200':
content:
application/json:
schema:
$ref: '#/components/schemas/accounts.v1.auth_token_promotion'
description: OK
security:
...
schemas:
accounts.v1.auth_token_promotion:
type: object
properties:
account_sid:
type: string
minLength: 34
maxLength: 34
pattern: ^AC[0-9a-fA-F]{32}$
nullable: true
description: The SID of the Account that the secondary Auth Token was created
for
auth_token:
type: string
nullable: true
description: The promoted Auth Token
date_created:
type: string
format: date-time
nullable: true
description: The ISO 8601 formatted date and time in UTC when the resource
was created
date_updated:
type: string
format: date-time
nullable: true
description: The ISO 8601 formatted date and time in UTC when the resource
was last updated
url:
type: string
format: uri
nullable: true
description: The URI for this resource, relative to `https://accounts.twilio.com`
...
It will generate a mock scenarios for each API based on mime-type, status-code, parameter formats, regex, data ranges, e.g.,
name: UpdateAuthTokenPromotion-xx
path: /v1/AuthTokens/Promote
description: Promote the secondary Auth Token to primary. After promoting the new token, all requests to Twilio using your old primary Auth Token will result in an error.
request:
match_query_params: {}
match_headers: {}
match_content_type: ""
match_contents: ""
example_path_params: {}
example_query_params: {}
example_headers: {}
example_contents: ""
response:
headers: {}
content_type: application/json
contents: '{"account_sid":"{{RandRegex `^AC[0-9a-fA-F]{32}$`}}",\
"auth_token":"{{RandStringMinMax 0 0}}","date_created":"{{Time}}",\
"date_updated":"{{Time}}","url":"https://{{RandName}}.com"}'
contents_file: ""
status_code: 200
wait_before_reply: 0s
In above example, the account_sid uses regex to generate data and URI format to generate URL. Then invoke the mock API as:
curl -v -X POST http://localhost:8080/v1/AuthTokens/Promote
In addition to serving a mock service, you can also use a builtin chaos client to test remote services for stochastic testing by generating random data based on regex or API specifications. For example, you may capture a test scenario for a remote API using http proxy such as:
You can then customize this scenario with additional assertions and you may remove all response contents as they won’t be used. Note that above scenario is defined with group todos. You can then submit a request for chaos testing as follows:
Above request will submit 10 requests to the todo server with random data and return response such as:
{"errors":null,"failed":0,"succeeded":10}
If you have a local captured data, you can also run chaos client with a command line without running mock server, e.g.:
go run main.go chaos --base_url https://jsonplaceholder.typicode.com --group todos --times 10
Static Assets
The mock service can serve any static assets from a user-defined folder and then serve it as follows:
cp static-file default_assets
# execute the API mock server
make && ./out/bin/api-mock-service
# access assets
curl http://localhost:8080/_assets/default_assets
API Reference
The API specification for the mock library defines details for managing mock scenarios and customizing the mocking behavior.
Summary
Building and testing distributed systems often requires deploying a deep stack of dependent services, which makes development hard on a local environment with limited resources. Ideally, you should be able to deploy and test entire stack without using network or requiring a remote access so that you can spend more time on building features instead of configuring your local environment. Above examples show how you use the https://github.com/bhatti/api-mock-service to mock APIs for testing purpose and define test scenarios for simulating both happy and error cases as well as injecting faults or network delays in your testing processes so that you can test for fault tolerance. This mock library can be used to define the API mock behavior using record/play, template language or API specification standards. I have found a great use of tools like this when developing micro services and hopefully you find it useful. Feel free to connect with your feedback or suggestions.
I recently needed a way to control access to shared resources in a distributed system for concurrency control. Though, you may use Consul or Zookeeper (or low-level Raft / Paxos implementations if you are brave) for managing distributed locks but I wanted to reuse existing database without adding another dependency to my tech stack. Most databases support transactions or conditional updates with varying degree of support for transactional guarantees, but they can’t be used for distributed locks if the business logic you need to protect resides outside the databases. I found a lock client library based on AWS Databases but it didn’t support semaphores. The library implementation was tightly coupled with concerns of lock management and database access and it wasn’t easy to extend it easily. For example, following diagram shows how cyclic dependencies in the code:
Due to above deficiencies in existing solutions, I decided to implement my own implementation of distributed locks in Rust with following capabilities:
Allow creating lease based locks that can either protect a single shared resource with a Mutex lock or protect a finite set of shared resources with a Semaphore lock.
Allow renewing leases based on periodic intervals so that stale locks can be acquired by other users.
Allow releasing Mutex and semaphore locks explicitly after the user performs a critical action.
CRUD APIs to manage Mutex and Semaphore entities in the database.
Multi-tenancy support for different clients in case the database is shared by multiple users.
Fair locks to support first-come and first serve based access grant when acquiring same lock concurrently.
Scalable solution for supporting tens of thousands concurrent mutexes and semaphores.
Support multiple data stores such as relational databases such as MySQL, PostgreSQL, Sqlite and as well as NoSQL/Cache data stores such as AWS Dynamo DB and Redis.
High-level Design
I chose Rust to build the library for managing distributed locks due to strict performance and correctness requirements. Following diagram shows the high-level components in the new library:
LockManager Interface
The client interacts with the LockManager that defines following operations to acquire, release, renew lock leases and manage lifecycle of Mutexes and Semaphores:
#[async_trait]
pub trait LockManager {
// Attempts to acquire a lock until it either acquires the lock, or a specified additional_time_to_wait_for_lock_ms is
// reached. This method will poll database based on the refresh_period. If it does not see the lock in database, it
// will immediately return the lock to the caller. If it does see the lock, it will note the lease expiration on the lock. If
// the lock is deemed stale, (that is, there is no heartbeat on it for at least the length of its lease duration) then this
// will acquire and return it. Otherwise, if it waits for as long as additional_time_to_wait_for_lock_ms without acquiring the
// lock, then it will return LockError::NotGranted.
//
async fn acquire_lock(&self, opts: &AcquireLockOptions) -> LockResult<MutexLock>;
// Releases the given lock if the current user still has it, returning true if the lock was
// successfully released, and false if someone else already stole the lock. Deletes the
// lock item if it is released and delete_lock_item_on_close is set.
async fn release_lock(&self, opts: &ReleaseLockOptions) -> LockResult<bool>;
// Sends a heartbeat to indicate that the given lock is still being worked on.
// This method will also set the lease duration of the lock to the given value.
// This will also either update or delete the data from the lock, as specified in the options
async fn send_heartbeat(&self, opts: &SendHeartbeatOptions) -> LockResult<MutexLock>;
// Creates mutex if doesn't exist
async fn create_mutex(&self, mutex: &MutexLock) -> LockResult<usize>;
// Deletes mutex lock if not locked
async fn delete_mutex(&self,
other_key: &str,
other_version: &str,
other_semaphore_key: Option<String>) -> LockResult<usize>;
// Finds out who owns the given lock, but does not acquire the lock. It returns the metadata currently associated with the
// given lock. If the client currently has the lock, it will return the lock, and operations such as release_lock will work.
// However, if the client does not have the lock, then operations like releaseLock will not work (after calling get_lock, the
// caller should check mutex.expired() to figure out if it currently has the lock.)
async fn get_mutex(&self, mutex_key: &str) -> LockResult<MutexLock>;
// Creates or updates semaphore with given max size
async fn create_semaphore(&self, semaphore: &Semaphore) -> LockResult<usize>;
// Returns semaphore for the key
async fn get_semaphore(&self, semaphore_key: &str) -> LockResult<Semaphore>;
// find locks by semaphore
async fn get_semaphore_mutexes(&self,
other_semaphore_key: &str,
) -> LockResult<Vec<MutexLock>>;
// Deletes semaphore if all associated locks are not locked
async fn delete_semaphore(&self,
other_key: &str,
other_version: &str,
) -> LockResult<usize>;
}
The LockManager interacts with LockStore to access mutexes and semaphores, which delegate to implementation of mutex and semaphore repositories for lock management. The library defines two implementation of LockStore: first, DefaultLockStore that supports mutexes and semaphores where mutexes are used to acquire a singular lock whereas semaphores are used to acquire a lock from a set of finite shared resources. The second, FairLockStore uses a Redis specific implementation of fair semaphores for managing lease based semaphores that support first-come and first-serve order. The LockManager supports waiting for the lock to be available if lock is not immediately available where it periodically checks for the availability of mutex or semaphore based lock. Due to this periodic polling, the fair semaphore algorithm won’t support FIFO order if a new client requests a lock while previous lock request is waiting for next polling interval.
Create Lock Manager
You can instantiate a Lock Manager with relational database store as follows:
let config = LocksConfig::new("test_tenant");
let mutex_repo = factory::build_mutex_repository(RepositoryProvider::Rdb, &config)
.await.expect("failed to build mutex repository");
let semaphore_repo = factory::build_semaphore_repository(
RepositoryProvider::Rdb, &config)
.await.expect("failed to build semaphore repository");
let store = Box::new(DefaultLockStore::new(&config, mutex_repo, semaphore_repo));
let locks_manager = LockManagerImpl::new(
&config, store, &default_registry()).expect("failed to initialize lock manager");
Alternatively, you can choose AWS Dynamo DB as follows:
let mutex_repo = factory::build_mutex_repository(
RepositoryProvider::Ddb, &config).await.expect("failed to build mutex repository");
let semaphore_repo = factory::build_semaphore_repository(
RepositoryProvider::Ddb, &config).await.expect("failed to build semaphore repository");
let store = Box::new(DefaultLockStore::new(&config, mutex_repo, semaphore_repo));
Or Redis based data-store as follows:
let mutex_repo = factory::build_mutex_repository(
RepositoryProvider::Redis, &config).await.expect("failed to build mutex repository");
let semaphore_repo = factory::build_semaphore_repository(
RepositoryProvider::Redis, &config).await.expect("failed to build semaphore repository");
let store = Box::new(DefaultLockStore::new(&config, mutex_repo, semaphore_repo));
Note: The AWS Dynamo DB uses strongly consistent reads feature as by default it is eventually consistent reads.
Acquiring a Mutex Lock
You will need to build options for acquiring with key name and lease period in milliseconds and then acquire it:
let opts = AcquireLockOptionsBuilder::new("mylock")
.with_lease_duration_secs(10).build();
let lock = lock_manager.acquire_lock(&opts)
.expect("should acquire lock");
The acquire_lock operation will automatically create mutex lock if it doesn’t exist otherwise it will wait for the period of lease-time if the lock is not available. This will return a structure for mutex lock that includes:
A lock is only available for the duration specified in lease_duration period, but you can renew it periodically if needed:
let opts = SendHeartbeatOptionsBuilder::new("one", "258d513e-bae4-4d91-8608-5d500be27593")
.with_lease_duration_secs(15)
.build();
let updated_lock = lock_manager.send_heartbeat(&opts)
.expect("should renew lock");
Note: The lease renewal will also update the version of lock so you will need to use the updated version to renew or release the lock.
Releasing the lease of Lock
You can build options for releasing from the lock returned by above API as follows and then release it:
let opts = ReleaseLockOptionsBuilder::new("one", "258d513e-bae4-4d91-8608-5d500be27593")
.build();
lock_manager.release_lock(&release_opts)
.expect("should release lock");
Acquiring a Semaphore based Lock
The semaphores allow you to define a set of locks for a resource with a maximum size. The operation for acquiring semaphore is similar to acquiring regular lock except you specify semaphore size, e.g.:
let opts = AcquireLockOptionsBuilder::new("my_pool")
.with_lease_duration_secs(15)
.with_semaphore_max_size(10)
.build();
let lock = lock_manager.acquire_lock(&opts)
.expect("should acquire semaphore lock");
The acquire_lock operation will automatically create semaphore if it doesn’t exist and it will then check for available locks and wait if all the locks are busy. This will return a structure for lock that includes:
The fair semaphores is only available for Redis due to internal implementation, and it requires enabling it via fair_semaphore configuration option, otherwise its usage is similar to above operations, e.g.:
let mut config = LocksConfig::new("test_tenant");
config.fair_semaphore = Some(fair_semaphore);
let fair_semaphore_repo = factory::build_fair_semaphore_repository(
RepositoryProvider::Redis, &config)
.await.expect("failed to create fair semaphore");
let store = Box::new(FairLockStore::new(&config, fair_semaphore_repo));
let locks_manager = LockManagerImpl::new(
&config, store, &default_registry())
.expect("failed to initialize lock manager");
Then acquire lock similar to the semaphore syntax as before:
let opts = AcquireLockOptionsBuilder::new("my_pool")
.with_lease_duration_secs(15)
.with_semaphore_max_size(10)
.build();
let lock = lock_manager.acquire_lock(&opts)
.expect("should acquire semaphore lock");
The acquire_lock operation will automatically create semaphore if it doesn’t exist and it will then check for available locks and wait if all the locks are busy. This will return a structure for lock that includes:
The fair semaphore lock does not use mutexes internally but for the API compatibility, it builds a mutex with a key based on combination of semaphore-key and version. You can then query semaphore state as follows:
let semaphore = locks_manager.get_semaphore("one").await
.expect("failed to find semaphore");
In addition to a Rust based interface, the distributed locks library also provides a command line interface for managing mutex and semaphore based locks, e.g.:
Mutexes and Semaphores based Distributed Locks with databases.
Usage: db-locks [OPTIONS] [PROVIDER] <COMMAND>
Commands:
acquire
heartbeat
release
get-mutex
delete-mutex
create-mutex
create-semaphore
get-semaphore
delete-semaphore
get-semaphore-mutexes
help
Print this message or the help of the given subcommand(s)
Arguments:
[PROVIDER]
Database provider [default: rdb] [possible values: rdb, ddb, redis]
Options:
-t, --tenant <TENANT>
tentant-id for the database [default: local-host-name]
-f, --fair-semaphore <FAIR_SEMAPHORE>
fair semaphore lock [default: false] [possible values: true, false]
-j, --json-output <JSON_OUTPUT>
json output of result from action [default: false] [possible values: true, false]
-c, --config <FILE>
Sets a custom config file
-h, --help
Print help information
-V, --version
Print version information
For example, you can acquire fair semaphore lock as follows:
% REDIS_URL=redis://192.168.1.102 cargo run -- --fair-semaphore true --json-output true redis acquire --key one --semaphore-max-size 10
You can run following command for renewing above lock:
% REDIS_URL=redis://192.168.1.102 cargo run -- --fair-semaphore true --json-output true redis heartbeat --key one_69816448-7080-40f3-8416-ede1b0d90e80 --semaphore-key one --version 69816448-7080-40f3-8416-ede1b0d90e80
And then release it as follows:
% REDIS_URL=redis://192.168.1.102 cargo run -- --fair-semaphore true --json-output true redis release --key one_69816448-7080-40f3-8416-ede1b0d90e80 --semaphore-key one --version 69816448-7080-40f3-8416-ede1b0d90e80
Summary
I was able to meet the initial goals for implementing distributed locks and though this library is early in development. You can download and try it from https://github.com/bhatti/db-locks. Feel free to send your feedback or contribute to this library.
As businesses grow with larger customers size and hire more employees, they face challenges to meet the customer demands in terms of scaling their systems and maintaining rapid product development with bigger teams. The businesses aim to scale systems linearly with additional computing and human resources. However, systems architecture such as monolithic or ball of mud makes scaling systems linearly onerous. Similarly, teams become less efficient as they grow their size and become silos. A general solution to solve scaling business or technical problems is to use divide & conquer and partition it into multiple sub-problems. A number of factors affect scalability of software architecture and organizations such as the interactions among system components or communication between teams. For example, the coordination, communication and data/knowledge coherence among the system components and teams become disproportionately expensive with the growth in size. The software systems and business management have developed a number of laws and principles that can used to evaluate constraints and trade offs related to the scalability challenges. Following is a list of a few laws from the technology and business domain for scaling software architectures and business organizations:
Amdhal’s Law
Amdahl’s Law is named after Gene Amdahl that is used to predict speed up of a task execution time when it’s scaled to run on multiple processors. It simply states that the maximum speed up will be limited by the serial fraction of the task execution as it will create resource contention:
Speed up (P, N) = 1 / [ (1 - P) + P / N ]
Where P is the fraction of task that can run in parallel on N processors. When N becomes large, P / N approaches 0 so speed up is restricted to 1 / (1 – P) where the serial fraction (1 – P) becomes a source of contention due to data coherence, state synchronization, memory access, I/O or other shared resources.
Amdahl’s law can also be described in terms of throughput using:
N / [ 1 + a (N - 1) ]
Where a is the serial fraction between 0 and 1. In parallel computing, a class of problems known as embarrassingly parallel workload where the parallel tasks have a little or no dependency among tasks so their value for a will be 0 because they don’t require any inter-task communication overhead.
Amdah’s law can be used to scale teams as an organization grows where the teams can be organized as small and cross-functional groups to parallelize the feature work for different product lines or business domains, however the maximum speed up will still be limited by the serial fraction of the work. The serial work can be: build and deployment pipelines; reviewing and merging changes; communication and coordination between teams; and dependencies for deliverables from other teams. Fred Brooks described in his book The Mythical Man-Month how adding people to a highly divisible task can reduce overall task duration but other tasks are not so easily divisible: while it takes one woman nine months to make one baby, “nine women can’t make a baby in one month”.
Brooks’s Law
Brooks’s law was coined by Fred Brooks that states that adding manpower to a late software project makes it later due to ramp up time. As the size of team increases, the ramp up time for new employees also increases due to quadratic communication overhead among team members, e.g.
Number of communication channels = N x (N - 1) / 2
The organizations can build small teams such as two-pizza/single-threaded teams where communication channels within each team does not explode and the cross-functional nature of the teams require less communication and dependencies from other teams. The Brook’s law can be equally applied to technology when designing distributed services or components so that each service is designed as a loosely coupled module around a business domain to minimize communication with other services and services only communicate using a well designed interfaces.
Universal Scalability Law
The Universal Scalability Law is used for capacity planning and was derived from Amdahl’s law by Dr. Neil Gunther. It describes relative capacity in terms of concurrency, contention and coherency:
C(N) = N / [1 + a(N – 1) + B.N (N – 1) ]
Where C(N) is the relative capacity, a is the serial fraction between 0 and 1 due to resource contention and B is delay for data coherency or consistency. As data coherency (B) is quadratic in N so it becomes more expensive as size of N increases, e.g. using a consensus algorithm such as Paxos is impractical to reach state consistency among large set of servers because it requires additional communication between all servers. Instead, large scale distributed storage services generally use sharding/partitioning and gossip protocol with a leader-based consensus algorithm to minimize peer to peer communication.
The Universal Scalability Law can be applied to scale teams similar to Amdahl’s law where a is modeled for serial work or dependency between teams and B is modeled for communication and consistent understanding among the team members. The cost of B can be minimized by building cross-functional small teams so that teams can make progress independently. You can also apply this model for any decision making progress by keeping the size of stake holders or decision makers small so that they can easily reach the agreement without grinding to halt.
The gossip protocols also applies to people and it can be used along with a writing culture, lunch & learn and osmotic communication to spread knowledge and learnings from one team to other teams.
Where L is the average number of items within the system or queue, A is the average arrival time of items and W is the average time an item spends in the system. The Little’s law and queuing theory can be used for capacity planning for computing servers and minimizing waiting time in the queue (L).
The Little’s law can be applied for predicting task completion rate in an agile process where L represents work-in-progress (WIP) for a sprint; A represents arrival and departure rate or throughput/capacity of tasks; W represents lead-time or an average amount of time in the system.
WIP = Throughput x Lead-Time
Lead-Time = WIP / Throughput
You can use this relationship to reduce the work in progress or lead time and improve throughput of tasks completion. Little’s law observes that you can accomplish more by keeping work-in-progress or inventory small. You will be able to better respond to unpredictable delays if you keep a buffer in your capacity and avoid 100% utilization.
King’s formula
The King’s formula expands Little’s law by adding utilization and variability for predicting wait time before serving of requests:
where T is the mean service time, m (1/T) is the service rate, A is the mean arrival rate, p = A/m is the utilization, ca is the coefficient of variation for arrivals and cs is the coefficient of variation for service times. The King’s formula shows that the queue sizes increases to infinity as you reach 100% utilization and you will have longer queues with greater variability of work. These insights can be applied to both technical and business processes so that you can build systems with a greater predictability of processing time, smaller wait time E(W) and higher throughput ?.
Note: See Erlang analysis for serving requests in a system without a queue where new requests are blocked or rejected if there is not sufficient capacity in the system.
Gustafson’s Law
Gustafson’s law improves Amdahl’s law with a keen observation that parallel computing enables solving larger problems by computations on very large data sets in a fixed amount of time. It is defined as:
S = s + p x N
S = (1 – s) x N
S = N + (1 – N) x s
where S is the theoretical speed up with parallelism, N is the number of processors, s is the serial fraction and p is the parallel part such that s + p = 1.
Gustafson’s law shows that limitations imposed by the sequential fraction of a program may be countered by increasing the total amount of computation. This allows solving bigger technical and business problems with a greater computing and human resources.
Conway’s Law
Conway’s law states that an organization that designs a system will produce a design whose structure is a copy of the organization’s communication structure. It means that the architecture of a system is derived from the team structures of an organization, however you can also use the architecture to derive the team structures. This allows defining building teams along the architecture boundaries so that each team is a small, cross functional and cohesive. A study by the Harvard Business School found that the often large co-located teams tended to produce more tightly-coupled and monolithic codebases whereas small distributed teams produce more modular codebases. These lessons can be applied to scaling teams and architecture so that teams and system modules are built around organizational boundaries and independent concerns to promote autonomy and reduce tight coupling.
Pareto Principle
The Pareto principle states that for many outcomes, roughly 80% of consequences come from 20% of causes. This principle shows up in numerous technical and business problems such as 20% of code has the 80% of errors; customers use 20% of functionality 80% of the time; 80% of optimization improvements comes from 20% of the effort, etc. It can also be used to identify hotspots or critical paths when scaling, as some microservices or teams may receive disproportionate demands. Though, scaling computing resources is relatively easy but scaling a team beyond an organization boundary is hard. You will have to apply other management tools such as prioritization, planning, metrics, automation and better communication to manage critical work.
Number of possible pair connections = N * (N – 1) / 2
Reed’s Law expanded this law and observed that the utility of large networks can scale exponentially with the size of the network.
Number of possible subgroups of a network = 2N – N – 1
This law explains the popularity of social networking services via viral communication. These laws can be applied to model information flow between teams or message exchange between services to avoid peer to peer communication with extremely large group of people or a set of nodes. A common alternative is to use a gossip protocol or designate a partition leader for each group that communicates with other leaders and then disseminate information to the group internally.
Dunbar Number
The Dunbar’s number is a suggested cognitive limit to the number of people with whom one can maintain stable social relationships. It has a commonly used value of 150 and can be used to limit direct communication connections within an organization.
Wirth’s Law and Parkinson’s Law
The Wirth’s Law is named after Niklaus Wirth who observed that the software is getting slower more rapidly than hardware is becoming faster. Over the last few decades, processors have become exponentially faster as a Moor’s Law but often that gain allows software developers to develop more complex software that consumes all gains of the speed. Another factor is that it allows software developers to use languages and tools that may not generate more efficient code so the code becomes bloated. There is a similar law in software development called Parkinson’s law that work expands to fill the time available for it. Though, you also have to watch for Hofstadter’s Law that states that “it always takes longer than you expect, even when you take into account Hofstadter’s Law”; and Brook’s Law, which states that “adding manpower to a late software project makes it later.”
The Wirth’s Law, named after Niklaus Wirth, posits that software tends to become slower at a rate that outpaces the speed at which hardware becomes faster. This observation reflects a trend where, despite significant advancements in processor speeds as predicted by Moor’s Law , software complexity increases correspondingly. Developers often leverage these hardware improvements to create more intricate and feature-rich software, which can negate the hardware gains. Additionally, the use of programming languages and tools that do not prioritize efficiency can lead to bloated code.
In the realm of software development, there are similar principles, such as Parkinson’s law, which suggests that work expands to fill the time allotted for its completion. This implies that given more time, software projects may become more complex or extended than initially necessary. Moreover, Hofstadter’s Law offers a cautionary perspective, stating, “It always takes longer than you expect, even when you take into account Hofstadter’s Law.” This highlights the often-unexpected delays in software development timelines. Brook’s Law further adds to these insights with the adage, “Adding manpower to a late software project makes it later.” These laws collectively emphasize that the demand upon a resource tends to expand to match the supply of the resource but adding resources later also poses challenges due to complexity in software development and project management.
Principle of Priority Inversion
In modern operating systems, the concept of priority inversion arises when a high-priority process needs resources or data from a low-priority process, but the low-priority process never gets a chance to execute due to its lower priority. This creates a deadlock or inefficiency where the high-priority process is blocked indefinitely. To avoid this, schedulers in modern operating systems adjust the priority of the lower-priority process to ensure it can complete its task and release the necessary resources, allowing the high-priority process to continue.
This same principle applies to organizational dynamics when scaling teams and projects. Imagine a high-priority initiative that requires collaboration from another team whose priorities do not align. Without proper coordination, the team working on the high-priority initiative may never get the support they need, leading to delays or blockages. Just as in operating systems, where a priority adjustment is needed to keep processes running smoothly, organizations must also ensure alignment across teams by managing a global list of priorities. A solution is to maintain a global prioritized list of projects that is visible to all teams. This ensures that the most critical initiatives are recognized and appropriately supported by every team, regardless of their individual workloads. This centralized prioritization ensures that teams working on essential projects can quickly receive the help or resources they need, avoiding bottlenecks or deadlock-like situations where progress stalls because of misaligned priorities.
Load Balancing (Round Robin, Least Connection)
Load balancing algorithms distribute tasks across multiple servers to optimize resource utilization and prevent any single server from becoming overwhelmed. Common strategies include round-robin (distributing tasks evenly across servers) and least connection (directing new tasks to the server with the fewest active connections).
Load balancing can be applied to distribute work among teams or individuals. For instance, round-robin can ensure that tasks are equally assigned to team members, while the least-connection principle can help assign tasks to those with the lightest workload, ensuring no one is overloaded. This leads to more efficient task management, better resource allocation, and balanced work distribution.
MapReduce
MapReduce splits a large task into smaller sub-tasks (map step) that can be processed in parallel, then aggregates the results (reduce step) to provide the final output. In a large project, teams or individuals can be assigned sub-tasks that they can work on independently. Once all the sub-tasks are complete, the results can be aggregated to deliver the final outcome. This fosters parallelism, reduces bottlenecks, and allows for scalable team collaboration, especially for large or complex projects.
Deadlock Prevention (Banker’s Algorithm)
The Banker’s Algorithm is used to prevent deadlocks by allocating resources in such a way that there is always a safe sequence of executing processes, avoiding circular wait conditions. In managing interdependent teams or tasks, it’s important to avoid deadlocks where teams wait on each other indefinitely. By proactively ensuring that resources (e.g., knowledge, tools, approvals) are available before committing teams to work, project managers can prevent deadlock scenarios. Prioritizing resource allocation and anticipating dependencies can ensure steady progress without one team stalling another.
Consensus Algorithms (Paxos, Raft)
Consensus algorithms ensure that distributed systems agree on a single data value or decision, despite potential failures. Paxos and Raft are used to maintain consistency across distributed nodes. In projects involving multiple stakeholders or teams, reaching consensus on decisions can be challenging, especially with different priorities and viewpoints. Consensus-building techniques, inspired by these algorithms, could involve ensuring that key stakeholders agree before any significant action is taken, much like how Paxos ensures agreement across distributed systems. This avoids misalignment and fosters collaboration and trust across teams.
Rate Limiting
Rate limiting controls the number of requests or operations that can be performed in a given timeframe to prevent overloading a system. This concept applies to managing expectations, particularly in teams with multiple incoming requests. Rate limiting can be applied to protect teams from being overwhelmed by too many requests at once. By limiting how many tasks or requests a team can handle at a time, project managers can ensure a sustainable work pace and prevent burnout, much like how rate limiting helps protect system stability.
Summary
Above laws offer strategies for optimizing both technical systems and team dynamics. Amdahl’s Law and the Universal Scalability Law highlight the challenges of parallelizing work, emphasizing the need to manage coordination and communication overhead as bottlenecks when scaling teams or systems. Brook’s and Metcalfe’s Laws reveal the exponential growth of communication paths, suggesting that effective team scaling requires managing these paths to avoid coordination paralysis. Little’s Law and Kingman’s Formula suggest limiting work in progress and preventing 100% resource utilization to ensure reliable throughput, while Conway’s Law underscores the alignment between team structures and system architecture. Teams and their responsibilities should mirror modular architectures, fostering autonomy and reducing cross-team dependencies.
The Pareto Principle can guide teams to make small but impactful changes in architecture or processes that yield significant productivity improvements. Wirth’s Law and Parkinson’s Law serve as reminders to prevent work bloat and unnecessary complexity by setting clear deadlines and objectives. Dunbar’s Number highlights the human cognitive limit in maintaining external relationships, suggesting that team dependencies should be kept minimal to maintain effective collaboration. The consensus algorithms used in distributed systems can be applied to decision-making and collaboration, ensuring alignment among teams. Error correction algorithms are useful for feedback loops, helping teams iteratively improve. Similarly, techniques like load balancing strategies can optimize task distribution and workload management across teams.
Before applying these laws, it is essential to have clear goals, metrics, and KPIs to measure baselines and improvements. Prematurely implementing these scalability strategies can exacerbate issues rather than resolve them. The focus should be on global optimization of the entire organization or system, rather than focusing on local optimizations that don’t align with broader goals.
Software is eating the world and today’s businesses demand shipping software features at a higher velocity to enable learning at a greater pace without compromising the quality. However, each new feature increases viscosity of existing code, which can add more complexity and technical debt so the time to market for new features becomes longer. Managing a sustainable pace for software delivery requires continuous improvements to the software development architecture and practices.
Software Architecture
The Software Architecture defines guiding principles and structure of the software systems. It also includes quality attribute such as performance, sustainability, security, scalability, and resiliency. The software architecture is then continuously updated through iterative software development process and feedback cycle from the the actual use in production environment. The software architecture decays if it’s ignored that results in a higher complexity and technical debt. In order to reduce technical debt, you can build a backlog of technical and architecture related changes so that you can prioritize along with the product development. In order to maintain consistent architecture throughout your organization, you can document the architecture principles to define high-level guidelines for best practices, documentation templates, review process and guidance for the architecture decisions.
Quality Attributes
Following are major quality attributes of the software architecture:
Availability — It defines percentage of time, the system is available, e.g. available-for-use-time / total-time. It is generally referred in percentiles such as P99.99, which indicates a down time of 52 minutes in a year. It can also be calculated in terms of as mean-time between failure (MTBF) and mean-time to recover (MTRR) using MTBF/(MTBF+MTRR). The availability will depend not only on the service that you are providing but also its dependent services, e.g. P-service * P-dep-service-1 * P-dep-service-2. You can improve availability with redundant services, which can be calculated as Max-availability - (100 - Service-availability) ** Redundancy-factor. In order to further improve availability, you can detect faults and use redundancy and state synchronization for fault recovery. The system should also handle exceptions gracefully so that it doesn’t crash or goes into a bad state.
Capacity — Capacity defines how the system scales by adding hardware resources.
Extensibility — Extensibility defines how the system meets future business requirements without significantly changing existing design and code.
Fault Tolerance — Fault tolerance prevents a single point of failure and allows the system to continue operating even when parts of the system fail.
Maintainability — Higher quality code allows building robust software with higher stability and availability. This improves software delivery due to modular and loosely coupled design.
Performance — It is defined in terms of latency of an operation under normal or peak load. A performance may degrade with consumptions of resources, which affects throughput and scalability of the system. You can measure user’s response-time, throughput and utilization of computational resources by stress testing the system. A number of tactics can be used to improve performance such as prioritization, reducing overhead, rate-limiting, asynchronicity, caching, etc. Performance testing can be integrated with continuous delivery process that use load and stress testing to measure performance metrics and resource utilization.
Resilience — Resilience accepts the fact that faults and failure will occur so instead system components resist them by retrying, restarting, limiting error propagation or other measures. A failure is when a system deviates from its expected behavior as a result of accidental fault, misconfigurations, transient network issues or programming error. Two metrics related to resilience are mean-time between failure (MTBF) and mean-time to recover (MTTR), however resilient systems pay more attention to recovery or a shorter MTTR for fast recovery.
Recovery — Recovery looks at system recover in relation with availability and resilience. Two metrics related to recovery are recovery point objective (RPO) and recovery time objective (RTO), where RPO determines data that can be lost in case of failure and RTO defines wait time for the system recovery.
Reliability — Reliability looks at the probability of failure or failure rate.
Reproducibility — Reproducibility uses version control for code, infrastructure, configuration so that you can track and audit changes easily.
Reusability — It encourages code reuse to improve reliability, productivity and cost savings from the duplicated effort.
Scalability — It defines ability of the system to handle increase in workload without performance degradation. It can be expressed in terms of vertical or horizontal scalability, where horizontal reduces impact of isolated failure and improves workload availability. Cloud computing offers elastic and auto-scaling features for adding additional hardware when higher request rate is detected by the load balancer.
Security — Security primarily looks at confidentiality, integrity, availability (CIA) and is critical in building distributed systems. Building secure systems depends on security practices such as a strong identity management, defense in depth, zero trust networks, auditing, ad protecting while data in motion or at rest. You can adopt DevSecOps that shifts security left to earlier in software development lifecycle with processes such as Security by Design (SbD), STRIDE (Spoofing, Tampering, Repudiation, Disclosure, Denial of Service, Elevation of privilege), PASTA (Process for Attack Simulation and Threat Analysis), VAST (Visual, Agile and Simple Threat), CAPEC (Common Attack pattern Enumeration and Classification), and OCTAGE (Operationally Critical Threat, and Vulnerability Evaluation).
Testability — It encourages building systems in a such way it’s easier to test them.
Usability — It defines user experience of user interface and information architecture.
Architecture Patterns
Following is a list of architecture patterns that help building a high quality software:
Asynchronicity
Synchronous services are difficult to scale and recover from failures because they require low-latency and can easily overwhelm the services. Messaging-based asynchronous communication based on point-to-point or publish/subscribe are more suitable for handling faults or high load. This improves resilience because service components can restart in case of failure while messages remain in the queue.
Admission Control
The admission control adds a authentication, authorization or validation check in front event queue so that service can handle the load and prevent overload when demand exceeds the server capacity.
Back Pressure
When a producer is generating workload faster than the server can process, it can result in long request queues. Back pressure signals clients that servers are overloaded and clients need to slow down. However, rogue clients may ignore these signals so servers often employ other tactics such as admission control, load shedding, rate limiting or throttling requests.
Big fleet in front of small fleet
You should look at all transitive dependencies when scaling a service with a large fleet of hosts so that you don’t drive a large network traffic that needs to invoke dependent services with a smaller fleet. You can use use load testing to find the bottlenecks and update SLAs for the dependent services so that they are aware of network load from your APIs.
Blast Radius
A blast radius defines impact of failure on overall system when an error occurs. In order to limit the blast radius, the system should eliminate a single point of failure, rolling deploy changes using canaries and stop cascading failures using circuit breakers, retry and timeout.
Bulkheads
Bulkheads isolate faults from one component to another, e.g. you may use different thread pool for different workloads or use multiple regions/availability-zones to isolate failures in a specific datacenter.
Caching
Caching can be implemented at a several layers to improve performance such as database-cache, application-cache, proxy/edge cache, pre-compute cache and client-side cache.
Circuit Breakers
The circuit breaker is defined as a state machine with three states: normal, checking and tripped. It can be used to detect persistent failures in a dependent service and trip its state to disable invocation of the service temporarily with some default behavior. It can be later changed to the checking state for detecting success, which changes its state to normal after a successful invocation of the dependent service.
CQRS / Event Sourcing
Command and Query Responsibility Segregation (CQRS) separates read and update operations in the database. It’s often implemented using event-sourcing that records changes in an append-only store for maintaining consistency and audit trails.
Default Values
Default values provide a simple way to provide limited or degraded behavior in case of failure in dependent configuration or control service.
Disaster Recovery
Disaster recovery (DR) enables business continuity in the event of large-scale failure of data centers. Based on cost, availability and RTO/RPO constraints, you can deploy services to multiple regions for hot site; only replicate data from one region to another region while keeping servers as standby for warm site; or use backup/restore for cold site. It is essential to periodically test and verify these DR procedures and processes.
Distributed Saga
Maintaining data consistency in a distributed system where data is stored in multiple databases can be hard and using 2-phase-commit may incur high complexity and performance. You can use distributed Saga for implementing long-running transactions. It maintains state of the transaction and applies compensating transactions in case of a failure.
Failing Fast
You can fail fast if the workload cannot serve the request due to unavailability of resources or dependent services. In some cases, you can queue requests, however it’s best to keep those queues short so that you are not spending resources to serve stale requests.
Function as a Service
Function as a service (FaaS) offers serverless computing to simplify managing physical resources. Cloud vendors offer APIs for AWS Lambda, Google Cloud Functions and Azure Functions to build serverless applications for scalable workloads. These functions can be easily scaled to handle load spikes, however you have to be careful scaling these functions so that any services that they depend on can support the workload. Each function should be designed with a single responsibility, idempotency and shared nothing principles that can be executed concurrently. The serverless applications generally use event-based architecture for triggering functions and as the serverless functions are more granular, they incur more communication overhead. In addition, chaining functions within the code can result in tightly coupled applications, instead use a state machine or a workflow to orchestrate the communication flow. There is also an open source support for FaaS based serverless computing such as OpenFaas and OpenWhisk on top of Kubernetes or OpenShift, which prevents locking into a specific cloud provider.
Graceful Degradation
Instead of failing a request when dependent components are unhealthy, a service may use circuit-breaker pattern to return a predefined or default response.
Health Checks
Health checks runs a dummy or synthetic transaction that performs the action without affecting real data to verify the system component and its dependencies.
Idempotency
Idempotent services completes API request only exactly once so resending same request due to retries has no side effect. Idempotent APIs typically uses a client-generated identifier or token and the idempotent service returns same response if duplicate request is received.
Layered Architecture
The layered architecture separates software into different concerns such as:
Presentation Layer
Business Logic Layer
Service Layer
Domain Model Layer
Data Access Layer
Load Balancer
Load balancer allows distributing traffic among groups of resources so that a single resource is not overloaded. These load balancers also monitors health of servers and you can setup a load balancer for each group of resources to ensure that requests are not routed to unhealthy or unavailable resources.
Load Shedding
Load shedding allows rejection the work at the edge when server side exceeds its capacity, e.g. a server may return HTTP 429 error to signal clients that they can retry at a slower rate.
Loosely coupled dependencies
Using queuing systems, streaming systems, and workflows isolate behavior of dependent components and increases resiliency with asynchronous communication.
MicroServices
Microservices evolved from service oriented architecture (SOA) and support both point to point protocols such as REST/gRPC and asynchronous protocols based on messaging/event bus. You can apply bounded-context of domain-driven design (DDD) to design loosely coupled services.
Model-View Controller
It decouples user interface from the data model and application functionality so that each component can be independently tested. Other variations of this pattern include model–view–presenter (MVP) and model–view–viewmodel (MVVM).
NoSQL
NoSQL database technology provide support for high availability and variable/write-heavy workloads that can be easily scaled with additional hardware. NoSQL optimizes CAP and PACELC tradeoffs of consistency, availability, partition tolerance and latency, A number of cloud vendors provide managed NoSQL database solutions, however they can create latency issues if services accessing these databases are not colocated.
No Single Point of Failure
In order to eliminate single-points of failures for providing high availability and failover, you can deploy redundant services to multiple regions and availability zones.
Ports and Adapters
Ports and Adapters (Hexagon) separates interface (ports) from implementation (adapters). The business logic is encapsulated in the Hexagon that is invoked by the implementation (adapters) when actors operate on capabilities offered by the interface (port).
Rate Limiting and Throttling
Rate-limiting defines the rate at which clients can access the services based on the license policy. The throttling can be used to restrict access as a result of unexpected increase in demand. For example, the server can return HTTP 429 to notify clients that they can backoff or retry at a slower rate.
Retries with Backoff and Jitter
A remote operation can be retried if it fails due to transient failure or a server overload, however each retry should use a capped exponential backoff so that retries don’t cause additional load on the server. In a layered architecture, retry should be performed at a single point to minimize multifold retries. Retries can use circuit-breakers and rate-limiting to throttle requests. In some cases, requests may timeout for clients but succeed on the server side so the APIs must be designed with idempotency so that they are safe to retry. In order to avoid retries at the same time, a small random jitter can be added with retries.
Rollbacks
The software should be designed with rollbacks in mind so that all code, database schema and configurations can be easily rolled back. A production environment might be running multiple versions of same service so care must be taken to design the APIs that are both backward and forward compatibles.
Stateless and Shared nothing
Shared nothing architecture helps building stateless and loosely decoupled services that can be easily horizontally scaled for providing high availability. This architecture allows recovering from isolated failures and support auto-scaling by shrinking or expanding resources based on the traffic patterns.
Startup dependencies
Upon start of services, they may need to connect to certain configuration or bootstrap services so care must be taken to avoid thundering herd problems that can overwhelm those dependent services in the event of a wide region outage.
Timeouts
Timeouts help building resilient systems by throttling invocation of external services and preventing the thundering herd problem. Timeouts can also be used when retrying a failed operation after a transient failure or a server overload. A timeout can also add a small jitter to randomly spread the load on the server. Jitter can also be applied to timers of scheduled jobs or delayed work.
Watchdogs and Alerts
A watchdogs monitors a system component for a specific action such as latency, traffic, errors, saturation and SLOs. It then send an alert based on the monitoring configuration that triggers an email, on-call paging or an escalation.
Virtualization and Containers
Virtualization allows abstracting computing resources using virtual machines or containers so that you don’t depend on physical implementation. A virtual machine is a complete operating system on top of hypervisors whereas container is an isolated, lightweight environment for running applications. Virtualization allows building immutable infrastructure that are specially design to meet application requirements and can be easily deployed on a variety of hardware resources.
Architecture Practices
Following are best practices for sustainable software delivery:
Automation
Automation builds pipelines for continuous integration, continuous testing and continuous delivery to improve speed and agility of the software delivery. Any kind of operation procedures for deployment and monitoring can be stored in version control and then automatically applied with CI/CD procedures. In addition, automated procedures can be defined to track failures based on key performance indicators and trigger recovery or repair for the erroneous components.
Automated Testing
Automated testing allows building software with a suite of unit, integration, functional, load and security tests that verify the behavior and ensures that it can meet production demand. These automated tests will run as part of CI/CD pipelines and will stop deployment if any of the tests fail. In order to run end-to-end and load tests, the deployment scripts will create a new environment and setup a tests data. These tests may replicate synthetic transactions based on production traffic and benchmark the performance metrics.
Capacity Planning
Using load testing, monitoring production traffic patterns and demand with workload utilization help forecast the resources needed for future growth. This can be further strengthened with a capacity model that calculates unit-price of resources and growth forecast so that you can automate addition or removal of resources based on demand.
Cloud Computing
Adopting cloud computing simplifies resource provisioning and its elasticity allows organizations to grow or shrink those resources based on the demand. You can also add automation to optimize utilization of the resources and reduce costs when allocating more resources.
Continuous Delivery
Continuous delivery automates production deployment of small and frequent changes by developers. Continuous delivery relies on continuous integration that runs automated tests and automated deployment without any manual interventions. During a software development process, a developer picks a feature, works on changes and then commits changes to the source control after peer code-review. The automated build system will run a pipeline to create a container image based on the commit and then deploy it to a test or QA environment. The test environment will run automated unit, integration and regression tests using a test data in the database. The code is then promoted to the main branch and the automated build system tags and build the image on the head commit of main-branch, which that is pushed to the container registry. The pre-prod environment pulls the image, restarts the pre-prod container and runs more comprehensive tests with a larger set of test data in the database including performance tests. You may need multiple stages of pre-prod deployment such as alpha, beta and gamma environments, where each environment may require deployment to a unique datacenter. After successful testing, the production systems are updated with the new image using rolling updates, blue/green deployments or canary deployments to minimize disruption to end users. The monitoring system watches for error rates at each stage of the deployment and automatically rollbacks changes if a problem occurs.
Deploy over Multiple Zones and Regions
In order to provide high availability, compliance and reduced latency, you can deploy to multiple availability zones and regions. Global load balancers can be used to route traffic based on geographic proximity to the closest region. This also helps implementing business continuity as applications can easily failover to another region with minimal data.
Service Mesh
In order to easily build distributed systems, a number of platforms based on service-mesh pattern have emerged to abstract a common set of problems such as network communication, security, observability, etc:
Dapr – Distributed Application Runtime
The Distributed Application Runtime (Dapr) provides a variety of communication protocols, encryption, observability and secret management for building secured and resilient distributed services.
Envoy
Envoy is a service proxy for building cloud native application with builtin support for networking protocols and observability.
Istio service mesh
Istio is built on top of Kubernetes and Envoy to build service mesh with builtin support for networking, traffic management, observability and security. A service mesh also addresses features such as A/B testing, canary deployments, rate limiting, access control, encryption, and end-to-end authentication.
Linkerd
Linkerd is a service mesh for Kubernetes and consists of control-plane and data-plane with builtin support for networking, observability and security. The control-plane allows controlling services and data-plane acts as a sidecar container that handles network traffic and communicate with the control-plane for configuration.
WebAssembly
The WebAssembly is a stack-based virtual machine that can run at the edge or in cloud. A number of WebAssembly platforms have adopted Actor model to build a platform for writing distributed applications such as wasmCloud and Lunatic.
Documentation
The architecture document defines goals and constraints of the software system and provides various perspectives such as use-cases, logical, data, processes, and physical deployment. It also includes non-functional or quality attributes such as performance, growth, scalability, etc. You can document these aspects using standards such as 4+1, C4, and ERD as well as document the broader enterprise architecture using methodologies like TOGAF, Zachman, and EA.
Incident management
Incident management defines process of root-cause analysis and actions that organization can take when an incident occurs affecting production environment. It defines best practices such as clear ownership, reducing time to detect/mitigate, blameless postmortems and prevention measures. The organization can then implement preventing measures and share lessons learned from all operational events and failures across teams. You can also use pre-mortem to identify potential areas that can be improved or mitigated. Another way to simulate potential problems is using chaos engineering or setting up game days to test the workloads for various scenarios and outage.
Infrastructure as Code
Infrastructure as code uses declarative language to define development, test and production environment, which is managed by the source code management software. These provisioning and configuration logic can be used by CI/CD pipelines to automatically deploy and test environments. Following is a list of frameworks for building infrastructure from code:
Azure Resource Manager
Azure cloud offer Azure Resource Manager (ARM) templates based on JSON format to declaratively define the infrastructure that you intend to deploy.
AWS Cloud Development Kit
The Cloud Development Kit (CDK) supports high-level programming languages to construct cloud resources on Amazon Web Services so that you can easily build cloud applications.
Hashicorp Terraform
Terraform uses HCL based configurations to describe computing resources that can be deployed to multiple cloud providers.
Monitoring
Monitoring measures key performance indicators (KPI) and service-level objectives (SLO) that are defined at the infrastructure, applications, services and end-to-end levels. These include both business and technical metrics such as number of errors, hot spots, call graphs, which are visible to the entire team for monitoring trends and reacting quickly to failures.
Multi-tenancy
If your system is consumed by a different groups or tenants of users, you will need to design your system and services so that it isolates data and computing resources for secure and reliable fashion. Each layer of the system can be designed to treat tenant context as a first-class construct, which is tied to the user identity. You can capture usage metrics per tenant to identify bottlenecks, estimate cost and analyze the resource utilization for capacity planning and growth projections. The operational dashboards can also use these metrics to construct tenant-based operational views and proactively respond to unexpected load.
Security Review
In order to minimize the security risk, the development teams can adopt shift-left on security and DevSecOps practices to closely collaborate with the InfoSec team and integrate security review into every phase of the software development lifecycle.
Version Control Systems
Version control systems such as Git or Mercurial help track code changes, configurations and scripts over time. You can adopt workflows such as gitflow or trunk-based development for check-in process. Other common practices include smaller commits, testing code and running static analysis or linters/profiling tools before checkin.
Summary
The software complexity is a major reason for missed deadlines and slow/buggy software. This complexity can be essential complexity within the business domain but it’s often result of accidental complexity as a result of technical debt, poor architecture and development practices. Another source of incidental complexity comes from distributed computing where you need handle security, rate-limiting, observability, etc. that needs to be applied consistently across the distributed systems. For example, virtualization helps building immutable infrastructures and adopting infrastructure as a code; functions as a service simplifies building micro-services; and distributed platforms such as Istio, Linkerd remove a lot of cruft such as security, observability, traffic management and communication protocols when building distributed systems. The goal of a good architecture is to simplify building, testing, deploying and operating a software. You need to continually improve the systems architecture and its practices to build sustainable software delivery pipelines that can meet both current and future demands of users.
User identity and authentication is commonly implemented using username and passwords (U/P). Despite wide adoption of password based authentication, they are considered the weakest link in the chain of secure access. The weak passwords and password reuse are common attack vectors for cyber-crimes. Most organizations enforce strong passwords but it burdens users to follow arbitrary rules for strong passwords and store them securely. Nevertheless, these strong passwords are still vulnerable to account takeover via phishing or social engineering attacks. Managing password also encumbers high cost of maintenance on service providers and help desk.
Simply put, passwords are broken. An alternative to password based authentication is passwordless authentication that verifies user credentials using biometric and asymmetric cryptography instead of passwords. A number of new industry standards are in development such as WebAuthn, FIDO2, NIST AAL, and Client to Authenticator Protocol (CTAP), etc. to implement passwordless authentication. Passwordless authentication can be further strengthened using multi-factor authentication (MFA) using one-time-passwords (OTP) and SMS. Passwordless authentication eliminates credential stuffing and phishing attacks for data breach [1].
The passwordless authentication can work with both centralized/federated systems as well as decentralized systems that may use decentralized identity. The decentralized identity uses decentralized identifiers (DIDs) developed by Decentralized Identity Foundation (DIF) and the W3C Credentials Community Group. This allows users to completely control their identity and data privacy so that users can choose which data can be shared with other applications and services. Following diagram shows progression of authentication from password-based to decentralized identity:
Drawbacks of Password based Authentication
Following are major drawbacks of password based authentication:
Inconsistent Password Rules
There aren’t industry standards for strong passwords and most organizations use internal standards for defining passwords such as length of password, use of special or alpha-numeric characters. In general, longer passwords with a mix of characters provide higher entropy and are more secure [NIST] but it mandates users to follow these arbitrary rules.
Too many passwords
An average user may need to create an account on hundreds of websites that encumber users to manage passwords for all those accounts.
Sharing same password
Due to the large number of accounts, users may share the same password for multiple websites. This poses a high security risk in the event of security breaches as reported by DBIR and it’s recommended not to share the same password with multiple websites.
Changing password
A user may need to change password after a security breach or on a regular basis, which adds unnecessary work for users.
Unsafe storage of passwords by service providers
Many organizations use internal standards for storing passwords such as storing passwords in plaintext, unsalted passwords or using weak hashing algorithms instead of standardized Argon2 algorithms. Such weak security measures lead to credential stuffing where stolen credentials are used to access other private data.
Costs of IT Support
The password-reset and forgot-password requests are one of the major costs of help-desk that consume up-to 50% of IT budget.
Mitigating Password Management
Following are a few remedies that can mitigate password management:
Password managers
As strong passwords or long passphrases are difficult to memorize, users often write them down that can be easily eavesdropped. It’s recommended to use password managers for storing passwords safely but often they come with their own set of security risks such as unsafe storage in memory, on local disk or in cloud [2, 3].
OAuth/OpenID standards
The OAuth/OpenIDConnect/SAML standards can be used to implement single-sign-on or social login that allow users to access a website or an application using the same credentials. However, it still poses a security risk if the account credentials are revealed and it weakens data privacy because users can be traced easily when the same account is used to access multiple websites.
Multi-Factor Authentication
The password authentication can use multi-factor authentication (MFA) to hardened security. It’s recommended not to use SMS or voice for MFA due to common attacks via SIM swapping/hacking, phishing, social engineering or loss of device. Instead, one-time passwords (OTPs) provide better alternatives that can be generated using a symmetric key and HOTP/TOTP schemes.
Long-term refresh tokens
Instead of frequent password prompts, some applications use long-term refresh tokens or cookies to extend user sessions. However, it’s recommended to use such a mechanism with other multi-factor authentication or PIN/passcodes.
Passwordless Authentication
Passwordless authentication addresses shortcomings of passwords because you diminish security risks associated with passwords. In addition, they provide better user experience, convenience and user productivity by eliminating friction with password management. Passwordless authentication also reduces IT overhead related with password reset and allows use of biometric on smartphones instead of using hard keys or smart cards.
Passwordless Standards
Following section defines industry standards for implementing passwordless authentication:
The Asymmetric password-authenticated key exchange (PAKE) protocols such as Secure Remote Password (SRP) allow authentication without exchanging passwords. However, it’s old and is susceptible to MITM attacks. Other recent standards such as CPace and OPAQUE provide better alternatives. OPAQUE uses an oblivious pseudorandom function (OPRF) where a random output can be generated without knowing the input.
Fast IDentity Online 2 (FIDO2)
FIDO2 is an open standard that uses asymmetric keys and physical devices to authenticate users. FIDO2 specification defines Client to Authenticate Protocol (CTAP) and Web Authentication (WebAuthn) protocols for communication and authentication. WebAuthn can work with roaming physical devices such as YubiKey as well as platform devices that may use built-in biometrics such as TouchID for authentication.
Decentralized Identity and Access Management
Decentralized or self-sovereign identity (SSI) allows users to own and control their identity. This provides passwordless authentication as well as decentralized authorization to control access to private data. The users use a local digital wallet to store credentials and authenticate their identity instead of using external identity providers.
Comparison
Following diagrams compare architecture of centralized, federated and decentralized authentication:
Centralized
In a centralized model, the user directly interacts with the website or application, e.g.
Federated
In a federated model, the user interacts with one or more identity providers using Security Assertion Markup Language (SAML), OAuth, or OpenID Connect (social login), e.g.
Decentralized
In a decentralized model, user establishes peer-to-peer relationships with the authenticating party, e.g.,
The issuers in decentralized identity can be organizations, universities or government agencies that generate and sign credentials. The holders are subjects of authentication that request verifiable-credentials from issuers and then hold them in digital wallet. The verifiers are independent entities that verifies claims and identity in verifiable-credentials using issuer’s digital signature.
Benefits
Following is a list of major benefits of decentralized or self-sovereign identity:
Reduced costs
User and access management such as account creation, login, logout, password reset, etc. incur major cost for service providers. Decentralized identity shifts the work on the client side where users can use a digital wallet to manage their identity and authentication.
Reduced liability
Storing private user data in the cloud is a liability and organizations often lack proper security and access control to protect it. With decentralized identity, user can control what data can be shared based on use-cases and it eliminates liability associated with centralized storage of private data.
Authorization
Decentralized identity can provide claims for authorization decisions using attribute-based access control or role-based access control. These claims can be cryptographically verified in real-time using zero-knowledge proofs (ZKP) to protect user data and privacy.
Sybil Attack
Decentralized identity facilitates generating new identities easily but they can be protected from Sybil or SPAM attacks where a user may create multiple identifiers to mimic multiple users. The service providers can use ID verification, referral, probationary periods or reputation systems to prevent such attacks.
Loyalty and rewards programs
As decentralized identity empowers users to control the data, they can choose to participate in various loyalty and reward programs by sharing specific data with other service providers.
Data privacy and security
Decentralized identity provides better privacy and security required by regulations such as Health Insurance Portability and Accountability Act (HIPAA), General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), etc. This also prevents identity theft and related cyber-crimes because user data is not stored in a cloud or centralized storage.
Key Building Blocks of Self-Sovereign Identity
Above diagram shows major building blocks of decentralized or self-sovereign identity as described below:
Verifiable Credentials
The verifiable credentials (VC) stores credentials and claims for a subject that can be verified using public key infrastructure. The VC may be stored using blockcerts, W3C or JSON-ld data models. It may store additional claims or attributes that can support data privacy using zero-knowledge proofs.
W3C Working Group defines decentralized identifier (DID) as a globally unique identifier similar to URL that uses peer-to-peer communication and decentralized public key infrastructure (DPKI).
The DID points to the associated document that contains metadata about the DID subject, e.g.
In the above example, plexobject is considered a DID method and the DID standards allow customized methods as defined by DID registries.
Public Keys/DID Registries
The public keys/DID registries allow lookup for public keys or decentralized identifiers (DID). A DID registry uses the DID method and protocol to interact with the remote client. Due to support of decentralized identities in blockchain, several implementations of DID registries use blockchains such as uPort, Sovrin, Veres, etc. If a decentralized identifier doesn’t need to be registered, other standards such as Peer DID Method Specification and Key Event Receipt Infrastructure can be used to directly interact with the peer device using DIDComm messaging protocol.
Digital Wallet
The keystore or digital wallet stores private keys, decentralized identifiers (DIDs), verifiable credentials, passwords, etc. The digital wallet uses a digital agent to provide access to internal data or generate new identities. In order to protect wallet data and recovery keys in case of loss of device or data corruption, it can create an encrypted backup in cloud or cold storage. Alternatively, digital wallet may use social recovery by sharding recovery key into multiple fragments using Shamir’s Secret Sharing that are then shared with trusted contacts.
Hardware security module
Hardware security module (HSM) is a physical computing device for storing digital keys and credentials cryptographically. It provides a security enclave for secure execution and preventing unauthorized access to private data.
Multiple Devices
As decentralized applications rely on peer to peer communication, digital wallets can use Decentralized Key Management System (DKMS), Key Event Receipt Infra- structure (KERI) and DIDComm messaging protocols to support multiple devices.
Push Notification
Decentralized identity may use push notifications to verify identity or receive alerts when a user is needed to permit an access.
Summary
The passwords are the bane of online experience and incur high cost of maintenance for service providers. Alternative user authentication mechanisms using passwordless, biometrics and asymmetric key exchange provide better user experience and stronger security. The decentralized identity further improves interoperability and data privacy where users can control who can access their private data and earn rewards for sharing their data. As modern mobile devices already have built-in support for HSM based security enclave, biometrics and push notifications, they provide a simple way to implement digital wallets and decentralized identity. This eliminates security breaches, identity theft and the need for data protection regulations and complex authentication protocols such as SAML/OIDC because data is cryptographically protected on local storage. Though open standards for passwordless authentication and decentralized identity are still in development, they show a bright future to build the next generation of applications that provide better data security, privacy, protection, end-to-end encryption, portability, instant payments and better user experience.
Web3 term was coined by Ethereum co-founder Gavin Wood to distinguish it from Web2 architecture. The Web2 architecture popularized interactive and interoperable applications, where Web3 architecture adds decentralization based on blockchain for protecting user data and privacy. I came across a number of recent posts [1, 2, 3, 4] lately that question the claims of Web3 architecture including a latest blog from Moxie. As a co-founder of Signal protocol and strong cryptography background, Moxie makes following observations:
People don’t want to run their own servers, and never will.
A protocol moves much more slowly than a platform.
NFT marketplace looks similar to centralized websites and databases instead of blockchain based distributed state.
The blockchain lack client/server interface for interacting with the data in a trustless fashion.
The dApp clients often communicate with centralized services such as Infura or Alchemy and lack client side validation.
Moxie identifies following security gaps in NFT use-cases:
Blockchain does not store data for NFT, instead it only points to URL and it lacked hash commitment for data located at the URL.
Crypto wallet such as MetaMask connects to central services such as Etherscan and OpenSea without client-side validation.
Moxie claims that innovations at centralized services such as OpenSea are outpacing standards such as ERC-721 similar to GMail compared to Email standards. He claims that decentralization has a little practical or pressing importance to the majority of people who just want to make money. In the end, he concludes:
We should accept the premise that people will not run their own servers by designing systems that can distribute trust without having to distribute infrastructure.
We should try to reduce the burden of building software because distributed systems add considerable complexity.
Moxie raises valid concerns regarding the need for thin clients that can validate blockchain data with trustless security as well as the use of message authentication code and metadata when creating or accessing NFT. However, he conflates Bitcoin, Ethereum, and Blockchain and there are existing standards such as Simplified payment verification that allows users to connect to Bitcoin blockchain without running full-node. Due to differences in Ethereum architecture, thin clients are difficult to operate on its platform but there are a number of alternatives such as Light Clients, Tendermint Light, Argent Wallet and Pockt Network. The NFT data can be stored Fleek or decentralized storage solutions such as IPFS, Skynet, Sia, Celestia, Arweave and Filecoin. Some blockchains such as Tezos and Avax also offer layer-1 solution for data storage. Despite the popularity of Opensea with its centralized access, a number of alternatives are being worked on such as OpenDao and NFTX. Further, NFT standards such as ERC-2981 and ERC-1155 are in work to support royalties and other enhancements.
When reviewing Signal app, I observed that Moxie advocated centralized cryptography and actively tried to stop other open source projects such as LibreSignal to use their servers. Moxie also declined collaboration with open messaging specifications such as Matrix. Centralized cryptography is an oxymoron as truly secure systems don’t require trust. The centralized messaging platforms such as WhatsApp impose forwarding limit, which require encryption of multiple messages using the same key. Despite open source development of Signal app, there is no guarantee that it uses the same code on its infrastructure (e.g. it stopped updating open source project for a year) or it doesn’t log metadata when delivering messages.
Though, Moxie‘s experience with centralized services within the blockchain ecosystem highlights a crucial area of work but a number of solutions are being worked on, admittedly with slow progress. However, his judgment is clouded by his prior position and beliefs where he prefers centralized cryptography and systems over decentralized and federated systems. In my previous blog, I have discussed the merits of decentralized messaging apps, which provide better support for data privacy, censorship, anonymity, deniability and confidentiality with trustless security. This is essential for building Web3 infrastructure and more work is needed to simplify development and adoption of decentralized computing and trustless security.