Shahzad Bhatti Welcome to my ramblings and rants!

January 1, 2023

Consumer-driven and Producer-generated Contract Testing for REST APIs

Filed under: REST,Testing,Web Services — admin @ 9:43 pm

Though, REST standard for remote APIs is fairly loose but you can document API shape and structure using standards such as Open API and swagger specifications. The documented API specification ensures that both consumer/client and producer/server side abide by the specifications and prevent unexpected behavior. The API provider may also define service-level objective (SLO) so that API meets specified latency, security and availability and other service-level indicators (SLI). The API provider can use contract tests to validate the API interactions based on documented specifications. The contract testing includes both consumer and producer where a consumer makes an API request and the producer produces the result. The contract tests ensures that both consumer requests and producer responses match the contract request and response definitions per API specifications. These contract tests don’t just validate API schema instead they validate interactions between consumer and producer thus they can also be used to detect any breaking or backward incompatible changes so that consumers can continue using the APIs without any surprises.

In order to demonstrate contract testing, we will use api-mock-service library to generate mock/stub client requests and server responses based on Open API specifications or customized test contracts. These test contracts can be used by both consumers and producers for validating API contracts and evolve the contract tests as API specifications are updated.

Sample REST API Under Test

A sample eCommerce application will be used to demonstrate contracts testing. The application will use various REST APIs to implement online shopping experience. The primary purpose of this example is to show how different request structures can be passed to the REST APIs and then generate a valid result or an error condition for contract testing. You can view the Open-API specifications for this sample app here.

Customer REST APIs

The customer APIs define operations to manage customers who shop online, e.g.:

Customer APIs

Product REST APIs

The product APIs define operations to manage products that can be shopped online, e.g.:

Product APIs

Payment REST APIs

The payment APIs define operations to charge credit card and pay for online shopping, e.g.:

Payment APIs

Order REST APIs

The order APIs define operations to purchase a product from the online store and it will use above APIs to validate customers, check product inventory, charge payments and then store record of orders, e.g.:

Order APIs

Generating Stub Server Responses based on Open-API Specifications

In this example, stub server responses will be generated by api-mock-service based on open-api specifications ecommerce-api.json by starting the mock service first as follows:

docker pull plexobject/api-mock-service:latest
docker run -p 8000:8000 -p 9000:9000 -e HTTP_PORT=8000 -e PROXY_PORT=9000 \
	-e DATA_DIR=/tmp/mocks -e ASSET_DIR=/tmp/assets api-mock-service

And then uploading open-API specifications for ecommerce-api.json:

curl -H "Content-Type: application/yaml" --data-binary @ecommerce-api.json \
	http://localhost:8000/_oapi

It will generate test contracts with stub/mock responses for all APIs defined in the ecommerce-api.json Open API specification. For example, you can produce result of customers REST APIs, e.g.:

curl http://localhost:8000/customers

to produce:

[
  {
    "address": {
      "city": "PpCJyfKUomUOdhtxr",
      "countryCode": "US",
      "id": "ede97f59-2ef2-48e5-913f-4bce0f152603",
      "streetAddress": "Se somnis cibo oculi, die flammam petimus?",
      "zipCode": "06826"
    },
    "creditCard": {
      "balance": {
        "amount": 53965,
        "currency": "CAD"
      },
      "cardNumber": "7345-4444-5461",
      "customerId": "WB97W4L2VQRRkH5L0OAZGk0MT957r7Z",
      "expiration": "25/0000",
      "id": "ae906a78-0aff-4d4e-ad80-b77877f0226c",
      "type": "VISA"
    },
    "email": "abigail.appetitum@dicant.net",
    "firstName": "sciam",
    "id": "21c82838-507a-4745-bc1b-40e6e476a1fb",
    "lastName": "inquit",
    "phone": "1-717-5555-3010"
  },
...  

Above response is randomly generated based on the types/formats/regex/min-max limits of properties defined in Open-API and calling this API will automatically generate all valid and error responses, e.g. calling “curl http://localhost:8000/customers” again will return:

* Mark bundle as not supporting multiuse
< HTTP/1.1 500 Internal Server Error
< Content-Type:
< Vary: Origin
< X-Mock-Path: /customers
< X-Mock-Request-Count: 9
< X-Mock-Scenario: getCustomerByEmail-customers-500-8a93b6c60c492e730ea149d5d09e79d85701c01dbc017d178557ed1d2c1bad3d
< Date: Sun, 01 Jan 2023 20:41:17 GMT
< Content-Length: 67
<
* Connection #0 to host localhost left intact
{"logRef":"achieve_output_fresh","message":"buffalo_rescue_street"}

Consumer-driven Contract Testing

Upon uploading the Open-API specifications of microservices, the api-mock-service generates test contracts for each REST API and response statuses. You can then customize these test cases for consumer-driven contract testing.

For example, here is the default test contract generated for finding a customer by id with path “/customers/:id”:

method: GET
name: getCustomer-customers-200-61a298e
path: /customers/:id
description: ""
predicate: ""
request:
    match_query_params: {}
    match_headers: {}
    match_contents: '{}'
    path_params:
        id: \w+
    query_params: {}
    headers: {}
response:
    headers: 
      Content-Type:
        - application/json
    contents: '{"address":{"city":"{{RandStringMinMax 2 60}}","countryCode":"{{EnumString `US CA`}}","id":"{{UUID}}","streetAddress":"{{RandRegex `\\w+`}}","zipCode":"{{RandRegex `\\d{5}`}}"},"creditCard":{"balance":{"amount":{{RandNumMinMax 0 0}},"currency":"{{RandRegex `(USD|CAD|EUR|AUD)`}}"},"cardNumber":"{{RandRegex `\\d{4}-\\d{4}-\\d{4}`}}","customerId":"{{RandStringMinMax 30 36}}","expiration":"{{RandRegex `\\d{2}/\\d{4}`}}","id":"{{UUID}}","type":"{{EnumString `VISA MASTERCARD AMEX`}}"},"email":"{{RandRegex `.+@.+\\..+`}}","firstName":"{{RandRegex `\\w`}}","id":"{{UUID}}","lastName":"{{RandRegex `\\w`}}","phone":"{{RandRegex `1-\\d{3}-\\d{4}-\\d{4}`}}"}'
    contents_file: ""
    status_code: 200
wait_before_reply: 0s

Above template demonstrates interaction between consumer and producer by defining properties such as:

  • method – of REST API such as GET/POST/PUT/DELETE
  • name – of the test case
  • path of REST API
  • description – of test
  • predicate – defines a condition which must be true to select this test contract
  • request section defines input properties for the REST API including:
    • match_query_params – to match query input parameters for selecting the test contract
    • match_headers – to match input headers for selecting the test contract
    • match_contents – defines regex for selecting input body
    • path_params – defines path variables and regex
    • query_params and headers – defines sample input parameters and headers
  • response section defines output properties for the REST API including:
    • headers – defines response headers
    • contents – defines body of response
    • contents_file – allows loading response from a file
    • status_code – defines HTTP response status
  • wait_before_reply – defines wait time before returning response

You can then invoke test contract using:

curl http://localhost:8000/customers/1

that generates test case from the mock/stub server provided by the api-mock-service library, e.g.

{
  "address": {
    "city": "PanHQyfbHZVw",
    "countryCode": "US",
    "id": "ff5d0e98-daa5-49c8-bb79-f2d7274f2fb1",
    "streetAddress": "Sumus o proferens etiamne intuerer fugasti, nuntiantibus da?",
    "zipCode": "01364"
  },
  "creditCard": {
    "balance": {
      "amount": 80704,
      "currency": "USD"
    },
    "cardNumber": "3226-6666-2214",
    "customerId": "0VNf07XNWkLiIBhfmfCnrE1weTlkhmxn",
    "expiration": "24/5555",
    "id": "f9549ef3-a5eb-4df4-a8a9-85a30a6a49c6",
    "type": "VISA"
  },
  "email": "amanda.doleat@fructu.com",
  "firstName": "quaero",
  "id": "9aeee733-932d-4244-a6f8-f21d2883fd27",
  "lastName": "habeat",
  "phone": "1-052-5555-4733"
}

You can customize above response contents using builtin template functions in the api-mock-service library or create additional test contracts for each distinct input parameter. For example, following contract defines interaction between consumer and producer to add a new customer:

method: POST
name: saveCustomer-customers-200-ddfceb2
path: /customers
description: ""
order: 0
group: Sample Ecommerce API
predicate: ""
request:
    match_query_params: {}
    match_headers: {}
    match_contents: '{"address.city":"(__string__\\w+)","address.countryCode":"(__string__(US|CA))","address.streetAddress":"(__string__\\w+)","address.zipCode":"(__string__\\d{5})","creditCard.balance.amount":"(__number__[+-]?((\\d{1,10}(\\.\\d{1,5})?)|(\\.\\d{1,10})))","creditCard.balance.currency":"(__string__(USD|CAD|EUR|AUD))","creditCard.cardNumber":"(__string__\\d{4}-\\d{4}-\\d{4})","creditCard.customerId":"(__string__\\w+)","creditCard.expiration":"(__string__\\d{2}/\\d{4})","creditCard.type":"(__string__(VISA|MASTERCARD|AMEX))","email":"(__string__.+@.+\\..+)","firstName":"(__string__\\w)","lastName":"(__string__\\w)","phone":"(__string__1-\\d{3}-\\d{4}-\\d{4})"}'
    path_params: {}
    query_params: {}
    headers:
        ContentsType: application/json
    contents: '{"address":{"city":"__string__\\w+","countryCode":"__string__(US|CA)","streetAddress":"__string__\\w+","zipCode":"__string__\\d{5}"},"creditCard":{"balance":{"amount":"__number__[+-]?((\\d{1,10}(\\.\\d{1,5})?)|(\\.\\d{1,10}))","currency":"__string__(USD|CAD|EUR|AUD)"},"cardNumber":"__string__\\d{4}-\\d{4}-\\d{4}","customerId":"__string__\\w+","expiration":"__string__\\d{2}/\\d{4}","type":"__string__(VISA|MASTERCARD|AMEX)"},"email":"__string__.+@.+\\..+","firstName":"__string__\\w","lastName":"__string__\\w","phone":"__string__1-\\d{3}-\\d{4}-\\d{4}"}'
    example_contents: |
        address:
            city: Ab fabrorum meminerim conterritus nota falsissime deum?
            countryCode: CA
            streetAddress: Mei nisi dum, ab amaremus antris?
            zipCode: "00128"
        creditCard:
            balance:
                amount: 3000.4861560368768
                currency: USD
            cardNumber: 7740-7777-6114
            customerId: Fudi eodem sed habitaret agam pro si?
            expiration: 85/2222
            type: AMEX
        email: larry.neglecta@audio.edu
        firstName: fatemur
        lastName: gaudeant
        phone: 1-543-8888-2641
response:
    headers: 
      Content-Type: 
        - application/json
    contents: '{"address":{"city":"{{RandStringMinMax 2 60}}","countryCode":"{{EnumString `US CA`}}","id":"{{UUID}}","streetAddress":"{{RandRegex `\\w+`}}","zipCode":"{{RandRegex `\\d{5}`}}"},"creditCard":{"balance":{"amount":{{RandNumMinMax 0 0}},"currency":"{{RandRegex `(USD|CAD|EUR|AUD)`}}"},"cardNumber":"{{RandRegex `\\d{4}-\\d{4}-\\d{4}`}}","customerId":"{{RandStringMinMax 30 36}}","expiration":"{{RandRegex `\\d{2}/\\d{4}`}}","id":"{{UUID}}","type":"{{EnumString `VISA MASTERCARD AMEX`}}"},"email":"{{RandRegex `.+@.+\\..+`}}","firstName":"{{RandRegex `\\w`}}","id":"{{UUID}}","lastName":"{{RandRegex `\\w`}}","phone":"{{RandRegex `1-\\d{3}-\\d{4}-\\d{4}`}}"}'
    contents_file: ""
    status_code: 200
wait_before_reply: 0s

Above template defines interaction for adding a new customer where request section defines format of request and matching criteria using match_content property. The response section includes the headers and contents that are generated by the stub/mock server for consumer-driven contract testing. You can then invoke test contract using:

curl -X POST http://localhost:8000/customers -d '{"address":{"city":"rwjJS","countryCode":"US","id":"4a788c96-e532-4a97-9b8b-bcb298636bc1","streetAddress":"Cura diu me, miserere me?","zipCode":"24121"},"creditCard":{"balance":{"amount":57012,"currency":"USD"},"cardNumber":"5566-2222-8282","customerId":"tgzwgThaiZqc5eDwbKk23nwjZqkap7","expiration":"70/6666","id":"d966aafa-c28b-4078-9e87-f7e9d76dd848","type":"VISA"},"email":"andrew.recorder@ipsas.net","firstName":"quendam","id":"071396bb-f8db-489d-a8f7-bbcce952ecef","lastName":"formaeque","phone":"1-345-6666-0618"}'

Which will return a response such as:

{
  "address": {
    "city": "j77oUSSoB5lJCUtc4scxtm0vhilPRdLE7Nc8KzAunBa87OrMerCZI",
    "countryCode": "CA",
    "id": "9bb21030-29d0-44be-8f5a-25855e38c164",
    "streetAddress": "Qui superbam imago cernimus, sensarum nuntii tot da?",
    "zipCode": "08020"
  },
  "creditCard": {
    "balance": {
      "amount": 75666,
      "currency": "AUD"
    },
    "cardNumber": "1383-8888-5013",
    "customerId": "nNaUd15lf6lqkAEwKoguVTvBnPMBVDhdeO",
    "expiration": "73/5555",
    "id": "554efad7-17ab-49f9-967a-3e47381a4d34",
    "type": "AMEX"
  },
  "email": "deborah.vivit@desivero.gov",
  "firstName": "contexo",
  "id": "db70b737-ee1d-48ed-83da-c5a8773c7a5f",
  "lastName": "delectat",
  "phone": "1-013-7777-0054"
}

Note: The response will not match the request body as the contract testing only tests interactions between consumer and producer without maintaining any server side state. You can use other types of testing such as integration/component/functional testing for validating state based behavior.

Producer-driven Generated Tests

The process of defining contracts to generate tests for validating producer REST APIs is similar to consumer-driven contracts. For example, you can upload open-api specifications or user-defined contracts to the api-mock-service provided mock/stub server.

For example, you can upload open-API specifications for ecommerce-api.json as follows:

curl -H "Content-Type: application/yaml" --data-binary @ecommerce-api.json \
	http://localhost:8000/_oapi

Upon uploading the specifications, the mock server will generate contracts for each REST API and status. You can customize those contracts with additional validation or assertion and then invoke server generated tests either by specifying the REST API or invoke multiple REST APIs belonging to a specific group. You can also define an order for executing tests in a group and can optionally pass data from one invocation to the next invocation of REST API.

For testing purpose, we will customize customer REST APIs for adding a new customer and fetching a customer by its id, i.e.,

A contract for adding a new customer

method: POST
name: save-customer
path: /customers
group: customers
order: 0
request:
    headers:
        Content-Type: application/json
    contents: |
        address:
            city: {{RandCity}}
            countryCode: {{EnumString `US CA`}}
            id: {{UUID}}
            streetAddress: {{RandSentence 2 3}}
            zipCode: {{RandRegex `\d{5}`}}
        creditCard:
            balance:
                amount: {{RandNumMinMax 20 500}}
                currency: {{EnumString `USD CAD`}}
            cardNumber: {{RandRegex `\d{4}-\d{4}-\d{4}`}}
            customerId: {{UUID}}
            expiration: {{RandRegex `\d{2}/\d{4}`}}
            id: {{UUID}}
            type: {{EnumString `VISA MASTERCARD`}}
        email: {{RandEmail}}
        firstName: {{RandName}}
        id: {{UUID}}
        lastName: {{RandName}}
        phone: {{RandRegex `1-\d{3}-\d{3}-\d{4}`}}
response:
    match_headers: {}
    match_contents: '{"address.city":"(__string__\\w+)","address.countryCode":"(__string__(US|CA))","address.id":"(__string__\\w+)","address.streetAddress":"(__string__\\w+)","address.zipCode":"(__string__\\d{5}.?\\d{0,4})","creditCard.balance.amount":"(__number__[+-]?(([0-9]{1,10}(\\.[0-9]{1,5})?)|(\\.[0-9]{1,10})))","creditCard.balance.currency":"(__string__\\w+)","creditCard.cardNumber":"(__string__[\\d-]{10,20})","creditCard.customerId":"(__string__\\w+)","creditCard.expiration":"(__string__\\d{2}.\\d{4})","creditCard.id":"(__string__\\w+)","creditCard.type":"(__string__(VISA|MASTERCARD|AMEX))","email":"(__string__.+@.+\\..+)","firstName":"(__string__\\w+)","id":"(__string__\\w+)","lastName":"(__string__\\w+)","phone":"(__string__[\\-\\w\\d]{9,15})"}'
    pipe_properties:
      - id
      - email
    assertions:
      - VariableContains contents.email @
      - VariableContains contents.creditCard.type A
      - VariableContains headers.Content-Type application/json
      - VariableEQ status 200

The request section defines content property that will build the input request, which will be sent to the producer provided REST API. The server section defines match_contents to match regex of each response property. In addition, the response section defines assertions to compare against response contents, headers or status against expected output.

A contract for finding an existing customer

method: GET
name: get-customer
path: /customers/{{.id}}
description: ""
order: 1
group: customers
predicate: ""
request:
    path_params:
        id: \w+
    query_params: {}
    headers:
      Content-Type: application/json
    contents: ""
    example_contents: ""
response:
    headers: {}
    match_headers:
      Content-Type: application/json    
    match_contents: '{"address.city":"(__string__\\w+)","address.countryCode":"(__string__(US|CA))","address.streetAddress":"(__string__\\w+)","address.zipCode":"(__string__\\d{5})","creditCard.balance.amount":"(__number__[+-]?((\\d{1,10}(\\.\\d{1,5})?)|(\\.\\d{1,10})))","creditCard.balance.currency":"(__string__(USD|CAD|EUR|AUD))","creditCard.cardNumber":"(__string__\\d{4}-\\d{4}-\\d{4})","creditCard.customerId":"(__string__\\w+)","creditCard.expiration":"(__string__\\d{2}/\\d{4})","creditCard.type":"(__string__(VISA|MASTERCARD|AMEX))","email":"(__string__.+@.+\\..+)","firstName":"(__string__\\w)","lastName":"(__string__\\w)","phone":"(__string__1-\\d{3}-\\d{3}-\\d{4})"}'
    pipe_properties:
      - id
      - email
    assertions:
      - VariableContains contents.email @
      - VariableContains contents.creditCard.type A
      - VariableContains headers.Content-Type application/json
      - VariableEQ status 200

Above template defines similar properties to generate request body and defines match_contents with assertions to match expected output headers, body and status. Based on order of tests, the generated test to add new customer will be executed first, which will be followed by the test to find a customer by id. As we are testing against real REST APIs, the REST API path is defined as “/customers/{{.id}}” for finding a customer will populate the id from the output of first test based on the pipe_properties.

Uploading Contracts

Once you have the api-mock-service mock server running, you can upload contracts using:

curl -H "Content-Type: application/yaml" --data-binary @fixtures/get_customer.yaml \
	http://localhost:8000/_scenarios
curl -H "Content-Type: application/yaml" --data-binary @fixtures/save_customer.yaml \
	http://localhost:8000/_scenarios

You can start your service before invoking generated tests, e.g. we will use sample-openapi for the testing purpose and then invoke the generated tests using:

curl -X POST http://localhost:8000/_contracts/customers -d \
	'{"base_url": "http://localhost:8080", "execution_times": 5, "verbose": true}'

Above command will execute all tests for customers group and it will invoke each REST API 5 times. After executing the APIs, it will generate result as follows:

{
  "results": {
    "get-customer_0": {
      "email": "anna.intra@amicum.edu",
      "id": "fa7a06cd-1bf1-442e-b761-d1d074d24373"
    },
    "get-customer_1": {
      "email": "aaron.sequi@laetus.gov",
      "id": "c5128ac0-865c-4d91-bb0a-23940ac8a7cb"
    },
    "get-customer_2": {
      "email": "edward.infligi@evellere.com",
      "id": "a485739f-01d4-442e-9ddc-c2656ba48c63"
    },
    "get-customer_3": {
      "email": "gary.volebant@istae.com",
      "id": "ef0eacd0-75cc-484f-b9a4-7aebfe51d199"
    },
    "get-customer_4": {
      "email": "alexis.dicant@displiceo.net",
      "id": "da65b914-c34e-453b-8ee9-7f0df598ac13"
    },
    "save-customer_0": {
      "email": "anna.intra@amicum.edu",
      "id": "fa7a06cd-1bf1-442e-b761-d1d074d24373"
    },
    "save-customer_1": {
      "email": "aaron.sequi@laetus.gov",
      "id": "c5128ac0-865c-4d91-bb0a-23940ac8a7cb"
    },
    "save-customer_2": {
      "email": "edward.infligi@evellere.com",
      "id": "a485739f-01d4-442e-9ddc-c2656ba48c63"
    },
    "save-customer_3": {
      "email": "gary.volebant@istae.com",
      "id": "ef0eacd0-75cc-484f-b9a4-7aebfe51d199"
    },
    "save-customer_4": {
      "email": "alexis.dicant@displiceo.net",
      "id": "da65b914-c34e-453b-8ee9-7f0df598ac13"
    }
  },
  "errors": {},
  "metrics": {
    "getcustomer_counts": 5,
    "getcustomer_duration_seconds": 0.006,
    "savecustomer_counts": 5,
    "savecustomer_duration_seconds": 0.006
  },
  "succeeded": 10,
  "failed": 0
}

Though, generated tests are executed against real services, it’s recommended that the service implementation use test doubles or mock services for any dependent services as contract testing is not meant to replace component or end-to-end tests that provide better support for integration testing.

Recording Consumer/Producer interactions for Generating Stub Requests and Responses

The contract testing does not always depend on API specifications such as Open API and swagger and instead you can record interactions between consumers and producers using api-mock-service tool.

For example, if you have an existing REST API or a legacy service such as above sample API, you can record an interaction as follows:

export http_proxy="http://localhost:9000"
export https_proxy="http://localhost:9000"
curl -X POST -H "Content-Type: application/json" http://localhost:8080/customers -d \
	'{"address":{"city":"rwjJS","countryCode":"US","id":"4a788c96-e532-4a97-9b8b-bcb298636bc1","streetAddress":"Cura diu me, miserere me?","zipCode":"24121"},"creditCard":{"balance":{"amount":57012,"currency":"USD"},"cardNumber":"5566-2222-8282","customerId":"tgzwgThaiZqc5eDwbKk23nwjZqkap7","expiration":"70/6666","id":"d966aafa-c28b-4078-9e87-f7e9d76dd848","type":"VISA"},"email":"andrew.recorder@ipsas.net","firstName":"quendam","id":"071396bb-f8db-489d-a8f7-bbcce952ecef","lastName":"formaeque","phone":"1-345-6666-0618"}'

This will invoke the remote REST API, record contract interactions and then return server response:

{
  "id": "95d655e1-405e-4087-8a7d-56791eaf51cc",
  "firstName": "quendam",
  "lastName": "formaeque",
  "email": "andrew.recorder@ipsas.net",
  "phone": "1-345-6666-0618",
  "creditCard": {
    "id": "d966aafa-c28b-4078-9e87-f7e9d76dd848",
    "customerId": "tgzwgThaiZqc5eDwbKk23nwjZqkap7",
    "type": "VISA",
    "cardNumber": "5566-2222-8282",
    "expiration": "70/6666",
    "balance": {
      "amount": 57012,
      "currency": "USD"
    }
  },
  "address": {
    "id": "4a788c96-e532-4a97-9b8b-bcb298636bc1",
    "streetAddress": "Cura diu me, miserere me?",
    "city": "rwjJS",
    "zipCode": "24121",
    "countryCode": "US"
  }
}

The recorded contract can be used to generate the stub response, e.g. following configuration defines the recorded contract:

method: POST
name: recorded-customers-200-55240a69747cac85a881a3ab1841b09c2c66d6a9a9ae41c99665177d3e3b5bb7
path: /customers
description: recorded at 2023-01-02 03:18:11.80293 +0000 UTC for http://localhost:8080/customers
order: 0
group: customers
predicate: ""
request:
    match_query_params: {}
    match_headers:
        Content-Type: application/json
    match_contents: '{"address.city":"(__string__\\w+)","address.countryCode":"(__string__\\w+)","address.id":"(.+)","address.streetAddress":"(__string__\\w+)","address.zipCode":"(__string__\\d{5,5})","creditCard.balance.amount":"(__number__[+-]?\\d{1,10})","creditCard.balance.currency":"(__string__\\w+)","creditCard.cardNumber":"(__string__\\d{4,4}[-]\\d{4,4}[-]\\d{4,4})","creditCard.customerId":"(.+)","creditCard.expiration":"(.+)","creditCard.id":"(.+)","creditCard.type":"(__string__\\w+)","email":"(__string__\\w+.?\\w+@\\w+.?\\w+)","firstName":"(__string__\\w+)","id":"(.+)","lastName":"(__string__\\w+)","phone":"(__string__\\d{1,1}[-]\\d{3,3}[-]\\d{4,4}[-]\\d{4,4})"}'
    path_params: {}
    query_params: {}
    headers:
        Accept: '*/*'
        Content-Length: "522"
        Content-Type: application/json
        User-Agent: curl/7.65.2
    contents: '{"address":{"city":"rwjJS","countryCode":"US","id":"4a788c96-e532-4a97-9b8b-bcb298636bc1","streetAddress":"Cura diu me, miserere me?","zipCode":"24121"},"creditCard":{"balance":{"amount":57012,"currency":"USD"},"cardNumber":"5566-2222-8282","customerId":"tgzwgThaiZqc5eDwbKk23nwjZqkap7","expiration":"70/6666","id":"d966aafa-c28b-4078-9e87-f7e9d76dd848","type":"VISA"},"email":"andrew.recorder@ipsas.net","firstName":"quendam","id":"071396bb-f8db-489d-a8f7-bbcce952ecef","lastName":"formaeque","phone":"1-345-6666-0618"}'
    example_contents: ""
response:
    headers:
        Content-Type:
            - application/json
        Date:
            - Mon, 02 Jan 2023 03:18:11 GMT
    contents: '{"id":"95d655e1-405e-4087-8a7d-56791eaf51cc","firstName":"quendam","lastName":"formaeque","email":"andrew.recorder@ipsas.net","phone":"1-345-6666-0618","creditCard":{"id":"d966aafa-c28b-4078-9e87-f7e9d76dd848","customerId":"tgzwgThaiZqc5eDwbKk23nwjZqkap7","type":"VISA","cardNumber":"5566-2222-8282","expiration":"70/6666","balance":{"amount":57012.00,"currency":"USD"}},"address":{"id":"4a788c96-e532-4a97-9b8b-bcb298636bc1","streetAddress":"Cura diu me, miserere me?","city":"rwjJS","zipCode":"24121","countryCode":"US"}}'
    contents_file: ""
    example_contents: ""
    status_code: 200
    match_headers: {}
    match_contents: '{"address.city":"(__string__\\w+)","address.countryCode":"(__string__\\w+)","address.id":"(.+)","address.streetAddress":"(__string__\\w+)","address.zipCode":"(__string__\\d{5,5})","creditCard.balance.amount":"(__number__[+-]?\\d{1,10})","creditCard.balance.currency":"(__string__\\w+)","creditCard.cardNumber":"(__string__\\d{4,4}[-]\\d{4,4}[-]\\d{4,4})","creditCard.customerId":"(.+)","creditCard.expiration":"(.+)","creditCard.id":"(.+)","creditCard.type":"(__string__\\w+)","email":"(__string__\\w+.?\\w+@\\w+.?\\w+)","firstName":"(__string__\\w+)","id":"(.+)","lastName":"(__string__\\w+)","phone":"(__string__\\d{1,1}[-]\\d{3,3}[-]\\d{4,4}[-]\\d{4,4})"}'
    pipe_properties: []
    assertions: []
wait_before_reply: 0s

You can then invoke consumer-driven contracts to generate stub response or invoke generated tests to test against producer implementation as described in earlier section. Another benefit of capturing test contracts using recorded session is that it can accurately capture all URLs, parameters and headers for both requests and responses so that contract testing can precisely validate against existing behavior.

Summary

Though, unit-testing, component testing and end-to-end testing are a common testing strategies that are used by most organizations but they don’t provide adequate support to validate API specifications and interactions between consumers/clients and producers/providers of the APIs. The contract testing ensures that consumers and producers will not deviate from the specifications and can be used to validate changes for backward compatibility when APIs are evolved. This also decouples consumers and producers if the API is still in development as both parties can write code against the agreed contracts and test them independently. A service owner can generate producer contracts using tools such as api-mock-service based on Open API specification or user-defined constraints. The consumers can provide their consumer-driven contracts to the service providers to ensure that the API changes don’t break any consumers. These contracts can be stored in a source code repository or on a registry service so that contract testing can easily access them and execute them as part of the build and deployment pipelines. The api-mock-service tool greatly assists in adding contract testing to your software development lifecycle and is freely available from https://github.com/bhatti/api-mock-service.

December 20, 2022

Property-based and Generative testing for Microservices

Filed under: REST,Technology,Testing — Tags: — admin @ 1:26 pm

The software development cycle for microservices generally include unit testing during the development where mock implementation for the dependent services are injected with the desired behavior to test various test-scenarios and failure conditions. However, the development teams often use real dependent services for integration testing of a microservice in a local environment. This poses a considerable challenge as each dependent service may be keeping its own state that makes it harder to reliably validate the regression behavior or simulate certain error response. Further, as the number of request parameters to the service or downstream services grow, the combinatorial explosion for test cases become unmanageable. This is where property-based testing offers a relief as it allows testing against automatically generated input fuzz-data, which is why this form of testing is also referred as a generative testing. A generator defines a function that generate random data based on type of input and constraints on the range of input values. The property-based test driver then iteratively calls the system under test to validate the result and assert the desired behavior, e.g.

def pre_condition_test_input_param(kind):
  ### assert pre-condition based on type of parameter and range of input values it may take

def generate_test_input_param(kind):
  ### generate data meeting pre-condition for the type
    
def generate_test_input_params(kinds):
  return [generate_test_input_param(kind) for kind in kinds]  
  
for i in range(max_attempts):
  [a, b, c, ...] = generate_test_input_params(type1, type2, type3, ...)
  output = function_under_test(a, b, c, ...)
  assert property1(output)
  assert property2(output)
  ...  

In above example, the input parameters are randomly generated based on a precondition. The generated parameters are passed to the function under test and the test driver validates result based on property assertions. This entire process is also referred as fuzzing, which is repeated based on a fixed range to identify any input parameters where the property assertions fail. There are a lot of libraries for property-based testing in various languages such as QuickCheck, fast-check, junit-quickcheck, ScalaCheck, etc. but we will use the api-mock-service library to demonstrate these capabilities for testing microservice APIs.

Following sections describe how the api-mock-service library can be used for testing microservice with fuzzing/property-based approaches and for mocking dependent services to produce the desired behavior:

Sample Microservices Under Test

A sample eCommerce application will be used to demonstrate property-based and generative testing. The application will use various microservices to implement online shopping experience. The primary purpose of this example is to show how different parameters can be passed to microservices, where microservice APIs will validate the input parameters, perform a simple business logic and then generate a valid result or an error condition. You can view the Open-API specifications for this sample app here.

Customer APIs

The customer APIs define operations to manage customers who shop online, e.g.:

Customer APIs

Product APIs

The product APIs define operations to manage products that can be shopped online, e.g.:

Product APIs

Payment APIs

The payment APIs define operations to charge credit card and pay for online shopping, e.g.:

Payment APIs

Order APIs

The order APIs define operations to purchase a product from the online store and it will use above APIs to validate customers, check product inventory, charge payments and then store record of orders, e.g.:

Order APIs

Defining Test Scenarios with Open-API Specifications

In this example, test scenarios will be generated by api-mock-service based on open-api specifications ecommerce-api.json by starting the mock service first as follows:

docker pull plexobject/api-mock-service:latest
docker run -p 8000:8000 -p 9000:9000 -e HTTP_PORT=8000 -e PROXY_PORT=9000 \
	-e DATA_DIR=/tmp/mocks -e ASSET_DIR=/tmp/assets api-mock-service

And then uploading open-API specifications for ecommerce-api.json:

curl -H "Content-Type: application/yaml" --data-binary @ecommerce-api.json \
	http://localhost:8000/_oapi

It will generate mock APIs for all microservices, e.g. you can produce result of products APIs, e.g.:

curl http://localhost:8000/products

to produce:

[
  {
    "id": "fd6a5ddb-35bc-47a9-aacb-9694ff5f8a32",
    "category": "TOYS",
    "inventory": 13,
    "name": "Se nota.",
    "price":{
      "amount":2,
      "currency": "USD"
    }
  },
  {
    "id": "47aab7d9-ecd2-4593-b1a6-c34bb5ca02bc",
    "category": "MUSIC",
    "inventory": 30,
    "name": "Proferuntur mortem.",
    "price":{
      "amount":23,
      "currency": "CAD"
    }
  },
  {
    "id": "ae649ae7-23e3-4709-b665-b1b0f436c97a",
    "category": "BOOKS",
    "inventory": 8,
    "name": "Cor.",
    "price":{
      "amount":13,
      "currency": "USD"
    }
  },
  {
    "id": "a3bd8426-e26d-4f66-8ee8-f55798440dc3",
    "category": "MUSIC",
    "inventory": 43,
    "name": "E diutius.",
    "price":{
      "amount":22,
      "currency": "USD"
    }
  },
  {
    "id": "7f328a53-1b64-4e4f-b6a6-7a69aed1b183",
    "category": "BOOKS",
    "inventory": 54,
    "name": "Dici utroque.",
    "price":{
      "amount":23,
      "currency": "USD"
    }
  }
]

Above response is randomly generated based on the properties defined in Open-API and calling this API will automatically generate all valid and error responses, e.g. calling “curl http://localhost:8000/products” again will return:

< HTTP/1.1 400 Bad Request
< Content-Type:
< Vary: Origin
< X-Mock-Path: /products
< X-Mock-Request-Count: 1
< X-Mock-Scenario: getProductByCategory-07ef44df0d38389ca9d589faaab9e458bd79e8abe7d2e1149e56c00820fac1fb
< Date: Tue, 20 Dec 2022 04:54:58 GMT
< Content-Length: 122
<
{ [122 bytes data]

* Connection #0 to host localhost left intact
{
  "errors": [
    "category_gym_bargain",
    "expand_tuna_stomach",
    "cage_enroll_between",
    "bulk_choice_category",
    "trend_agree_purse"
  ]
}

Applying Property-based/Generative Testing for Clients of Microservices

Upon uploading the Open-API specifications of microservices, the api-mock-service automatically generated templates for producing mock responses and error conditions, which can be customized for property-based and generative testing of microservice clients by defining constraints for generating input/output data and assertions for request/response validation.

Client-side Testing for Listing Products

You can find generated mock scenarios for listing products on the mock service using:

curl -v http://localhost:8000/_scenarios|jq '.'|grep "GET.getProductByCategory"

which returns:

"/_scenarios/GET/getProductByCategory-1a6d4d84e4a8a1ad706d671a26e66c419833b3a99f95cc442942f96d0d8f43f8/products": {
"/_scenarios/GET/getProductByCategory-6e522e565bb669ab3d9b09cc2e16b9d636220ec28a860a1cc30e9c5104e41f53/products": {
"/_scenarios/GET/getProductByCategory-7ede8f15af851323576a0c864841e859408525632eb002a1571676a0d835a0e1/products": {
"/_scenarios/GET/getProductByCategory-9ed14ecd11bbeb9f7bfde885d00efcbf168661354e4c48fe876c545e9a778302/products": {

and then invoking above URL paths, e.g.

curl -v http://localhost:8000/_scenarios/GET/getProductByCategory-7ede8f15af851323576a0c864841e859408525632eb002a1571676a0d835a0e1/products

which will return randomly generated response such as:

method: GET
name: getProductByCategory-7ede8f15af851323576a0c864841e859408525632eb002a1571676a0d835a0e1
path: /products
description: ""
order: 1
group: products
predicate: ""
request:
    match_query_params: {}
    match_headers: {}
    match_contents: '{}'
    path_params: {}
    query_params:
        category: '[\x20-\x7F]{1,128}'
    headers:
        "Content-Type": "application/json"
    contents: ""
response:
    headers: {}
    contents: '[{"category":"{{EnumString `BOOKS MUSIC TOYS`}}","id":"{{RandStringMinMax 0 0}}","inventory":"{{RandNumMinMax 10000 10000}}","name":"{{RandStringMinMax 2 50}}","price":{"amount":{{RandNumMinMax 0 0}},"currency":"{{RandStringMinMax 0 0}}"}}]'
    contents_file: ""
    status_code: 200
    match_headers: {}
    match_contents: '{"category":".+","id":"(__string__\\w+)","inventory":".+","name":"(__string__\\w+)","price.amount":".+","price.currency":"(__string__\\w+)"}'
wait_before_reply: 0s

We can customize above response contents using builtin template functions in the api-mock-service library to generate fuzz response, e.g.

    headers:
        "Content-Type":
          - "application/json"
    contents: >
      [
{{- range $val := Iterate 5}}
        {
          "id": "{{UUID}}",
          "category": "{{EnumString `BOOKS MUSIC TOYS`}}",
          "inventory": {{RandNumMinMax 1 100}},
          "name": "{{RandSentence 1 3}}",
          "price":{
            "amount":{{RandNumMinMax 1 25}},
            "currency": "{{EnumString `USD CAD`}}"
          }
        }{{if lt $val 4}},{{end}}
{{ end }}
      ]
    status_code: 200

In above example, we slightly improved the test template by generating product entries in a loop and using built-in functions to randomize the data. You can upload this scenario using:

curl -H "Content-Type: application/yaml" --data-binary @fixtures/get_products.yaml \
	http://localhost:8000/_scenarios

You can also generate a template for returning an error response similarly, i.e.,

method: GET
name: error-products
path: /products
description: ""
order: 2
group: products
predicate: '{{NthRequest 2}}'
request:
    headers:
        "Content-Type": "application/json"
    query_params:
        category: '[\x20-\x7F]{1,128}'
response:
    headers: {}
    contents: '{"errors":["{{RandSentence 5 10}}"]}'
    contents_file: ""
    status_code: {{EnumInt 400 415 500}}
    match_contents: '{"errors":"(__string__\\w+)"}'
wait_before_reply: 0s

Invoking curl -v http://localhost:8000/products will randomly return both of those test scenarios so that client code can test for various conditions.

Client-side Testing for Creating Products

You can find mock scenarios for creating products that were generated from above Open-API specifications using:

curl -v http://localhost:8000/_scenarios|jq '.'|grep "POST.saveProduct"

You can then customize scenarios as follows and then upload it:

method: POST
name: saveProduct
path: /products
description: ""
order: 0
group: products
request:
    match_query_params: {}
    match_headers: {}
    match_contents: '{"category":"(__string__(BOOKS|MUSIC|TOYS))","id":"(__string__\\w+)","inventory":"(__number__[+-]?(([0-9]{1,10}(\\.[0-9]{1,5})?)|(\\.[0-9]{1,10})))","name":"(__string__\\w+)","price.amount":"(__number__[+-]?(([0-9]{1,10}(\\.[0-9]{1,5})?)|(\\.[0-9]{1,10})))","price.currency":"(USD|CAD)"}'
    path_params: {}
    query_params: {}
    headers:
        "Content-Type": "application/json"
    contents: |
        category: MUSIC
        id: suavitas
        inventory: 5408.89695278641
        name: leporem
        price:
            amount: 7373.800941656166
            currency: cordis
response:
    headers: {}
    contents: '{"category":"{{EnumString `BOOKS MUSIC TOYS`}}","id":"{{RandStringMinMax 0 0}}","inventory":"{{RandNumMinMax 5 500}}","name":"{{RandStringMinMax 2 50}}","price":{"amount":{{RandNumMinMax 0 0}},"currency":"{{RandStringMinMax 0 0}}"}}'
    contents_file: ""
    status_code: 200
    match_headers: {}
    match_contents: '{"category":"(__string__(BOOKS|MUSIC|TOYS))","id":"(__string__\\w+)","inventory":"(__number__[+-]?(([0-9]{1,10}(\\.[0-9]{1,5})?)|(\\.[0-9]{1,10})))","name":"(__string__\\w+)","price.amount":"(__number__[+-]?(([0-9]{1,10}(\\.[0-9]{1,5})?)|(\\.[0-9]{1,10})))","price.currency":"(USD|CAD)"}'
    pipe_properties:
      - id
      - name
    assertions: []
wait_before_reply: 0s

And then invoke above POST /products API using:

curl -H "Content-Type: application/yaml" --data-binary @fixtures/save_product.yaml http://localhost:8000/_scenarios

curl  http://localhost:8000/products -d \
  '{"category":"BOOKS","id":"123","inventory":"10","name":"toy 1","price":{"amount":12,"currency":"USD"}}'

The client code can test for product properties and other error scenarios can be added to simulate failure conditions.

Applying Property-based/Generative Testing for Microservices

The api-mock-service test scenarios defined above can also be used to test against the microservice implementations. You can start your service, e.g. we will use sample-openapi for testing purpose and then invoke test request for server-side testing using:

curl -H "Content-Type: application/yaml" --data-binary @fixtures/get_products.yaml \
	http://localhost:8000/_scenarios
curl -H "Content-Type: application/yaml" --data-binary @fixtures/save_product.yaml \
	http://localhost:8000/_scenarios

curl -k -v -X POST http://localhost:8000/_contracts/products -d \
	'{"base_url": "http://localhost:8080", "execution_times": 5, "verbose": true}'

Above command will submit request to execute all scenarios belonging to products group five times and then return:

{
  "results": {
    "getProducts_0": {},
    "getProducts_1": {},
    "getProducts_2": {},
    "getProducts_3": {},
    "getProducts_4": {},
    "saveProduct_0": {
      "id": "895f584b-dc65-4950-982e-167680bcd133",
      "name": "Opificiis misera dei."
    },
    "saveProduct_1": {
      "id": "d89b6c16-549c-4baa-9dca-4dd9bb4b3ecf",
      "name": "Ea sumus aula teneant."
    },
    "saveProduct_2": {
      "id": "15dd54eb-fe89-4de8-9570-59fca20b9969",
      "name": "Vim odor et respondi."
    },
    "saveProduct_3": {
      "id": "e3769044-2a19-4e86-b0aa-9724378a0113",
      "name": "Me tua timeo an."
    },
    "saveProduct_4": {
      "id": "07ee20b9-df9a-487d-9ff9-cf76bef09a8f",
      "name": "Ruminando latinae omnibus."
    }
  },
  "metrics": {
    "getProducts_counts": 5,
    "getProducts_duration_seconds": 0.007,
    "saveProduct_counts": 5,
    "saveProduct_duration_seconds": 0.005
  },  
  "errors": {},
  "succeeded": 10,
  "failed": 0
}

You can also add custom assertions to validate the response in the save-product scenario:

method: POST
name: saveProduct
path: /products
description: ""
order: 0
group: products
predicate: ""
request:
    match_query_params: {}
    match_headers: {}
    match_contents: '{"category":"(BOOKS|MUSIC|TOYS)","id":"(__string__\\w+)","inventory":"(__number__[+-]?(([0-9]{1,10}(\\.[0-9]{1,5})?)|(\\.[0-9]{1,10})))","name":"(__string__\\w+)","price.amount":"(__number__[+-]?(([0-9]{1,10}(\\.[0-9]{1,5})?)|(\\.[0-9]{1,10})))","price.currency":"(USD|CAD)"}'
    path_params: {}
    query_params: {}
    headers:
        "Content-Type": "application/json"
    contents: |
        category: TOYS
        id: tempus
        inventory: 3890.9145609093966
        name: pleno
        price:
            amount: 5539.183583809511
            currency: "{{EnumString `USD CAD`}}"
response:
    headers: {}
    contents: '{"category":"{{EnumString `BOOKS MUSIC TOYS`}}","id":"{{RandStringMinMax 0 0}}","inventory":"{{RandNumMinMax 5 500}}","name":"{{RandStringMinMax 2 50}}","price":{"amount":{{RandNumMinMax 0 0}},"currency":"$"}}'
    contents_file: ""
    status_code: 200
    match_headers: {}
    match_contents: '{"category":"(__string__(BOOKS|MUSIC|TOYS))","id":"(__string__\\w+)","inventory":"(__number__[+-]?(([0-9]{1,10}(\\.[0-9]{1,5})?)|(\\.[0-9]{1,10})))","name":"(__string__\\w+)","price.amount":"(__number__[+-]?(([0-9]{1,10}(\\.[0-9]{1,5})?)|(\\.[0-9]{1,10})))","price.currency":"(USD|CAD)"}'
    pipe_properties:
      - id
      - name
    assertions:
        - VariableGE contents.inventory 5
        - VariableContains contents.category S
        - VariableContains contents.category X
wait_before_reply: 0s

If you try to run it again, the execution will fail with following error because none of the categories include X:

{
  "results": {
    "getProducts_0": {},
    "getProducts_1": {},
    "getProducts_2": {},
    "getProducts_3": {},
    "getProducts_4": {}
  },
  "errors": {
    "saveProduct_0": "failed to assert '{{VariableContains \"contents.category\" \"X\"}}' with value 'false'",
    "saveProduct_1": "failed to assert '{{VariableContains \"contents.category\" \"X\"}}' with value 'false'",
    "saveProduct_2": "failed to assert '{{VariableContains \"contents.category\" \"X\"}}' with value 'false'",
    "saveProduct_3": "failed to assert '{{VariableContains \"contents.category\" \"X\"}}' with value 'false'",
    "saveProduct_4": "failed to assert '{{VariableContains \"contents.category\" \"X\"}}' with value 'false'"
  },
  "succeeded": 5,
  "failed": 5
}

Summary

Using unit-testing and other forms of testing methodologies don’t rule out presence of the bugs but they can greatly reduce the probability of bugs. However, with large sized test suites, the maintenance of tests incur a high development cost especially if those tests are brittle that requires frequent changes. The property-based/generative testing can help fill in gaps in unit testing while keeping size of the tests suite small. The api-mock-service tool is designed to mock and test microservices using fuzzing and property-based testing techniques. This mocking library can be used to test both clients and server side implementation and can also be used to generate error conditions that are not easily reproducible. This library can be a powerful tool in your toolbox when developing distributed systems with a large number services, which can be difficult to deploy and test locally. You can read more about the api-mock-library at “Mocking and Fuzz Testing Distributed Micro Services with Record/Play, Templates and OpenAPI Specifications” and download it freely from https://github.com/bhatti/api-mock-service.

December 4, 2022

Evolving Software Development Model for Building Distributed Systems

Filed under: Business,Technology — Tags: , — admin @ 5:19 pm

1. Overview

Over the last few decades, the software systems has evolved from the mainframe and client-server models to distributed systems and service oriented architecture. In early 2000, Amazon and Netflix forged ahead the industry to adopt microservices by applying Conway’s law and self-organizing teams structure where a small 2-pizza team own entire lifecycle of the microservices including operational responsibilities. The microservice architecture with small, cross-functional and independent structure has helped team agility to develop, test and deploy microservices independently. However, the software systems are becoming increasingly complex with the Cambrian explosion of microservices and the ecosystem of microservices is reaching a boiling point where building new features, releasing the enhancements and operational load from maintaining high availability, scalability, resilience, security, observability, etc are slowing down the development teams and raising the artificial complexity as a result of mixing different concerns.

Following diagram shows difference between monolithic architecture and microservice architecture:

microservices

As you can see in above diagram, each microservice is responsible for managing a number of cross-cutting concerns such as AuthN/AuthZ, monitoring, rate-limit, configuration, secret-management, etc., which adds a scope of work for development of each service. Following sections dive into fundamental causes of the complexity that comes with the microservices and a path forward for evolving the development methodology to build these microservices more effectively.

2. Perils in Building Distributed Systems

Following is a list of major pitfalls faced by the development teams when building distributed systems and services:

2.1 Coordinating Feature Development

The feature development becomes more convoluted with increase in dependencies of downstream services and any large change in a single service often requires the API changes or additional capabilities from multiple dependent services. This creates numerous challenges such as prioritization of features development among different teams, coordinating release timelines and making progress in absence of the dependent functionalities in the development environment. This is often tackled with additional personnel for project management and managing dependencies with Gantt charts or other project management tools but it still leads to unexpected delays and miscommunication among team members about the deliverables.

2.2 Low-level Concurrency Controls

The development teams often use imperative languages and apply low-level abstractions to implement distributed services where each request is served by a native thread in a web server. Due to extensive overhead of native threads such as stack size limits, the web server becomes constrained with the maximum number of concurrent requests it can support. In addition, these native threads in the web server often share a common state, which must be protected with a mutex, semaphore or lock to avoid data corruption. These low-level and primitive abstractions couple business logic with the concurrency logic, which add accidental complexity when safeguarding the shared state or communicating between different threads. This problem worsens with the time as the code size increases that results in subtle concurrency related heisenbugs where these bugs may produce incorrect results in the production environment.

2.3 Security

Each microservice requires implementing authentication, authorization, secure communication, key management and other aspects of the security. The development teams generally have to support these security aspects for each service they own and they have to be on top of any security patches and security vulnerabilities. For example, a zero-day log4j vulnerability in December 2021 created a havoc in most organizations as multiple services were affected by the bug and needed a patch immediately. This resulted in large effort by each development team to patch their services and deploy the patched services as soon as possible. Worst, the development teams had to apply patches multiple times because initial bug fixes from the log4j team didn’t fully work, thus further multiplying the work by each development team. With growth of the dependency stack or bill of material for third party libraries in modern applications, the development teams face an overwhelming operational burden and enormous security risk to support their services safely.

2.4 Web Server

In general, each microservice requires a web server, which adds additional computing and administration overhead for deploying and running the service stack. The web server must be running all the time whether the service is receiving requests or not, thus wasting CPU, memory and storage resources needlessly.

2.5 Colocating Multiple Services

The development teams often start with a monolithic style applications that hosts multiple services on a single web server or with segregated application servers hosting multiple services on the same web server to lessen the development and deployment effort. The monolithic and service colocation architecture hinders speed, agility and extensibility as the code becomes complicated, harder to maintain and reasoned due to lack of isolation. In this style of deployment, computing resources can be entirely consumed by a single component or a bug in one service can crash the entire system. As each service may have unique runtime or usage characteristics, it’s also arduous to scale a single service, to plan service capacity or to isolate service failures in a colocated runtime environment.

2.6 Cross Cutting Concerns

Building distributed systems require managing a lot of horizontal concerns such as security, resilience, business continuity, and availability but coupling these common concerns with the business logic results in inconsistencies and higher complexity by different implementations in microservices. Mixing these different concerns with business logic in microservices means each development team will have to solve those concerns independently and any omission or divergence may cause miserable user experience, faulty results, poor protection against load spikes or a security breach in the system.

2.7 Service Discovery

Though, microservices use various synchronous and asynchronous protocols for communicating with other services but they often store the endpoints of other services locally in the service configurations. This adds maintenance and operational burden for maintaining the endpoints for all dependent services in each development, test and production environment. In addition, services may not be able to apply certain containment or access control policies such as not invoking cross-region service to maintain lower latency or sustain a service for disaster recovery.

2.8 Architectural Quality Attributes

The architecture quality attributes include performance, availability sustainability, security, scalability, fault tolerance, performance, resilience, recovery and usability, etc. Each development team not only has to manage these attributes for each service but often requires coordination with other teams when scaling their services so that downstream services can handle additional load or meet the availability/reliability guarantees. Thus, the availability, fault tolerance, capacity management or other architectural concerns become tied with downstream services as an outage in any of those services directly affect upstream services. Thus, improving availability, scalability, fault tolerance or other architecture quality attributes often requires changes from the dependent services, which adds scope of the development work.

2.9 Superfluous Development Work

When a developing a microservice, a development team owns end-to-end development and release process that includes a full software development lifecycle support such as :

  • maintaining build scripts for CI/CD pipelines
  • building automation tools for integration/functional/load/canary tests
  • defining access policies related to throttling/rate-limits
  • implementing consistent error handling, idempotency behavior, contextual information across services
  • adding alarms/metrics/monitoring/observability/notification/logs
  • providing customized personal and system dashboards
  • supporting data encryption and managing secret keys
  • defining security policies related to AuthN/AuthZ/Permissons/ACL
  • defining network policies related to VPN, firewall, load-balancer, network/gateway/routing configuration
  • adding compression, caching, and any other common pre/post processing for services

As a result, any deviations or bugs in implementation of these processes or misconfiguration in underlying infrastructure can lead to conflicting user experience, security gaps and outages. Some organizations maintain lengthy checklists and hefty review processes for applying best practices before the software release but they often miss key learnings from other teams and slow down the development process due to cumbersome release process.

2.10 Yak shaving when developing and testing features

Due to enormous artificial complexity of microservices with abysmal dependency stack, the development teams have to spend inordinate amount of time in setting up a development environment when building a new feature. The feature development requires testing the features using unit tests with a mock behavior of dependent services and using integration tests with real dependent services in a local development environment. However, it’s not always possible to run complete stack locally and fully test the changes for new features, thus the developers are encumbered with finding alternative integration environment where other developers may also be testing their features. All this yak shaving makes the software development awfully tedious and error prone because developers can’t test their features in isolation with a high confidence. This means that development teams find bugs later in phases of the release process, which may require a rollback of feature changes and block additional releases until bugs are fixed in the main/release branch.

3. Path to the Enlightenment

Following are a few recommendations to remedy above pitfalls in the development of distributed systems and microservices:

3.1 Higher level of Development Abstraction

Instead of using low-level imperative languages or low-level concurrency controls, high-level abstractions can be applied to simplify the development of the microservices as follows:

3.1.1 Actor Model

The Actor model was first introduced in 1973 by Carl Hewitt and it provides a high-level abstraction for concurrent computation. An actor uses a mailbox or a queue to store incoming messages and processes one message at a time using local state in a single green thread or a coroutine. An actor can create other actors or send messages without using any lock-based synchronization and blocking. Actors are reactive so they cannot initiate any action on their own, instead they simply react to external stimuli in the form of message passing. Actors provide much better error handling where applications can define a hierarchy of actors with parent/child relationships and a child actor may crash when encountering system errors, which are monitored and supervised by parent actors for failure recovery.

actor

Actor model is supported natively in many languages such Erlang/Elixir, Scala, Swift and Pony and it’s available as a library in many other languages. In these languages, actors generally use green threads, coroutines or a preemptive scheduler to schedule actors with non-blocking I/O operations. As the actors incur much lower overhead compare to native threads, they can be used to implement microservices with much greater scalability and performance. In addition, message passing circumvents the need to guard the shared state as each actors only maintains a local state, which provides more robust and reliable implementation of microservices. Here is an example of actor model in Erlang language:

-module(sum).
-export([init/0, add/1, get/0]).

init() ->
    Pid = spawn(fun() -> loop(0) end),
    register(sumActor, Pid).

loop(N) ->
    receive
        {add, X} -> loop(N+X);
        {Client, get} ->
            Client ! N,
            loop(N)
    end.

add(X) ->
    sumActor ! {add, X}.

get() ->
    sumActor ! {self(), get},
    receive Result -> Result end.

In above example, an actor is spawned to run loop function, which uses a tail recursion to receive next message from the queue and then processes it based on the tag of the message such as add or get. The client code uses a symbol sumActor to send a message, which is registered with a local registry. As an actor only maintains local state, microservices may use use external data store and manage a state machine using orchestration based SAGA pattern to trigger next action.

3.1.2 Function as a service (FaaS) and Serverless Computing

Function as a service (FaaS) offers serverless computing to simplify managing physical resources. Cloud vendors offer APIs for AWS Lambda, Google Cloud Functions and Azure Functions to build serverless applications for scalable workloads. There are also open source support for FaaS computing such as OpenFaas and OpenWhisk on top of Kubernetes or OpenShift. These functions resemble actor model as each function is triggered based on an event and is designed with a single responsibility, idempotency and shared nothing principles that can be executed concurrently.

FaaS/BaaS

The FaaS and serverless computing can be used to develop microservices where business logic can be embedded within a serverless functions but any platform services such as storage, messaging, caching, SMS can be exposed via Backend as a Service (BaaS). The Backend as a Service (BaaS) adds additional business logic on top of Platform as a Service (PaaS).

3.1.3 Agent based computing

Microservices architecture decouples data access from the business logic and microservices fetch data from the data-store, which incurs a higher overhead if the business logic needs to fetch a lot of data for processing or filtering before generating results. As opposed, agent style computing allow migrating business logic remotely where data or computing resources reside, thus it can process data more efficiently. In a simplest example, an agent may behave like a stored procedure where a function is passed to a data store for executing a business logic, which is executed within the database but other kind of agents may support additional capabilities to gather data from different data stores or sources and then produces desired results after processing the data remotely.

3.2 Service and Schema Registry

The service registry allows microservices to register the endpoints so that other services can look up the endpoints for communication with them instead of storing the endpoints locally. This allows service registry to enforce any authorization and access policies for communication based on geographic location or other constraints. A service registry may also allow registering mock services for testing in a local development environment to facilitate feature development. In addition, the registry may store schema definitions for the API models so that services can validate requests/responses easily or support multiple versions of the API contracts.

3.3 API Router, Reverse-Proxy or Gateway

API router, reverse-proxy or an API gateway are common patterns with microservices for routing, monitoring, versioning, securing and throttling APIs. These patterns can also be used with FaaS architecture where an API gateway may provide these capabilities and eliminate the need to have a web server for each service function. Thus, API gateway or router can result in lowering computing cost for each service and reducing the complexity for maintaining non-functional capabilities or -ilities.

3.4 Virtualization

Virtualization abstracts computer hardware and uses a hypervisor to create multiple virtual computers with different operating systems and applications on top of a single physical computer.

3.4.1 Virtual Machines

The initial implementation of virtualization was based on Virtual Machines for building virtualized computing environments and emulating a physical computer. The virtual machines use a hypervisor to communicate with the physical computer.

virtual machines

3.4.2 Containers

Containers implement virtualization using host operating system instead of a hypervisor, thus provide more light-weight and faster provisioning of computing resources. The containers use platforms such as Docker and Kubernetes to execute applications and services, which are bundled into images based on Open Container Initiative (OCI) standard.

containers

3.4.3 MicroVM

MicroVMs such as Firecracker and crosVM are based on kernel-based VM (KVM) and use hostOS acting as a hypervisor to provide isolation and security. As MicroVMs only include essential features for network, storage, throttling and metadata so they are quick to start and can scale to support multiple VMs with minimal overhead. A number of serverless platforms such as AWS Lambda, appfleet, containerd, Fly.io, Kata, Koyeb, OpenNebula, Qovery, UniK, and Weave FireKube have adopted Firecracker VM, which offers low overhead for starting a new virtual machine or executing a serverless workload.

micro virtualmachine

3.4.4 WebAssembly

The WebAssembly is a stack-based virtual machine that can run at the edge or in cloud. The applications written Go, C, Rust, AssemblyScript, etc. are compiled into WebAssembly binary and are then executed on a WebAssembly runtime such as extism, faasm, wasmtime, wamr, wasmr and wagi. The WebAssembly supports WebAssembly System Interface (WASI) standard, which provides access to the systems APIs for different operating systems similar to POSIX standard. There is also an active development of WebAssembly Component Model with proposals such as WebIDL bindings and Interface Types. This allows you to write microservices in any supported language, compile the code into WASM binary and then deploy in a managed platform with all support for security, traffic management, observability, etc.

containers

A number of WebAssembly platforms such as teaclave, wasmCloud, fermyon and Lunatic have also adopted Actor model to build a platform for writing distributed applications.

If WASM+WASI existed in 2008, we wouldn’t have needed to created Docker. That’s how important it is. Webassembly on the server is the future of computing. A standardized system interface was the missing link. Let’s hope WASI is up to the task!

3.5 Instrumentation and Service Binding

In order to reduce service code that deals specifically with non-functional capabilities such as authentication, authorization, logging, monitoring, etc., the business service can be instrumented to provide those capabilities at compile or deployment time. This means that the development team can largely focus on the business requirements and instrumentation takes care of adding metrics, failure reporting, diagnostics, monitoring, etc. without any development work. In addition, any external platform dependencies such as messaging service, orchestration, database, key/value store, caching can be injected into the service dynamically at runtime. The runtime can be configured to provide different implementation for the platform services, e.g. it may use a local Redis server for key/value store in a hosted environment or AWS/Azure’s implementation in a cloud environment.

When deploying services with WebAssembly support, the instrumentation may use WebAssembly libraries for extending the services to support authentication, rate-limiting, observability, monitoring, state management, and other non-functional capabilities, so that the development work can be shortened as shown below:

wasm platform

3.6 Orchestration and Choreography

The orchestration and choreography allows writing microservices that can be easily composed to provide a higher abstractions for business services. In orchestration design, a coordinator manages synchronous communication among different services whereas choreography uses event-driven architecture to communicate asynchronously. An actor model fits naturally with event based architecture for communicating with other actors and external services. However, orchestration services can be used to model complex business processes where a SAGA pattern is used to manage state transitions for different activities. Microservices may also use staged event-driven architecture (SEDA) to decompose complex event-driven services into a set of stages that are connected by different queues, which supports better modularity and code-reuse. SEDA allows enforcing admission control on each event queue and grants flexible scheduling for processing events based on adaptive workload controls and load shedding policies.

3.7 Automation, Continuous Deployment and Infrastructure as a code

Automation is a key to remove any drudgery work during the development process and consolidate common build processes such as continuous integration and deployment for improving the productivity of a development team. The development teams can employ continuous delivery to deploy small and frequent changes by developers. The continuous deployment often uses rolling updates, blue/green deployments or canary deployments to minimize disruption to end users. The monitoring system watches for error rates at each stage of the deployment and automatically rollbacks changes if a problem occurs.

Infrastructure as code (IaC) uses a declarative language to define development, test and production environment, which is managed by the source code management software. These provisioning and configuration logic can be used by CI/CD pipelines to automatically deploy and test environments. Many cloud vendors provide support for IaC such as Azure Resource Manager (ARM), AWS Cloud Development Kit (CDK), Hashicorp Terraform etc to deploy computing resources.

3.8 Proxy Patterns

Following sections shows implementing cross cutting concerns using proxy patterns for building microservices:

3.8.1 Sidecar Model

The sidecar model helps modularity and reusability where an application requires two containers: application container and sidebar container where sidebar container provides additional functionality such as adding SSL proxy for the service, observability, collecting metrics for the application container.

sidecar pattern

The Sidecar pattern generally uses another container to proxy off all traffic, which enforces security, access control, throttling before forwarding the incoming requests to the microservice.

3.8.2 Service Mesh

The service mesh uses a mesh of sidecar proxies to enable:

  • Dynamic request routing for blue-green deployments, canaries, and A/B testing
  • Load balancing based on latency, geographic locations or health checks
  • Service discovery based on a version of the service in an environment
  • TLS/mTLS encryption
  • Authentication and Authorization
  • Keys, certificates and secrets management
  • Rate limiting and throttling
  • State management and PubSub
  • Observability, metrics, monitoring, logging
  • Distributed tracing
  • Traffic management and traffic splitting
  • Circuit breaker and retries using libraries like FinagleStubby, and Hysterix to isolate unhealthy instances and gradually adding them back after successful health checks
  • Error handling and fault tolerance
  • Control plane to manage routing tables, service discovery, load balancer and other service configuration
Service Mesh

The service mesh pattern uses a data plane to host microservices and all incoming and outgoing requests go through a sidecar proxy that implements cross cutting concerns such as security, routing, throttling, etc. The control-plan in mesh network allows administrators to change the behavior of data plane proxies or configuration for data access. Popular service mesh frameworks include Consul, Distributed Application Runtime (Dapr), Envoy, Istio, Linkerd, Kong, Koyeb and Kuma for providing control-plane and data-plane with builtin support for networking, observability, traffic management and security. Dapr mesh network also supports Actor-model using the Orleans Virtual Actor pattern, which leverages the scalability and reliability guarantees of the underlying platform.

3.9 Cell-Based Architecture

A Cell-based Architecture (CBA) allows grouping business functions into single units called “cells”, which provides better scalability, modularity, composibility, disaster-recovery and governance for building microservices. It reduces the number of deployment units for microservices, thus simplifying the artificial complexity in building distributed systems.

cell-based architecture

A cell is an immutable collection of components and services that is independently deployable, manageable, and observable. The components and services in a cell communicate with each other using local network protocols and other cells via network endpoints through a cell gateway or brokers. A cell defines an API interface as part of its definition, which provides better isolation, short radius blast and agility because releases can be rolled out cell by cell. However, in order to maintain the isolation, a cell should be either self contained for all service dependencies or dependent cells should be on the same region so that there is no cross region dependency.

cell-based data-plane

3.10 12-Factor app

The 12-factor app is a set of best practices from Heroku for deploying applications and services in a virtualized environment. It recommends using declarative configuration for automation, enabling continuous deployment and other best practices. Though, these recommendations are a bit dated but most of them still holds except you may consider storing credentials or keys in a secret store or in an encrypted files instead of simply using environment variables.

4. New Software Development Workflow

The architecture patterns, virtual machines, WebAssembly and serverless computing described in above section can be used to simplify the development of microservices and easily support both functional and non-functional requirements of the business needs. Following sections describe how these patterns can be integrated with the software development lifecycle:

4.1 Development

4.1. Adopting WebAssembly as the Lingua Franca

WebAssembly along with its component model and WASI standard has been gaining support and many popular languages now can be compiled into wasm binary. Microservices can be developed in supported languages and ubiquitous WebAssembly support in production environment can reduce the development and maintenance drudgery for the application development teams. In addition, teams can leverage a number of serverless platforms that support WebAssembly such as extism, faasm, netlify, vercel, wasmcloud and wasmedge to reduce the operational load.

4.1.2 Instrumentation and Service Binding

The compiled WASM binary for microservices can be instrumented similar to aspects so that the generated code automatically supports horizontal concerns such as circuit breaker, diagnostics, failure reporting, metrics, monitoring, retries, tracing, etc. without any additional endeavor by the application developer.

In addition, the runtime can use service mesh features to bind external platform services such as event bus, data store, caching, service registry or dependent business services. This simplifies the development effort as development team can use the shared services for developing microservices instead of configuring and deploying infrastructure for each service. The shared infrastructure services support multi-tenancy and distinct namespaces for each microservice so that it can manage its state independently.

4.1.3 Adopting Actor Model for Microservices

As discussed above, the actor model offers a light-weight and highly concurrent model to implement a microservice APIs. The actors based microservices can be coded in any supported high-level language but then compiled into WebAssembly with a WASI support. A number of WebAssembly serverless platforms including Lunatic and wasmCloud already support Actor model while other platforms such as fermyon use http based request handlers, which are invoked for each request similar to actors based message passing. For example, here is a sample actor model in wasmCloud in Rust language though any language with wit-bindgen is supported as well:

#[derive(Debug, Default, Actor, HealthResponder)]
#[services(Actor, HttpServer)]
struct HelloActor {}

#[async_trait]
impl HttpServer for HelloActor {
    async fn handle_request(
        &self,
        _ctx: &Context,
        req: &HttpRequest,
    ) -> std::result::Result<HttpResponse, RpcError> {
        let text=form_urlencoded::parse(req.query_string.as_bytes())
            .find(|(n, _)| n == "name")
            .map(|(_, v)| v.to_string())
            .unwrap_or_else(|| "World".to_string());
        Ok(HttpResponse {
            body: format!("Hello {}", text).as_bytes().to_vec(),
            ..Default::default()
        })
    }
}

The wasmCloud supports Contract-driven design and development (CDD) using Wasmcloud interfaces based on smithy IDL for building microservices and composable systems. There is also a pending work to support OpenFaas with wasmCloud to invoke functions on capability providers with appropriate privileges.

Following example demonstrates similar capability with fermyon, which can be deployed to Fermyon Cloud:

use anyhow::Result;
use spin_sdk::{
    http::{Request, Response},
    http_component,
};
#[http_component]
fn hello_rust(req: Request) -> Result<Response> {
    println!("{:?}", req.headers());
    Ok(http::Response::builder()
        .status(200)
        .header("foo", "bar")
        .body(Some("Hello, Fermyon".into()))?)
}

Following example shows how Dapr and WasmEdge work together to support lightweight WebAssembly-based microservices in a cloud-native environment:

fn main() -> std::io::Result<()> {
    let port = std::env::var("PORT").unwrap_or(9005.to_string());
    println!("new connection at {}", port);
    let listener = TcpListener::bind(format!("127.0.0.1:{}", port))?;
    loop {
        let _ = handle_client(listener.accept()?.0);
    }
}

fn handle_client(mut stream: TcpStream) -> std::io::Result<()> {
  ... ...
}

fn handle_http(req: Request<Vec<u8>>) -> bytecodec::Result<Response<String>> {
  ... ...
}

The WasmEdge can also be used with other serverless platforms such as Vercel, Netlify, AWS Lambda, SecondState and Tencent.

4.1.4 Service Composition with Orchestration and Choreography

As described above, actors based microservices can be extended with the orchestration patterns such as SAGA and choreography/event driven architecture patterns such as SEDA to build composable services. These design patterns can be used to build loosely coupled and extensible systems where additional actors and components can be added without changing existing code.

4.2 Deployment and Runtime

4.2.1 Virtual Machines and Serverless Platform

Following diagram shows the evolution of virtualized environments for hosting applications, services, serverless functions and actors:

Evolution of hosting apps and services

In this architecture, the microservices are compiled into wasm binary, instrumented and then deployed in a micro virtual machine.

4.2.2 Sidecar Proxy and Service Mesh

Though, the service code will be instrumented with additional support for error handling, metrics, alerts, monitoring, tracing, etc. before the deployment but we can further enforce access policies, rate limiting, key management, etc. using a sidecar proxy or service mesh patterns:

For example, WasmEdge can be integrated with Dapr service mesh for adding building blocks for state management, event bus, orchestration, observability, traffic routing, and bindings to external services. Similarly, wasmCloud can be extended with additional non-functional capabilities by implementing capability provider. wasmCloud also provides a lattice, self-healing mesh network for simplifying communication between actors and capability providers.

4.2.3 Cellular Deployment

As described above, Cell-based Architecture (CBA) provides better scalability, modularity, composibility and business continuity for building microservices. Following diagram shows how above design with virtual machines, service-mesh and WebAssembly can be extended to support cell-based architecture:

In above architecture, each cell deploys a set of related microservices for an application that persists state in a replicated data store and communicate with other cells with an event-bus. In this model, separate cells are employed to access data-plane services and control-plane services for configuration and administration purpose.

5. Conclusion

The transition of monolithic services towards microservices architecture over last many years has helped development teams to be more agile in building modular and reusable code. However, as teams are building a growing number new microservices, they are also tasked with supporting non-functional requirements for each service such as high availability, capacity planning, scalability, performance, recovery, resilience, security, observability, etc. In addition, each microservice may depend on numerous other microservices, thus a microservice becomes susceptible to scaling/availability limits or security breaches in any of the downstream services. Such tight coupling of horizontal concerns escalates complexity, development undertaking and operational load by each development team resulting in larger time to market for new features and larger risk for outages due to divergence in implementing non-functional concerns. Though, serverless computing, function as a service (Faas) and event-driven compute services have emerged to solve many of these problems but they remain limited in the range of capabilities they offer and lack a common standards across vendors. The advancements in micro virtual machines and containers have created a boon to the serverless platforms such as appfleet, containerd, Fly.io, Kata, Koyeb, OpenNebula, Qovery, UniK, and Weave FireKube. In addition, widespread adoption of WebAssembly along with its component model and WASI standard are helping these serverless platforms such as extism, faasm, netlify, vercel, wasmcloud and wasmedge to build more modular and reusable components. These serverless platforms allow the development teams to primarily focus on building business features and offload all non-functional concerns to the underlying platforms. Many of these serverless platforms also support service mesh and sidecar patterns so that they can bind platform and dependent services and automatically handle concerns such as throttling, security, state management, secrets, key management, etc. Though, cell-based architecture is still relatively new and is only supported by more matured serverless and cloud platforms, but it further raises scalability, modularity, composibility, business continuity and governance of microservices. As each cell is isolated, it adds agility to deploy code changes to a single cell and use canary tests to validate the changes before deploying code to all cells. Due to such isolated deployment, cell-based architecture reduces the blast radius if a bug is found during validation or other production issues are discovered. Finally, automating continuous deployment processes and applying Infrastructure as code (IaC) can simplify local development and deployment so that developers use the same infrastructure setup for local testing as the production environment. This means that the services can be deployed in any environment consistently, thus reduces any manual configuration or subtle bugs due to misconfigurations.

In summary, the development teams will greatly benefit from the architecture patterns, virtualized environments, WebAssembly and serverless platforms described above so that application developers are not burdened with maintaining horizontal concerns and instead they focus on building core product features, which will be the differentiating factors in the competing markets. These serverless and managed platform not only boosts developer productivity but also lowers the infrastructure cost, operational overhead, cruft development work and the time to market for releasing new features.

October 18, 2022

Mocking and Fuzz Testing Distributed Micro Services with Record/Play, Templates and OpenAPI Specifications

Filed under: GO,REST,Technology — admin @ 11:36 am

Building large distributed systems often requires integrating with multiple distributed micro-services that makes development a particularly difficult as it’s not always easy to deploy and test all dependent services in a local environment with constrained resources. In addition, you might be working on a large system with multiple teams where you may have received new API specs from another team but the API changes are not available yet. Though, you can use mocking frameworks based on API specs when writing a unit tests but integration or functional testing requires an access to the network service. A common solution that I have used in past projects is to configure a mock service that can simulate different API operations. I wrote a JVM based mock-service many years ago with following use-cases:

Use-Cases

  • As a service owner, I need to mock remote dependent service(s) by capturing/recording request/responses through an HTTP proxy so that I can play it back when testing the remote service(s) without connecting with them.
  • As a service owner, I need to mock remote dependent service(s) based on a open-api/swagger specifications so that my service client can test all service behavior per specifications for the remote service(s) even when remote service is not fully implemented or accessible.
  • As a service owner, I need to mock remote dependent service(s) based on a mock scenario defined in a template so that my service client can test service behavior per expected request/response in the template even when remote service is not fully implemented or accessible.
  • As a service owner, I need to inject various response behavior and faults to the output of a remote service so that I can build a robust client that prevents cascading failures and is more resilient to unexpected faults.
  • As a service owner, I need to define test cases with faulty or fuzz responses to test my own service so that I can predict how it will behave with various input data and assert the service response based on expected behavior.

Features

API mock service for REST/HTTP based services with following features:

  • Record API request/response by working as a HTTP proxy server (native http/https or via API) between client and remote service.
  • Playback API response that were previously recorded based on request parameters.
  • Define API behavior manually by specifying request parameters and response contents using static data or dynamic data based on GO templating language.
  • Generate API behavior from open standards such as Open API/Swagger and automatically create constraints and regex based on the specification.
  • Customize API behavior using a GO template language so that users can generate dynamic contents based on input parameters or other configuration.
  • Generate large responses using the template language with dynamic loops so that you can test performance of your system.
  • Define multiple test scenarios for the API based on different input parameters or simulating various error cases that are difficult to reproduce with real services.
  • Store API request/responses locally as files so that it’s easy to port stubbed request/responses to any machine.
  • Allow users to define API request/response with various formats such as XML/JSON/YAML and upload them to the mock service.
  • Support test fixtures that can be uploaded to the mock service and can be used to generate mock responses.
  • Define a collection of helper methods to generate different kind of random data such as UDID, dates, URI, Regex, text and numeric data.
  • Ability to playback all test scenarios or a specific scenario and change API behavior dynamically with different input parameters.
  • Support multiple mock scenarios for the same API that can be selected either using round-robin order, custom predicates based on parameters or based on scenario name.
  • Inject error conditions and artificial delays so that you can test how your system handles error conditions that are difficult to reproduce or use for game days/chaos testing.
  • Generate client requests for a remote API for chaos and stochastic testing where a set of requests are sent with a dynamic data generated based on regex or other constraints.

I used this service in many past projects, however I felt it needed a bit fresh approach to meet above goals so I rewrote it in GO language, which has a robust support for writing network services. You can download the new version from https://github.com/bhatti/api-mock-service. As, it’s written in GO, you can either download GO runtime environment or use Docker to install it locally. If you haven’t installed docker, you can download the community version from https://docs.docker.com/engine/installation/ or find installer for your OS on https://docs.docker.com/get-docker/.

docker build -t api-mock-service .
docker run -p 8000:8080 -p 8081:8081 -e HTTP_PORT=8080 PROXY_PORT=8081 \
	-e DATA_DIR=/tmp/mocks -e ASSET_DIR=/tmp/assets api-mock-service

or pull an image from docker hub (https://hub.docker.com/r/plexobject/api-mock-service), e.g.

docker pull plexobject/api-mock-service:latest
docker run -p 8000:8080 -p 8081:8081 -e HTTP_PORT=8080 PROXY_PORT=8081 -e DATA_DIR=/tmp/mocks \
	-e ASSET_DIR=/tmp/assets plexobject/api-mock-service:latest

Alternatively, you can run it locally with GO environment, e.g.,

make && ./out/bin/api-mock-service

For full command line options, execute api-mock-service -h that will show you command line options such as:

./out/bin/api-mock-service -h
Starts mock service

Usage:
  api-mock-service [flags]
  api-mock-service [command]

Available Commands:
  chaos       Executes chaos client
  completion  Generate the autocompletion script for the specified shell
  help        Help about any command
  version     Version will output the current build information

Flags:
      --assetDir string   asset dir to store static assets/fixtures
      --config string     config file
      --dataDir string    data dir to store mock scenarios
  -h, --help              help for api-mock-service
      --httpPort int      HTTP port to listen
      --proxyPort int     Proxy port to listen

Recording a Mock Scenario via HTTP/HTTPS Proxy

Once you have the API mock service running, the mock service will start two ports on startup, first port (default 8080) will be used to record/play mock scenarios, updating templates or uploading OpenAPIs. The second port (default 8081) will setup an HTTP/HTTPS proxy server that you can point to record your scenarios, e.g.

export http_proxy="http://localhost:8081"
export https_proxy="http://localhost:8081"

curl -k -v -H "Authorization: Bearer sk_test_xxxx" \
	https://api.stripe.com/v1/customers/cus_xxx/cash_balance

Above curl command will automatically record all requests and responses and create mock scenario to play it back. For example, if you call the same API again, it will return a local response instead of contacting the server. You can customize the proxy behavior for record by adding X-Mock-Record: true header to your request.

Recording a Mock Scenario via API

Alternatively, you can use invoke an internal API as a pass through to invoke a remote API so that you can automatically record API behavior and play it back later, e.g.

% curl -H "X-Mock-Url: https://api.stripe.com/v1/customers/cus_**/cash_balance" \
	-H "Authorization: Bearer sk_test_***" http://localhost:8080/_proxy

In above example, the curl command is passing the URL of real service as an HTTP header Mock-Url. In addition, you can pass other authorization headers as needed.

Viewing the Recorded Mock Scenario

The API mock-service will store the request/response in a YAML file under a data directory that you can specify. For example, you may see a file under:

default_mocks_data/v1/customers/cus_***/cash_balance/GET/recorded-scenario-***.scr

Note: the sensitive authentication or customer keys are masked in above example but you will see following contents in the captured data file:

method: GET
name: recorded-v1-customers-cus
path: /v1/customers/cus_**/cash_balance
description: recorded at 2022-10-29 04:26:17.24776 +0000 UTC
request:
     match_query_params: {}
     match_headers: {}
     match_content_type: ""
     match_contents: ""
     example_path_params: {}
     example_query_params: {}
     example_headers:
         Accept: '*/*'
         Authorization: Bearer sk_test_xxx
         User-Agent: curl/7.65.2
         X-Mock-Url: https://api.stripe.com/v1/customers/cus_/cash_balance
     example_contents: ""
response:
    headers:
        Access-Control-Allow-Credentials:
            - "true"
        Access-Control-Allow-Methods:
            - GET, POST, HEAD, OPTIONS, DELETE
        Access-Control-Allow-Origin:
            - '*'
        Access-Control-Expose-Headers:
            - Request-Id, Stripe-Manage-Version, X-Stripe-External-Auth-Required, X-Stripe-Privileged-Session-Required
        Access-Control-Max-Age:
            - "300"
        Cache-Control:
            - no-cache, no-store
        Content-Length:
            - "168"
        Content-Type:
            - application/json
        Date:
            - Sat, 29 Oct 2022 04:26:17 GMT
        Request-Id:
            - req_lOP4bCsPIi5hQC
        Server:
            - nginx
        Strict-Transport-Security:
            - max-age=63072000; includeSubDomains; preload
        Stripe-Version:
            - "2018-09-06"
    content_type: application/json
    contents: |-
        {
          "object": "cash_balance",
          "available": null,
          "customer": "cus_",
          "livemode": false,
          "settings": {
            "reconciliation_mode": "automatic"
          }
        }
    contents_file: ""
    status_code: 200
wait_before_reply: 0s

Above example defines a mock scenario for testing /v1/customers/cus_**/cash_balance path. A test scenario includes:

Predicate

  • This is a boolean condition if you need to enable or disable a scenario test based on dynamic parameters or request count.

Group

  • This specifies the group for related test scenarios.

Request Matching Parameters:

The matching request parameters will be used to select the mock scenario to execute and you can use regular expressions to validate:

  • URL Query Parameters
  • URL Request Headers
  • Request Body

You can use these parameters so that test scenario is executed only when the parameters match, e.g.

    match_query_params:
      name: [a-z0-9]{1,50}
    match_headers:
      Content-Type: "application/json"

The matching request parameters will be used to select the mock scenario to execute and you can use regular expressions to validate, e.g. above example will be matched if content-type is application/json and it will validate that name query parameter is alphanumeric from 1-50 size.

Example Request Parameters:

The example request parameters show the contents captured from the record/play so that you can use and customize to define matching parameters:

  • URL Query Parameters
  • URL Request Headers
  • Request Body

Response Properties

The response properties will include:

  • Response Headers
  • Response Body statically defined or loaded from a test fixture
  • Response can also be loaded from a test fixture file
  • Status Code
  • Matching header and contents
  • Assertions You can copy recorded scenario to another folder and use templates to customize it and then upload it for playback.

The matching header and contents use match_headers and match_contents similar to request to validate response in case you want to test response from a real service for chaos testing. Similarly, assertions defines a set of predicates to test against response from a real service:

    assertions:
        - VariableContains contents.id 10
        - VariableContains contents.title illo
        - VariableContains headers.Pragma no-cache 

Above example will check API response and verify that id property contains 10, title contains illo and result headers include Pragma: no-cache header.

Playback the Mock API Scenario

You can playback the recorded response from above example as follows:

% curl http://localhost:8080/v1/customers/cus_***/cash_balance

Which will return captured response such as:

{
  "object": "cash_balance",
  "available": null,
  "customer": "cus_***",
  "livemode": false,
  "settings": {
    "reconciliation_mode": "automatic"
  }
}%

Though, you can customize your template with dynamic properties or conditional logic but you can also send HTTP headers for X-Mock-Response-Status to override HTTP status to return or X-Mock-Wait-Before-Reply to add artificial latency using duration syntax.

Debug Headers from Playback

The playback request will return mock-headers to indicate the selected mock scenario, path and request count, e.g.

X-Mock-Path: /v1/jobs/{jobId}/state
X-Mock-Request-Count: 13
X-Mock-Scenario: setDefaultState-bfb86eb288c9abf2988822938ef6d4aa3bd654a15e77158b89f17b9319d6f4e4

Upload Mock API Scenario

You can customize above recorded scenario, e.g. you can add path variables to above API as follows:

method: GET
name: stripe-cash-balance
path: /v1/customers/:customer/cash_balance
request:
    match_headers:
        Authorization: Bearer sk_test_[0-9a-fA-F]{10}$
response:
    headers:
        Access-Control-Allow-Credentials:
            - "true"
        Access-Control-Allow-Methods:
            - GET, POST, HEAD, OPTIONS, DELETE
        Access-Control-Allow-Origin:
            - '*'
        Access-Control-Expose-Headers:
            - Request-Id, Stripe-Manage-Version, X-Stripe-External-Auth-Required, X-Stripe-Privileged-Session-Required
        Access-Control-Max-Age:
            - "300"
        Cache-Control:
            - no-cache, no-store
        Content-Type:
            - application/json
        Request-Id:
            - req_2
        Server:
            - nginx
        Strict-Transport-Security:
            - max-age=63072000; includeSubDomains; preload
        Stripe-Version:
            - "2018-09-06"
    content_type: application/json
    contents: |-
        {
          "object": "cash_balance",
          "available": null,
          "customer": {{.customer}}
          "livemode": false,
          "page": {{.page}}
          "pageSize": {{.pageSize}}
          "settings": {
            "reconciliation_mode": "automatic"
          }
        }
    status_code: 200
wait_before_reply: 1s

In above example, I assigned a name stripe-cash-balance to the mock scenario and changed API path to /v1/customers/:customer/cash_balance so that it can capture customer-id as a path variable. I added a regular expression to ensure that the HTTP request includes an Authorization header matching Bearer sk_test_[0-9a-fA-F]{10}$ and defined dynamic properties such as {{.customer}}, {{.page}} and {{.pageSize}} so that they will be replaced at runtime.

The mock scenario uses builtin template syntax of GO. You can then upload it as follows:

curl -H "Content-Type: application/yaml" --data-binary @fixtures/stripe-customer.yaml \
	http://localhost:8080/_scenarios

and then play it back as follows:

curl -v -H "Authorization: Bearer sk_test_0123456789" \
	"http://localhost:8080/v1/customers/123/cash_balance?page=2&pageSize=55"

and it will generate:

 HTTP/1.1 200 OK
< Content-Type: application/json
< X-Mock-Request-Count: 1
< X-Mock-Scenario: stripe-cash-balance
< Request-Id: req_2
< Server: nginx
< Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
< Stripe-Version: 2018-09-06
< Date: Sat, 29 Oct 2022 17:29:12 GMT
< Content-Length: 179
<
{
  "object": "cash_balance",
  "available": null,
  "customer": 123
  "livemode": false,
  "page": 2
  "pageSize": 55
  "settings": {
    "reconciliation_mode": "automatic"
  }

As you can see, the values of customer, page and pageSize are dynamically updated and the response header includes name of mock scenario with request counts. You can upload multiple mock scenarios for the same API and the mock API service will play it back sequentially. For example, you can upload another scenario for above API as follows:

method: GET
name: stripe-customer-failure
path: /v1/customers/:customer/cash_balance
request:
    match_headers:
        Authorization: Bearer sk_test_[0-9a-fA-F]{10}$
response:
    headers:
        Stripe-Version:
            - "2018-09-06"
    content_type: application/json
    contents: My custom error
    status_code: 500
wait_before_reply: 1s

And then play it back as before:

curl -v -H "Authorization: Bearer sk_test_0123456789" \
	"http://localhost:8080/v1/customers/123/cash_balance?page=2&pageSize=55"

which will return response with following error response

> GET /v1/customers/123/cash_balance?page=2&pageSize=55 HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.65.2
> Accept: */*
> Authorization: Bearer sk_test_0123456789
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 500 Internal Server Error
< Content-Type: application/json
< Mock-Request-Count: 1
< X-Mock-Scenario: stripe-customer-failure
< Stripe-Version: 2018-09-06
< Vary: Origin
< Date: Sat, 29 Oct 2022 17:29:15 GMT
< Content-Length: 15

Dynamic Templates with Mock API Scenarios

You can use loops and conditional primitives of template language and custom functions provided by the API mock library to generate dynamic responses as follows:

method: GET
name: get_devices
path: /devices
description: ""
request:
    match_content_type: "application/json; charset=utf-8"
response:
    headers:
        "Server":
            - "SampleAPI"
        "Connection":
            - "keep-alive"
    content_type: application/json
    contents: >
     {
     "Devices": [
{{- range $val := Iterate .pageSize }}
      {
        "Udid": "{{SeededUdid $val}}",
        "Line": { {{SeededFileLine "lines.txt" $val}}, "Type": "Public", "IsManaged": false },
        "Amount": {{JSONFileProperty "props.yaml" "amount"}},        
        "SerialNumber": "{{Udid}}",
        "MacAddress": "{{Udid}}",
        "Imei": "{{Udid}}",
        "AssetNumber": "{{RandString 20}}",
        "LocationGroupId": {
         "Id": {
           "Value": {{RandNumMax 1000}},
         },
         "Name": "{{SeededCity $val}}",
         "Udid": "{{Udid}}"
        },
        "DeviceFriendlyName": "Device for {{SeededName $val}}",
        "LastSeen": "{{Time}}",
        "Email": "{{RandEmail}}",
        "Phone": "{{RandPhone}}",        
        "EnrollmentStatus": {{SeededBool $val}}
        "ComplianceStatus": {{RandRegex "^AC[0-9a-fA-F]{32}$"}}
        "Group": {{RandCity}},
        "Date": {{TimeFormat "3:04PM"}},
        "BatteryLevel": "{{RandNumMax 100}}%",
        "StrEnum": {{EnumString "ONE TWO THREE"}},
        "IntEnum": {{EnumInt 10 20 30}},
        "ProcessorArchitecture": {{RandNumMax 1000}},
        "TotalPhysicalMemory": {{RandNumMax 1000000}},
        "VirtualMemory": {{RandNumMax 1000000}},
        "AvailablePhysicalMemory": {{RandNumMax 1000000}},
        "CompromisedStatus": {{RandBool}},
        "Add": {{Add 2 1}},
      }{{if LastIter $val $.PageSize}}{{else}},  {{end}}
{{ end }}
     ],
     "Page": {{.page}},
     "PageSize": {{.pageSize}},
     "Total": {{.pageSize}}
     }
    {{if NthRequest 10 }}
    status_code: {{EnumInt 500 501}}
    {{else}}
    status_code: {{EnumInt 200 400}}
    {{end}}
wait_before_reply: {{.page}}s

Above example includes a number of template primitives and custom functions to generate dynamic contents such as:

Loops

GO template support loops that can be used to generate multiple data entries in the response, e.g.

{{- range $val := Iterate .pageSize }}

Builtin functions

GO template supports custom functions that you can add to your templates. The mock service includes a number of helper functions to generate random data such as:

Add numbers

  "Num": "{{Add 1 2}}",

Date/Time

  "LastSeen": "{{Time}}",
  "Date": {{Date}},
  "DateFormatted": {{TimeFormat "3:04PM"}},
  "LastSeen": "{{Time}}",

Comparison

  {{if EQ .MyVariable 10 }}
  {{if GE .MyVariable 10 }}
  {{if GT .MyVariable 10 }}
  {{if LE .MyVariable 10 }}
  {{if LT .MyVariable 10 }}
  {{if Nth .MyVariable 10 }}

Enums

  "StrEnum": {{EnumString "ONE TWO THREE"}},
  "IntEnum": {{EnumInt 10 20 30}},

Random Data

  "SerialNumber": "{{Udid}}",
  "AssetNumber": "{{RandString 20}}",
  "LastSeen": "{{Time}}",
  "Host": "{{RandHost}}",
  "Email": "{{RandEmail}}",
  "Phone": "{{RandPhone}}",
  "URL": "{{RandURL}}",
  "EnrollmentStatus": {{SeededBool $val}}
  "ComplianceStatus": {{RandRegex "^AC[0-9a-fA-F]{32}$"}}
  "City": {{RandCity}},
  "Country": {{RandCountry}},
  "CountryCode": {{RandCountryCode}},
  "Completed": {{RandBool}},
  "Date": {{TimeFormat "3:04PM"}},
  "BatteryLevel": "{{RandNumMax 100}}%",
  "Object": "{{RandDict}}",
  "IntHistory": {{RandIntArrayMinMax 1 10}},
  "StringHistory": {{RandStringArrayMinMax 1 10}},
  "FirstName": "{{SeededName 1 10}}",
  "LastName": "{{RandName}}",
  "Score": "{{RandNumMinMax 1 100}}",
  "Paragraph": "{{RandParagraph 1 10}}",
  "Word": "{{RandWord 1 1}}",
  "Sentence": "{{RandSentence 1 10}}",
  "Colony": "{{RandString}}",

Request count and Conditional Logic

{{if NthRequest 10 }}   -- for every 10th request
{{if GERequest 10 }}    -- if number of requests made to API so far are >= 10
{{if LTRequest 10 }}    -- if number of requests made to API so far are < 10

The template syntax allows you to define a conditional logic such as:

    {{if NthRequest 10 }}
    status_code: {{AnyInt 500 501}}
    {{else}}
    status_code: {{AnyInt 200 400}}
    {{end}}

In above example, the mock API will return HTTP status 500 or 501 for every 10th request and 200 or 400 for other requests. You can use conditional syntax to simulate different error status or customize response.

Loops

  {{- range $val := Iterate 10}}

     {{if LastIter $val 10}}{{else}},{{end}}
  {{ end }}

Variables

     {{if VariableContains "contents" "blah"}}
     {{if VariableEquals "contents" "blah"}}
     {{if VariableSizeEQ "contents" "blah"}}
     {{if VariableSizeGE "contents" "blah"}}
     {{if VariableSizeLE "contents" "blah"}}

Test fixtures

The mock service allows you to upload a test fixture that you can refer in your template, e.g.

  "Line": { {{SeededFileLine "lines.txt" $val}}, "Type": "Public", "IsManaged": false },

Above example loads a random line from a lines.txt fixture. As you may need to generate a deterministic random data in some cases, you can use Seeded functions to generate predictable data so that the service returns same data. Following example will read a text fixture to load a property from a file:

  "Amount": {{JSONFileProperty "props.yaml" "amount"}},

This template file will generate content as follows:

{ "Devices": [
 {
   "Udid": "fe49b338-4593-43c9-b1e9-67581d000000",
   "Line": { "ApplicationName": "Chase", "Version": "3.80", "ApplicationIdentifier": "com.chase.sig.android", "Type": "Public", "IsManaged": false },
   "Amount": {"currency":"$","value":100},
   "SerialNumber": "47c2d7c3-c930-4194-b560-f7b89b33bc2a",
   "MacAddress": "1e015eac-68d2-42ee-9e8f-73fb80958019",
   "Imei": "5f8cae1b-c5e3-4234-a238-1c38d296f73a",
   "AssetNumber": "9z0CZSA03ZbUNiQw2aiF",
   "LocationGroupId": {
    "Id": {
      "Value": 980
    },
    "Name": "Houston",
    "Udid": "3bde6570-c0d4-488f-8407-10f35902cd99"
   },
   "DeviceFriendlyName": "Device for Alexander",
   "LastSeen": "2022-10-29T11:25:25-07:00",
   "Email": "john.smith@abc.com",
   "Phone": "1-408-454-1507",
   "EnrollmentStatus": true,
   "ComplianceStatus": "ACa3E07B0F2cA00d0fbFe88f5c6DbC6a9e",
   "Group": "Chicago",
   "Date": "11:25AM",
   "BatteryLevel": "43%",
   "StrEnum": "ONE",
   "IntEnum": 20,
   "ProcessorArchitecture": 243,
   "TotalPhysicalMemory": 320177,
   "VirtualMemory": 768345,
   "AvailablePhysicalMemory": 596326,
   "CompromisedStatus": false,
   "Add": 3
 },
...
 ], "Page": 2, "PageSize": 55, "Total": 55 }  

Artificial Delays

You can specify artificial delay for the API request as follows:

wait_before_reply: {{.page}}s

Above example shows delay based on page number but you can use any parameter to customize this behavior.

Conditional Logic

The template syntax allows you to define a conditional logic such as:

    {{if NthRequest 10 }}
    status_code: {{AnyInt 500 501}}
    {{else}}
    status_code: {{AnyInt 200 400}}
    {{end}}

In above example, the mock API will return HTTP status 500 or 501 for every 10th request and 200 or 400 for other requests. You can use conditional syntax to simulate different error status or customize response.

Test fixtures

The mock service allows you to upload a test fixture that you can refer in your template, e.g.

"Line": { {{SeededFileLine "lines.txt" $val}}, "Type": "Public", "IsManaged": false },

Above example loads a random line from a lines.txt fixture. As you may need to generate a deterministic random data in some cases, you can use Seeded functions to generate predictable data so that the service returns same data. Following example will read a text fixture to load a property from a file:

"Amount": {{JSONFileProperty "props.yaml" "amount"}},

This template file will generate content as follows:

{ "Devices": [
 {
   "Udid": "fe49b338-4593-43c9-b1e9-67581d000000",
   "Line": { "ApplicationName": "Chase", "Version": "3.80", "ApplicationIdentifier": "com.chase.sig.android", "Type": "Public", "IsManaged": false },
   "Amount": {"currency":"$","value":100},   
   "SerialNumber": "47c2d7c3-c930-4194-b560-f7b89b33bc2a",
   "MacAddress": "1e015eac-68d2-42ee-9e8f-73fb80958019",
   "Imei": "5f8cae1b-c5e3-4234-a238-1c38d296f73a",
   "AssetNumber": "9z0CZSA03ZbUNiQw2aiF",
   "LocationGroupId": {
    "Id": {
      "Value": 980,
    },
    "Name": "Houston",
    "Udid": "3bde6570-c0d4-488f-8407-10f35902cd99"
   },
   "DeviceFriendlyName": "Device for Alexander",
   "LastSeen": "2022-10-29T11:25:25-07:00",
   "Email": "anthony.christian@abblhfgkpd.edu",
   "Phone": "1-573-993-7542",   
   "EnrollmentStatus": true
   "ComplianceStatus": ACa3E07B0F2cA00d0fbFe88f5c6DbC6a9e
   "Group": Chicago,
   "Date": 11:25AM,
   "BatteryLevel": "43%",
   "StrEnum": ONE,
   "IntEnum": 20,
   "ProcessorArchitecture": 243,
   "TotalPhysicalMemory": 320177,
   "VirtualMemory": 768345,
   "AvailablePhysicalMemory": 596326,
   "CompromisedStatus": false,
   "Add": 3,
   "Dict": map[one:1 three:3 two:2]
 },
...
 ], "Page": 2, "PageSize": 55, "Total": 55 }   

Playing back a specific mock scenario

You can pass a header for X-Mock-Scenario to specify the name of scenario if you have multiple scenarios for the same API, e.g.

curl -v -H "X-Mock-Scenario: stripe-cash-balance" -H "Authorization: Bearer sk_test_0123456789" \
	"http://localhost:8080/v1/customers/123/cash_balance?page=2&pageSize=55"

You can also customize response status by overriding the request header with X-Mock-Response-Status and delay before return by overriding X-Mock-Wait-Before-Reply header.

Using Test Fixtures

You can define a test data in your test fixtures and then upload as follows:

curl -H "Content-Type: application/yaml" --data-binary @fixtures/lines.txt \
	http://localhost:8080/_fixtures/GET/lines.txt/devices

curl -v -H "Content-Type: application/yaml" --data-binary @fixtures/props.yaml \
    http://localhost:8080/_fixtures/GET/props.yaml/devices

In above example, test fixtures for lines.txt and props.yaml will be uploaded and will be available for all GET requests under /devices URL path. You can then refer to above fixture in your templates. You can also use this to serve any binary files, e.g. you can define an image template file as follows:

method: GET
name: test-image
path: /images/mock_image
description: ""
request:
response:
    headers:
      "Last-Modified":
        - {{Time}}
      "ETag":
        - {{RandString 10}}
      "Cache-Control":
        - max-age={{RandNumMinMax 1000 5000}}
    content_type: image/png
    contents_file: mockup.png
    status_code: 200

Then upload a binary image using:

curl -H "Content-Type: application/yaml" --data-binary @fixtures/mockup.png \
	http://localhost:8080/_fixtures/GET/mockup.png/images/mock_image

And then serve the image using:

curl -v "http://localhost:8080/images/mock_image"

Custom Functions

The API mock service defines following custom functions that can be used to generate test data:

Numeric Random Data

Following functions can be used to generate numeric data within a range or with a seed to always generate deterministic test data:

  • Random
  • SeededRandom
  • RandNumMinMax
  • RandIntArrayMinMax

Text Random Data

Following functions can be used to generate numeric data within a range or with a seed to always generate deterministic test data:

  • RandStringMinMax
  • RandStringArrayMinMax
  • RandRegex
  • RandEmail
  • RandPhone
  • RandDict
  • RandCity
  • RandName
  • RandParagraph
  • RandPhone
  • RandSentence
  • RandString
  • RandStringMinMax
  • RandWord

Email/Host/URL

  • RandURL
  • RandEmail
  • RandHost

Boolean

Following functions can be used to generate boolean data:

  • RandBool
  • SeededBool

UDID

Following functions can be used to generate UDIDs:

  • Udid
  • SeededUdid

String Enums

Following functions can be used to generate a string from a set of Enum values:

  • EnumString

Integer Enums

Following functions can be used to generate an integer from a set of Enum values:

  • EnumInt

Random Names

Following functions can be used to generate random names:

  • RandName
  • SeededName

City Names

Following functions can be used to generate random city names:

  • RandCity
  • SeededCity

Country Names or Codes

Following functions can be used to generate random country names or codes:

  • RandCountry
  • SeededCountry
  • RandCountryCode
  • SeededCountryCode

File Fixture

Following functions can be used to generate random data from a fixture file:

  • RandFileLine
  • SeededFileLine
  • FileProperty
  • JSONFileProperty
  • YAMLFileProperty

Generate Mock API Behavior from OpenAPI or Swagger Specifications

If you are using Open API or Swagger for API specifications, you can simply upload a YAML based API specification. For example, here is a sample Open API specification from Twilio:

openapi: 3.0.1
paths:
  /v1/AuthTokens/Promote:
    servers:
    - url: https://accounts.twilio.com
    description: Auth Token promotion
    x-twilio:
      defaultOutputProperties:
      - account_sid
      - auth_token
      - date_created
      pathType: instance
      mountName: auth_token_promotion
    post:
      description: Promote the secondary Auth Token to primary. After promoting the
        new token, all requests to Twilio using your old primary Auth Token will result
        in an error.
      responses:
        '200':
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/accounts.v1.auth_token_promotion'
          description: OK
      security:

...


   schemas:
     accounts.v1.auth_token_promotion:
       type: object
       properties:
         account_sid:
           type: string
           minLength: 34
           maxLength: 34
           pattern: ^AC[0-9a-fA-F]{32}$
           nullable: true
           description: The SID of the Account that the secondary Auth Token was created
             for
         auth_token:
           type: string
           nullable: true
           description: The promoted Auth Token
         date_created:
           type: string
           format: date-time
           nullable: true
           description: The ISO 8601 formatted date and time in UTC when the resource
             was created
         date_updated:
           type: string
           format: date-time
           nullable: true
           description: The ISO 8601 formatted date and time in UTC when the resource
             was last updated
         url:
           type: string
           format: uri
           nullable: true
           description: The URI for this resource, relative to `https://accounts.twilio.com`
...           

You can then upload the API specification as:

curl -H "Content-Type: application/yaml" --data-binary @fixtures/oapi/twilio_accounts_v1.yaml \
		http://localhost:8080/_oapi

It will generate a mock scenarios for each API based on mime-type, status-code, parameter formats, regex, data ranges, e.g.,

name: UpdateAuthTokenPromotion-xx
path: /v1/AuthTokens/Promote
description: Promote the secondary Auth Token to primary. After promoting the new token, all requests to Twilio using your old primary Auth Token will result in an error.
request:
    match_query_params: {}
    match_headers: {}
    match_content_type: ""
    match_contents: ""
    example_path_params: {}
    example_query_params: {}
    example_headers: {}
    example_contents: ""
response:
    headers: {}
    content_type: application/json
    contents: '{"account_sid":"{{RandRegex `^AC[0-9a-fA-F]{32}$`}}",\
    "auth_token":"{{RandStringMinMax 0 0}}","date_created":"{{Time}}",\
    "date_updated":"{{Time}}","url":"https://{{RandName}}.com"}'
    contents_file: ""
    status_code: 200
wait_before_reply: 0s

In above example, the account_sid uses regex to generate data and URI format to generate URL. Then invoke the mock API as:

curl -v -X POST http://localhost:8080/v1/AuthTokens/Promote

Which will generate dynamic response as follows:

{
  "account_sid": "ACF3A7ea7f5c90f6482CEcA77BED07Fb91",
  "auth_token": "PaC7rKdGER73rXNi6rVKZMN1Jw0QYxPFeEkqyvnM7Ojw2nziOER7SMWkIV6N2hXYTKxAfDMfS9t0",
  "date_created": "2022-10-29T11:54:46-07:00",
  "date_updated": "2022-10-29T11:54:46-07:00",
  "url": "https://Billy.com"
}

Listing all Mock Scenarios

You can list all available mock APIs using:

curl -v http://localhost:8080/_scenarios

Which will return summary of APIs such as:

{
  "/_scenarios/GET/FetchCredentialAws-8b2fcf02dfb7dc190fb735a469e1bbaa3ccb5fd1a24726976d110374b13403c6/v1/Credentials/AWS/{Sid}": {
    "method": "GET",
    "name": "FetchCredentialAws-8b2fcf02dfb7dc190fb735a469e1bbaa3ccb5fd1a24726976d110374b13403c6",
    "path": "/v1/Credentials/AWS/{Sid}",
    "match_query_params": {},
    "match_headers": {},
    "match_content_type": "",
    "match_contents": "",
    "LastUsageTime": 0,
    "RequestCount": 0
  },
  "/_scenarios/GET/FetchCredentialPublicKey-60a01dcea5290e6d429ce604c7acf5bd59606045fc32c0bc835e57ac2b1b8eb6/v1/Credentials/PublicKeys/{Sid}": {
    "method": "GET",
    "name": "FetchCredentialPublicKey-60a01dcea5290e6d429ce604c7acf5bd59606045fc32c0bc835e57ac2b1b8eb6",
    "path": "/v1/Credentials/PublicKeys/{Sid}",
    "match_query_params": {},
    "match_headers": {},
    "match_content_type": "",
    "match_contents": "",
    "LastUsageTime": 0,
    "RequestCount": 0
  },
  "/_scenarios/GET/ListCredentialAws-28717701f05de4374a09ec002066d308043e73e30f25fec2dcd4c3d3c001d300/v1/Credentials/AWS": {
    "method": "GET",
    "name": "ListCredentialAws-28717701f05de4374a09ec002066d308043e73e30f25fec2dcd4c3d3c001d300",
    "path": "/v1/Credentials/AWS",
    "match_query_params": {
      "PageSize": "\\d+"
    },
    "match_headers": {},
    "match_content_type": "",
    "match_contents": "",
    "LastUsageTime": 0,
    "RequestCount": 0
  },
...  

Chaos Testing

In addition to serving a mock service, you can also use a builtin chaos client to test remote services for stochastic testing by generating random data based on regex or API specifications. For example, you may capture a test scenario for a remote API using http proxy such as:

export http_proxy="http://localhost:8081"
export https_proxy="http://localhost:8081"
curl -k https://jsonplaceholder.typicode.com/todos

This will capture a mock scenario such as:

method: GET
name: recorded-todos-ff9a8e133347f7f05273f15394f722a9bcc68bb0e734af05ba3dd98a6f2248d1
path: /todos
description: recorded at 2022-12-12 02:23:42.845176 +0000 UTC for https://jsonplaceholder.typicode.com:443/todos
group: todos
predicate: ""
request:
    match_query_params: {}
    match_headers:
        Content-Type: ""
    match_contents: '{}'
    example_path_params: {}
    example_query_params: {}
    example_headers:
        Accept: '*/*'
        User-Agent: curl/7.65.2
    example_contents: ""
response:
    headers:
        Access-Control-Allow-Credentials:
            - "true"
        Age:
            - "19075"
        Alt-Svc:
            - h3=":443"; ma=86400, h3-29=":443"; ma=86400
        Cache-Control:
            - max-age=43200
        Cf-Cache-Status:
            - HIT
        Cf-Ray:
            - 7782ffe4bd6bc62c-SEA
        Connection:
            - keep-alive
        Content-Type:
            - application/json; charset=utf-8
        Date:
            - Mon, 12 Dec 2022 02:23:42 GMT
        Etag:
            - W/"5ef7-4Ad6/n39KWY9q6Ykm/ULNQ2F5IM"
        Expires:
            - "-1"
        Nel:
            - '{"success_fraction":0,"report_to":"cf-nel","max_age":604800}'
        Pragma:
            - no-cache
    contents: |-
      [
        {
          "userId": 1,
          "id": 1,
          "title": "delectus aut autem",
          "completed": false
        },
        {
          "userId": 1,
          "id": 2,
          "title": "quis ut nam facilis et officia qui",
          "completed": false
        },
      ...
        ]
    contents_file: ""
    status_code: 200
    match_headers: {}
    match_contents: '{"completed":"__string__.+","id":"(__number__[+-]?[0-9]{1,10})","title":"(__string__\\w+)","userId":"(__number__[+-]?[0-9]{1,10})"}'
    assertions: []

You can then customize this scenario with additional assertions and you may remove all response contents as they won’t be used. Note that above scenario is defined with group todos. You can then submit a request for chaos testing as follows:

curl -k -v -X POST http://localhost:8080/_chaos/todos -d '{"base_url": "https://jsonplaceholder.typicode.com", "execution_times": 10}'

Above request will submit 10 requests to the todo server with random data and return response such as:

{"errors":null,"failed":0,"succeeded":10}

If you have a local captured data, you can also run chaos client with a command line without running mock server, e.g.:

go run main.go chaos --base_url https://jsonplaceholder.typicode.com --group todos --times 10

Static Assets

The mock service can serve any static assets from a user-defined folder and then serve it as follows:

cp static-file default_assets

# execute the API mock server
make && ./out/bin/api-mock-service

# access assets
curl http://localhost:8080/_assets/default_assets

API Reference

The API specification for the mock library defines details for managing mock scenarios and customizing the mocking behavior.

Summary

Building and testing distributed systems often requires deploying a deep stack of dependent services, which makes development hard on a local environment with limited resources. Ideally, you should be able to deploy and test entire stack without using network or requiring a remote access so that you can spend more time on building features instead of configuring your local environment. Above examples show how you use the https://github.com/bhatti/api-mock-service to mock APIs for testing purpose and define test scenarios for simulating both happy and error cases as well as injecting faults or network delays in your testing processes so that you can test for fault tolerance. This mock library can be used to define the API mock behavior using record/play, template language or API specification standards. I have found a great use of tools like this when developing micro services and hopefully you find it useful. Feel free to connect with your feedback or suggestions.

October 10, 2022

Implementing Distributed Locks (Mutex and Semaphore) with Databases

Filed under: Concurrency,Rust,Uncategorized — admin @ 10:55 pm

Overview

I recently needed a way to control access to shared resources in a distributed system for concurrency control. Though, you may use Consul or Zookeeper (or low-level Raft / Paxos implementations if you are brave) for managing distributed locks but I wanted to reuse existing database without adding another dependency to my tech stack. Most databases support transactions or conditional updates with varying degree of support for transactional guarantees, but they can’t be used for distributed locks if the business logic you need to protect resides outside the databases. I found a lock client library based on AWS Databases but it didn’t support semaphores. The library implementation was tightly coupled with concerns of lock management and database access and it wasn’t easy to extend it easily. For example, following diagram shows how cyclic dependencies in the code:

class diagram

Due to above deficiencies in existing solutions, I decided to implement my own implementation of distributed locks in Rust with following capabilities:

  • Allow creating lease based locks that can either protect a single shared resource with a Mutex lock or protect a finite set of shared resources with a Semaphore lock.
  • Allow renewing leases based on periodic intervals so that stale locks can be acquired by other users.
  • Allow releasing Mutex and semaphore locks explicitly after the user performs a critical action.
  • CRUD APIs to manage Mutex and Semaphore entities in the database.
  • Multi-tenancy support for different clients in case the database is shared by multiple users.
  • Fair locks to support first-come and first serve based access grant when acquiring same lock concurrently.
  • Scalable solution for supporting tens of thousands concurrent mutexes and semaphores.
  • Support multiple data stores such as relational databases such as MySQL, PostgreSQL, Sqlite and as well as NoSQL/Cache data stores such as AWS Dynamo DB and Redis.

High-level Design

I chose Rust to build the library for managing distributed locks due to strict performance and correctness requirements. Following diagram shows the high-level components in the new library:

LockManager Interface

The client interacts with the LockManager that defines following operations to acquire, release, renew lock leases and manage lifecycle of Mutexes and Semaphores:

#[async_trait]
pub trait LockManager {
    // Attempts to acquire a lock until it either acquires the lock, or a specified additional_time_to_wait_for_lock_ms is
    // reached. This method will poll database based on the refresh_period. If it does not see the lock in database, it
    // will immediately return the lock to the caller. If it does see the lock, it will note the lease expiration on the lock. If
    // the lock is deemed stale, (that is, there is no heartbeat on it for at least the length of its lease duration) then this
    // will acquire and return it. Otherwise, if it waits for as long as additional_time_to_wait_for_lock_ms without acquiring the
    // lock, then it will return LockError::NotGranted.
    //
    async fn acquire_lock(&self, opts: &AcquireLockOptions) -> LockResult<MutexLock>;

    // Releases the given lock if the current user still has it, returning true if the lock was
    // successfully released, and false if someone else already stole the lock. Deletes the
    // lock item if it is released and delete_lock_item_on_close is set.
    async fn release_lock(&self, opts: &ReleaseLockOptions) -> LockResult<bool>;

    // Sends a heartbeat to indicate that the given lock is still being worked on.
    // This method will also set the lease duration of the lock to the given value.
    // This will also either update or delete the data from the lock, as specified in the options
    async fn send_heartbeat(&self, opts: &SendHeartbeatOptions) -> LockResult<MutexLock>;

    // Creates mutex if doesn't exist
    async fn create_mutex(&self, mutex: &MutexLock) -> LockResult<usize>;

    // Deletes mutex lock if not locked
    async fn delete_mutex(&self,
                          other_key: &str,
                          other_version: &str,
                          other_semaphore_key: Option<String>) -> LockResult<usize>;

    // Finds out who owns the given lock, but does not acquire the lock. It returns the metadata currently associated with the
    // given lock. If the client currently has the lock, it will return the lock, and operations such as release_lock will work.
    // However, if the client does not have the lock, then operations like releaseLock will not work (after calling get_lock, the
    // caller should check mutex.expired() to figure out if it currently has the lock.)
    async fn get_mutex(&self, mutex_key: &str) -> LockResult<MutexLock>;

    // Creates or updates semaphore with given max size
    async fn create_semaphore(&self, semaphore: &Semaphore) -> LockResult<usize>;

    // Returns semaphore for the key
    async fn get_semaphore(&self, semaphore_key: &str) -> LockResult<Semaphore>;

    // find locks by semaphore
    async fn get_semaphore_mutexes(&self,
                                   other_semaphore_key: &str,
    ) -> LockResult<Vec<MutexLock>>;

    // Deletes semaphore if all associated locks are not locked
    async fn delete_semaphore(&self,
                              other_key: &str,
                              other_version: &str,
    ) -> LockResult<usize>;
}

The LockManager interacts with LockStore to access mutexes and semaphores, which delegate to implementation of mutex and semaphore repositories for lock management. The library defines two implementation of LockStore: first, DefaultLockStore that supports mutexes and semaphores where mutexes are used to acquire a singular lock whereas semaphores are used to acquire a lock from a set of finite shared resources. The second, FairLockStore uses a Redis specific implementation of fair semaphores for managing lease based semaphores that support first-come and first-serve order. The LockManager supports waiting for the lock to be available if lock is not immediately available where it periodically checks for the availability of mutex or semaphore based lock. Due to this periodic polling, the fair semaphore algorithm won’t support FIFO order if a new client requests a lock while previous lock request is waiting for next polling interval.

Create Lock Manager

You can instantiate a Lock Manager with relational database store as follows:

let config = LocksConfig::new("test_tenant");
let mutex_repo = factory::build_mutex_repository(RepositoryProvider::Rdb, &config)
	.await.expect("failed to build mutex repository");
let semaphore_repo = factory::build_semaphore_repository(
  	RepositoryProvider::Rdb, &config)
	.await.expect("failed to build semaphore repository");
let store = Box::new(DefaultLockStore::new(&config, mutex_repo, semaphore_repo));

let locks_manager = LockManagerImpl::new(
  	&config, store, &default_registry()).expect("failed to initialize lock manager");

Alternatively, you can choose AWS Dynamo DB as follows:

let mutex_repo = factory::build_mutex_repository(
  	RepositoryProvider::Ddb, &config).await.expect("failed to build mutex repository");
let semaphore_repo = factory::build_semaphore_repository(
  	RepositoryProvider::Ddb, &config).await.expect("failed to build semaphore repository");
let store = Box::new(DefaultLockStore::new(&config, mutex_repo, semaphore_repo));

Or Redis based data-store as follows:

let mutex_repo = factory::build_mutex_repository(
  	RepositoryProvider::Redis, &config).await.expect("failed to build mutex repository");
let semaphore_repo = factory::build_semaphore_repository(
  	RepositoryProvider::Redis, &config).await.expect("failed to build semaphore repository");
let store = Box::new(DefaultLockStore::new(&config, mutex_repo, semaphore_repo));

Note: The AWS Dynamo DB uses strongly consistent reads feature as by default it is eventually consistent reads.

Acquiring a Mutex Lock

You will need to build options for acquiring with key name and lease period in milliseconds and then acquire it:

let opts = AcquireLockOptionsBuilder::new("mylock")
	.with_lease_duration_secs(10).build();
let lock = lock_manager.acquire_lock(&opts)
	.expect("should acquire lock");

The acquire_lock operation will automatically create mutex lock if it doesn’t exist otherwise it will wait for the period of lease-time if the lock is not available. This will return a structure for mutex lock that includes:

{
  "mutex_key":"one",
  "tenant_id":"local-host-name",
  "version":"258d513e-bae4-4d91-8608-5d500be27593",
  "lease_duration_ms":15000,
  "locked":true,
  "expires_at":"2022-10-11T03:04:43.126542"
}

Renewing the lease of Lock

A lock is only available for the duration specified in lease_duration period, but you can renew it periodically if needed:

let opts = SendHeartbeatOptionsBuilder::new("one", "258d513e-bae4-4d91-8608-5d500be27593")
                .with_lease_duration_secs(15)
                .build();

let updated_lock = lock_manager.send_heartbeat(&opts)
					.expect("should renew lock");

Note: The lease renewal will also update the version of lock so you will need to use the updated version to renew or release the lock.

Releasing the lease of Lock

You can build options for releasing from the lock returned by above API as follows and then release it:

let opts = ReleaseLockOptionsBuilder::new("one", "258d513e-bae4-4d91-8608-5d500be27593")
                .build();
lock_manager.release_lock(&release_opts)
				.expect("should release lock");

Acquiring a Semaphore based Lock

The semaphores allow you to define a set of locks for a resource with a maximum size. The operation for acquiring semaphore is similar to acquiring regular lock except you specify semaphore size, e.g.:

let opts = AcquireLockOptionsBuilder::new("my_pool")
                    .with_lease_duration_secs(15)
                    .with_semaphore_max_size(10)
                    .build();
let lock = lock_manager.acquire_lock(&opts)
				.expect("should acquire semaphore lock");

The acquire_lock operation will automatically create semaphore if it doesn’t exist and it will then check for available locks and wait if all the locks are busy. This will return a structure for lock that includes:

{
  "mutex_key":"one_0000000000",
  "tenant_id":"local-host-name",
  "version":"5ad557df-dbe6-439d-8a31-dc367e32eab9",
  "lease_duration_ms":15000,
  "semaphore_key":"one",
  "locked":true,
  "expires_at":"2022-10-11T04:03:33.662484"
}

The semaphore lock will create mutexes internally that will be numbered from 0 to max-size (exclusive). You can get semaphore details using:

let semaphore = locks_manager.get_semaphore("one").await
                .expect("failed to find semaphore");

That would return:

{
  "semaphore_key": "one",
  "tenant_id": "local-host-name",
  "version": "4ff77432-ed84-48b5-9831-8e53f56c2620",
  "max_size": 10,
  "lease_duration_ms": 15000,
  "busy_count": 1,
  "fair_semaphore": false,
}

Or, fetch state of all mutexes associated with the semaphore using:

let mutexes = locks_manager.get_semaphore_mutexes("one").await
                .expect("failed to find semaphore mutexes");

Which would return:

  {
    "mutex_key": "one_0000000000",
    "tenant_id": "local-host-name",
    "version": "ba5a62e5-80f1-474e-a895-c4a18d252cb9",
    "lease_duration_ms": 15000,
    "semaphore_key": "one",
    "locked": true,
  },
  {
    "mutex_key": "one_0000000001",
    "tenant_id": "local-host-name",
    "version": "749b4ded-e356-4ef5-a23b-73a4984130c8",
    "lease_duration_ms": 15000,
    "semaphore_key": "one",
    "locked": false,
  },
  ...

Renewing the lease of Semaphore Lock

A lock is only available for the duration specified in lease_duration period, but you can renew it periodically if needed:

let opts = SendHeartbeatOptionsBuilder::new(
  				"one_0000000000", "749b4ded-e356-4ef5-a23b-73a4984130c8")
                .with_lease_duration_secs(15)
                .with_opt_semaphore_key("one")
                .build();
let updated_lock = lock_manager.send_heartbeat(&opts)
					.expect("should renew lock");

Note: The lease renewal will also update the version of lock so you will need to use the updated version to renew or release the lock.

Releasing the lease of Semaphore Lock

You can build options for releasing from the lock returned by above API as follows and then release it:

let opts = ReleaseLockOptionsBuilder::new("one_0000000000", "749b4ded-e356-4ef5-a23b-73a4984130c8")
                .with_opt_semaphore_key("one")
                .build();

lock_manager.release_lock(&opts)
				.expect("should release lock");

Acquiring a Fair Semaphore

The fair semaphores is only available for Redis due to internal implementation, and it requires enabling it via fair_semaphore configuration option, otherwise its usage is similar to above operations, e.g.:

let mut config = LocksConfig::new("test_tenant");
config.fair_semaphore = Some(fair_semaphore);

let fair_semaphore_repo = factory::build_fair_semaphore_repository(
  	RepositoryProvider::Redis, &config)
	.await.expect("failed to create fair semaphore");
let store = Box::new(FairLockStore::new(&config, fair_semaphore_repo));
let locks_manager = LockManagerImpl::new(
  	&config, store, &default_registry())
	.expect("failed to initialize lock manager");

Then acquire lock similar to the semaphore syntax as before:

let opts = AcquireLockOptionsBuilder::new("my_pool")
                    .with_lease_duration_secs(15)
                    .with_semaphore_max_size(10)
                    .build();
let lock = lock_manager.acquire_lock(&opts)
			.expect("should acquire semaphore lock");

The acquire_lock operation will automatically create semaphore if it doesn’t exist and it will then check for available locks and wait if all the locks are busy. This will return a structure for lock that includes:

{
  "mutex_key": "one_0fec9a7b-4354-4712-b537-ac14213bc5e8",
  "tenant_id": "local-host-name",
  "version": "0fec9a7b-4354-4712-b537-ac14213bc5e8",
  "lease_duration_ms": 15000,
  "semaphore_key": "one",
  "locked": true,
}

The fair semaphore lock does not use mutexes internally but for the API compatibility, it builds a mutex with a key based on combination of semaphore-key and version. You can then query semaphore state as follows:

let semaphore = locks_manager.get_semaphore("one").await
                .expect("failed to find semaphore");

That would return:

{
  "semaphore_key": "one",
  "tenant_id": "local-host-name",
  "version": "5779b01f-eaea-4043-8ae0-9f8b942c2727",
  "max_size": 10,
  "lease_duration_ms": 15000,
  "busy_count": 1,
  "fair_semaphore": true,
}

Or, fetch state of all mutexes associated with the semaphore using:

let mutexes = locks_manager.get_semaphore_mutexes("one").await
                .expect("failed to find semaphore mutexes");

Which would return:

  [
  {
    "mutex_key": "one_0fec9a7b-4354-4712-b537-ac14213bc5e8",
    "tenant_id": "local-host-name",
    "version": "0fec9a7b-4354-4712-b537-ac14213bc5e8",
    "lease_duration_ms": 15000,
    "semaphore_key": "one",
    "locked": true,
    "expires_at": "2022-10-11T04:41:43.845711",
  },
  {
    "mutex_key": "one_0000000001",
    "tenant_id": "local-host-name",
    "version": "",
    "lease_duration_ms": 15000,
    "semaphore_key": "one",
    "locked": false,
  },
  ...

Note: The mutex_key will be slightly different for unlocked mutexes as mutex-key isn’t needed for internal implementation.

Renewing the lease of Fair Semaphore Lock

You can renew lease of fair semaphore similar to above semaphore syntax, e.g.:

let opts = SendHeartbeatOptionsBuilder::new(
  			"one_0fec9a7b-4354-4712-b537-ac14213bc5e8", "0fec9a7b-4354-4712-b537-ac14213bc5e8")
                .with_lease_duration_secs(15)
                .with_opt_semaphore_key("one")
                .build();
let updated_lock = lock_manager.send_heartbeat(&opts)
					.expect("should renew lock");

Note: Due to internal implementation of fair semaphore, the version won’t be changed upon lease renewal.

Releasing the lease of Semaphore Lock

You can build options for releasing from the lock returned by above API as follows and then release it:

let opts = ReleaseLockOptionsBuilder::new(
    			"one_0fec9a7b-4354-4712-b537-ac14213bc5e8", "0fec9a7b-4354-4712-b537-ac14213bc5e8")
                .with_opt_semaphore_key("one")
                .build();

lock_manager.release_lock(&opts)
				.expect("should release lock");

Command Line Interface

In addition to a Rust based interface, the distributed locks library also provides a command line interface for managing mutex and semaphore based locks, e.g.:

Mutexes and Semaphores based Distributed Locks with databases.

Usage: db-locks [OPTIONS] [PROVIDER] <COMMAND>

Commands:
  acquire

  heartbeat

  release

  get-mutex

  delete-mutex

  create-mutex

  create-semaphore

  get-semaphore

  delete-semaphore

  get-semaphore-mutexes

  help
          Print this message or the help of the given subcommand(s)

Arguments:
  [PROVIDER]
          Database provider [default: rdb] [possible values: rdb, ddb, redis]

Options:
  -t, --tenant <TENANT>
          tentant-id for the database [default: local-host-name]
  -f, --fair-semaphore <FAIR_SEMAPHORE>
          fair semaphore lock [default: false] [possible values: true, false]
  -j, --json-output <JSON_OUTPUT>
          json output of result from action [default: false] [possible values: true, false]
  -c, --config <FILE>
          Sets a custom config file
  -h, --help
          Print help information
  -V, --version
          Print version information

For example, you can acquire fair semaphore lock as follows:

% REDIS_URL=redis://192.168.1.102 cargo run --  --fair-semaphore true --json-output true redis acquire --key one --semaphore-max-size 10

Which would return:

{
  "mutex_key": "one_69816448-7080-40f3-8416-ede1b0d90e80",
  "tenant_id": "local-host-name",
  "version": "69816448-7080-40f3-8416-ede1b0d90e80",
  "lease_duration_ms": 15000,
  "semaphore_key": "one",
  "locked": true,
}

You can run following command for renewing above lock:

% REDIS_URL=redis://192.168.1.102 cargo run --  --fair-semaphore true --json-output true redis heartbeat --key one_69816448-7080-40f3-8416-ede1b0d90e80 --semaphore-key one --version 69816448-7080-40f3-8416-ede1b0d90e80

And then release it as follows:

% REDIS_URL=redis://192.168.1.102 cargo run --  --fair-semaphore true --json-output true redis release --key one_69816448-7080-40f3-8416-ede1b0d90e80 --semaphore-key one --version 69816448-7080-40f3-8416-ede1b0d90e80

Summary

I was able to meet the initial goals for implementing distributed locks and though this library is early in development. You can download and try it from https://github.com/bhatti/db-locks. Feel free to send your feedback or contribute to this library.

May 12, 2022

Applying Laws of Scalability to Technology and People

As businesses grow with larger customers size and hire more employees, they face challenges to meet the customer demands in terms of scaling their systems and maintaining rapid product development with bigger teams. The businesses aim to scale systems linearly with additional computing and human resources. However, systems architecture such as monolithic or ball of mud makes scaling systems linearly onerous. Similarly, teams become less efficient as they grow their size and become silos. A general solution to solve scaling business or technical problems is to use divide & conquer and partition it into multiple sub-problems. A number of factors affect scalability of software architecture and organizations such as the interactions among system components or communication between teams. For example, the coordination, communication and data/knowledge coherence among the system components and teams become disproportionately expensive with the growth in size. The software systems and business management have developed a number of laws and principles that can used to evaluate constraints and trade offs related to the scalability challenges. Following is a list of a few laws from the technology and business domain for scaling software architectures and business organizations:

Amdhal’s Law

Amdahl’s Law is named after Gene Amdahl that is used to predict speed up of a task execution time when it’s scaled to run on multiple processors. It simply states that the maximum speed up will be limited by the serial fraction of the task execution as it will create resource contention:

Speed up (P, N) = 1 / [ (1 - P) + P / N ]

Where P is the fraction of task that can run in parallel on N processors. When N becomes large, P / N approaches 0 so speed up is restricted to 1 / (1 – P) where the serial fraction (1 – P) becomes a source of contention due to data coherence, state synchronization, memory access, I/O or other shared resources.

Amdahl’s law can also be described in terms of throughput using:

N / [ 1 + a (N - 1) ]

Where a is the serial fraction between 0 and 1. In parallel computing, a class of problems known as embarrassingly parallel workload where the parallel tasks have a little or no dependency among tasks so their value for a will be 0 because they don’t require any inter-task communication overhead.

Amdah’s law can be used to scale teams as an organization grows where the teams can be organized as small and cross-functional groups to parallelize the feature work for different product lines or business domains, however the maximum speed up will still be limited by the serial fraction of the work. The serial work can be: build and deployment pipelines; reviewing and merging changes; communication and coordination between teams; and dependencies for deliverables from other teams. Fred Brooks described in his book The Mythical Man-Month how adding people to a highly divisible task can reduce overall task duration but other tasks are not so easily divisible: while it takes one woman nine months to make one baby, “nine women can’t make a baby in one month”.

The theoretical speedup of the latency of the execution of a program according to Amdahl’s law (credit wikipedia).

Brooks’s Law

Brooks’s law was coined by Fred Brooks that states that adding manpower to a late software project makes it later due to ramp up time. As the size of team increases, the ramp up time for new employees also increases due to quadratic communication overhead among team members, e.g.

Number of communication channels = N x (N - 1) / 2

The organizations can build small teams such as two-pizza/single-threaded teams where communication channels within each team does not explode and the cross-functional nature of the teams require less communication and dependencies from other teams. The Brook’s law can be equally applied to technology when designing distributed services or components so that each service is designed as a loosely coupled module around a business domain to minimize communication with other services and services only communicate using a well designed interfaces.

Universal Scalability Law

The Universal Scalability Law is used for capacity planning and was derived from Amdahl’s law by Dr. Neil Gunther. It describes relative capacity in terms of concurrency, contention and coherency:

C(N) = N / [1 + a(N – 1) + B.N (N – 1) ]

Where C(N) is the relative capacity, a is the serial fraction between 0 and 1 due to resource contention and B is delay for data coherency or consistency. As data coherency (B) is quadratic in N so it becomes more expensive as size of N increases, e.g. using a consensus algorithm such as Paxos is impractical to reach state consistency among large set of servers because it requires additional communication between all servers. Instead, large scale distributed storage services generally use sharding/partitioning and gossip protocol with a leader-based consensus algorithm to minimize peer to peer communication.

The Universal Scalability Law can be applied to scale teams similar to Amdahl’s law where a is modeled for serial work or dependency between teams and B is modeled for communication and consistent understanding among the team members. The cost of B can be minimized by building cross-functional small teams so that teams can make progress independently. You can also apply this model for any decision making progress by keeping the size of stake holders or decision makers small so that they can easily reach the agreement without grinding to halt.

The gossip protocols also applies to people and it can be used along with a writing culture, lunch & learn and osmotic communication to spread knowledge and learnings from one team to other teams.

Little’s Law

Little’s Law was developed by John Little to predict number of items in a queue for stable stable and non-preemptive. It is part of queueing theory and is described mathematically as:

L = A W

Where L is the average number of items within the system or queue, A is the average arrival time of items and W is the average time an item spends in the system. The Little’s law and queuing theory can be used for capacity planning for computing servers and minimizing waiting time in the queue (L).

The Little’s law can be applied for predicting task completion rate in an agile process where L represents work-in-progress (WIP) for a sprint; A represents arrival and departure rate or throughput/capacity of tasks; W represents lead-time or an average amount of time in the system.

WIP = Throughput x Lead-Time

Lead-Time = WIP / Throughput

You can use this relationship to reduce the work in progress or lead time and improve throughput of tasks completion. Little’s law observes that you can accomplish more by keeping work-in-progress or inventory small. You will be able to better respond to unpredictable delays if you keep a buffer in your capacity and avoid 100% utilization.

King’s formula

The King’s formula expands Little’s law by adding utilization and variability for predicting wait time before serving of requests:

{\displaystyle \mathbb {E} (W_{q})\approx \left({\frac {\rho }{1-\rho }}\right)\left({\frac {c_{a}^{2}+c_{s}^{2}}{2}}\right)\tau }
(credit wikipedia)

where T is the mean service time, m (1/T) is the service rate, A is the mean arrival rate, p = A/m is the utilization, ca is the coefficient of variation for arrivals and cs is the coefficient of variation for service times. The King’s formula shows that the queue sizes increases to infinity as you reach 100% utilization and you will have longer queues with greater variability of work. These insights can be applied to both technical and business processes so that you can build systems with a greater predictability of processing time, smaller wait time E(W) and higher throughput ?.

Note: See Erlang analysis for serving requests in a system without a queue where new requests are blocked or rejected if there is not sufficient capacity in the system.

Gustafson’s Law

Gustafson’s law improves Amdahl’s law with a keen observation that parallel computing enables solving larger problems by computations on very large data sets in a fixed amount of time. It is defined as:

S = s + p x N

S = (1 – s) x N

S = N + (1 – N) x s

where S is the theoretical speed up with parallelism, N is the number of processors, s is the serial fraction and p is the parallel part such that s + p = 1.

Gustafson’s law shows that limitations imposed by the sequential fraction of a program may be countered by increasing the total amount of computation. This allows solving bigger technical and business problems with a greater computing and human resources.

Conway’s Law

Conway’s law states that an organization that designs a system will produce a design whose structure is a copy of the organization’s communication structure. It means that the architecture of a system is derived from the team structures of an organization, however you can also use the architecture to derive the team structures. This allows defining building teams along the architecture boundaries so that each team is a small, cross functional and cohesive. A study by the Harvard Business School found that the often large co-located teams tended to produce more tightly-coupled and monolithic codebases whereas small distributed teams produce more modular codebases. These lessons can be applied to scaling teams and architecture so that teams and system modules are built around organizational boundaries and independent concerns to promote autonomy and reduce tight coupling.

Pareto Principle

The Pareto principle states that for many outcomes, roughly 80% of consequences come from 20% of causes. This principle shows up in numerous technical and business problems such as 20% of code has the 80% of errors; customers use 20% of functionality 80% of the time; 80% of optimization improvements comes from 20% of the effort, etc. It can also be used to identify hotspots or critical paths when scaling, as some microservices or teams may receive disproportionate demands. Though, scaling computing resources is relatively easy but scaling a team beyond an organization boundary is hard. You will have to apply other management tools such as prioritization, planning, metrics, automation and better communication to manage critical work.

Metcalfe’s Law

The Metcalfe’s law states that if there are N users of a telecommunications network, the value of the network is N2. It’s also referred as Network effects and applies to social networking sites.

Number of possible pair connections = N * (N – 1) / 2

Reed’s Law expanded this law and observed that the utility of large networks can scale exponentially with the size of the network.

Number of possible subgroups of a network = 2N – N – 1

This law explains the popularity of social networking services via viral communication. These laws can be applied to model information flow between teams or message exchange between services to avoid peer to peer communication with extremely large group of people or a set of nodes. A common alternative is to use a gossip protocol or designate a partition leader for each group that communicates with other leaders and then disseminate information to the group internally.

Dunbar Number

The Dunbar’s number is a suggested cognitive limit to the number of people with whom one can maintain stable social relationships. It has a commonly used value of 150 and can be used to limit direct communication connections within an organization.

Wirth’s Law and Parkinson’s Law

The Wirth’s Law is named after Niklaus Wirth who observed that the software is getting slower more rapidly than hardware is becoming faster. Over the last few decades, processors have become exponentially faster as a Moor’s Law but often that gain allows software developers to develop more complex software that consumes all gains of the speed. Another factor is that it allows software developers to use languages and tools that may not generate more efficient code so the code becomes bloated. There is a similar law in software development called Parkinson’s law that work expands to fill the time available for it. Though, you also have to watch for Hofstadter’s Law that states that “it always takes longer than you expect, even when you take into account Hofstadter’s Law”; and Brook’s Law, which states that “adding manpower to a late software project makes it later.”

The Wirth’s Law, named after Niklaus Wirth, posits that software tends to become slower at a rate that outpaces the speed at which hardware becomes faster. This observation reflects a trend where, despite significant advancements in processor speeds as predicted by Moor’s Law , software complexity increases correspondingly. Developers often leverage these hardware improvements to create more intricate and feature-rich software, which can negate the hardware gains. Additionally, the use of programming languages and tools that do not prioritize efficiency can lead to bloated code.

In the realm of software development, there are similar principles, such as Parkinson’s law, which suggests that work expands to fill the time allotted for its completion. This implies that given more time, software projects may become more complex or extended than initially necessary. Moreover, Hofstadter’s Law offers a cautionary perspective, stating, “It always takes longer than you expect, even when you take into account Hofstadter’s Law.” This highlights the often-unexpected delays in software development timelines. Brook’s Law further adds to these insights with the adage, “Adding manpower to a late software project makes it later.” These laws collectively emphasize that the demand upon a resource tends to expand to match the supply of the resource but adding resources later also poses challenges due to complexity in software development and project management.

Dunbar Number

The Dunbar’s number is a suggested cognitive limit to the number of people with whom one can maintain stable social relationships. It has a commonly used value of 150 and can be used to limit direct communication connections within an organization.

Summary

Above laws shows how you can partition tightly coupled architecture and large teams into modular architecture and small autonomous teams. For example, Amdahl’s and Universal Scalability laws demonstrate that you have to account for the cost of serial work, coordination and communication between partitions as you parallelize the problem because they become bottleneck as you scale. Brook’s and Metcalfe’s laws indicate that you will need to manage the number of communication paths among modules or teams as they can explode quadratically thus stifling your growth. Little’s law and King’s formula establishes that you need to reduce inventory or work in progress and avoid 100% utilization in order to provide reliable throughput. Conway’s law shows how architecture and team structures can be aligned for maximum autonomy and productivity. This allows you to accomplish more work by using small cross functional teams who own independent product lines and build modular architecture to reduce dependency on other teams and subsystems. Pareto principle can be used to make small changes to the architecture or teams that results in higher scalability and productivity. Wirth’s Law and Parkinson’s Law, when applied judiciously, can be instrumental in enhancing efficiency in software development. By setting more stringent timelines and clear, concise objectives, it can counteract the tendency for work to expand to fill the available time. Dunbar number only applies to people but it can be used to limit dependencies for external teams as a human mind has a finite capacity to maintain external relationships. However, before applying these laws, you should have clear goals and collect proper metrics and KPIs so that you can measure the baseline and improvements from these laws. You should also be cautious when applying these laws prematurely for scalability as it may make things worse. Finally, when solving scalability and performance related problems, it is vital to focus on global optimization to scale an entire organization or the system as opposed to a local optimization by focusing strictly only on a specific part of the system.

March 24, 2022

Architecture Patterns and Practices for Sustainable Software Delivery Pipelines

Filed under: Project Management,Technology — Tags: , , , , — admin @ 10:31 pm

Abstract

Software is eating the world and today’s businesses demand shipping software features at a higher velocity to enable learning at a greater pace without compromising the quality. However, each new feature increases viscosity of existing code, which can add more complexity and technical debt so the time to market for new features becomes longer. Managing a sustainable pace for software delivery requires continuous improvements to the software development architecture and practices.

Software Architecture

The Software Architecture defines guiding principles and structure of the software systems. It also includes quality attribute such as performance, sustainability, security, scalability, and resiliency. The software architecture is then continuously updated through iterative software development process and feedback cycle from the the actual use in production environment. The software architecture decays if it’s ignored that results in a higher complexity and technical debt. In order to reduce technical debt, you can build a backlog of technical and architecture related changes so that you can prioritize along with the product development. In order to maintain consistent architecture throughout your organization, you can document the architecture principles to define high-level guidelines for best practices, documentation templates, review process and guidance for the architecture decisions.

Quality Attributes

Following are major quality attributes of the software architecture:

  • Availability — It defines percentage of time, the system is available, e.g. available-for-use-time / total-time. It is generally referred in percentiles such as P99.99, which indicates a down time of 52 minutes in a year. It can also be calculated in terms of as mean-time between failure (MTBF) and mean-time to recover (MTRR) using MTBF/(MTBF+MTRR). The availability will depend not only on the service that you are providing but also its dependent services, e.g. P-service * P-dep-service-1 * P-dep-service-2. You can improve availability with redundant services, which can be calculated as Max-availability - (100 - Service-availability) ** Redundancy-factor. In order to further improve availability, you can detect faults and use redundancy and state synchronization for fault recovery. The system should also handle exceptions gracefully so that it doesn’t crash or goes into a bad state.
  • Capacity — Capacity defines how the system scales by adding hardware resources.
  • Extensibility — Extensibility defines how the system meets future business requirements without significantly changing existing design and code.
  • Fault Tolerance — Fault tolerance prevents a single point of failure and allows the system to continue operating even when parts of the system fail.
  • Maintainability — Higher quality code allows building robust software with higher stability and availability. This improves software delivery due to modular and loosely coupled design.
  • Performance — It is defined in terms of latency of an operation under normal or peak load. A performance may degrade with consumptions of resources, which affects throughput and scalability of the system. You can measure user’s response-time, throughput and utilization of computational resources by stress testing the system. A number of tactics can be used to improve performance such as prioritization, reducing overhead, rate-limiting, asynchronicity, caching, etc. Performance testing can be integrated with continuous delivery process that use load and stress testing to measure performance metrics and resource utilization.
  • Resilience — Resilience accepts the fact that faults and failure will occur so instead system components resist them by retrying, restarting, limiting error propagation or other measures. A failure is when a system deviates from its expected behavior as a result of accidental fault, misconfigurations, transient network issues or programming error. Two metrics related to resilience are mean-time between failure (MTBF) and mean-time to recover (MTTR), however resilient systems pay more attention to recovery or a shorter MTTR for fast recovery.
  • Recovery — Recovery looks at system recover in relation with availability and resilience. Two metrics related to recovery are recovery point objective (RPO) and recovery time objective (RTO), where RPO determines data that can be lost in case of failure and RTO defines wait time for the system recovery.
  • Reliability — Reliability looks at the probability of failure or failure rate.
  • Reproducibility — Reproducibility uses version control for code, infrastructure, configuration so that you can track and audit changes easily.
  • Reusability — It encourages code reuse to improve reliability, productivity and cost savings from the duplicated effort.
  • Scalability — It defines ability of the system to handle increase in workload without performance degradation. It can be expressed in terms of vertical or horizontal scalability, where horizontal reduces impact of isolated failure and improves workload availability. Cloud computing offers elastic and auto-scaling features for adding additional hardware when higher request rate is detected by the load balancer.
  • Security — Security primarily looks at confidentiality, integrity, availability (CIA) and is critical in building distributed systems. Building secure systems depends on security practices such as a strong identity management, defense in depth, zero trust networks, auditing, ad protecting while data in motion or at rest. You can adopt DevSecOps that shifts security left to earlier in software development lifecycle with processes such as Security by Design (SbD), STRIDE (Spoofing, Tampering, Repudiation, Disclosure, Denial of Service, Elevation of privilege), PASTA (Process for Attack Simulation and Threat Analysis), VAST (Visual, Agile and Simple Threat), CAPEC (Common Attack pattern Enumeration and Classification), and OCTAGE (Operationally Critical Threat, and Vulnerability Evaluation).
  • Testability — It encourages building systems in a such way it’s easier to test them.
  • Usability — It defines user experience of user interface and information architecture.

Architecture Patterns

Following is a list of architecture patterns that help building a high quality software:

Asynchronicity

Synchronous services are difficult to scale and recover from failures because they require low-latency and can easily overwhelm the services. Messaging-based asynchronous communication based on point-to-point or publish/subscribe are more suitable for handling faults or high load. This improves resilience because service components can restart in case of failure while messages remain in the queue.

Admission Control

The admission control adds a authentication, authorization or validation check in front event queue so that service can handle the load and prevent overload when demand exceeds the server capacity.

Back Pressure

When a producer is generating workload faster than the server can process, it can result in long request queues. Back pressure signals clients that servers are overloaded and clients need to slow down. However, rogue clients may ignore these signals so servers often employ other tactics such as admission control, load shedding, rate limiting or throttling requests.

Big fleet in front of small fleet

You should look at all transitive dependencies when scaling a service with a large fleet of hosts so that you don’t drive a large network traffic that needs to invoke dependent services with a smaller fleet. You can use use load testing to find the bottlenecks and update SLAs for the dependent services so that they are aware of network load from your APIs.

Blast Radius

A blast radius defines impact of failure on overall system when an error occurs. In order to limit the blast radius, the system should eliminate a single point of failure, rolling deploy changes using canaries and stop cascading failures using circuit breakers, retry and timeout.

Bulkheads

Bulkheads isolate faults from one component to another, e.g. you may use different thread pool for different workloads or use multiple regions/availability-zones to isolate failures in a specific datacenter.

Caching

Caching can be implemented at a several layers to improve performance such as database-cache, application-cache, proxy/edge cache, pre-compute cache and client-side cache.

Circuit Breakers

The circuit breaker is defined as a state machine with three states: normal, checking and tripped. It can be used to detect persistent failures in a dependent service and trip its state to disable invocation of the service temporarily with some default behavior. It can be later changed to the checking state for detecting success, which changes its state to normal after a successful invocation of the dependent service.

CQRS / Event Sourcing

Command and Query Responsibility Segregation (CQRS) separates read and update operations in the database. It’s often implemented using event-sourcing that records changes in an append-only store for maintaining consistency and audit trails.

Default Values

Default values provide a simple way to provide limited or degraded behavior in case of failure in dependent configuration or control service.

Disaster Recovery

Disaster recovery (DR) enables business continuity in the event of large-scale failure of data centers. Based on cost, availability and RTO/RPO constraints, you can deploy services to multiple regions for hot site; only replicate data from one region to another region while keeping servers as standby for warm site; or use backup/restore for cold site. It is essential to periodically test and verify these DR procedures and processes.

Distributed Saga

Maintaining data consistency in a distributed system where data is stored in multiple databases can be hard and using 2-phase-commit may incur high complexity and performance. You can use distributed Saga for implementing long-running transactions. It maintains state of the transaction and applies compensating transactions in case of a failure.

Failing Fast

You can fail fast if the workload cannot serve the request due to unavailability of resources or dependent services. In some cases, you can queue requests, however it’s best to keep those queues short so that you are not spending resources to serve stale requests.

Function as a Service

Function as a service (FaaS) offers serverless computing to simplify managing physical resources. Cloud vendors offer APIs for AWS Lambda, Google Cloud Functions and Azure Functions to build serverless applications for scalable workloads. These functions can be easily scaled to handle load spikes, however you have to be careful scaling these functions so that any services that they depend on can support the workload. Each function should be designed with a single responsibility, idempotency and shared nothing principles that can be executed concurrently. The serverless applications generally use event-based architecture for triggering functions and as the serverless functions are more granular, they incur more communication overhead. In addition, chaining functions within the code can result in tightly coupled applications, instead use a state machine or a workflow to orchestrate the communication flow. There is also an open source support for FaaS based serverless computing such as OpenFaas and OpenWhisk on top of Kubernetes or OpenShift, which prevents locking into a specific cloud provider.

Graceful Degradation

Instead of failing a request when dependent components are unhealthy, a service may use circuit-breaker pattern to return a predefined or default response.

Health Checks

Health checks runs a dummy or synthetic transaction that performs the action without affecting real data to verify the system component and its dependencies.

Idempotency

Idempotent services completes API request only exactly once so resending same request due to retries has no side effect. Idempotent APIs typically uses a client-generated identifier or token and the idempotent service returns same response if duplicate request is received.

Layered Architecture

The layered architecture separates software into different concerns such as:

  • Presentation Layer
  • Business Logic Layer
  • Service Layer
  • Domain Model Layer
  • Data Access Layer

Load Balancer

Load balancer allows distributing traffic among groups of resources so that a single resource is not overloaded. These load balancers also monitors health of servers and you can setup a load balancer for each group of resources to ensure that requests are not routed to unhealthy or unavailable resources.

Load Shedding

Load shedding allows rejection the work at the edge when server side exceeds its capacity, e.g. a server may return HTTP 429 error to signal clients that they can retry at a slower rate.

Loosely coupled dependencies

Using queuing systems, streaming systems, and workflows isolate behavior of dependent components and increases resiliency with asynchronous communication.

MicroServices

Microservices evolved from service oriented architecture (SOA) and support both point to point protocols such as REST/gRPC and asynchronous protocols based on messaging/event bus. You can apply bounded-context of domain-driven design (DDD) to design loosely coupled services.

Model-View Controller

It decouples user interface from the data model and application functionality so that each component can be independently tested. Other variations of this pattern include model–view–presenter (MVP) and model–view–viewmodel (MVVM).

NoSQL

NoSQL database technology provide support for high availability and variable/write-heavy workloads that can be easily scaled with additional hardware. NoSQL optimizes CAP and PACELC tradeoffs of consistency, availability, partition tolerance and latency, A number of cloud vendors provide managed NoSQL database solutions, however they can create latency issues if services accessing these databases are not colocated.

No Single Point of Failure

In order to eliminate single-points of failures for providing high availability and failover, you can deploy redundant services to multiple regions and availability zones.

Ports and Adapters

Ports and Adapters (Hexagon) separates interface (ports) from implementation (adapters). The business logic is encapsulated in the Hexagon that is invoked by the implementation (adapters) when actors operate on capabilities offered by the interface (port).

Rate Limiting and Throttling

Rate-limiting defines the rate at which clients can access the services based on the license policy. The throttling can be used to restrict access as a result of unexpected increase in demand. For example, the server can return HTTP 429 to notify clients that they can backoff or retry at a slower rate.

Retries with Backoff and Jitter

A remote operation can be retried if it fails due to transient failure or a server overload, however each retry should use a capped exponential backoff so that retries don’t cause additional load on the server. In a layered architecture, retry should be performed at a single point to minimize multifold retries. Retries can use circuit-breakers and rate-limiting to throttle requests. In some cases, requests may timeout for clients but succeed on the server side so the APIs must be designed with idempotency so that they are safe to retry. In order to avoid retries at the same time, a small random jitter can be added with retries.

Rollbacks

The software should be designed with rollbacks in mind so that all code, database schema and configurations can be easily rolled back. A production environment might be running multiple versions of same service so care must be taken to design the APIs that are both backward and forward compatibles.

Stateless and Shared nothing

Shared nothing architecture helps building stateless and loosely decoupled services that can be easily horizontally scaled for providing high availability. This architecture allows recovering from isolated failures and support auto-scaling by shrinking or expanding resources based on the traffic patterns.

Startup dependencies

Upon start of services, they may need to connect to certain configuration or bootstrap services so care must be taken to avoid thundering herd problems that can overwhelm those dependent services in the event of a wide region outage.

Timeouts

Timeouts help building resilient systems by throttling invocation of external services and preventing the thundering herd problem. Timeouts can also be used when retrying a failed operation after a transient failure or a server overload. A timeout can also add a small jitter to randomly spread the load on the server. Jitter can also be applied to timers of scheduled jobs or delayed work.

Watchdogs and Alerts

A watchdogs monitors a system component for a specific action such as latency, traffic, errors, saturation and SLOs. It then send an alert based on the monitoring configuration that triggers an email, on-call paging or an escalation.

Virtualization and Containers

Virtualization allows abstracting computing resources using virtual machines or containers so that you don’t depend on physical implementation. A virtual machine is a complete operating system on top of hypervisors whereas container is an isolated, lightweight environment for running applications. Virtualization allows building immutable infrastructure that are specially design to meet application requirements and can be easily deployed on a variety of hardware resources.

Architecture Practices

Following are best practices for sustainable software delivery:

Automation

Automation builds pipelines for continuous integration, continuous testing and continuous delivery to improve speed and agility of the software delivery. Any kind of operation procedures for deployment and monitoring can be stored in version control and then automatically applied with CI/CD procedures. In addition, automated procedures can be defined to track failures based on key performance indicators and trigger recovery or repair for the erroneous components.

Automated Testing

Automated testing allows building software with a suite of unit, integration, functional, load and security tests that verify the behavior and ensures that it can meet production demand. These automated tests will run as part of CI/CD pipelines and will stop deployment if any of the tests fail. In order to run end-to-end and load tests, the deployment scripts will create a new environment and setup a tests data. These tests may replicate synthetic transactions based on production traffic and benchmark the performance metrics.

Capacity Planning

Using load testing, monitoring production traffic patterns and demand with workload utilization help forecast the resources needed for future growth. This can be further strengthened with a capacity model that calculates unit-price of resources and growth forecast so that you can automate addition or removal of resources based on demand.

Cloud Computing

Adopting cloud computing simplifies resource provisioning and its elasticity allows organizations to grow or shrink those resources based on the demand. You can also add automation to optimize utilization of the resources and reduce costs when allocating more resources.

Continuous Delivery

Continuous delivery automates production deployment of small and frequent changes by developers. Continuous delivery relies on continuous integration that runs automated tests and automated deployment without any manual interventions. During a software development process, a developer picks a feature, works on changes and then commits changes to the source control after peer code-review. The automated build system will run a pipeline to create a container image based on the commit and then deploy it to a test or QA environment. The test environment will run automated unit, integration and regression tests using a test data in the database. The code is then promoted to the main branch and the automated build system tags and build the image on the head commit of main-branch, which that is pushed to the container registry. The pre-prod environment pulls the image, restarts the pre-prod container and runs more comprehensive tests with a larger set of test data in the database including performance tests. You may need multiple stages of pre-prod deployment such as alpha, beta and gamma environments, where each environment may require deployment to a unique datacenter. After successful testing, the production systems are updated with the new image using rolling updates, blue/green deployments or canary deployments to minimize disruption to end users. The monitoring system watches for error rates at each stage of the deployment and automatically rollbacks changes if a problem occurs.

Deploy over Multiple Zones and Regions

In order to provide high availability, compliance and reduced latency, you can deploy to multiple availability zones and regions. Global load balancers can be used to route traffic based on geographic proximity to the closest region. This also helps implementing business continuity as applications can easily failover to another region with minimal data.

Service Mesh

In order to easily build distributed systems, a number of platforms based on service-mesh pattern have emerged to abstract a common set of problems such as network communication, security, observability, etc:

Dapr – Distributed Application Runtime

The Distributed Application Runtime (Dapr) provides a variety of communication protocols, encryption, observability and secret management for building secured and resilient distributed services.

Envoy

Envoy is a service proxy for building cloud native application with builtin support for networking protocols and observability.

Istio service mesh

Istio is built on top of Kubernetes and Envoy to build service mesh with builtin support for networking, traffic management, observability and security. A service mesh also addresses features such as A/B testing, canary deployments, rate limiting, access control, encryption, and end-to-end authentication.

Linkerd

Linkerd is a service mesh for Kubernetes and consists of control-plane and data-plane with builtin support for networking, observability and security. The control-plane allows controlling services and data-plane acts as a sidecar container that handles network traffic and communicate with the control-plane for configuration.

WebAssembly

The WebAssembly is a stack-based virtual machine that can run at the edge or in cloud. A number of WebAssembly platforms have adopted Actor model to build a platform for writing distributed applications such as wasmCloud and Lunatic.

Documentation

The architecture document defines goals and constraints of the software system and provides various perspectives such as use-cases, logical, data, processes, and physical deployment. It also includes non-functional or quality attributes such as performance, growth, scalability, etc. You can document these aspects using standards such as 4+1, C4, and ERD as well as document the broader enterprise architecture using methodologies like TOGAF, Zachman, and EA.

Incident management

Incident management defines process of root-cause analysis and actions that organization can take when an incident occurs affecting production environment. It defines best practices such as clear ownership, reducing time to detect/mitigate, blameless postmortems and prevention measures. The organization can then implement preventing measures and share lessons learned from all operational events and failures across teams. You can also use pre-mortem to identify potential areas that can be improved or mitigated. Another way to simulate potential problems is using chaos engineering or setting up game days to test the workloads for various scenarios and outage.

Infrastructure as Code

Infrastructure as code uses declarative language to define development, test and production environment, which is managed by the source code management software. These provisioning and configuration logic can be used by CI/CD pipelines to automatically deploy and test environments. Following is a list of frameworks for building infrastructure from code:

Azure Resource Manager

Azure cloud offer Azure Resource Manager (ARM) templates based on JSON format to declaratively define the infrastructure that you intend to deploy.

AWS Cloud Development Kit

The Cloud Development Kit (CDK) supports high-level programming languages to construct cloud resources on Amazon Web Services so that you can easily build cloud applications.

Hashicorp Terraform

Terraform uses HCL based configurations to describe computing resources that can be deployed to multiple cloud providers.

Monitoring

Monitoring measures key performance indicators (KPI) and service-level objectives (SLO) that are defined at the infrastructure, applications, services and end-to-end levels. These include both business and technical metrics such as number of errors, hot spots, call graphs, which are visible to the entire team for monitoring trends and reacting quickly to failures.

Multi-tenancy

If your system is consumed by a different groups or tenants of users, you will need to design your system and services so that it isolates data and computing resources for secure and reliable fashion. Each layer of the system can be designed to treat tenant context as a first-class construct, which is tied to the user identity. You can capture usage metrics per tenant to identify bottlenecks, estimate cost and analyze the resource utilization for capacity planning and growth projections. The operational dashboards can also use these metrics to construct tenant-based operational views and proactively respond to unexpected load.

Security Review

In order to minimize the security risk, the development teams can adopt shift-left on security and DevSecOps practices to closely collaborate with the InfoSec team and integrate security review into every phase of the software development lifecycle.

Version Control Systems

Version control systems such as Git or Mercurial help track code changes, configurations and scripts over time. You can adopt workflows such as gitflow or trunk-based development for check-in process. Other common practices include smaller commits, testing code and running static analysis or linters/profiling tools before checkin.

Summary

The software complexity is a major reason for missed deadlines and slow/buggy software. This complexity can be essential complexity within the business domain but it’s often result of accidental complexity as a result of technical debt, poor architecture and development practices. Another source of incidental complexity comes from distributed computing where you need handle security, rate-limiting, observability, etc. that needs to be applied consistently across the distributed systems. For example, virtualization helps building immutable infrastructures and adopting infrastructure as a code; functions as a service simplifies building micro-services; and distributed platforms such as Istio, Linkerd remove a lot of cruft such as security, observability, traffic management and communication protocols when building distributed systems. The goal of a good architecture is to simplify building, testing, deploying and operating a software. You need to continually improve the systems architecture and its practices to build sustainable software delivery pipelines that can meet both current and future demands of users.

January 21, 2022

Evolving from Password-based Authentication

Filed under: Encryption — admin @ 1:36 pm

Abstract

User identity and authentication is commonly implemented using username and passwords (U/P). Despite wide adoption of password based authentication, they are considered the weakest link in the chain of secure access. The weak passwords and password reuse are common attack vectors for cyber-crimes. Most organizations enforce strong passwords but it burdens users to follow arbitrary rules for strong passwords and store them securely. Nevertheless, these strong passwords are still vulnerable to account takeover via phishing or social engineering attacks. Managing password also encumbers high cost of maintenance on service providers and help desk.

Simply put, passwords are broken. An alternative to password based authentication is passwordless authentication that verifies user credentials using biometric and asymmetric cryptography instead of passwords. A number of new industry standards are in development such as WebAuthn, FIDO2, NIST AAL,  and Client to Authenticator Protocol (CTAP), etc. to implement passwordless authentication. Passwordless authentication can be further strengthened using multi-factor authentication (MFA) using one-time-passwords (OTP) and SMS. Passwordless authentication eliminates credential stuffing and phishing attacks for data breach [1].

The passwordless authentication can work with both centralized/federated systems as well as decentralized systems that may use decentralized identity. The decentralized identity uses decentralized identifiers (DIDs) developed by Decentralized Identity Foundation (DIF) and the W3C Credentials Community Group. This allows users to completely control their identity and data privacy so that users can choose which data can be shared with other applications and services. Following diagram shows progression of authentication from password-based to decentralized identity:

Drawbacks of Password based Authentication

Following are major drawbacks of password based authentication:

Inconsistent Password Rules

There aren’t industry standards for strong passwords and most organizations use internal standards for defining passwords such as length of password, use of special or alpha-numeric characters. In general, longer passwords with a mix of characters provide higher entropy and are more secure [NIST] but it mandates users to follow these arbitrary rules.

Too many passwords

An average user may need to create an account on hundreds of websites that encumber users to manage passwords for all those accounts.

Sharing same password

Due to the large number of accounts, users may share the same password for multiple websites. This poses a high security risk in the event of security breaches as reported by DBIR and it’s recommended not to share the same password with multiple websites.

Changing password

A user may need to change password after a security breach or on a regular basis, which adds unnecessary work for users.

Unsafe storage of passwords by service providers

Many organizations use internal standards for storing passwords such as storing passwords in plaintext, unsalted passwords or using weak hashing algorithms instead of standardized Argon2 algorithms. Such weak security measures lead to credential stuffing where stolen credentials are used to access other private data.

Costs of IT Support

The password-reset and forgot-password requests are one of the major costs of help-desk that consume up-to 50% of IT budget.

Mitigating Password Management

Following are a few remedies that can mitigate password management:

Password managers

As strong passwords or long passphrases are difficult to memorize, users often write them down that can be easily eavesdropped. It’s recommended to use password managers for storing passwords safely but often they come with their own set of security risks such as unsafe storage in memory, on local disk or in cloud [2, 3].

OAuth/OpenID standards

The OAuth/OpenID Connect/SAML standards can be used to implement single-sign-on or social login that allow users to access a website or an application using the same credentials. However, it still poses a security risk if the account credentials are revealed and it weakens data privacy because users can be traced easily when the same account is used to access multiple websites.

Multi-Factor Authentication

The password authentication can use multi-factor authentication (MFA) to hardened security. It’s recommended not to use SMS or voice for MFA due to common attacks via SIM swapping/hacking, phishing, social engineering or loss of device. Instead, one-time passwords (OTPs) provide better alternatives that can be generated using a symmetric key and HOTP/TOTP schemes.

Long-term refresh tokens

Instead of frequent password prompts, some applications use long-term refresh tokens or cookies to extend user sessions. However, it’s recommended to use such a mechanism with other multi-factor authentication or PIN/passcodes.

Passwordless Authentication

Passwordless authentication addresses shortcomings of passwords because you diminish security risks associated with passwords. In addition, they provide better user experience, convenience and user productivity by eliminating friction with password management. Passwordless authentication also reduces IT overhead related with password reset and allows use of biometric on smartphones instead of using hard keys or smart cards.

Passwordless Standards

Following section defines industry standards for implementing passwordless authentication:

Asymmetric password-authenticated key exchange (PAKE)

The Asymmetric password-authenticated key exchange (PAKE) protocols such as Secure Remote Password (SRP) allow authentication without exchanging passwords. However, it’s old and is susceptible to MITM attacks. Other recent standards such as CPace and OPAQUE provide better alternatives. OPAQUE uses an oblivious pseudorandom function (OPRF) where a random output can be generated without knowing the input.

Fast IDentity Online 2 (FIDO2)

FIDO2 is an open standard that uses asymmetric keys and physical devices to authenticate users. FIDO2 specification defines Client to Authenticate Protocol (CTAP) and Web Authentication (WebAuthn) protocols for communication and authentication. WebAuthn can work with roaming physical devices such as YubiKey as well as platform devices that may use built-in biometrics such as TouchID for authentication.

Decentralized Identity and Access Management

Decentralized or self-sovereign identity (SSI) allows users to own and control their identity. This provides passwordless authentication as well as decentralized authorization to control access to private data. The users use a local digital wallet to store credentials and authenticate their identity instead of using external identity providers.

Comparison

Following diagrams compare architecture of centralized, federated and decentralized authentication:

Centralized

In a centralized model, the user directly interacts with the website or application, e.g.

Federated

In a federated model, the user interacts with one or more identity providers using Security Assertion Markup Language (SAML), OAuth, or OpenID Connect (social login), e.g.

Decentralized

In a decentralized model, user establishes peer-to-peer relationships with the authenticating party, e.g.,

The issuers in decentralized identity can be organizations, universities or government agencies that generate and sign credentials. The holders are subjects of authentication that request verifiable-credentials from issuers and then hold them in digital wallet. The verifiers are independent entities that verifies claims and identity in verifiable-credentials using issuer’s digital signature.

Benefits

Following is a list of major benefits of decentralized or self-sovereign identity:

Reduced costs

User and access management such as account creation, login, logout, password reset, etc. incur major cost for service providers. Decentralized identity shifts the work on the client side where users can use a digital wallet to manage their identity and authentication.

Reduced liability

Storing private user data in the cloud is a liability and organizations often lack proper security and access control to protect it. With decentralized identity, user can control what data can be shared based on use-cases and it eliminates liability associated with centralized storage of private data.

Authorization

Decentralized identity can provide claims for authorization decisions using attribute-based access control or role-based access control. These claims can be cryptographically verified in real-time using zero-knowledge proofs (ZKP) to protect user data and privacy.

Sybil Attack

Decentralized identity facilitates generating new identities easily but they can be protected from Sybil or SPAM attacks where a user may create multiple identifiers to mimic multiple users. The service providers can use ID verification, referral, probationary periods or reputation systems to prevent such attacks.

Loyalty and rewards programs

As decentralized identity empowers users to control the data, they can choose to participate in various loyalty and reward programs by sharing specific data with other service providers.

Data privacy and security

Decentralized identity provides better privacy and security required by regulations such as Health Insurance Portability and Accountability Act (HIPAA), General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), etc. This also prevents identity theft and related cyber-crimes because user data is not stored in a cloud or centralized storage.

Key Building Blocks of Self-Sovereign Identity

Above diagram shows major building blocks of decentralized or self-sovereign identity as described below:

Verifiable Credentials

The verifiable credentials (VC) stores credentials and claims for a subject that can be verified using public key infrastructure. The VC may be stored using blockcerts, W3C or JSON-ld data models. It may store additional claims or attributes that can support data privacy using zero-knowledge proofs.

A credential holder may use identity hub to store credentials that can use DIF’s Presentation Exchange spec for serving this data using data-storage standards, e.g.

{
          "@context": [
            "https://www.w3.org/2018/credentials/v1",
            "https://schema.org",
          ],
          "id": "http://plexobject.com/credentials/58473",
          "type": ["VerifiableCredential", "Person"],
          "credentialSubject": {
            "id": "did:plexobject:ebfeb1f712ebc6f1c276e12ec21",
            "alumniOf": "PlexObject Solutions"
          },
          "proof": { ... }
}
Decentralized identifiers (DIDs)

W3C Working Group defines decentralized identifier (DID) as a globally unique identifier similar to URL that uses peer-to-peer communication and decentralized public key infrastructure (DPKI).

The DID points to the associated document that contains metadata about the DID subject, e.g.

did:plexbject:xyz	

and its associated DID document:

{
          "@context": "https://www.w3.org/ns/did/v1",
          "id": "did:plexobject:xyz",
          "authentication": [{
            "id": "did:plexobject:xyz#keys-1",
            "type": "Ed25519VerificationKey2018",
            "controller": "did:plexobject:xyz",
            "publicKeyBase58" : "..."
          }],
          "service": [{
            "id":"did:plexobject:xyz#vcs",
            "type": "VerifiableCredentialService",
            "serviceEndpoint": "https://plexobject.com/vc/"
          }] 
}

In the above example, plexobject is considered a DID method and the DID standards allow customized methods as defined by DID registries.

Public Keys/DID Registries

The public keys/DID registries allow lookup for public keys or decentralized identifiers (DID). A DID registry uses the DID method and protocol to interact with the remote client. Due to support of decentralized identities in blockchain, several implementations of DID registries use blockchains such as uPort, Sovrin, Veres, etc. If a decentralized identifier doesn’t need to be registered, other standards such as Peer DID Method Specification and Key Event Receipt Infrastructure can be used to directly interact with the peer device using DIDComm messaging protocol.

Digital Wallet

The keystore or digital wallet stores private keys, decentralized identifiers (DIDs), verifiable credentials, passwords, etc. The digital wallet uses a digital agent to provide access to internal data or generate new identities. In order to protect wallet data and recovery keys in case of loss of device or data corruption, it can create an encrypted backup in cloud or cold storage. Alternatively, digital wallet may use social recovery by sharding recovery key into multiple fragments using Shamir’s Secret Sharing that are then shared with trusted contacts.

Hardware security module

Hardware security module (HSM) is a physical computing device for storing digital keys and credentials cryptographically. It provides a security enclave for secure execution and preventing unauthorized access to private data.

Multiple Devices

As decentralized applications rely on peer to peer communication, digital wallets can use Decentralized Key Management System (DKMS), Key Event Receipt Infra- structure (KERI) and DIDComm messaging protocols to support multiple devices.

Push Notification

Decentralized identity may use push notifications to verify identity or receive alerts when a user is needed to permit an access.

Summary

The passwords are the bane of online experience and incur high cost of maintenance for service providers. Alternative user authentication mechanisms using passwordless, biometrics and asymmetric key exchange provide better user experience and stronger security. The decentralized identity further improves interoperability and data privacy where users can control who can access their private data and earn rewards for sharing their data. As modern mobile devices already have built-in support for HSM based security enclave, biometrics and push notifications, they provide a simple way to implement digital wallets and decentralized identity. This eliminates security breaches, identity theft and the need for data protection regulations and complex authentication protocols such as SAML/OIDC because data is cryptographically protected on local storage. Though open standards for passwordless authentication and decentralized identity are still in development, they show a bright future to build the next generation of applications that provide better data security, privacy, protection, end-to-end encryption, portability, instant payments and better user experience.

January 10, 2022

Promise of Web3 and Decentralization

Filed under: Encryption,Web3 — admin @ 8:59 am

Web3 term was coined by Ethereum co-founder Gavin Wood to distinguish it from Web2 architecture. The Web2 architecture popularized interactive and interoperable applications, where Web3 architecture adds decentralization based on blockchain for protecting user data and privacy. I came across a number of recent posts [1, 2, 3, 4] lately that question the claims of Web3 architecture including a latest blog from Moxie. As a co-founder of Signal protocol and strong cryptography background, Moxie makes following observations:

  • People don’t want to run their own servers, and never will.
  • A protocol moves much more slowly than a platform.

Moxie shares his experience of making dApps: Autonomous Art and First Derivative and reaches following conclusions:

  • NFT marketplace looks similar to centralized websites and databases instead of blockchain based distributed state.
  • The blockchain lack client/server interface for interacting with the data in a trustless fashion.
  • The dApp clients often communicate with centralized services such as Infura or Alchemy and lack client side validation.

Moxie identifies following security gaps in NFT use-cases:

  • Blockchain does not store data for NFT, instead it only points to URL and it lacked hash commitment for data located at the URL.
  • Crypto wallet such as MetaMask connects to central services such as Etherscan and OpenSea without client-side validation.

Moxie claims that innovations at centralized services such as OpenSea are outpacing standards such as ERC-721 similar to GMail compared to Email standards. He claims that decentralization has a little practical or pressing importance to the majority of people who just want to make money. In the end, he concludes:

  • We should accept the premise that people will not run their own servers by designing systems that can distribute trust without having to distribute infrastructure.
  • We should try to reduce the burden of building software because distributed systems add considerable complexity.

Moxie raises valid concerns regarding the need for thin clients that can validate blockchain data with trustless security as well as the use of message authentication code and metadata when creating or accessing NFT. However, he conflates Bitcoin, Ethereum, and Blockchain and there are existing standards such as Simplified payment verification that allows users to connect to Bitcoin blockchain without running full-node. Due to differences in Ethereum architecture, thin clients are difficult to operate on its platform but there are a number of alternatives such as Light Clients, Tendermint Light, Argent Wallet and Pockt Network. The NFT data can be stored Fleek or decentralized storage solutions such as IPFS, Skynet, Sia, Celestia, Arweave and Filecoin. Some blockchains such as Tezos and Avax also offer layer-1 solution for data storage. Despite the popularity of Opensea with its centralized access, a number of alternatives are being worked on such as OpenDao and NFTX. Further, NFT standards such as ERC-2981 and ERC-1155 are in work to support royalties and other enhancements.

When reviewing Signal app, I observed that Moxie advocated centralized cryptography and actively tried to stop other open source projects such as LibreSignal to use their servers. Moxie also declined collaboration with open messaging specifications such as Matrix. Centralized cryptography is an oxymoron as truly secure systems don’t require trust. The centralized messaging platforms such as WhatsApp impose forwarding limit, which require encryption of multiple messages using the same key. Despite open source development of Signal app, there is no guarantee that it uses the same code on its infrastructure (e.g. it stopped updating open source project for a year) or it doesn’t log metadata when delivering messages.

Though, Moxie‘s experience with centralized services within the blockchain ecosystem highlights a crucial area of work but a number of solutions are being worked on, admittedly with slow progress. However, his judgment is clouded by his prior position and beliefs where he prefers centralized cryptography and systems over decentralized and federated systems. In my previous blog, I have discussed the merits of decentralized messaging apps, which provide better support for data privacy, censorship, anonymity, deniability and confidentiality with trustless security. This is essential for building Web3 infrastructure and more work is needed to simplify development and adoption of decentralized computing and trustless security.

September 10, 2021

Notes from “Monolith to Microservices”

Filed under: Uncategorized — admin @ 2:04 pm

I recently read Sam Newman’s book on Monolithic to Microservices architecture. I had read his previous book on Building Microservices on related topic that focused more on design and implementation of microservices but there is some overlap of topics in these books.

Chapter 1 – Just Enough Microservices

The first chapter defines how microservices can be designed to be deployed independently by modeling around a business domain.

Benefits

The major benefits of microservices include:

Independent Deployability

Modeled Around a Business Domain

The author explains one of common reason for three-tier architecture with UI/Business-Logic/Database is so common is due to Conway’s law that predicates that system design mimics organization’s communication structure.

Own Their Own Data

In order to keep reduce coupling, author recommends against sharing data with microservices.

Problems

Though, microservices provide you isolation and flexibility but they also add complexity that comes with network communication such as higher latencies, distributed transactions, and handling network failures. Other problems include:

User Interface

The author also cautions against focusing only on the server side and leaving UI as monolithic.

Technology

The author also cautions against chasing shiny new technologies instead of leveraging what you already know.

Size

The size of a microservice depends on the context but just having a small-size should not be the primary concern.

Ownership

The microservices architecture allows strong ownership but it requires that they are designed around the business domain and product lines.

Monolith

The author explains monolithic apps where all code is packaged into a single process.

Modular Monolith

In modular monolith, the code can be broken into different modules and is for deployment.

Distributed Monolith

If boundaries of microservices are not loosely coupled, they can result in distributed monolith that has disadvantages of monolithic and microservices.

Challenges and Benefits of Monolith

The author explains that monolithic design is not necessarily a legacy but a choice that depends on the context.

Cohesion and Coupling

He uses cohesion and coupling when defining microservices where stable systems encourage high cohesion and low coupling that provides independent deployment and minimize chatty services.

Implementation Coupling

The implementation coupling may be caused by sharing domain or database.

Temporal Coupling

The temporal coupling using synchronous APIs to perform an operation.

Deployment Coupling

The deployment coupling adds risk of adding unchanged modules to the deployment.

Domain Coupling

The domain coupling is caused by sharing full domain object instead of events or reducing unrelated information.

Domain-Driven Design

The author reviews domain-driven design concepts such as aggregate and bounded context.

Aggregate

In DDD, an aggregate uses a state machine to manage lifecycle of a grouped object that can be used to design a microservice so that it handles the lifecycle and storage of those aggregates.

Bounded Context

Bounded context represents a boundary of business domain that may contain one or more aggregates. These concepts can be used to define service boundaries so that each service is cohesive with a well-defined interface.

Chapter 2 – Planning a Migration

The chapter two defines a migration path for micro-services by defining goals for the migration and why you should adopt a microservice architecture.

Why Might You Choose Microservices

Common reasons for such architecture includes:

  • improving autonomy
  • reduce time to market
  • scaling independently
  • improving robustness
  • scaling the number of developers while minimizing coordination
  • embracing new technologies

When Might Microservices Be a Bad Idea?

The author also describes scenarios when a microservice architecture is a bad idea such as:

  • when business domain is unclear
  • early adopting microservices in startups
  • customer-installed software.

Changing Organizations

The author describes some of ways organizations can be persuaded to adopt this architecture using Dr. John Kottler’s eight-step process:

  • establishing a sense of urgency
  • creating the guided coalition
  • developing a vision and strategy
  • communicating the change vision
  • empowering employees
  • generating short-term wins.

Importance of Incremental Migration

Incremental migration for microservice architecture is important so that you can release these services to the production and learn from the actual use.

Cost of Change

The author explains cost of change in terms of reversible and irreversible decisions commonly used at Amazon.

Domain Driven Design

The author goes over domain-driven design again and shows how bounded context can define boundaries of the microservices. You can use event storming to define a shared domain model where participants define first domain events and then group those domain events. You can then pick a context that has few in-bound dependencies to start with and using strangler fig pattern for incremental migration. The author also shows two-axis model for service decomposition by comparing benefit vs ease of decomposition.

Reorganizing Teams

The chapter then reviews team restructure so that you can reassign responsibilities towards cross-functional model who can fully own and operate specific microservices.

How Will You Know if the Transition is Working?

You can determine if the transition is working by:

  • having regular checkpoints
  • quantitative measures
  • Qualitative measures
  • Avoiding the sunk cost fallacy.

Chapter 3 – Splitting the Monolith

The chapter three describes how to split the monolith in small steps.

To change the Monolith or Not?

You will have to decide whether to move existing code as is or reimplement.

Refactoring the Monolith

A key blocker in breaking the monolith might be tightly coupled design that requires some refactoring before the migration. The next step in this process might be a modular monolith where you have a single unit of deployment but with statically linked modules.

Pattern: Stranger Fig Application

The Strangler Fig Application incrementally migrates existing code by moving modules to external microservices. In some cases those microservices may need to invoke other common behavior in the monolith.

HTTP Proxy

If the monolith is using an HTTP reverse proxy to intercept incoming calls, it can be extended to redirect requests for the new service. If the new service chooses to implement a new protocol, it may require other changes to the proxy layer that could add risk and goes against general recommendation of “Keep the pipes dumb, the endpoints smart.” One way to remediate is to create a layer for converting protocol from the legacy to the new format such as REST to gRPC.

Service Mesh

Instead of a centralized proxy, you can use service meshes such as Envoy that is deployed as a control-plane along with each service that acts as a proxy for communicating with the service.

Message Interception and Content-based Routing

If a monolith is using messaging, you can intercept messages and use a content-based router to send messages to the new service

Pattern: UI Composition

The UI composition looks at how user interface can be migrated from monolithic backend to microservice architecture.

Page Composition

The page-composition migrates one page at a time based on product verticals to ensure that old page links are replaced with the new URLs when they are changed. You can choose a low-risk vertical for UI migration to reduce the risk of breaking functionality.

Widget Composition

The widget composition reduces the UI migration risk by just replacing a single widget at a time. For example, you may use Edge-Side Includes (ESI) to define a template in the web page and a web server splices in this content. Though, this technique is less common these days due to browser can make requests to populate a widget. This technique was used by Orbitz to render UI modules from a central orchestration service but due to its monolithic design, it became a bottleneck for coordinating changes. The central orchestration service was then migrated to microservices incrementally.

Mobile Applications

These UI composition changes can also be applied to mobile apps, however mobile app is a monolith and whole application needs to be deployed. Some organizations such as Spotify allow dynamic changes from the server side.

Micro Frontends

Modern web browsers and standards such as Web Component specification to help build single page apps and micro frontends by using widget-based composition.

The UI composition is highly effective when migrating vertical slices and you have the ability to change the existing user interface.

Pattern: Branch by Abstraction

The “Branch by Abstraction” can be used with incremental migration using Strangler Fig when the functionality is deeply nested and other developers may be making changes to the same codebase. In order to prevent merge conflicts from large changes and to keep minimal disruption for developers, you create an abstraction for the functionality to be replaced. This abstraction can then be implemented by both existing code and new implementation. You can then switch over the abstraction to new implementation once you are done and clean up old implementation.

Step1: Create abstraction

Create an abstraction using “Extract Interface” refactoring.

Step2: Use abstraction

Refactor the existing clients to use the new abstraction point.

Step3: Create new implementation

Implement the abstraction to call the external service inside the monolith. You may simply return “Not Implemented” errors in the new implementation and deploy code into production as this new service isn’t actually being called.

Step4: Switch implementation

Once the new implementation is done, you can switch the abstraction to point to the new implementation. You may also use feature flags to toggle back and forth quickly.

Step5: Clean up

Once the new implementation is fully working, the old implementation can be removed and you may also remove the abstraction if needed.

Fallback Mechanism

A variation of the branch by abstraction pattern called verify branch by abstraction can be used to implement a live verification where if the new implementation fails, then the old implementation could be used instead.

Overall, branch by abstraction is a fairly general-purpose pattern and useful in most cases where you can change the existing code.

Pattern: Parallel Run

As the strangler fig pattern and branch by abstraction pattern allow both old and new implementations to coexist in production, you can use parallel runs to call both implementations and compare results. Typically, the old implementation is considered the source of truth until the new implementation can be verified (Examples: new pricing system, FTP upload).

N-Version Programming

Critical control systems such as air flight systems use redundant subsystems to interpret signals and then use quorum to find the results.

Verification Techniques

In addition to simply comparing results, you may also need to compare nonfunctional aspects such as latency and failure rate.

Using Spies

A spy is used with unit testing to stub a piece of functionality such as communication with an external service and verify the results (Examples: sending an email or remote notification). Spy is generally run as external process and you may use record/play to replay the events for testing. GitHub’s Scientist is a notable library for this pattern.

Dark Launching and Canary Releasing

The parallel run can be combined with canary release to test the new implementation before releasing to all users. Similarly, dark launching allows enabling the new implementation to only selected users (A/B testing).

Parallel run requires a lot of effort so care must be taken when it’s used.

Pattern: Decorating Collaborator

If you need to trigger some behavior inside the monolith, you can use decorating collaborator pattern to attach new functionality (Example: Loyalty Program – earn points on orders). You may use proxy to intercept the request and add new functionality. On the downside, this may require additional data, which adds more complexity and latency.

This is a simple and elegant approach but it works best the required information can be extracted from the request.

Pattern: Change Data Capture

This pattern allows reacting to changes made in a datastore instead of intercepting the calls. This underlying capture system is coupled with the monolithic datastore (Example: Issue Loyalty Cards – trigger on insert that calls Loyalty Card Printing service).

Implementing Change Data Capture

Database triggers

For example, defining triggers on relational database that calls a service when a record is inserted.

Transaction log pollers

The transactional logs from transactional databases can be inspected by external tools to capture data changes and then pass this data to message brokers or other services.

Batch delta copier

This simply scans the database on a regular schedule for the data that has changed and copies this data to the destination.

The change data capture has a lot of implementation challenges so its use must be kept to minimal.

Chapter 4 – Decomposing the Database

This chapter reviews patterns for managing a single shared database:

Pattern: The Shared Database

The implementation coupling is generally caused by the shared database because the ownership or usage of the database schema is not clear. This weakens the cohesion of business logic because the behavior is spread across multiple services. The author points that the only two situations where you may share the database are when using database for read-only static reference data or when a service is directly exposing a database to handle multiple consumers (Database as a service interface). Also, in some cases the work involved with splitting the database might be too large for incremental migration where you may use some coping patterns to manage the shared database.

Pattern: Database View

When sharing a database, you may define database views for different services to limit the data that is visible to the service.

The Database as a Public Contract

When sharing a database, it might be difficult who is reading or writing the data especially when different applications use the same credentials. This also prevent changing the database schema because some of the applications might not be actively maintained.

Views to Present

One way to change schema without breaking existing application is to define views that looks like old schema. The database view may also just project limited information to implement a form of information hiding. In some cases you may need a materialized view to improve performance. You should use this pattern only when existing monolithic schema can’t be decomposed.

Pattern: Database Wrapping Service

The database wrapping service hides the database behind a service. This can be used when the underlying schema can’t be broken down. This provides better support for customized projection and read/write than the database views.

Pattern: Database-as-a-Service Interface

In cases when you just need to query the database, you may create a dedicated reporting database that can be exposed as a read-only endpoint. A mapping engine can listen for changes in the internal database and then persist them in the reporting database for query purpose by internal/external consumers.

Implementing a Mapping Engine

You may use a change data capture system (Debezium), a batch process to copy the data or a message broker to listen for data events. This allows presenting data in different database technology and provides more flexibility than the database views.

Pattern: Aggregate Exposing Monolith

When a microservice needs a data inside the monolith database, you can expose a service endpoint to provide the access to the domain aggregate. This pattern allows exposing the information in a well defined interface and is safer than exposing the shared database despite additional work.

Pattern: Change Data Ownership

If the monolith needs to access the data in a shared database that should belong to the new microservice, then monolith can be updated to call the new service and treat it as a source of truth.

Database Synchronization

As a strangler fig pattern allows switching back to monolith if we find an issue in the microservice but it requires that the data between the monolith and the new service remains in sync. You may use database view and a shared database for short term until the migration is successfully completed. Alternatively, you may sync both databases via code but it requires some careful thoughts.

Pattern: Synchronize Data in Application

Migrating data from one database to another requires performing synchronization between two data sources.

Step 1: Bulk Synchronize Data

You may take a snapshot of the source data and import it into the new data source while the existing system is kept running. As the source data might be changed while the import process is running, a change data capture process can be implemented to import the changes since the import. You can then deploy new version of the application after this process.

Step 2: Synchronize on Write, Read from Old Schema

Once both databases are in sync, the new application writes all data to both databases while reading from the old database.

Step 3: Synchronize on Write, Read from New Schema

Once, you verify the reads from new database work, you can switch the application to switch the new database as a source of truth.

Pattern: Trace Write

This is a variation of the synchronize data in application pattern where the source of truth is moved incrementally and both sources are considered sources of truth during migration. For example, you may just migrate one domain aggregate at a time and other services may use either data source depending on the information they need.

Data Synchronization

If data is duplicated inconsistency, you may need to apply following options:

  • Write to one source – data to the other source of truths is synchronized after the write.
  • Send writes to both sources – The client makes a call to both sources or use an intermediary to broadcast the request.
  • Send writes to either source – the data is synchronized behind the scene.

The last option should be avoided as it requires two way synchronization. In other cases, there will be some delay in the data being consistent (eventual consistency).

Splitting Apart the Database

Physical versus Logical Database Separation

A physical database can host multiple logically separated schemas so migration to separate physical databases can be planned later to improve robustness and throughput/latency. A single physical database can become a single point of failure but it can be remedied by using multi-primary database modes.

Splitting the Database First, or the Code?

Split the Database First

This may cause multiple database calls instead of one action such as SELECT statement or break transactional integrity so you can detect performance problems earlier. However, it won’t yield much short-term benefits.

Pattern: Repository per bounded context

Breaking down the repositories along the boundaries of bounded context help understand dependencies and coupling between tables.

Pattern: Database per bounded context

This allows you to decompose database around the lines of bounded context so that each bounded context uses a distinct schema. This pattern can be applied in startups when the requirements may change drastically so you can keep multiple schemas while using monolithic architecture.

Split the Code First

This allows understanding data dependencies in the new service and benefit of independent deployment thus offering the short-term improvements.

Pattern: Monolith as data access layer

This allows creating an API in the monolith to provide access to the data.

Pattern: Multi-schema storage

When adding new tables while migrating from the monolith, add new tables to its own schema.

Split Database and Code Together

This split the code and data at once but it takes more effort.

Pattern: Split Table

This splits a single shared table into multiple tables based on boundaries of bounded contexts. However, this may break database transactions.

Pattern: Move Foreign-Key Relationship to Code

Moving the Join

Instead of using database join but in the new service, you will need to query the data from another service.

Database Consistency

As you can’t rely on the referential integrity enforced by the database with multiple schemas, you may need to enforce it in the application such as:

  • check before deletion or existence but it can be error prone.
  • handle deletion gracefully – such as not showing missing information and services may also subscribe to add/delete events for related data (recommended).
  • don’t allow deletion

Shared Static Data

Duplicate static reference data

This will result in some data inconsistencies.

Pattern: Dedicated reference data schema

However, you may need to share the physical database.

Pattern: Static reference data library

This may not be suitable when using different programming languages and you will have to support multiple versions of the library.

Pattern: Static reference data service

This will add another dependency but it can be cached at the client side with update events to sync the local cache.

Transactions

ACID Transactions

This will be hard with multiple schemas but you may use state such as PENDING/VERIFIED to manage atomicity.

Two-Phase Commits

This breaks transaction into voting and commit phases and rollback message is sent if any worker doesn’t vote for commit.

Distributed Transactions – just say No

Sagas

A saga or long-lived transactions use an algorithm that can coordinate multiple changes in state but avoid locking resources. It breaks down LLT into a sequence of transactions that can occur independently as a short-lived.

Saga Failure Modes

Saga provides backward and forward recovery where backward recovery reverts the failure by using compensating transactions. The forward recovery allows continuation from the failure by retrying it.

Note: The rollback will undo all previously executed transactions. You can move forward the steps that are most likely to fail to avoid triggering compensating transactions on large number of steps.

Implementing Sagas
  • orchestrated sags – You may use multiple orchestration to break down the saga using BPM or other tools.
  • Choreographed sagas – This broadcasts events using a message broker. However, the scope of saga transaction may not be apparently visible.

Chapter 5 – Growing Pains

More Services, More Pain

Ownership at Scale

  • Strong code ownership – large scale microservices
  • Weak code ownership
  • Collective code ownership

Breaking Changes

A change in a microservice may break backward compatibility or other changes for consumers. You can ensure that you don’t break contracts when making changes to the services by using explicit schemas and maintaining semantics. You may also support multiple versions of the service if you need to break backward compatibility and allow existing clients to migrate to the new version.

Reporting

A monolithic database simplifies reporting but with different schemas and databases, you may need to build a reporting database to aggregate data from different services.

Monitoring and Troubleshooting

A monolithic app is easier to monitor but with multiple microservices you will need to use log aggregation and define a correlation id with tracing tools to see a transaction events from different services.

Test in Production

You may use synthetic transactions to perform end-to-end testing in production.

Observability

You can gather information about the system by tracing, logs and other system events.

Local Developer Experience

When setting up a local environment, you may need to setup a large number of services that can slow down development process. You may need to stub out services or use tools such as Telepresence proxy to call remote services.

Running Too Many Things

You may use Kubernetes to manage the deployment and troubleshooting the microservices.

End-to-End Testing

A microservice architecture increases the scope of end-to-end testing and you have to debug false negative due to environmental issues. You can use following approaches for end-to-end testing:

  • Limit scope of functional automated tests
  • Use consumer-driven contracts
  • Use automated release remediation and progressive delivery
  • Continually refine your quality feedback cycles

Global versus Local Optimization

You may have to solve same problem with multiple services such as service deployment. You may need to evaluate problems as irreversible or reversible decisions and adopt a broader discussion for irreversible decisions.

Robustness and Resiliency

Distributed systems exhibit a variety of failures so you may need to run redundant services, use asynchronous communication or apply other patterns such as circuit breakers, retries, etc.

Orphaned Services

The orphaned services don’t have clear owners so you can’t immediately fix it if the service stops working. You may need a service registry or other tools to track the services, however some services may predate these tools.

« Newer PostsOlder Posts »

Powered by WordPress