Shahzad Bhatti Welcome to my ramblings and rants!

August 23, 2023

Failures in MicroService Architecture

Filed under: Computing,Microservices — admin @ 12:54 pm

Microservice architecture is an evolution of Monolithic and Service-Oriented Architecture (SOA), where an application is built as a collection of loosely coupled, independently deployable services. Each microservice usually corresponds to a specific business functionality and can be developed, deployed, and scaled independently. In contrast to Monolithic Architecture that lacks modularity, and Service-Oriented Architecture (SOA), which is more coarse-grained and is prone to a single point of failure, the Microservice architecture offers better support for modularity, independent deployment and distributed development that often uses Conway’s law to organize teams based on the Microservice architecture. However, Microservice architecture introduces several challenges in terms of:

  • Network Complexity: Microservices communicate over the network, increasing the likelihood of network-related issues (See Fallacies of distributed computing).
  • Distributed System Challenges: Managing a distributed system introduces complexities in terms of synchronization, data consistency, and handling partial failures.
  • Monitoring and Troubleshooting: Due to the distributed nature, monitoring and troubleshooting can become more complex, requiring specialized tools and practices.
  • Potential for Cascading Failures: Failure in one service can lead to failures in dependent services if not handled properly.
Microservices Challenges

Faults, Errors and Failures

The challenges associated with microservice architecture manifest at different stages require understanding concepts of faults, errors and failures:

1. Faults:

Faults in a microservice architecture could originate from various sources, including:

  • Software Bugs: A defect in one service may cause incorrect behavior but remain dormant until triggered.
  • Network Issues: Problems in network connectivity can be considered faults, waiting to lead to errors.
  • Configuration Mistakes: Incorrect configuration of a service is another potential fault.
  • Dependency Vulnerabilities: A weakness or vulnerability in an underlying library or service that hasn’t yet caused a problem.

Following are major concerns that the Microservice architecture must address for managing faults:

  • Loose Coupling and Independence: With services being independent, a fault in one may not necessarily impact others, provided the system is designed with proper isolation.
  • Complexity: Managing and predicting faults across multiple services and their interactions can be complex.
  • Isolation: Properly isolating faults can prevent them from causing widespread problems. For example, a fault in one service shouldn’t impact others if isolation is well implemented.
  • Detecting and Managing Faults: Given the distributed nature of microservices, detecting and managing faults can be complex.

2. Error:

When a fault gets activated under certain conditions, it leads to an error. In microservices, errors can manifest as:

  • Communication Errors: Failure in service-to-service communication due to network problems or incompatible data formats.
  • Data Inconsistency: An error in one service leading to inconsistent data across different parts of the system.
  • Service Unavailability: A service failing to respond due to an internal error.

Microservice architecture should include diagnosing and handling errors including:

  • Propagation: Errors can propagate quickly across services, leading to cascading failures if not handled properly.
  • Transient Errors: Network-related or temporary errors might be resolved by retries, adding complexity to error handling.
  • Monitoring and Logging Challenges: Understanding and diagnosing errors in a distributed system can be more complex.

3. Failure:

Failure is the inability of a system to perform its required function due to unhandled errors. In microservices, this might include:

  • Partial Failure: Failure of one or more services leading to degradation in functionality.
  • Total System Failure: Cascading errors causing the entire system to become unavailable.

Further, failure handling in Microservice architecture poses additional challenges such as:

  • Cascading Failures: A failure in one service might lead to failures in others, particularly if dependencies are tightly interwoven and error handling is insufficient.
  • Complexity in Recovery: Coordinating recovery across multiple services can be challenging.

The faults and errors can be further categorized into customer related and system related:

  • Customer-Related: These may include improper usage of an API, incorrect input data, or any other incorrect action taken by the client. These might include incorrect input data, calling an endpoint that doesn’t exist, or attempting an action without proper authorization. Since these errors are often due to incorrect usage, simply retrying the same request without fixing the underlying issue is unlikely to resolve the error. For example, if a customer sends an invalid parameter, retrying the request with the same invalid parameter will produce the same error. In many cases, customer errors are returned with specific HTTP status codes in the 4xx range (e.g., 400 Bad Request, 403 Forbidden), indicating that the client must modify the request before retrying.
  • System-Related: These can stem from various aspects of the microservices, such as coding bugs, network misconfigurations, a timeout occurring, or issues with underlying hardware. These errors are typically not the fault of the client and may be transient, meaning they could resolve themselves over time or upon retrying. System errors often correlate with HTTP status codes in the 5xx range (e.g., 500 Internal Server Error, 503 Service Unavailable), indicating an issue on the server side. In many cases, these requests can be retried after a short delay, possibly succeeding if the underlying issue was temporary.

Causes Related to Faults, Errors and Failures

The challenges in microservice architecture are rooted in its distributed nature, complexity, and interdependence of services. Here are common causes of the challenges related to faults, errors, and failures, including the distinction between customer and system errors:

1. Network Complexity:

  • Cause: Multiple services communicating over the network where one of the service cannot communicate with other service. For example, Amazon Simple Storage Service (S3) had an outage in Feb 28, 2017 and many services that were tightly coupled failed as well due to limited fault isolation. The post-mortem analysis recommended proper fault isolation, redundancy across regions, and better understanding and managing complex inter-service dependencies.
  • Challenges: Leads to network-related issues, such as latency, bandwidth limitations, and network partitioning, causing both system errors and potentially triggering faults.

2. Data Consistency:

  • Cause: Maintaining data consistency across services that use different databases. This can occur where a microservice stores data in multiple data stores without proper anti-entropy validation or uses eventual consistency, e.g. a trading firm might be using CQRS pattern where transaction events are persisted in a write datastore, which is then replicated to a query datastore so user may not see up-to-date data when querying recently stored data.
  • Challenges: Ensuring transactional integrity and eventual consistency can be complex, leading to system errors if not managed properly.

3. Service Dependencies:

  • Cause: Tight coupling between services. For example, an online travel booking platform might deploy multiple microservices for managing hotel bookings, flight reservations, car rentals, etc. If these services are tightly coupled, then a minor update to the flight reservation service unintentionally may break the compatibility with the hotel booking service. 
  • Challenges: Cascading failures and difficulty in isolating faults. A failure in one service can easily propagate to others if not properly isolated.

4. Scalability Issues:

  • Cause: Individual services may require different scaling strategies. For example, Netflix in Oct 29, 2012 suffered a major outage when due to a scaling issue, the Amazon Elastic Load Balancer (ELB) that was used for routing couldn’t route requests effectively. The lessons learned from the incident included improved scaling strategies, redundancy and failover planning, and monitoring and alerting enhancements.
  • Challenges: Implementing effective scaling without affecting other services or overall system stability. Mismanagement can lead to system errors or even failures.

5. Security Concerns:

  • Cause: Protecting the integrity and confidentiality of data as it moves between services. For example, on July 19, 2019, CapitalOne had a major security breach for its data that was stored on AWS. A former AWS employee discovered a misconfigured firewall and exploited it, accessing sensitive customer data. The incident caused significant reputational damage and legal consequences to CapitalOne, which then worked on a broader review of security practices, emphasizing the need for proper configuration, monitoring, and adherence to best practices.
  • Challenges: Security breaches or misconfigurations could be seen as faults, leading to potential system errors or failures.

6. Monitoring and Logging:

  • Cause: The need for proper monitoring and logging across various independent services to gain insights when microservices are misbehaving. For example, if a service is silently behaving erratically, causing intermittent failures for customers will lead to more significant outage and longer time to diagnose and resolve due to lack of proper monitoring and logging.
  • Challenges: Difficulty in tracking and diagnosing both system and customer errors across different services.

7. Configuration Management:

  • Cause: Managing configuration across multiple services. For example, July 20, 2021, WizCase discovered unsecured Amazon S3 buckets containing data from more than 80 US locales, predominantly in New England. The misconfigured S3 buckets included more than 1,000GB of data and more than 1.6 million files. Residents’ actual addresses, telephone numbers, IDs, and tax papers were all exposed due to the attack. On October 5, 2021, Facebook had nearly six hours due to misconfigured DNS and PGP settings. Oasis cites misconfiguration as a top root cause for security incidents and events.
  • Challenges: Mistakes in configuration management can be considered as faults, leading to errors and potentially failures in one or more services.

8. API Misuse (Customer Errors):

  • Cause: Clients using the API incorrectly, sending improper requests. For example, on October 21, 2016, Dyn experienced a massive Distributed Denial of Service (DDoS) attack, rendering a significant portion of the internet inaccessible for several hours. High-profile sites, including Twitter, Reddit, and Netflix, experienced outages. The DDoS attack was primarily driven by the Mirai botnet, which consisted of a large number of compromised Internet of Things (IoT) devices like security cameras, DVRs, and routers. These devices were vulnerable because of default or easily guessable passwords. The attackers took advantage of these compromised devices and used them to send massive amounts of traffic to Dyn’s servers, especially by abusing the devices’ APIs to make repeated and aggressive requests. The lessons learned included better IoT security, strengthening infrastructure and adding API guardrails such as built-in security and rate-limiting.
  • Challenges: Handling these errors gracefully to guide clients in correcting their requests.

9. Service Versioning:

  • Cause: Multiple versions of services running simultaneously. For example, conflicts between the old and new versions may lead to unexpected behavior in the system. Requests routed to the new version might be handled differently than those routed to the old version, causing inconsistencies.
  • Challenges: Compatibility issues between different versions can lead to system errors.

10. Diverse Technology Stack:

  • Cause: Different services might use different languages, frameworks, or technologies. For example, the diverse technology stack may cause problems with inconsistent monitoring and logging, different vulnerability profiles and security patching requirement, leading to increased complexity in managing, monitoring, and securing the entire system.
  • Challenges: Increases complexity in maintaining, scaling, and securing the system, which can lead to faults.

11. Human Factors:

  • Cause: Errors in development, testing, deployment, or operations. For example, Amazon Simple Storage Service (S3) had an outage in Feb 28, 2017, which was caused by a human error during the execution of an operational command. A typo in a command executed by an Amazon team member intended to take a small number of servers offline inadvertently removed more servers than intended.and many services that were tightly coupled failed as well due to limited fault isolation. The post-mortem analysis recommended implementing safeguards against both human errors and system failures.
  • Challenges: Human mistakes can introduce faults, lead to both customer and system errors, and even cause failures if not managed properly.

12. Lack of Adequate Testing:

  • Cause: Insufficient unit, integration, functional, and canary testing. For example, on August 1, 2012, Knight Capital deployed untested software to a production environment, resulting in a malfunction in their automated trading system. The flawed system started buying and selling millions of shares at incorrect prices. Within 45 minutes, the company incurred a loss of $440 million. The code that was deployed to production was not properly tested. It contained old, unused code that should have been removed, and the new code’s interaction with existing systems was not fully understood or verified. The lessons learned included ensuring that all code, especially that which controls critical functions, is thoroughly tested, implementing robust and consistent deployment procedures to ensure that changes are rolled out uniformly across all relevant systems, and having mechanisms in place to quickly detect and halt erroneous behavior, such as a “kill switch” for automated trading systems.
  • Challenges: Leads to undetected faults, resulting in both system and customer errors, and potentially, failures in production.

13. Inadequate Alarms and Health Checks:

  • Cause: Lack of proper monitoring and health check mechanisms. For example, on January 31, 2017, GitLab suffered a severe data loss incident. An engineer accidentally deleted a production database while attempting to address some performance issues. This action resulted in a loss of 300GB of user data. GitLab’s monitoring and alerting system did not properly notify the team of the underlying issues that were affecting database performance. The lack of clear alarms and health checks contributed to the confusion and missteps that led to the incident. The lessons learned included ensuring that health checks and alarms are configured to detect and alert on all critical conditions, and establishing and enforcing clear procedures and protocols for handling critical production systems, including guidelines for dealing with performance issues and other emergencies.
  • Challenges: Delays in identifying and responding to faults and errors, which can exacerbate failures.

14. Lack of Code Review and Quality Control:

  • Cause: Insufficient scrutiny during the development process. For example, on March 14, 2012, the Heartbleed bug was introduced with the release of OpenSSL version 1.0.1 but it was not discovered until April 2014. The bug allowed attackers to read sensitive data from the memory of millions of web servers, potentially exposing passwords, private keys, and other sensitive information. The bug was introduced through a single coding error. There was a lack of rigorous code review process in place to catch such a critical mistake. The lessons learned included implementing a thorough code review process, establishing robust testing and quality control measures to ensure that all code, especially changes to security-critical areas, is rigorously verified.
  • Challenges: Increases the likelihood of introducing faults and bugs into the system, leading to potential errors and failures.

15. Lack of Proper Test Environment:

  • Cause: Absence of a representative testing environment. For example, on August 1, 2012, Knight Capital deployed new software to a production server that contained obsolete and nonfunctional code. This code accidentally got activated, leading to unintended trades flooding the market. The algorithm was buying high and selling low, the exact opposite of a profitable strategy. The company did not have a proper testing environment that accurately reflected the production environment. Therefore, the erroneous code was not caught during the testing phase. The lessons learned included ensuring a robust and realistic testing environment that accurately mimics the production system, implementing strict and well-documented deployment procedures and implementing real-time monitoring and alerting to catch unusual or erroneous system behavior.
  • Challenges: Can lead to unexpected behavior in production due to discrepancies between test and production environments.

16. Elevated Permissions:

  • Cause: Overly permissive access controls. For example, on July 19, 2019, CapitalOne announced that an unauthorized individual had accessed the personal information of approximately 106 million customers and applicants. The breach occurred when a former employee of a third-party contractor exploited a misconfigured firewall, gaining access to data stored on Amazon’s cloud computing platform, AWS. The lessons learned included implementing the principle of least privilege, robust monitoring to detect and alert on suspicious activities quickly, and evaluating the security practices of third-party contractors and vendors.
  • Challenges: Increased risk of security breaches and unauthorized actions, potentially leading to system errors and failures.

17. Single Point of Failure:

  • Cause: Reliance on a single component without redundancy. For example, on January 31, 2017, GitLab experienced a severe data loss incident when an engineer while attempting to remove a secondary database, the primary production database was engineer deleted. The primary production database was a single point of failure in the system. The deletion of this database instantly brought down the entire service. Approximately 300GB of data was permanently lost, including issues, merge requests, user accounts, comments, and more. The lessons learned included eliminating single points of failure, implementing safeguards to protect human error, and testing backups.
  • Challenges: A failure in one part can bring down the entire system, leading to cascading failures.

18. Large Blast Radius:

  • Cause: Lack of proper containment and isolation strategies. For example, on September 4, 2018, the Azure South Central U.S. datacenter experienced a significant outage affecting multiple regions. A severe weather event in the southern United States led to cooling failures in one of Azure’s data centers. Automated systems responded to the cooling failure by shifting loads to a data center in a neighboring region. This transfer was larger and faster than anticipated, leading to an overload in the secondary region. The lessons learned included deep understanding of dependencies and failure modes, limiting the blast radius, and continuous improvements in resilience.
  • Challenges: An error in one part can affect a disproportionate part of the system, magnifying the impact of failures.

19. Throttling and Limits Issues:

  • Cause: Inadequate management of request rates and quotas. For example, on February 28, 2017, AWS S3 experienced a significant disruption in the US-EAST-1 region, causing widespread effects on many dependent systems. A command to take a small number of servers offline for inspection was executed incorrectly, leading to a much larger removal of capacity than intended. Once the servers were inadvertently removed, the S3 subsystems had to be restarted. The restart process included safety checks, which required specific metadata. However, the capacity removal caused these metadata requests to be throttled. Many other systems were dependent on the throttled subsystem, and as the throttling persisted, it led to a cascading failure. The lessons learned included safeguards against human errors, dependency analysis, and testing throttling mechanisms.
  • Challenges: Can lead to service degradation or failure under heavy load.

20. Rushed Releases:

  • Cause: Releasing changes without proper testing or review. For example, on January 31, 2017, GitLab experienced a severe data loss incident. A series of events that started with a rushed release led to an engineer accidentally deleting a production database, resulting in the loss of 300GB of user data. The team was working on addressing performance issues and pushed a release without properly assessing the risks and potential side effects. The team was working on addressing performance issues and pushed a release without properly assessing the risks and potential side effects. The lessons learned included avoiding rushed decisions, clear separation of environments, proper access controls, and robust backup strategy.
  • Challenges: Increases the likelihood of introducing faults and errors into the system.

21. Excessive Logging:

  • Cause: Logging more information than necessary. For example, excessive logs can result in disk space exhaustion, performance degradation, service disruption or high operating cost due to additional network bandwidth and storage costs.
  • Challenges: Can lead to performance degradation and difficulty in identifying relevant information.

22. Circuit Breaker Mismanagement:

  • Cause: Incorrect implementation or tuning of circuit breakers. For example, on November 18, 2014, Microsoft Azure suffered a substantial global outage affecting multiple services. An update to Azure’s Storage Service included a change to the configuration file governing the circuit breaker settings. The flawed update led to an overly aggressive tripping of circuit breakers, which, in turn, led to a loss of access to the blob front-ends. The lessons learned incremental rollouts, thorough testing of configuration changes, clear understanding of component interdependencies.
  • Challenges: Potential system errors or failure to protect the system during abnormal conditions.

23. Retry Mechanism:

  • Cause: Mismanagement of retry logic. For example, on September 20, 2015, an outage in DynamoDB led to widespread disruption across various AWS services. The root cause was traced back to issues related to the retry mechanism. A small error in the system led to a slight increase in latency. Due to an aggressive retry mechanism, the slightly increased latency led to a thundering herd problem where many clients retried their requests almost simultaneously. The absence of jitter (randomization) in the retry delays exacerbated this surge of requests because retries from different clients were synchronized. The lessons learned included proper retry logic with jitter, understanding dependencies, and enhancements to monitoring and alerting.
  • Challenges: Can exacerbate network congestion and failure conditions, particularly without proper jitter implementation.

24. Backward Incompatible Changes:

  • Cause: Introducing changes that are not backward compatible. For example, on August 1, 2012, Knight Capital deployed new software to a production environment. This software was intended to replace old, unused code but was instead activated, triggering old, defective functionality. The new software was not compatible with the existing system, and instead of being deactivated, the old code paths were unintentionally activated. The incorrect software operation caused Knight Capital to loss of over $460 million in just 45 minutes. The lessons learned included proper testing, processes for deprecating old code, and robust monitoring and rapid response mechanism.
  • Challenges: Can break existing clients and other services, leading to system errors.

25. Inadequate Capacity Planning:

  • Cause: Failure to plan for growth or spikes in usage. For example, on October 21, 2018, GitHub experienced a major outage that lasted for over 24 hours. During this period, various services within GitHub were unavailable or severely degraded. The incident was caused by inadequate capacity planning as GitHub’s database was operating close to its capacity. A routine maintenance task to replace a failing 100G network link set off a series of events that caused the database to failover to a secondary. This secondary didn’t have enough capacity to handle the production load, leading to cascading failures. The lessons learned included capacity planning, regular review of automated systems and building redundancy in critical components.
  • Challenges: Can lead to system degradation or failure under increased load.

26. Lack of Failover Isolation:

  • Cause: Insufficient isolation between primary and failover mechanisms. For example, on September 4, 2018, the Azure South Central U.S. datacenter experienced a significant outage. The Incident was caused by a lightning, which resulted in a voltage swell that impacted the cooling systems, causing them to shut down. Many services that were only hosted in this particular region went down completely, showing a lack of failover isolation between regions. The lessons learned included redundancy in critical systems, cross-region failover strategies, and regular testing of failover procedures.
  • Challenges: Can lead to cascading failures if both primary and failover systems are affected simultaneously.

27. Noise in Metrics and Alarms:

  • Cause: Too many irrelevant or overly sensitive alarms and metrics. Over time, the number of metrics and alarms may grow to a point where there are thousands of alerts firing every day, many of them false positives or insignificant. The noise level in the alerting system becomes overwhelming. For example, if many alarms are set with thresholds too close to regular operating parameters, they may cause frequent false positives. The operations team became desensitized to alerts, treating them as “normal.” The lessons learned include focusing on the most meaningful metrics and alerts, and regular review and adjust alarm thresholds and relevance to ensure they remain meaningful.
  • Challenges: Can lead to alert fatigue and hinder the prompt detection and response to real issues, increasing the risk of system errors and failures going unaddressed.

28. Variations Across Environments:

  • Cause: Differences between development, staging, and production environments. For example, a development team might be using development, testing, staging, and production environments, allowing them to safely develop, test, and deploy their services. However, production environment might be using different versions of database or middleware, using different network topology or production data is different, causing unexpected behaviors that leads to a significant outage.
  • Challenges: May lead to unexpected behavior and system errors, as code behaves differently in production compared to the test environment.

29. Inadequate Training or Documentation:

  • Cause: Lack of proper training, guidelines, or documentation for developers and operations teams. For example, if the internal team is not properly trained on the complexities of the microservices architecture, it can lead to misunderstandings of how services interact. Without proper training or documentation, the team may take a significant amount of time to identify the root causes of the issues.
  • Challenges: Can lead to human-induced faults, misconfiguration, and inadequate response to incidents, resulting in errors and failures.

30. Self-Inflicted Traffic Surge:

  • Cause: Uncontrolled or unexpected increase in internal traffic, such as excessive inter-service calls. For example, on January 31st 2017, GitLab experienced an incident that, while primarily related to data deletion, also demonstrated a form of self-inflicted traffic surge. While attempting to restore from a backup, a misconfiguration in the application caused a rapid increase in requests to the database. The lessons learned included testing configurations in an environment that mimics production, robust alerting and monitoring, clear understanding of interactions between components.
  • Challenges: Can overload services, causing system errors, degradation, or even failure.

31. Lack of Phased Deployment:

  • Cause: Releasing changes to all instances simultaneously without gradual rollout. For example, on August 1, 2012, Knight Capital deployed new software to a production environment. The software was untested in this particular environment, and an old, incompatible module was accidentally activated. The software was deployed to all servers simultaneously instead of gradually rolling it out to observe potential issues. The incorrect software operation caused Knight Capital to accumulate a massive unintended position in the market, resulting in a loss of over $440 million and a significant impact to its reputation. The lessons learned included phased deployment, thorough testing and understanding dependencies.
  • Challenges: Increases the risk of widespread system errors or failures if a newly introduced fault is triggered.

32. Broken Rollback Mechanisms:

  • Cause: Inability to revert to a previous stable state due to faulty rollback procedures. For example, a microservice tries to deploy a new version but after the deployment, issues are detected, and the decision is made to rollback. However, the rollback process fails, exacerbating the problem and leading to an extended outage.
  • Challenges: Can exacerbate system errors or failures during an incident, as recovery options are limited.

33. Inappropriate Timing:

  • Cause: Deploying new changes during critical periods such as Black Friday. For example, on Black Friday in 2014, Best Buy’s website experienced multiple outages throughout the day, which was caused by some maintenance or deployment actions that coincided with the traffic surge. Best Buy took the site down intermittently to address the issues, which, while necessary, only exacerbated the outage durations for customers. The lessons learned included avoiding deployments on critical days, better capacity planning and employing rollback strategies.
  • Challenges: Deploying significant changes or conducting maintenance during high-traffic or critical periods can lead to catastrophic failures.

The myriad potential challenges in microservice architecture reflect the complexity and diversity of factors that must be considered in design, development, deployment, and operation. By recognizing and addressing these causes proactively through robust practices, thorough testing, careful planning, and vigilant monitoring, teams can greatly enhance the resilience, reliability, and robustness of their microservice-based systems.

Incident Metrics

In order to prevent common causes of service faults and errors, Microservice environment can track following metrics:

1. MTBF (Mean Time Between Failures):

  • Prevent: By analyzing MTBF, you can identify patterns in system failures and proactively address underlying issues to enhance stability.
  • Detect: Monitoring changes in MTBF may help in early detection of emerging problems or degradation in system health.
  • Resolve: Understanding MTBF can guide investments in redundancy and failover mechanisms to ensure continuous service even when individual components fail.

2. MTTR (Mean Time to Repair):

  • Prevent: Reducing MTTR often involves improving procedures and tools for diagnosing and fixing issues, which also aids in preventing failures by addressing underlying faults more efficiently.
  • Detect: A sudden increase in MTTR can signal that something has changed within the system, such as a new fault that’s harder to diagnose, triggering a deeper investigation.
  • Resolve: Lowering MTTR directly improves recovery by minimizing the time it takes to restore service after a failure. This can be done through automation, streamlined procedures, and robust rollback strategies.

3. MTTA (Mean Time to Acknowledge):

  • Prevent: While MTTA mainly focuses on response times, reducing it can foster a more responsive monitoring environment, helping to catch issues before they escalate.
  • Detect: A robust monitoring system that allows for quick acknowledgment can speed up the detection of failures or potential failures.
  • Resolve: Faster acknowledgment of issues means quicker initiation of resolution processes, which can help in restoring the service promptly.

4. MTTF (Mean Time to Failure):

  • Prevent: MTTF provides insights into the expected lifetime of a system or component. Regular maintenance, monitoring, and replacement aligned with MTTF predictions can prevent unexpected failures.
  • Detect: Changes in MTTF patterns can provide early warnings of potential failure, allowing for pre-emptive action.
  • Resolve: While MTTF doesn’t directly correlate with resolution, understanding it helps in planning failover strategies and ensuring that backups or redundancies are in place for anticipated failures.

Implementing These Metrics:

Utilizing these metrics in a microservices environment requires:

  • Comprehensive Monitoring: Continual monitoring of each microservice to gather data.
  • Alerting and Automation: Implementing automated alerts and actions based on these metrics to ensure immediate response.
  • Regular Review and Analysis: Periodic analysis to derive insights and make necessary adjustments to both the system and the process.
  • Integration with Incident Management: Linking these metrics with incident management to streamline detection and resolution.

By monitoring these metrics and integrating them into the daily operations, incident management, and continuous improvement processes, organizations can build more robust microservice architectures capable of preventing, detecting, and resolving failures efficiently.

Development Procedures

A well-defined process is essential for managing the complexities of microservices architecture, especially when it comes to preventing, detecting, and resolving failures. This process typically covers various stages, from setting up monitoring and alerts to handling incidents, troubleshooting, escalation, recovery, communication, and continuous improvement. Here’s how such a process can be designed, including specific steps to follow when an alarm is received about the health of a service:

1. Preventing Failures:

  • Standardizing Development Practices: Creating coding standards, using automated testing, enforcing security guidelines, etc.
  • Implementing Monitoring and Alerting: Setting up monitoring for key performance indicators and establishing alert thresholds.
  • Regular Maintenance and Health Checks: Scheduling periodic maintenance, updates, and health checks to ensure smooth operation.
  • Operational Checklists: Maintaining a checklist for operational readiness such as:
    • Review requirements, API specifications, test plans and rollback plans.
    • Review logging, monitoring, alarms, throttling, feature flags, and other key configurations.
    • Document and understand components of a microservice and its dependencies.
    • Define key operational and business metrics for the microservice and setup a dashboard to monitor health metrics.
    • Review authentication, authorization and security impact for the service.
    • Review data privacy, archival and retention policies.
    • Define failure scenarios and impact to other services and customers.
    • Document capacity planning for scalability, redundancy to eliminate single point of failures and failover strategies.

2. Detecting Failures:

  • Real-time Monitoring: Constantly watching system metrics to detect anomalies.
  • Automated Alerting: Implementing automated alerts that notify relevant teams when an anomaly or failure is detected.

3. Responding to Alarms and Troubleshooting:

When an alarm is received:

  • Acknowledge the Alert: Confirm the reception of the alert and log the incident.
  • Initial Diagnosis: Quickly assess the scope, impact, and potential cause of the issue.
  • Troubleshooting: Follow a systematic approach to narrow down the root cause, using tools, logs, and predefined troubleshooting guides.
  • Escalation (if needed): If the issue cannot be resolved promptly, escalate to higher-level teams or experts, providing all necessary information.

4. Recovery and Mitigation:

  • Implement Immediate Mitigation: Apply temporary fixes to minimize customer impact.
  • Recovery Actions: Execute recovery plans, which might include restarting services, reallocating resources, etc.
  • Rollback (if needed): If a recent change caused the failure, initiate a rollback to a stable version, following predefined rollback procedures.

5. Communication:

  • Internal Communication: Keep all relevant internal stakeholders informed about the status, actions taken, and expected resolution time.
  • Communication with Customers: If the incident affects customers, communicate transparently about the issue, expected resolution time, and any necessary actions they need to take.

6. Post-Incident Activities:

  • Post-mortem Analysis: Conduct a detailed analysis of the incident, identify lessons learned, and update procedures as needed.
  • Continuous Improvement: Regularly review and update the process, including the alarm response and troubleshooting guides, based on new insights and changes in the system.

A well-defined process for microservices not only provides clear guidelines on development and preventive measures but also includes detailed steps for responding to alarms, troubleshooting, escalation, recovery, and communication. Such a process ensures that the team is prepared and aligned when issues arise, enabling rapid response, minimizing customer impact, and fostering continuous learning and improvement.

Post-Mortem Analysis

When a failure or an incident occurs in a microservice, the development team will need to follow a post-mortem process for analyzing and evaluating an incident or failure. Here’s how post-mortems help enhance fault tolerance:

1. Understanding Root Causes:

A post-mortem helps identify the root cause of a failure, not just the superficial symptoms. By using techniques like the “5 Whys,” teams can delve deep into the underlying issues that led to the fault, such as coding errors, network latency, or configuration mishaps.

2. Assessing Impact and Contributing Factors:

Post-mortems enable the evaluation of the full scope of the incident, including customer impact, affected components, and contributing factors like environmental variations. This comprehensive view allows for targeted improvements.

3. Learning from Failures:

By documenting what went wrong and what went right during an incident, post-mortems facilitate organizational learning. This includes understanding the sequence of events, team response effectiveness, tools and processes used, and overall system resilience.

4. Developing Actionable Insights:

Post-mortems result in specific, actionable recommendations to enhance system reliability and fault tolerance. This could involve code refactoring, infrastructure upgrades, or adjustments to monitoring and alerting.

5. Improving Monitoring and Alerting:

Insights from post-mortems can be used to fine-tune monitoring and alerting systems, making them more responsive to specific failure patterns. This enhances early detection and allows quicker response to potential faults.

6. Fostering a Culture of Continuous Improvement:

Post-mortems encourage a blame-free culture focused on continuous improvement. By treating failures as opportunities for growth, teams become more collaborative and proactive in enhancing system resilience.

7. Enhancing Documentation and Knowledge Sharing:

The documentation produced through post-mortems is a valuable resource for the entire organization. It can be referred to when similar incidents occur, or during the onboarding of new team members, fostering a shared understanding of system behavior and best practices.

Conclusion

The complexity and interdependent nature of microservice architecture introduce specific challenges in terms of management, communication, security, and fault handling. By adopting robust measures for prevention, detection, and recovery, along with adhering to development best practices and learning from post-mortems, organizations can significantly enhance the fault tolerance and resilience of their microservices. A well-defined, comprehensive approach that integrates all these aspects ensures a more robust, flexible, and responsive system, capable of adapting and growing with evolving demands.

June 24, 2023

Systems Performance

Filed under: Computing,Technology — admin @ 3:33 pm

This blog summarizes learnings from Systems Performance by Brendan Gregg that covers concepts, tools, and performance tuning for operating systems and applications.

1. Introduction

This chapter introduces Systems Performance including all major software and hardware components with goals to improve the end-user experience by reducing latency and computing cost. The author describes challenges for performance engineering such as subjectivity without clear goals, complexity that requires holistic approach, multiple causes, and multiple performance issues. The author defines key concepts for performance such as:

  • Latency: measurement of time spent waiting and it allows maximum speedup to be estimated.
  • Observability: refers to understanding a system through observation and includes that use counters, profiling, and tracing. This relies on counters, statistics, and metrics, which can be used to trigger alerts by the monitoring software. The profiling performs sampling to paint the picture of target. Tracing is event-based recording, where event data is captured and saved for later analysis using static instrumentation or dynamic instrumentation. The latest dynamic tools use BPF (Berkley Packet Filter) to build dynamic tracing, which is also referred as eBPF.
  • Experimentation tools: benchmark tools that test a specific component, e.g. following example performs a TCP network throughput:
iperf -c 192.168.1.101 -i 1 -t 10

The chapter also describes common Linux tools for analysis such as:

  • dmesg -T | tail
  • vmstat -SM 1
  • mpstat -P ALL 1
  • pidstat 1
  • iostat -sxz 1
  • free -m
  • sar -n DEV 1
  • sar -n TCP,ETCP 1

2. Methodologies

This chapter introduces common performance concepts such as IOPS, throughput, response-time, latency, utilization, saturation, bottleneck, workload, and cache. It defines following models of system performance like System under Test and Queuing System shown below:

System Under Test
Queuing System

The chapter continues with defining concepts such as latency, time-scales, trade-offs, tuning efforts, load vs architecture, scalability, and metrics. It defines utilization based on time as:

U = B / T
where U = utilization, B = total-busy-time, T = observation period

and in terms of capacity, e.g.

U = % used capacity

The chapter defines saturation as the degree to which a resource has queued work it cannot service and defines caching considerations to improve performance such as hit ratio, cold-cache, warm-cache, and hot-cache.

The author then describes analysis perspectives and defines resource analysis that begins with analysis of the system resources for investigating performance issues and capacity planning, and workload analysis that examines the performance of application for identifying and confirming latency issues. Next, author describes following methodologies for analysis:

  • Streetlight Anti-Method: It is an absence of a deliberate methodology and user analyzes a familiar tool but can be a hit or miss.
  • Random Change Anti-Method: In this approach, user randomly guesses where the problems may be and then changes things until it goes away.
  • Blame-Someone-Else Anti-Method: In this approach, user blames someone else and redirects the issue to another team.
  • Ad Hoc Checklist Method: It’s a common methodology where a user uses an ad hoc list built from recent experience.
  • Problem Statement: It defines the problem statement based on if there has been a performance issue; what was changed recently; who are being affected; etc.
  • Scientific Method: This approach is summarized as: Question -> Hypothesis -> Prediction -> Test -> Analysis.
  • Diagnostic Cycle: This is defined as hypothesis -> instrumentation -> data -> hypothesis.
  • Tools Method: This method lists available performance tools; gather metrics from each tool; and then interpret the metrics collected.
  • Utilization, Saturation, and Errors (USE) Method: This method focuses on system resources and checks utilization, saturation and errors for each resource.
  • RED Method: This approach checks request rate, errors, and duration for every service.
  • Workload Characteristics: This approach answers questions about Who is causing the load; why is the load being called; what are the load characteristics?
  • Drill-Down Analysis: This approach defines three-stage drill-down analysis methodology for system resource: Monitoring, Identification, Analysis.
  • Latency Analysis: This approach examines the time taken to complete an operation and then breaks it into smaller components, continuing to subdivide the components.
  • Method R: This methodology developed for Oracle database that focuses on finding the origin of latency.

Modeling

The chapter then defines analytical modeling of a system using following techniques:

  • Enterprise vs Cloud
  • Visual Identification uses graphs to identify patterns for linear scalability, contention, coherence (propagation of changes), knee-point (where performance stops scaling linearly), scalability ceiling.
  • Admdahl’s Law of Scalability describes content for the serial resource:
C(N) = N / (1 + a(N - 1))
where C(N) is relative capacity, N is scaling dimension such as number of CPU, and a is degree of seriality.
  • Universal Scalability Law is described as:
C(N) = N / (1 + a(N - 1) + bN(N - 1))
where b is the coherence parameter and when b = 0, it becomes Amdahl's Law
  • Queuing Theory describes Little’s Law as:
L = lambda * W
where L is average number of requests in the system, lambda is average arrival rate, and W is average request time.
Queuing System

Capacity Planning

This section describes Capacity Planning for examining how system will handle load and will scale as load scales. It searches for a resource that will become the bottleneck under load including hardware and software components. It then applies factor analysis to determine what factors to change to achieve the desired performance.

Statistics

This section reviews how to use statistics for analysis such as:

  • Quantifying Performance Gains: using observation-based and experimentation-based techniques to compare performance improvements.
  • Averages: including geometric-mean (nth root of multiplied values), harmonic-mean (count of values divided by sum of their reciprocals), average over time, decayed average (recent time weighed more).
  • Standard Deviation, Percentile, Median
  • Coefficient of Variations
  • Multimodal Distribution

Monitoring

Monitoring records performance statistics over time for comparison and identification using various time-based patterns such as hourly, daily, weekly, and yearly.

Visualization

This section examines various visualization graphs such as line-chart, scatter-plots, heat-maps, timeline-charts, and surface plot.

3. Operating Systems

This chapter examines operating system and kernel for system performance analysis and defines concepts such as:

Background

Kernel

Kernel is the core of operating system and though Linux and BSED have a monolithic kernel but other kernel models include micokernels, unikernels and hybrid kernels. In addition, new Linux versions include extended BPF for enabling secure kernel-mode applications.

Kernel and User Modes

The kernel runs in a kernel mode to access devices and execution of the privileged instructions. User applications run in a user mode where they can request privileged operations using system calls such as ioctl, mmap, brk, and futex.

Interrupts

An interrupt is a single to the processor that some event has occurred that needs processing, and interrupts the current execution of the processor and runs interrupt service routine to handle the event. The interrupts can be asynchronous for handling interrupt service requests (IRQ) from hardware devices or synchronous generated by software instructions such as traps, exceptions, and faults.

Clocks and Idle

In old kernel implementations, tick latency and tick overhead caused some performance issues but modern implementations have moved much functionality out of the clock routine to on-demand interrupts to create tickless kernel for improving power efficiency.

Processes

A process is an environment for executing a user-level program and consists of memory address space, file descriptors, thread stacks, and registers. A process contains one or more threads, where each thread has a stack, registers and an instruction pointer (PC). Processes are normally created using fork system call that wraps around clone and exec/execve system call.

Stack

A stack is a memory storage area for temporary data, organized as last-in, first-out (FIFO) list. It is used to store the return address when calling a function and passing parameters to function, which is also referred as a stack frame. The call path can be seen by examining saved return addresses across all the stack frames, which is called stack trace or stack back trace. While executing a system call, a process thread has two stacks: a user-level stack and a kernel-level stack.

Virtual Memory

Virtual memory is an abstraction of main memory, providing processes and the kernel with their own private view of main memory. It supports multitasking of threads and over-subscription of main memory. The kernel manages memory using process-swapping and paging schemes.

Schedulers

The scheduler schedules processes on processors by dividing CPU time among the active processes and threads. The scheduler tracks all threads in the read-to-run state in run priority-queues where process priority can be modified to improve performance of the workload. Workloads are categorized as either CPU-bound or I/O bound and scheduler may decrease the priority of CPU-bound processes to allow I/O-bound workloads to run sooner.

File Systems

File systems are an organization of data as files and directories. The virtual file system (VFS) abstracts file system types so that multiple file systems may coexist.

Kernels

This section discusses Unix-like kernel implementations with a focus on performance such as Unix, BSD, and Solaris. In the context of Linux, it describes systemd, which is a commonly used service manager to replace original UNIX init system and extended BPF that can be used for networking, observability, and security. BPF programs run in kernel model and are configured to run on events like USDT probes, kprobes, uprobes, and perf_events.

4. Observability Tools

This chapter identifies static performance and crises tools including their overhead like counters, profiling, and tracing.

Tools Coverage

This section describes static performance tools like sysctl, dmesg, lsblk, mdadm, ldd, tc, etc., and crisis tools like vmstat, ps, dmesg, lscpu, iostat, mpstat, pidstat, sar, and more.

Tools Type

The observability tool can categorized as a system-wide, per-process observability or counters/events based, e.g. top shows system-wide summary; ps, pmap are maintained per-process; and profilers/tracers are event-based tools. Kernel maintains various counters that are incremented when events occur such as network packets received, disk I/O, etc. Profiling collects a set of samples such as CPU usage at fixed-rate or based on untimed hardware events, e.g., perf, profile are system-wide profilers, and gprof and cachegrind are per-process profilers. Tracing instruments occurrence of an event, and can store event-based details for later analysis. Examples of system-wide tracing tools include tcpdump, biosnoop, execsnoop, perf, ftrace and bpftrace, and examples of per-process tracing tools include strace and gdb. Monitoring records statistics continuously for later analysis, e.g. sar, prometheus, collectd are common tools for monitoring.

Observability Sources

The main sources of system performance include /proc and /sys. The /proc is a file system for kernel statistics and is created dynamically by the kernel. For example, ls -F /proc/123 lists per-process statistics for process with PID 123 like limits, maps, sched, smaps, stat, status, cgroup and task. Similarly, ls -Fd /proc/[a-z]* lists system-wide statistics like cpuinfo, diskstats, loadavg, meminfo, schedstats, and zoneinfo. The /sys was originally designed for device-driver statistics but has been extended for other types of statistics, e.g. find /sys/devices/system/cpu/cpu0 -type f provides information about CPU caches. Tracepoints are Linux kernel event source based on static instrumentation and provide insight into kernel behavior. For example, perf list tracepoint lists available tracepoints and perf trace -e block:block_rq_issue traces events. Tracing tools can also use tracepoints, e.g., strace -e openat ~/iosnoop and strace -e perf_event_open ~/biosnoop. A kernel event source based on dynamic instrumentation includes kprobes that can trace entry to functions and instructions, e.g. bpftrace -e 'kprobe:do_nanosleep { printf("sleep %s\n", comm); }'. A user-space event-source for dynamic instrumentation includes uprobes, e.g., bpftrace -l 'uprob:/bin/bash:*'. User-level statically-defined tracing (UsDT) is the user-space version of tracepoint and some libraries/apps have added USDt probes, e.g., bpftrace -lv 'usdt:/openjdk/libjvm.so:*'. Hardware counters (PMC) are used for observing activity by devices, e.g. perf stat gzip words instruments the architectural PMCs.

Sar

Sar is a key monitoring tool that is provided via the sysstat package, e.g., sar -u -n TCP,ETCP reports CPU and TCP statistics.

5. Applications

This chapter describes performance tuning objectives, application basics, fundamentals for application performance, and strategies for application performance analysis.

Application Basics

This section defines performance goals including lowering latency, increasing throughput, improving resource utilization, and lowering computing costs. Some companies use a target application performance index (ApDex) as an objective and as a metric to monitor:

Apdex = (satisfactory + 0.5 x tolerable + 0 x frustrating) / total-events

Application Performance Techniques

This section describes common techniques for improving application performance such as increasing I/O size to improve throughput, caching results of commonly performed operations, using ring buffer for continuous transfer between components, and using event notifications instead of polling. This section describes concurrency for loading multiple runnable programs and their execution that may overlap, and recommends parallelism via multiple processes or threads to take advantage of multiprocessor systems. These multiprocess or multithreaded applications use CPU scheduler with the cost of context-switch overhead. Alternatively, use-mode applications may implement their own scheduling mechanisms like fibers (lightweight threads), co-routines (more lightweight than fiber, e.g. Goroutine in Golang), and event-based concurrency (as in Node.js). The common models of user-mode multithreaded programming are: using service thread-pool, CPU thread-pool, and staged event-driven architecture (SEDA). In order to protect integrity of shared memory space when accessing from multiple threads, applications can use mutex, spin locks, RW locks, and semaphores. The implementation of these synchronization primitives may use fastpath (using cmpxchg to set owner), midpath (optimistic spinning), slowpath (blocks and deschedules thread), or read-copy-update (RCU) mechanisms based on the concurrency use-cases. In order to avoid the cost of creation and destruction of mutex-locks, the implementations may use hashtable to store a set of mutex locks instead of using a global mutex lock for all data structures or a mutex lock for every data structure. Further, non-blocking allows issuing I/O operations asynchronously without blocking the thread using O_ASYNC flag in open, io_submit, sendfile and io_uring_enter.

Programming Languages

This section describes compiled languages, compiler optimization flags, interpreted languages, virtual machines, and garbage collection.

Methodology

This section describes methodologies for application analysis and tuning using:

  • CPU profiling and visualizing via CPU flame graphs.
  • off-CPU analysis using sampling, scheduler tracing, and application instrumentation, which may be difficult to interpret due to wait time in the flame graphs so zooming or kernel-filtering may be required.
  • Syscall analysis can be instrumented to study resource based performance issues where the target for syscall analysis include new process tracing, I/O profiling, and kernel time analysis.
  • USE method
  • Thread state analysis like user, kernel, runnable, swapping, disk I/O, net I/O, sleeping, lock, idle.
  • Lock analysis
  • Static performance tuning
  • Distributed tracing

Observability Tools

This section introduces application performance observability tools:

  • perf is standard Linux profiler with many uses such as:
    • CPU Profiling: perf record -F 49 -a -g -- sleep 30 && per script --header > out.stack
    • CPU Flagme Graphs: ./stackcollapse-perf.pl < out.stacks | ./flamgraph.ol --hash > out.svg
    • Syscall Tracing: perf trace -p $(pgrep mysqld)
    • Kernel Time Analysis: perf trace -s -p $(pgrep mysqld)
  • porfile is timer-based CPU profiler from BCC, e.g.,
    • profile -F 49 10
  • offcputime and bpftrace to summarize time spent by thrads blocked and off-CPU, e.g.,
    • offcputime 5
  • strace is the Linux system call tracer
    • strace -ttt -T -p 123
  • execsnoop traces new process execution system-wide.
  • syscount to count system call system wide.
  • bpftrace is a BPF-based tracer for high-level programming languages, e.g.,
    • bpftrace -e 't:syscalls:sys_enter_kill { time("%H:%M:%S "); }'

6. CPU

This chapter provides basis for CPU analysis such as:

Models

This section describes CPU architecture and memory caches like CPU registers, L1, L2 and L3. It then describes run-queue that queue software threads that are ready to run and time spent waiting on CPU run-queue is called run-queue latency or dispatcher-queue latency.

Concepts

This section describes concepts regarding CPU performance including:

  • Clock Rate: Each CPU instruction may take one or more cycles of the clock to execute.
  • Instructions: CPU execute instructions chosen from their instruction set.
  • Instruction Pipeline: This allows multiple instructions in parallel by executing different components of different instructions at the same time. Modern processors may implement branch prediction to perform out-of-order execution of the pipeline.
  • Instruction Width: Superscalar CPU architecture allows more instructions can make progress with each clock cycle based on the width of instruction.
  • SMT: Simultaneous multithreading makes use of a superscalar architecture and hardware multi-threading support to improve parallelism.
  • IPC, CP: Instructions per cycle ((IPC) describe how CPU is spending its clock cycles.
  • Utilization: CPU utilization is measured by the time a CPU instance is busy performing work during an interval.
  • User Time/Kernel Time: The CPU time spent executing user-level software is called user time, and kernel-time software is kernel time.
  • Saturation: A CPU at 100% utilization is saturated, and threads will encounter scheduler lateny as they wait to run on CPU.
  • Priority Inversion: It occurs when a lower-priority threshold holds a resource and blocks a high priority thread from running.
  • Multiprocess, Multithreading: Multithreading is generally considered superior.

Architecture

This section describes CPU architecture and implementation:

Hardware

CPU hardware include processor and its subsystems:

  • Processor: The processor components include P-cache (prefetch-cache), W-cache (write-cache), Clock, Timestamp counter, Microcode ROM, Temperature sensor, and network interfaces.
  • P-States and C-States: The advanced configuration and power interface (ACPI) defines P-states, which provides different levels of performance during execution, and C-states, which provides different idle states for when execution is halted, saving power.
  • CPU caches: This include levels of caches like level-1 instruction cache, level-1 data cache, translation lookaside buffer (TLB), level-2 cache, and level-3 cache. Multiple levels of cache are used to deliver the optimum configuration of size and latency.
  • Associativity: It describes constraint for locating new entries like full-associative (e.g. LRU), direct-mapped where each entry has only one valid location in cache, set-associative where a subset of the cache is identified by mapping, e.g., four-way associative maps an address to four possible location.
  • Cache Line: Cache line size is a range of bytes that are stored and transferred as a unit.
  • Cache Coherence: Cache coherence ensures that CPUs are always accessing the correct state of memory.
  • MMU: The memory management unit (MMU) is responsible for virtual-to-physical address translation.
  • Hardware Counters (PMC): PMCs are processor registers implemented in hardware and include CPU cycles, CPU instructions, Level 1, 2, 3 cache accesses, floating-point unit, memory I/O, and resource I/O.
  • GPU: GPU support graphical displays.
  • Software: Kernel software include scheduler that performs time-sharing, preemption, and load balancing. The scheduler uses scheduling classes to manage the behavior of runnable threads like priorities and scheduling policies. The scheduling classes for Linux kernels include RT (fixed and high-priorities for real-time workloads), O(1) for reduced latency, CFS (Completely fair scheduling), Idle, and Deadline. Scheduler policies include RR (round-robin), FIFO, NORMAL, BATCH, IDLE, and DEADLINE.

Methodology

This section describes various methodologies for CPU analysis and tuning such as:

  • Tools Method: iterate over available tools like uptime, vmstat, mpstat, perf/profile, showboost/turboost, and dmesg.
  • USE Method: It checks for utilization, saturation, and errors for each CPU.
  • Workload Charcterization like CPU load average, user-time to system-time ratio, syscall rate, voluntary context switch rate, and interrupt rate.
  • Profiling: CPU profiler can be performed by time-based sampling or function tracing.
  • Cycle Analysis: Using performance monitor counter ((PMC) to understand CPU utilization at the cycle level.
  • Performance Monitoring: identifies active issues and patterns over time using metrics for CPU like utilization and saturation.
  • Static Performance Tuning
  • Priority Tuning
  • CPU Binding

Observability Tools

This section introduces CPU performance observability tools such as:

  • uptime
  • load average – exponentially damped moving average for load including current resource usage plus queued requests (saturation).
  • pressure stall information (PSI)
  • vmstat, e.g. vmstat 1
  • mpstat, e.g. mpstat -P ALL 1
  • sar
  • ps
  • top
  • pidstat
  • time, ptime
  • turbostat
  • showboost
  • pmcash
  • tlbstat
  • perf, e.g., perf record -F 99 command, perf stat gzip ubuntu.iso, and perf stat -a -- sleep 10
  • profile
  • cpudist, e.g., cpudist 10 1
  • runqlat, e.g., runqlat 10 1
  • runqlen, e.g., runqlen 10 1
  • softirqs, e.g., softirqs 10 1
  • hardirqs, e.g., hardirqs 10 1
  • bpftrace, e.g., bpftrace -l 'tracepoint:sched:*'

Visualization

This section introduces CPU utilization heat maps, CPU subsecond-offset heat maps, flame graphs, and FlameScope.

Tuning

The tuning may use scheduling priority, power stats, and CPU binding.

7. Memory

This chapter provides basis for memory analysis including background information on key concepts and architecture of hardware and software memory.

Concepts

Virtual Memory

Virtual memory is an abstraction that provides each process its own private address space. The process address space is mapped by the virtual memory subsystem to main memory and the physical swap device.

Paging

Paging is the movement of pages in and out of main memory. File system paging (good paging) is caused by the reading and writing of pages in memory-mapped files. Anonymous paging (bad paging) involves data that is private to process: the process heap and stacks.

Demand Paging

Demand paging map pages of virtual memory to physical memory on demand and defers the CPU overhead of creating the mapping until they are needed.

Utilization and Saturation

If demands for the memory exceeds the amount of main memory, main memory becomes saturated and operating system may employ paging or OOM killer to free it.

Architecture

This section introduces memory architecture:

Hardware Main Memory

The common type of the main memory is dynamic random-access memory (DRAM) and column address strobe (CAS) latency for DDR4 is around 10-20ns. The main memory architecture can be uniform memory access (UMA) or non-uniform memory access (NUMA). The main memory may use a shared system-bus to connect a single or multiprocessors, directly attached memory, or interconnected memory bus. The MMU (memory management unit) translates virtual addresses to physical addresses for each page and offset within a page. The MMU uses a TLB (translation lookaside buffer) as a first-level cache for addresses in the page tables.

Software

The kernel tracks free memory in free list of pages that are available for immediate allocation. The kernel may use swapping, reap any memory that can be freed, or use OOM killer to free memory when memory is low.

Process Virtual Address Space

The process virtual address space is a range of virtual pages that are mapped to physical pages and addresses are split into segments like executable text, executable data, heap, and stack. There are a variety of user- and kernel-level allocators for memory with simple APIs (malloc/free), effcient memory usage, performance, and observability.

Methodology

This section describes various methodlogies for memory analysis:

  • Tools Method: involves checking page scanning, pressure stall information (PSI), swapping, vmstat, OOM killer, and perf.
  • USE Method: utilization of memory, the degree of page scanning, swapping, OOM killer, and hardware errors.
  • Characterizing usage
  • Performance monitoring
  • Leak detection
  • Static performance tuning

Observability Tools

This section includes memory observability tools including:

  • vmstat
  • PSI, e.g., cat /proc/pressure/memory
  • swapon
  • sar
  • slabtop, e.g., slabtop -sc
  • numastat
  • ps
  • pmap, e.g. pmap -x 123
  • perf
  • drsnoop
  • wss
  • bpftrace

Tuning

This section describes tunable parameters for Linux kernels such as:

  • vm.dirty_background_bytes
  • vm.dirty_ratio
  • kernel.numa_balancing

8. File Systems

This chapter provides basic for file system analysis:

Models

  • File System Interfaces: File system interfaces include read, write, open, and more.
  • File System Cache: The file system cache may cache reads or buffer writes.
  • Second-Level Cache: It can be any memory type like level-1, level2, RAM, and disk.

Concepts

  • File System Latency: primary metric of file system performance for time spent in the file system and disk I/O subsystem.
  • Cache: The file system will use main memory as a cache to improve performance.
  • Random vs Sequential I/O: A series of logical file system I/O can be either random or sequential based on the file offset of each I/O.
  • Prefetch/Read-Ahead: Prefetch detects sequential read workload and issue disk reads before the application request it.
  • Write-Back Caching: It marks write completed after transfering to main memory and writes to disk asynchronously.
  • Synchronous Writes: using O_SYNC, O_DSYNC or O_RSYNC flags.
  • Raw and Direct I/O
  • Non-Blocking I/O
  • Memory-Mapped Files
  • Metadata including information about logical and physical that is read and written to the file system.
  • Logical vs Physical I/O
  • Access Timestamps
  • Capacity

Architecture

This section introduces generic and specific file system architecture:

  • File System I/O Stack
  • VFS (virtual file system) common interface for different file system types.
  • File System Caches
    • Buffer Cache
    • Page Cache
    • Dentry Cache
    • Inode Cache
  • File System Features
    • Block (fixed-size) vs Extent (pre-allocated contiguous space)
    • Journaling
    • Copy-On-Write
    • Scrubbing
  • File System Types
    • FFS – Berkley fast file system
    • ext3/ext4
    • XFS
    • ZFS
  • Volume and Pools

Methodology

This section describes various methodologies for file system analysis and tuning:

  • Disk Analysis
  • Latency Analysis
    • Transaction Cost
  • Workload Characterization
    • cache hit ratio
    • cache capacity and utilization
    • distribution of I/O arrival times
    • errors
  • Performance monitoring (operation rate and latency)
  • Static Performance Tuning
  • Cache Tuning
  • Workload Separation
  • Micro-Benchmaking
    • Operation types (read/write)
    • I/O size
    • File offset pattern
    • Write type
    • Working set size
    • Concurency
    • Memory mapping
    • Cache state
    • Tuning

Observability Tools

  • mount
  • free
  • vmstat
  • sar
  • slabtop, e.g., slabtop -a
  • strace, e.g., strace -ttT -p 123
  • fatrace
  • opensnoop, e.g., opensnoop -T
  • filetop
  • cachestat, e.g., cachestat -T 1
  • bpftrace

9. Disks

This chapter provides basis for disk I/O analysis. The parts are as follows:

Models

  • Simple Disk: includes an on-disk queue for I/O requests
  • Caching Disk: on-disk cache
  • Controller: HBA (host-bus adapter) bridges CPU I/O transport with the storage transport and attached disk devices.

Concepts

  • Measuring Time: I/O request time = I/O wait-time + I/O service-time
    • disk service time = utilization / IOPS
  • Time Scales
  • Caching
  • Random vs Sequential I/O
  • Read/Write Ratio
  • I/O size
  • Utilization
  • Saturation
  • I/O Wait
  • Synchronous vs Asynchronous
  • Disk vs Application I/O

Architecture

  • Disk Types
    • Magnetic Rotational
      • max throughput = max sector per track x sector-size x rpm / 60 s
      • Short-Stroking
      • Sector Zoning
      • On-Disk Cache
    • Solid-State Drives
      • Flash Memory
      • Persistent Memory
  • Interfaces
    • SCSI
    • SAS
    • SATA
    • NVMe
  • Storage Type
    • Disk Devices
    • RAID
  • Operating System Disk I/O Stack

Methodology

  • Tools Method
    • iostat
    • iotop/biotop
    • biolatency
    • biosnoop
  • USE Method
  • Performance Monitoring
  • Workload Characterization
    • I/O rate
    • I/O throughput
    • I/O size
    • Read/write ratio
    • Random vs sequential
  • Latency Analysis
  • Static Performance Tuning
  • Cache Tuning
  • Micro-Benchmarking
  • Scaling

Observability Tools

  • iostat
  • pressure stall information (PSI)
  • perf
  • biolatency, e.g., biolatency 10 1
  • biosnoop
  • biotop
  • ioping

10. Network

This chapter introduces network analysis. The parts are as follows:

Models

  • Network Interface
  • Controller
  • Protocol Stack
    • TCP/IP
    • OSI Model

Concepts

  • Network and Routing
  • Protocols
  • Encapsulation
  • Packet Size
  • Latency
    • Connection Latency
    • First-Byte Latency
    • Round-Trip Time
  • Buffering
  • Connection Backlog
  • Congestion Avoidance
  • Utilization
  • Local Connection

Architecture

  • Protocols
  • IP
  • TCP
    • Sliding Window
    • Congestion avoidance
    • TCP SYN cookies
    • 3-way Handshake
    • Duplicate Ack Detection
    • Congestion Controls
    • Nagle algorithm
    • Delayed Acks
  • UDP
    • QUIC and HTTP/3
  • Hardware
    • Interfaces
    • Controller
    • Switches and Routers
    • Firewalls
  • Software
    • Network Stack
  • Linux Stack
    • TCP Connection queues
    • TCP Buffering
    • Segmentation Offload: GSO and TSO
    • Network Device Drivers
    • CPU Scaling
    • Kernel Bypass

Methodology

  • Tools Method
    • netstat -s
    • ip -s link
    • ss -tiepm
    • nicstat
    • tcplife
    • tcptop
    • tcpdump
  • USE Method
  • Workload Characterization
    • Network interface throughput
    • Network interface IOPS
    • TCP connection rate
  • Latency Analysis
  • Performance Monitoring
    • Throughput
    • Connections
    • Errors
    • TCP retransmits
    • TCP out-of-order pack
  • Packet Sniffing
    • tcpdump -ni eth4
  • TCP Analysis
  • Static Performance Tuning
  • Resource Controls
  • Micro-Benchmarking

Observability Tools

  • ss, e.g., ss -tiepm
  • strace, e.g., strace -e sendmesg,recvmsg ss -t
  • ip, e.g., ip -s link
  • ifconfig
  • netstat
  • sar
  • nicstat
  • ethtool, e.g., ethtool -S eth0
  • tcplife, tcptop
  • tcpretrans
  • bpftrace
  • tcpdump
  • wireshark
  • pathchar
  • iperf
  • netperf
  • tc

Tuning

sysctl -a | grep tcp

11. Cloud Computing

This chapter introduces cloud performance analysis with following parts:

Background

  • Instance Types: m5 (general-purpose), c5 (compute optimized)
  • Scalable Architecture: horizontal scalability with load balancers, web servers, application servers, and databases.
  • Capacity Planning: Dynamic sizing (auto-scaling) using auto-scaling-group and and Scalability testing.
  • Storage: File store, block store, object store
  • Multitenancy
  • Orchestration (Kubernetes)

Hardware Virtualization

  • Type 1: execute directly on the processors using native hypervisor or bare-metal hypervisor (e.g., Xen)
  • Type 2: execute within a host OS and hypervisor is scheduled by the host kernel
  • Implementation: Xen, Hyper-V, KVM, Nitro
  • Overhead:
    • CPU overhead (binary translation, paravirtualization, hardware assisted)
    • Memory Mapping
    • Memory Size
    • I/O
    • MultiTenant Contention
    • Resource Controls
  • Resource Controls
    • CPUs – Borrowed virtual time, Simple earliest deadline, Credit based schedulers
    • CPU Caches
    • Memory Capacity
    • File System Capacity
    • Device I/O
  • Observability
    • xentop
    • perf kvm stat live
    • bpftrace -lv t:kvm:kvm_exit
    • mpstat -P ALL 1

OS Virtualization

  • Implementation: Linux supports namespaces and cgroups that are used to create containers. Kubernetes uses following architecture for Pods, Kube Proxy and CNI.
  • Namespaces
    • lsns
  • Control Groups (cgroups) limit the usage of resources
  • Overhead – CPU, Memory Mapping, Memory Sie, I/O, and Multi-Tenant Contention
  • Resource Controls – throttle access to resources so they can be shared more fairly
    • CPU
    • Shares and Bandwidth
    • CPU Cache
    • Memory Capacity
    • Swap Capacity
    • File System Capacity
    • File System Cache
    • Disk I/O
    • Network I/O
  • Observability
    • from Host
      • kubectl get pod
      • docker ps
      • kubectl top nodes
      • kubectl top pods
      • docke stats
      • cgroup stats (cpuacct.usage and cpuacct.usage_percpu)
      • system-cgtop
      • nsenter -t 123 -m -p top
      • Resource Controls (throttled time, non-voluntary context switches, idle CPU, busy, all other tenants idle)
    • from guest (container)
      • iostat -sxz 1
  • Lightweight Virtualization
  • Lightweight hypervisor based on process virtualization (Amazon Firecracker)
    • Implementation – Intel Clear, Kata, Google gVisor, Amazon Firecracker
  • Overhead
  • Resource Controls
  • Observability
    • From Host
    • From Guest
      • mpstat -P ALL 1

12. Benchmarking

This chapter discusses benchmarks and provides advice with methodologies. The parts of this chapter include:

Background

  • Reasons
    • System design
    • Proof of concept
    • Tuning
    • Development
    • Capacity planning
    • Troubleshooting
    • Marketing
  • Effective Bencharmking
    • Repeatable
    • Observable
    • Portable
    • Easily presented
    • Realistic
    • Runnable
    • Bencharmk Analysis
  • Bencharm Failures
    • Causal Bencharmking – you benchmark A, but measure B, and conclude you measured C, e.g. disk vs file system (buffering/caching may affect measurements).
    • Blind Faith
    • Numbers without Analysis – include description of the benchmark and analysis.
    • Complex Benchmark Tools
    • Testing the wrong thing
    • Ignoring the Environment (not tuning same as production)
    • Ignoring Errors
    • Ignoring Variance
    • Ignoring Perturbations
    • Changing Multiple Factors
    • Friendly Fire
  • Benchmarking Types
    • Micro-Benchmarking
    • Simulation – simulate customer application workload (macro-bencharmking)
    • Replay
    • Industry Standards – TPC, SPEC

Methodology

This section describes methologies for performing becharmking:

  • Passive Bencharmking (anti-methodology)
    • pick a BM tool
    • run with a variety of options
    • Make a slide of results and share it with management
    • Problems
      • Invalid due to software bugs
      • Limited by benchmark software (e.g., single thread)
      • Limited by a component that is unrelated to the benchmark (congested network)
      • Limited by configuration
      • Subject to perturbation
      • Benchmarking the wrong the thing entirely
  • Active Benchmarking
    • Analyze performance while benchmarking is running
    • bonie++
    • iostat -sxz 1
  • CPU Profiling
  • USE Method
  • Workload Characterization
  • Custom Benchmarks
  • Ramping Load
  • Statistical Analysis
    • Selection of benchmark tool, its configuration
    • Execution of the benchmark
    • Interpretation of the data
  • Benchmarking Checklist
    • Why not double
    • Did it beak limit
    • Did it error
    • Did it reproduce
    • Does it matter
    • Did it even happen

13. perf

This chapter introduces perf tool:

  • Subcommands Overview
    • perf record -F 99 -a -- sleep 30
  • perf Events
    • perf list
  • Hardware Events (PMCs)
    • Frequency Sampling
    • perf record -vve cycles -a sleep 1
  • Software Events
    • perf record -vve context-switches -a -- sleep
  • Tracepoint Events
    • perf record -e block:block_rq_issue -a sleep 10; perf script
  • Probe Events
    • kprobes, e.g., perf prob --add do_nanosleep
    • uprobes, e.g., perf probe -x /lib.so.6 --add fopen
    • USDT
  • perf stat
    • Interval Statistics
    • Per-CPU Balance
    • Event Filters
    • Shadow Statistics
  • perf record
    • CPU Profiling, e.g., perf record -F 99 -a -g -- sleep 30
    • Stack Walking
  • perf report
    • TUI
    • STDIO
  • perf script
    • Flame graphs
  • perf trac

12. Ftrace

This chapter introduces Ftrace tool. The sections are:

  • Capabilities
  • tracefs
    • tracefs contents, e.g., ls -F /sys/kernel/debug/tracing
  • Ftrace Function Profiler
  • Ftrace Function Tracing
  • Tracepoints
    • Filter
    • Trigger
  • kprobes
    • Event Tracing
    • Argument and Return Values
    • Filters and Triggers
  • uprobes
    • Event Tracing
    • Argument and Return Values
  • Ftrace function_graph
  • Ftrace hwlat
  • Ftrace Hist Triggers
  • perf ftrace

15. BPF

This chapter introduces BPF tool. The sections are:

  • BCC – BPF COmpiler Collection, e.g., biolatency.py -mF
  • bpftrace – open-source tracer bult upon BPF and BC
  • Programming

16. Case Study

This chapter describes the story of a real-world performance issue.

  • Problem Statement – Java application in AWS EC2 Cloud
  • Analysis Strategy
    • Checklist
    • USE method
  • Statistics
    • uptime
    • mpstat 10
  • Configuration
    • cat /proc/cpuinfo
  • PMCs
    • ./pmarch -p 123 10
  • Software Events
    • perf stat -e cs -a -I 1000
  • Tracing
    • cpudist -p 123 10 1
  • Conclusion
    • No container neighbors
    • LLC size and workload diffeence
    • CPU difference

May 28, 2023

Patterns for API Design

Filed under: API,Computing — admin @ 10:46 pm

I recently read Olaf Zimmermann’s book “Patterns for API Design” that reviews theory and practice of API design patterns. These patterns built upon earlier work of Gregor Hohpe on Enterprise Integration Patterns and Martn Fowler’s Patterns of Enterprise Application Architecture. Following is summary of these API patterns from the book:

1. Application Programming Interface (API) Fundamentals

The first chapter defines the remote API fundamentals that defines contract for the desired behavior, communication protocol, network endpoints and policies regarding failures. The chapter also surveys history of remote APIs such as TCP/IP based FTP, RPC based DCE/CORBA/RMI/gRPC, queue/messaging based, REST style, and data streams/pipelines. The authors then examines cloud native applications (CNA) and a set of principles described in IDEAL that include Isolated State, Distribution, Elasticity, Automation and Loose Coupling. These traits are then summarized as:

  • Fit for purpose
  • Rightsized and modular
  • Resilient and protected
  • Controllable and adaptable
  • Workfload-aware and resource-efficient
  • Agile and tool-supported

The authors describes how service-oriented architecture originated and evolved into Microservices that has a single responsibility within a domain-specific business capability. Microservices facilitate software reuse but also bring new challenges that include fallacies of distributed computing, data consistency and state management. These APIs should be considered as products and they may form ecosystems. API business value, visibility and lifetime help make API successful by enabling rapid integration of systems with support of the autonomy and independent evolution of those systems. The API design may differ based on:

  • One general vs many specialized endpoints
  • Fine vs coarse-grained endpoint and operation scope
  • Few operations carrying much data vs chatty interactions carrying little data
  • Data currentness vs correctness
  • Stable contracts vs fast changing ones

The authors reviews architecturally significant requirements such as understandability, information sharing vs hiding, amount of coupling, modifiability, performance & scalability, data parsimony, and security & privacy. The authors then describes developer experience as:

DX = function, stability, ease of use, and clarity

The developer experience includes quality attributes throughout the lifecycle of API such as development qualities, operational qualities and managerial qualities. The chapter ends with definition of a domain model for remote APIs that include communication participants, endpoints with contracts that describe operations, message structure, and API contract.

2. Lakeside Mutual Case Study

The chapter 2 introduces a fake Lakeside Mutual case study to illustrate API design. This chapter examines the user-stories and quality attributes for the case study, analsysis-level domain model and architecture overview. The architecture overviews describes system context and application architecture including service components. The chapter then describes an example of API specification using MDSL (Microservice Domain-Specific-Language).

3. API Decision Narratives

This chapter goes over the API design options and decisions where each decision includes criteria, alternative options and recommendations based on why-statement and architecture-decision-record format. The first pattern starts with Foundation API decisions and patterns that include:

3.1 API Visibility

This decision that is primarily managerial and organizational looks at the different visibility options such as Public API, Community API and Solution-Internal API.

3.2 API Integration

The chapter looks at the decision looks for the API integration types where an API can be integrated with backend horizontally or can be integrated with frontend and backend vertically. In the former option, backend exposes its services via a message-based remote backend integration API. In the latter option, the APIs are exposed via a message-based remote frontend integration API.

3.3 Documentation of API

The API designers need to decide if the API should have the documentation and if so, how should it be documented. For example, there are multiple standards such as OpenAPI specification (formerly known as Swagger), Web Application Description Language (WADL), Web Service Description Language and Microservice Domain-Specific Language (MDSL).

3.4 API Roles and Responsibilities

The API designers have to find an appropriate business granularity for the service and handle cohesion & coupling criteria. The drivers for this decision is to define the architectural role that an API endpoint should play and define the responsibility of each API operation. The role of an API endpoint can be Processing Resource for processing incoming commands or Information Holder Resource for storage and retrieval of data or metadata. The information holder roles can be further divided into operational/transactional short-lived data, master long-lived data for business transactions, reference long-lived data for looking up delivery status, zip codes, etc., link lookup resource to identify links to resources and data transfer resource to offer a shared data exchange between clients.

3.5 Defining Operation Responsibilities

The operation responsibilities include defining read-write characteristics of each API operation and can be categorized into Computing Function, State Creation Operation, State Transition Operation and Retrieval Operation. The Computation Function computes a result solely from the client input without reading or writing a server-side state. The State Creation Operation creates states with reliability on an write-only API endpoint. The State Transition Operation performs one or more activities, causing a server-side state change with considerations for network efficiency and data parsimony. The Retrieval Operation represents a read-only access operation to find the data.

3.6 Selecting Message Representation Patterns

The structural representation patterns deal with designing message representation structures with considerations for finding the optimal number of message parameters and semantic meaning and stereotypes of the representation elements. This structural representation needs to take four decisions: responsibility of message elements, structure of the parameter representation, exchange of context information required and meaning of stereotypes of message elements. The structure of parameter representation can be nested or flat with types: Atomic Parameter, Atomic Parameter List, Parameter Tree, and Parameter Forrest. The Atomic Parameter defines a single parameter or body element. The Atomic Parameter List aggregates multiple atomic parameters as list. The Parameter Tree defines a hierarchical structure with one or more child nodes. The Parameter Forest comprises two or more Parameter Trees. Security and privacy concerns such as data integrity and confidentiality as well as semantic proximity will determine choosing the right option for these structure types.

3.7 Element Stereotypes

The element stereotype patterns include Data Element and Metadata Element, Link Element, ID Element. Security and data privacy concerns drive Data Elements and Metadata Elements and messages may become larger if Metadata Elements are included. The unique ID Element is used to identify to API endpoints, operations, and message representation elements. The Link Element act as human and machine readable network accessible pointers to other endpoints and operations.

3.8 Governing API Quality

The quality of service (QoS) for API include reliability, performance, security and scalability. The themes of the decisions for quality governance include:

3.8.1 Identification and Authentication of the API Client

Identification, authentication and authorization are important for APIs security but they also enable measures for ensuring many other qualities.

3.8.2 API Key

The API Key identifies the client and additional signature made with the secret key, which are never transmitted. You may use OAuth 2.0, OpenID, Kerberos in combination with LDAP, CHAP, and EAP.

3.8.3 Pricing Plan

The pricing plan looks at the metering and charging for API consumption and its variants include Freemium Model, Flat-Rate Subscription, Usage-based Pricing and Market-based Pricing based on economic aspects.

3.8.4 Rate Limit

The Rate Limit safeguards against API clients that overuse the API.

3.8.5 Service Level Agreement

The service Level Agreement defines testable service-level objectives to establish a structured, quality-oriented agreement with the API product owner.

3.8.6 Error Report

The Error Report uses error codes in response message that indicate and classify the faults in a simple and machine-readable ways.

3.8.7 Context Representation

The context representation uses Metadata Elements to carry contextual information in request and response messages. It can be used to cope with the diversity of protocols for distributed applications and transport security tokens and digital signatures.

3.8.8 Pagination

The pagination divides large response data into smaller chunks and its variants include Page-Based Pagination, Cursor-Based Pagination, Offset-Based Pagination, and Time-Based Pagination. It can optionally allow filtering capabilities and pagination structure can be defined as Atomic Parameter List, or Parameter Forest or Parameter Tree.

3.8.9 Wish List and Wish Template

A Wish List allows API clients to provide desired data elements of requested resource in the request. When response contains a nested data, Wish Template can be used to specify parameters in the request message that should be included in the corresponding response message.

3.8.10 Conditional Request

This pattern makes a Conditional Request by adding Metadata Elements to the request message and the API service processes the request only the condition specified by the metadata is met.

3.8.11 Request Bundle

Request Bundle is defined as a data container that assembles multiple requests with unique identifiers in a single request message.

3.8.12 Embedded Entity

This pattern embeds a Data Element in the request or response instead of link or identifier.

3.8.13 Linked Information Holder

Linked Information Holder adds a Link Element to message that points to the API endpoint that represent the linked element.

3.9 API Evolution

API Evolution patterns define governing rules balancing stability and compatibility with maintainability and extensibility such as:

3.9.1 Version Identifier

Version identifier is added as a Metadata Element to the endpoint address, protocol header or the message payload to indicate possibly incompatible changes to clients.

3.9.2 Semantic Versioning

Semantic Versioning introduces a hierarchical three-number versioning schema x.y.z, which allows API providers to denote level of changes as major, minor and patch versions.

3.9.3 Commissioning and Decommissioning

The variants for version introduction and decommissioning decision include Two in Production, Limited Lifetime Guarantee, and Aggressive Obsolescence.

3.9.4 Experimental Preview

Experimental Preview provides access to an API to receive early feedback from consumers without making any commitments about the functionality, stability and longevity.

4. Pattern Language Introduction

This chapter introduces a pattern language, basic scoping and structural patterns. Many of the patterns builds upon Enterprise Integration Patterns and Gof Design Patterns when defining structure of a message. This chapter categorizes patterns into Foundation patterns, Responsibility patterns, Structure patterns, Quality patterns and Evolution patterns. These patterns also follow Design Refinement Phases based on Unified Process:

  • Inception – Foundation
  • Elaboration – Responsibility and Quality
  • Construction – Structure, Responsibility and Quality
  • Transition – Foundation, Quality and Evolution

4.1 Foundations: API Visibility and Integration Types

These patterns deal with types of systems, subsystems and components as well as where should an API be accessible. The API integration types can be Frontend Integration and Backend Integration. The Frontend Integrations, also referred as vertical integrations are consumed by API clients in application frontends. The cloud-native applications and microservices-based system benefit with Backend Integration, sometimes called horizontal integration to access information or activity in other systems. The API visibility alternatives can be Public API, Community API and Solution-Internal API. Public API specifies endpoints, operations, message representation, quality of service and lifecycle model that can be accessed by unlimited or unknown number of API clients and can be controlled with API keys. You may apply other patterns such as Version Identifiers, Pricing Plan, Rate Limit and Service Level Agreement with Public APIs. The Community API are only available to a community that may consists of different organizations. The Solution-Internal API is also referred as Platform API that may be exposed in a single cloud provider offering.

4.2 Basic Structure Patterns

The structure patterns looks at the number of representation elements for request and response messages and decides how these elements should be grouped. These patterns include Atomic Parameter that describes plain data such as text and numbers; Atomic Parameter List that groups several elementary parameters; Parameter Trees that provide nested parameters; and Parameter Forest that groups multiple tree parameters.

5. Define Endpoint Types and Operations

This chapter corresponds to Define phase of the Align-Define-Design-Refine (ADDR) process and describes high-level endpoint identification activities. The authors looks at user stories, event storming or other collaboration techniques to define API roles and responsibilities. The design of API contracts also have to define developer experience in terms of function, stability, ease of use, clarity. Other quality attributes that the API designer have to decide include: Accuracy for functional correctness including preconditions, invariants and postconditions; Distribution of control and autonomy between API client and provider; Scalability, performance and availability with Service Level Agreements for mission-critical APIs; Manageability for monitoring APIs; Consistency and atomicity for all-or-nothing semantics; Idempotence property; Auditability for risk management.

5.1 Endpoint Roles (aka Service Granularity)

The two general endpoint roles are Processing Resource and Information Holder Resource. The Processing Resource role allows remote clients to trigger an action and related design concerns include contract expressiveness and service granularity; learnability and manageability; semantic interoperability; response time; security and privacy; and compatibility and evolvability. Information Holder Resource exposes domain data in API and it may use Domain-driven design and object-oriented analysis and design to model the data. Other related concerns include quality attribute conflicts and trade-offs; security; data freshness vs consistency; and compliance with architectural design principles. Related patterns include:

  • Operational Data Holder to create, read, update and delete its data often and fast.
  • Master Data Holder to access master data that lives for long time, doesn’t change and will be referenced from many clients. The request and response messages of Master Data often take the form of Parameter Trees and master data update may come in the form of coarse-grained full updates or fine-grained partial updates.
  • Reference Data Holder is used to lookup reference data that is long lived and is immutable for clients using API endpoints. Its desired qualities include Do not repeat yourself (DRY) and performance vs consistency trade-off for read access.
  • Link Lookup Resource allows referring to other resources so that clients remain loosely coupled if API provider changes the destination of links. The design challenges include: coupling between clients and endpoints; dynamic endpoint references; centralization vs decentralization; message sizes, number of calls, resource use; dealing with broken links; and number of endpoints and API complexity.
  • Data Transfer Resource allows exchanging data between participants without knowing each other, without being available at the same time. The design considerations include coupling (time and location dimensions); communication constraints; reliability; scalability; storage space efficiency; latency; and ownership management. You may introduce a shared storage endpoint with a State Creation Operation and Retrieval Operation. The pattern properties include coupling (time and location dimensions); communication constraints; reliability; scalability; storage space efficiency; latency; ownership management; access control; (lack of) coordination; optimistic locking; polling; and garbage collection.

5.2 Operation Responsibilities

The four operation responsibilities include:

  • State Creation Operation to allow its clients that something has happened, e.g. to trigger instant or later processing. This design concerns include: coupling trade-offs (accuracy and expressiveness vs information parsimony); timing; consistency; and reliability. It may or may not have fire-and-forget semantics and idempotency may be needed for the transaction boundary. A popular variant of this pattern is Event Notification Operation, notifying the endpoint about an external event.
  • Retrieval Operation to retrieve information and allow further client-side processing. The design issues include: veracity, variety, velocity and volume; workload management; network efficiency vs data parsimony.
  • State Transition Operation to allow a client initiate a processing action that causes the provider-side application state to change. The design concerns include service granularity; consistency and auditability; dependencies on state changing being made beforehand; workload management; and network efficiency vs data parsimony. State Transition Operations are generally transactional supporting ACID behavior and may use ABAC for compliance and security controls.
  • Computation Function to allow client invoke side-effect-free remote processing on the provider side based on the input parameters. The relevant design issues include reproducibility and trust; performance; and workload management. Its examples include a Transformation service, Validation service, and Long Running Computation.

6. Design Request and Response Message Representations

This chapter corresponds to Design phase of the Align-Define-Design-Refine (ADDR) process and examines structural patterns for requests and responses. The challenges when designing message representations include interoperability on protocol and message-content; latency; throughput and scalability; maintainability; and developer experience.

6.1 Data Element

The Data Element pattern allow exchanging application-level information between API clients and API providers without decoupling them. The competing force concerns include rich functionality vs ease of processing and performance; security and data privacy vs ease of configuration; and maintainability vs flexibility.

6.2 Metadata Element

Metadata Element pattern allows enriching messages with additional information so that receiver can interpret the message content correctly. The design concerns include interoperability; concerns; and ease of use vs runtime efficiency. The variants of this pattern include Control Metadata Element such as identifiers, flags, filter, ACL, API eys, etc; Aggregated Metadata Elements such as counters of Pagination and statistical information; Provenance Metadata Elements such as message/request IDs, creation date, version numbers, etc.

6.3 ID Element

ID Element pattern helps identify elements of the Published Language using UUID or surrogate key when applying domain-driven design. The identification problems include effort vs stability; reliability for machines and humans; and security.

6.3 Link Element

Link Element is used to reference API endpoints and operations in request and response message payloads so that they can be called remotely.

6.4 API Key

API Key allows API provider to identify and authenticate clients and their requests. The design issues include establishing basic security; access control; avoiding the need to store or transmit user credentials; decoupling clients form their organizations; security vs ease of use; and performance.

6.5 Error Report

Error Report allows API provides inform its clients about communication and processing faults. The design concerns include expressiveness and target audience expectations; robustness and reliability; security and performance; interoperability and portability; and internationalization.

6.6 Context Representation

Context Representation allows API consumers and providers exchange context information without relying on any particular remoting protocols. The design considerations include interoperability and modifiability; dependency on evolving protocols; developer productivity (control vs convenience); diversity of clients and their requirements; end-to-end security; and logging and auditing on business domain level.

7. Refine Message Design for Quality

This chapter reviews API Quality patterns related to the Design and Refine phases of the Align-Define-Design-Refine (ADDR) process. The major challenges with API Quality include message sizes vs number of requests; information needs of individual clients; network bandwidth usage vs computation efforts; implementation complexity vs performance; statelessness vs performance; and ease of use vs legacy.

7.1 Message Granularity

The message granularity patterns deal with performance and scalability; modifiability and flexibility; data quality; data privacy; and data freshness vs consistency. These patterns include:

  • Embedded Pattern allows placing Data Element in the request or response to avoid exchanging multiple messages.
  • Linked Information Holder can be used to keep the message small when an API deals with multiple information elements that reference each other.

7.2 Client-Driven Message Content (aka Response Shaping)

These patterns deal with performance, scalability, and resource use; information needs of individual clients; loose coupling and interoperability; developer experience; security and data privacy; and test and maintenance effort. These patterns include:

  • Pagination allows an API provider deliver large sequence of structured data without overwhelming clients. The design concerns include session awareness and isolation; and data set size and data access profile. The variants of this pattern include Page-Based Pagination, Cursor-Based Pagination, Time-Based Pagination and Offset-Based Pagination.
  • Wish List allows an API client inform the API provider at runtime about the data it is interested in.
  • Wish Template allows API client inform the API provider about nested data that it is interested in. For example, Wish Templates of GraphQL are the query and mutation schemas providing declarative descriptions of the client requirements.

7.3 Message Exchange Optimization (aka Conversation Efficiency)

These patterns provide balance competing forces for complexity of endpoint, client, and message payload design; and accuracy of reporting and billing. These patterns include:

  • Conditional Request prevents unnecessary server-side processing and bandwidth usage by invoking API operation only when the condition is true. The design concerns include size of message; client workload; provider workload; and data currentness vs correctness. Its variants include Time-Based Conditional Request (e.g. If-Modified-Since HTTP header) and Fingerprint-Based Conditional Request (e.g. ETag and If-None-Match HTTP headers).
  • Request Bundle works as a data container that assembles multiple independent requests in a single request message with unique request identifiers.

8. Evolve API

This chapter reviews API Evolution patterns related to the Refine phase of the Align-Define-Design-Refine (ADDR) process. The major challenges with API Evolution autonomy, loose coupling, extensibility, compatibility and sustainability. The patterns in this chapter include:

8.1 Versioning and Compatibility Management

These patterns include:

  • Version Identifier allows an API provider indicate its current capabilities as well as existence of possibly incompatible changes to clients. The design concerns include accuracy vs exact identification; no accidental breakage of comparability; client-side impact; and traceability of API versions in use.
  • Semantic versioning allow stakeholders compare API versions to detect incompatible changes. The design concerns include minimal effort to detect version incompatibility; clarity of change impact; clear separation of changes with different levels of impact and compatibility; manageability of API versions and related governance effort; and clarity with regard to evolution timeline. The common numbering scheme in Semantic Versioning include major version, minor version and patch version.

8.2 Life-Cycle Management Guarantees

These patterns include:

  • Experimental Preview allows providers make the introduction of a new API or new API versions, less risky for their clients and obtain early adopter feedback without freezing the API design prematurely. The design considerations include innovation and new features; feedback; focus effort; early learning and security.
  • Aggressive Obsolescence allows API providers reduce the effort for maintaining an entire API by removing unused or deprecated features. The design concerns include minimizing the maintenance effort; reducing forced changes to clients in a given time span as a consequence of API changes; repeating/acknowledging power dynamics; and commercial goals and constraints. The API version is marked as Release, Deprecate or Decommission in the lifecycle of Aggressive Obsolescence.
  • Limited Lifetime Guarantee allows a provider let clients know for how long they can rely on the published version of an API. The design considerations include make client-side changes caused by API changes plannable; and limit the maintenance effort for supporting old clients.
  • Two in Production allows a provider gradually update an API without breaking existing clients but also without maintaining a large number of API versions in production. The design concerns include allow the provider and the client to follow different life cycles; guarantee that API changes do not lead to undetected backward compatibility problems between clients and provider; ensure the ability to rollback if a new API version is designed badly; minimize changes to the client; minimize the maintenance effort for supporting clients relying on old API versions.

9. Document and Communicate API Contracts

This chapter does not correspond to any phase of the Align-Define-Design-Refine (ADDR) process. The challenges for Documenting APIs include interoperability; compliance; information hiding; economic aspects; performance and reliability; meter granularity; attractiveness from a consumer point of view. The patterns in the chapter include:

  • API Description to share knowledge between API provider and its clients and related concerns include interoperability; consumability; information hiding; extensibility and evolvability.
  • Pricing Plan to allow API provider meter API service consumption and charge for it. Its variants include Subscription-based Pricing, Usage-based Pricing, and Market-based Pricing.
  • Rate Limit to allow API provider prevent API clients from excessive API usage with design considerations for economic aspects; performance; reliability; impact and severity of risks of API abuse; and client awareness.
  • Service Level Agreement to allow API client learn about the specific quality-of-service characteristics of an API and its endpoint operations. The design concerns include business agility and vitality; attractiveness from the consumer point of view; availability; performance and scalability; security and privacy; government regulations and legal obligations; and cost-efficiency and business risks from a provider point of view.

10. Real-World Pattern Stories

This chapter examines API design and evolution in real-world business domains. The first case-study discusses large-scale process integration in Terravis, a Swiss Mortgage business had to adopt a new law for the digitization of Swiss land registry businesses. In the context dimensions defined by Philippe Krutchten, the Terravis platform was characterized in terms of system size, system criticality, system age, team distribution, rate of change, preexistence of stable architecture, governance and business model. Terravis applied role and status of API as well as other patterns such as Solution-Internal API, Community API, API Description, Service Level Agreements, Semantic Versioning, Error Report, Pricing Plan, Rate Limit, Context Representation, State Creation Operation, State Transition Operations, Pagination, etc. The other case-study showed how an internal system at the concrete column manufacturer SACAC had to integrate different existing software such as ERP and CAD systems. The chapter used the Philippe Krutchten’s project dimensions to describe the project. The key challenges included correctness of all calculations and the solution applied book’s guidelines for roles and status of API. The API used Solution-Internal, Frontend Integration, Backend Integration, State Creation Operations, State Transition Operations, Retrieval Operations, Computations Functions, etc.

11. Conclusion

The last chapter concludes with how the pattern language in the book helps integration architects, API developers and other roles involved with API design and evolution. The authors also suggest how APIs can be refactored to the patterns described in the book and use Microservice Domain Specific Language (MDSL) Tools for refactoring. The chapter also describes advancements in API protocols and standards such as HTTP/2, HTTP/3, and gRPC. OpenAPI Specification is the dominant API description language for HTTP-based APIs and AsyncAPI is gaining adoption for message-based APIs, which can also generate MDSL bindings.

May 21, 2023

Heuristics from “Code That Fits in Your Head”

Filed under: Methodologies,Technology,Uncategorized — admin @ 5:00 pm

The code maintenance and readability are important aspects of writing software systems and the “Code That Fits in Your Head” comes with a lot of practical advice for writing maintainable code. Following are a few important heuristics from the book:

1. Art or Science

In the first chapter, the author compares software development with other fields such as Civil engineering that deals with design, construction, and maintenance of components. Though, software development has these phases among others but the design and construction phases in it are intimately connected and requires continuous iteration. Another metaphor discussed in the book is thinking software development as a living organism like garden, which makes more sense as like pruning weeds in garden, you have to refactor the code base and manage technical debt. Another metaphor described in the book is software craftsmanship and software developer may progress from apprentice, journeyman to master. Though, these perspectives help but software doesn’t quite fit the art metaphor and author suggests heuristics and guidelines for programming. The author introduces software engineering that allows a structured framework for development activities.

2. Checklists

A lot of professions such as airplane pilots and doctors follow a checklist for accomplishing a complex task. You may use similar checklist for setting up a new code-base such as using Git, automating the build, enabling all compiler error messages, using linters, static analysis, etc. Though, software engineering is more than following a checklist but these measures help make small improvements.

3. Tackling Complexity

This chapter defines sustainability and quality as the purpose for the book as the software may exists for decades and it needs to sustain its organization. The software exists to provide a value, though in some cases the value may not be immediate. This means at times, worse technology succeeds because it provides faster path to the value and companies which are too focus on perfection can run out of business. Richard Gabriel coined the aphorism that worse is better. The sustainability chooses middle ground by moving in right direction with checklists and balanced software development practices. The author compares computer with human brain and though this comparison is not fair and working memory of humans is much smaller that can hold from four to seven pieces of information. This number is important when writing a code as you spend more time reading the code and a code with large number of variables or conditional logic can make it harder to understand. The author refers to the work of Daniel Kahneman who suggested model of thoughts comprising two systems: System 1 and System 2. When a programmer is in the zone or in a flow, the system 1 always active and try to understand the code. This means that writing modular code with a fewer dependencies, variables and decisions is easier to understand and maintain. The human brain can deal with limited memory and if the code handles more than seven things at once then it will lead to the complexity.

4. Vertical Slice and Walking Skeleton

This chapter recommends starting and deploying a vertical slice of the application to get to the working software. A vertical slice may consists of multiple layers but it gives an early feedback and is a working software. A number of software development methodologies such as Test-driven development, Behavioral-driven development, Domain-driven design, Type-driven development and Property-driven development help building fine-grained implementations with tests. For example, if you don’t have tests then you can use characterization test to describe the behavior of existing software. The tests generally follow Arranage-Act-Assert phases where the arrange phase prepares the test, the act phase invokes the operation under test and the assert phase verifies the actual outcome. The documentation can further explain why decisions in the code were made. The walking skeleton helps vertical slice by using acceptance-test-driven development or outside-in-test-driven development. For example, you can pick a simplest feature to implement that aims for the happy path to demonstrate that the system has a specific capability. The unit-tests will test this feature by using Fake Object, data-transfer-object (DTO) and interfaces (e.g. RepositoryInterface). The dependencies are injected into tests with this mock behavior. The real objects that are difficult tests can use a Humble Object pattern and drain the object of branching logic. Making small improvements that are continuously delivered also keep stakeholders updated so that they know when you will be done.

5. Encapsulation

The encapsulation hides details by defining a contract that describes the valid interactions between objects and callers. The parameterized tests can capture the desired behavior and assert the invariants. The incremental changes can be added using test-driven development that uses red-green-refactor where you first write a failing test, then make the test pass and then refactor to improve the code. When using a contract to capture the interactions, you can use Postel’s law to build resilient systems, i.e.,

Be conservative in what you send, be liberal in what you accept.

The encapsulation guarantees that an object is always valid, e.g. you can use a constructor to validate all invariants including pre-conditions and post-conditions.

6. Triangulation

As the working memory for humans is very small, you have to decompose and compartmentalize the code structure into smaller chunks that can be easily understood. The author describes a devil’s advocate technique for validating behavior in the unit tests where you try to pass the tests with incomplete implementation, which tells you that you need more test cases. This process can be treated as kind of triangulation:

As the tests get more specific, the code gets more generic

7. Decomposition

The code rot occurs because no one pays attention to the overall quality when making small changes. You can use metrics to track gradual decay such as cyclomatic complexity should be below seven. In order to improve the code readability, the author suggests using 80/24 rule where you limit a method size to be no more than 24 lines and width of each line to be no more than 80 characters. The author also suggests hex flower rule:

No more than seven things should be going on in a single piece of code.

The author defines abstraction to capture essence of an object, i.e.,

Abstraction is the elimination of the irrelevant and the amplification of the essential.

Another factor that influences decomposition is cohesion so that code that works on the same data structure or all of its attributes is defined in the same module or class. The author cautions against the feature envy to decrease the complexity and you may need to refactor the code to another method or class. The author refers to a technique “parse, don’t validate” when validating an object so that the validate method takes less-structured input and produces more-structured output. Next, author describes fractal architecture where a large system is decomposed into smaller chunks and each chunk hides details but can be zoomed in to see the structure. The fractal architecture helps organize the code so that lower-level details are captured in a single abstract chunk and can easily fit in your brain.

8. API Design

This chapter describes principles of API design such as affordance, which uses encapsulation to preserve the invariants of objects involved in the API. The affordance allows a caller to invoke an API only when preconditions are met. The author strengthen the affordance with a poka-yoke (mistake proof) analogy, which means a good interface design should be hard to misuse. Other techniques in the chapter includes: write code for the readers; favor well-named code over comments; and X out names. The X out names replaces API name with xxx and sees if a reader can guess what the API does. For example, you may identify APIs for command-query separation where a method structure like void xxx() can be considered as command with a side effect. In order to communicate the intent of an API, the author describes a hierarchy of communication such as using API’s distinct types, helpful names, good comments, automated tests, helpful commit messages and good documentation.

9. Teamwork

In this chapter, the author provides tips for teamwork and communication with other team mates such as writing good Git commit messages using 50/72 rule where you first write a summary no wider than 50 characters, followed by a blank line and then detailed text with no wider than 72 characters. Other techniques include Continuous Integration that generally use trunk or main branch for all commits and developers make small changes optionally with feature-flags that are frequently merged. The developers are encouraged to make small commits and the code ownership is collective to decrease the bus factor. The author refers to pair programming and mob programming for collaboration within the team. In order to facilitate the collaboration, the author suggests reducing code review latency and rejecting any large change set. The reviewers should be asking whether they can maintain the code, is the code intent clear and could it be further simplified, etc. You can also pull down the code and test it locally to further gain the insight.

10. Augmenting Code

This chapter focuses on refactoring existing code for adding new functionality, enhancing existing behavior and bug fixes. The author suggests using feature-flags when deploying incomplete code. The author describes the strangler pattern for refactoring with incremental changes and suggests:

For any significant change, don’t make it in-place; make it side-by-side.

The strangler pattern can be applied at method-level where you may add a new method instead of making in-place change to an existing method and then remove the original method. Similarly, you can use class-level strangler to introduce new data structure and then remove old references. The author suggests using semantic versioning so that you can support backward compatible or breaking changes.

11. Editing Unit Tests

Though, with an automated test suite, you can refactor production code safely but there is no safety net when making changes to the test code. You can add additional tests, supplement new assertions to existing tests or change unit tests to parametersized tests without affecting existing behavior. Though, some programmers follow a single assertion per test and consider multiple assertions an Assertion Roulette but author suggests strengthening the postconditions in unit tests with additional assertions, which is somewhat similar to the Liskov Substitution Principle that says that subtypes may weaken precondition and strengthen postconditions. The author suggests separating refactoring of test and production code and use IDE’s supported refactoring tools such as rename, extract or move method when possible.

12. Troubleshooting

When troubleshooting, you first have to understand what’s going on. This chapter suggests using scientific method to make a hypothesis, performing the experiment and comparing the outcome to prediction. The author also suggests simplifying and removing the code to check if a problem goes away. Other ways to simplify the code include composing an object graph in code instead of using complex dependency injection; using pure functions instead of using mock objects; merging often instead of using complex diff tools; learning SQL instead of using convoluted object-relational mapping, etc. Another powerful technique for troubleshooting is rubber ducking where you try to explain the problem and gain a new insight in the process. In order to build quality, you should aim to reduce defects to zero. The tests also help with troubleshooting by writing an automated test to reproduce defects before fixing so that they serve as a regression test. The author cautions against slow tests and non-deterministic defects due race conditions. Finally, the author suggests using bisection that uses a binary search for finding the root cause where you reproduce the defect in half of the code and continue until you find the problem. You can also use bisection feature of Git to find the commit that introduced the defect.

13. Separation of Concerns

The author describes Kent Beck’s aphorism:

Things that change at the same rate belong together. Things that change at different rates belong apart.

The principle of separation of concerns can be used for decomposing working software into smaller parts, which can be decomposed further with nested composition. The author suggests using command query separation principle to keep side effects separated from the query operations. Object-oriented composition tends to focus on composing side effects together such as Composite design pattern, which lead to complex code. The author describes Sequential Composition that chains methods together and Referential Transparency to define a deterministic method without side effects. Next, the author describes cross cutting concerns such as logging, performance monitoring, auditing, metering, instrumentation, caching, fault tolerance, and security. The author finally describes Decorator pattern to enhance functionality, e.g., you can add logging to existing code without changing it and log actions from impure functions.

14. Rhythm

This chapter describes daily and recurring practices that software development teams follow such as daily stand-ups. The personal rhythm includes time-boxing or using Pomodoro technique; taking a break; using time deliberately; and touch type. The team rhythm includes updating dependencies regularly, scheduling other things such as checking certificates. The author describes Conway’s law:

Any organization that design a system […] will inevitably produce a design whose structure is a copy of the organization’s communication structure.

You can use this law to organize the work that impacts the code base.

15. The Usual Suspects

This chapter covers usual suspects of software engineering: architecture, algorithms, performance, security and other approaches. For example, performance is often a critical aspect but premature optimization can be wasteful. Instead correctness, an effort to reduce complexity and defects should be priority. In order to implement security, the author suggests STRIDE threat modelling, which includes Spoofing, Tempering, Repudiation, Information disclosure, Denial of service and Elevation of privilege. Other techniques include property-based testing and Behavioral code analysis can be used to extract information from Git to identify patterns and problems.

16. Tour

In this last chapter, the author shows tips on understanding an unfamiliar code by navigating to the main method and finding the way around. You can check if the application uses broader patterns such as Fractal architecture, Model-View-Controller and understands authentication, authorization, routing, etc. The author provides a few suggestions about code structure and file organization such as putting files in one directory though it’s a contestable advice. The author refers to the Hex flower and fractal architecture where you can zoom in to see more details. When using a monolithic architecture, the entire production code compiles to a single executable file that makes it harder to reuse parts of the code in new ways. Another drawback of monolithic architecture is that dependencies can be hard to manage and abstraction can be coupled with implementation, which violates the Dependency Inversion Principle. Further in order to prevent cyclic dependencies, you will need to detect and prevent Acyclic Dependency Principle. Finally, you can use test suite to learn about the system.

Summary

The book is full of practical advice on writing maintainable code such as:

  • 50/72 Rule for Git commit messages
  • 80/24 Rule for writing small blocks of code
  • Tests based on Arrange-Act-Assert and Red-Green Refactor
  • Bisection for troubleshooting
  • Checklists for a new codebase
  • Command Query Separation
  • Cyclomatic Complexity and Counting the Variables
  • Decorators for cross-cutting concerns
  • Devil’s advocate for improving assertions
  • Feature flags
  • Functional core and imperative shell
  • Hierarchy of communication
  • Parse, don’t validate
  • Postel’s law to maintain invariants
  • Regularly update dependencies
  • Reproduce defects as Tests
  • Review code
  • Semantic Versioning
  • Separate refactoring of test and production code
  • Strangler pattern
  • Threat modeling using STRIDE
  • Transformation priority premise to make small changes and keeping the code in working condition
  • X-driven development by using unit-tests, static code analysis, etc.
  • X out of Names

These heuristics help make the software development sustainable so that the team can make incremental changes to the code while maintaining high quality.

May 9, 2023

Applying Domain-Driven Design and Clean/Onion/Hexagonal Architecture to MicroServices

Filed under: Computing,Design — admin @ 8:41 pm

1. Abstract

In software design, modular design facilitates building large systems by decomposing functionality into independent modules where each module defines an interface for the behavior it implements. The modular design evolved into component-based design that emphasized separation of concerns and into distributed systems, which gave rise to web services, service-oriented architectures and event-driven architectures. This evolution led to Microservices architecture in which each service defines a bounded-context for the business domain of its functionality. Each service is autonomous, agile, loosely coupled, resilient, reliable, independently deployable and scalable. This architecture encourages use of abstraction, single-responsibility, DRY, dependency-inversion, common-closure, common-reuse, release-equivalence and persistence Ignorance principles. The software development teams often use Inverse Conway Maneuver to define a clear ownership of the service, which improves developer velocity. As cloud computing gained wider adoption over the last 15 years, microservice architecture was extended with the architecture of cloud native applications (CNA), which offer properties of Isolated State, Distribution, Elasticity, Automation, and Loose Coupling (IDEAL). The extended benefits of CNA and Microservice architecture include:

  • Fit for purpose
  • Rightsized and modular
  • Elasticity
  • Sovereign and tolerant
  • Resilient and protected
  • Controllable and adaptable
  • Workload-aware and resource-efficient
  • Agile and tool-supported
  • Observability including metric, tracing, and logging
  • Resilience
  • Availability
  • Independent, autonomous
  • Zero-Trust Security
  • Automation
  • Decentralized governance

2. Applying Domain-Driven Design

Following sections examines primary concepts from the domain driven design:

2.1 Layers

The domain-driven design by Eric Evans simplifies the architecture of microservices, which builds upon layer architecture such as:

  • presentation layer for user-interface.
  • application layer for use-cases that define the behavior.
  • domain layer for representing business rules and domain model.
  • infrastructure layer for data access and persistence for the domain objects based on Persistence Ignorance and Infrastructure Ignorance principles.
DDD Layers

2.2 Domain Model

The software development team and domain experts define model using workshop based Event Storming. The domain layer employs defines following types of model:

  • Entity is a mutable object defined not by their attributes, but rather by a thread of continuity and identity.
  • Value object is an immutable object defined by their attributes instead of an identifier.
  • Domain events to notify data update.

The domain-driven design recommends rich behavior in entity objects in addition to the data attributes and cautions against AnemicDomainModel that only hold data attributes.

2.3 Aggregates

The entities and values can be clustered into aggregates that become a unit for retrieving and persisting data together. An aggregate entity becomes root for controlling lifecycle and access to the objects inside its boundary.

2.4 Ubiquitous Language

The domain model employs Ubiquitous Language to bring domain experts and software development team together and eliminate inaccuracies, contradictions and confusion from the model.

2.5 Services

The services define high-level business logic that doesn’t fit within the domain objects. The services are generally designed as stateless with clearly defined interfaces.

2.6 Repositories

The repositories implement data persistence logic for retrieving and persisting aggregate and entity objects.

2.7 Factories

The factories help create complex objects, values and aggregates.

2.8 Bounded Context

The Bounded Context defines the boundaries of the domain model, which may consists of other sub-domains. This becomes foundation for the boundary of microservices, where each service is cohesive and loosely coupled that avoids chatty communication between microservices.

2.9 Context Map

The context map help define boundaries of bounded context explicitly to prevent Big Ball of Mud architecture with following patterns:

  • Shared Kernel shares a common domain model between teams.
  • Partnership with mutual dependency between teams.
  • Customer-Supplier defines an interface that supplier implements and consumer consumes it.
  • Open Host Service / Published Language relies on well documented or readily available information for integration.
  • Conformist where the downstream team conforms to the model of the upstream without any translation of models.
  • Anticorruption Layer isolates and abstracts the downstream’s models from external system’s models by translation.

3. Applying Hexagonal Architecture

The hexagonal architecture or ports & adapter architecture by Alistair Cockburn defines ports to receive incoming requests, which is then translated to internal message or procedure by an adapter. Similarly, when the application need to connect to external systems on the driven side, it sends a message through a port to an adapter. The port and adapter architecture decouples driver side and driven side from the implementation technology.

Hexagonal Architecture

The port uses a protocol or an application program interface (API) for communicating with the application, which is then translated by the adapter for internal consumption. When the application needs to connect to external systems such as database, it goes through similar port or interface and is then translated into underlying database protocol by the adapter. This architecture essentially uses Dependency Inversion and Inversion of control Principles by only depending on the ports and decoupling external and internal components from the implementation technology. The application is the core of the system that defines use-cases that can be triggered by CLI or UI. The application layer internally contains commands, handlers and services, which receives commands or queries from ports and communicates with external systems via ports and adapters. The application layer may trigger application events as an outcome of a use-case. The domain layer defines domain model and domain specific services, which are used by the application layer. The driver or primary side in hexagonal architecture allows users to initiate communication with the application core and the driven or secondary side within the application core initiates communication with external dependent systems.

4. Applying Onion Architecture

The Onion Architecture defines concentric circles for layers where all code can depend on layers more central, but code cannot depend on layers further out from the core. The Domain Model represents the state and behavior combination that models truth for the organization. The number of layers in application core will vary but it has domain model at the center. The interfaces for repository to to retrieve and persist data surrounds domain model and the interfaces for repository are defined in the application core. The Onion Architecture uses the Dependency Inversion principle to inject implementations for the interfaces defined in the application core.

Onion Architecture

5. Applying Clean Architecture

The Clean architecture defines concentric circles to represent different areas of software and uses dependency rule to point dependency inwards.

Clean Architecture
  • The entities encapsulate business rules with behavior and data structure.
  • The use-cases encapsulates application specific use-cases and orchestrates flow of data to and from the entities.
  • The interface adapters convert data from the use cases and entities to external systems, which are used by presenters, views and controllers.
  • The frameworks and drivers layer is composed of frameworks and tools such as the database and web framework.

The Clean Architecture uses Dependency Inversion Principle to communicate across boundaries with interfaces and inner circle does not depend on outer circle.

6. Related Design Patterns

6.1 Model-driven architecture

Model-driven engineering and Model-driven architecture facilitate domain-driven design by generating source code, documentation, tests, etc. from the domain model.

6.2 Command Query Responsibility Segregation

Command Query Responsibility Segregation (CQRS) coined by Bertrand Meyer generalizes message-driven and event-driven architecture by segregating behavior for querying the data and updating the data.

6.3 Event sourcing

Event sourcing tracks internal by reading and committing events to an event store.

7. Putting it all together

Following sections describe how a library management system can be built with the domain driven and hexagonal/clean architecture:

7.1 User Stories

Following is a list of primary user-stories that will be implemented for the library management system:

  • As a library administrator, I want to add a book to the collection so that patrons of the library may checkout it.
  • As a library administrator, I want to remove a book from the collection so that it’s no longer available to borrow.
  • As a library administrator or a patron, I want to search books based on different criteria such as title, author, publisher, dates, etc. so that I may see details or use it to checkout the book later.
  • As a patron, I want to checkout a book so that I can read it and return later.
  • As a patron, I want to return a checked out book when I am done with reading.
  • As a patron, I want to hold a book, which is not currently available so that I can checkout later.
  • As a patron, I want to cancel the hold that previously made when I am no longer interested in the book.
  • As a patron, I want to checkout the book hold that previously made so that I can read it.

7.1.1 Constraints and Validation Policies

In addition, the library may impose certain policies and restrictions on the books and checkout/hold actions such as restricted book can be held by a researcher patron or limit the number of books that can be held or checkout at a time.

7.2 Layered Architecture

The library management application divides the application into multiple domains such as patrons, catalog, checkout and hold. Each domain then divides into following layers:

7.2.1 Application-Service and Controller Layer

This layer defines remote APIs for communicating with the microservices defined in the library management application.

7.2.2 Command and Query Layer (CQRS)

This layer defines operations for commands and queries that are invoked by the controller layer, which depend on underlying domain service layer. The command layer also defines the scope of a transaction so that all changes are persisted atomically.

Note: This layer may use SAGA pattern to handle distributed transactions when you need to invoke multiple services or databases for performing an operation.

7.2.3 Domain Service Layer

This layer defines additional business behavior that is built upon the domain model layer and is used by the commands and queries layer

7.2.4 Domain Model Layer

This layer defines data and behavior of the domain and defines entity, value, aggregates, factories, and interface to model the domain.

7.2.5 Infrastructure Layer

This layer defines repositories and gateways to persist domain entities and connect to external services such as messaging and logging.

7.3 Domain Model

Following domain model was defined as a result of above use-stories and an event-storming exercise:

Class Diagram

7.3.1 Party Pattern

Above design uses party-pattern to model patrons, library administrator and library branches because they share a lot of common attributes to describe people and organizations. The base Party class uses Address class to store physical address, so the Party class acts as an Aggregate for all related data about people and organizations.

7.3.2 Book

The book class models a library book that can be added to the library collection, queried by the administrators or patrons and then checked out or held by the patrons.

7.3.3 Checkout

The Checkout class abstracts the data when checking out a book, which can be returned later.

7.3.4 Hold

The Hold class abstracts the data when holding a book that is not currently available so that it can be checked out later.

Note: The domain driven design considers anemic domain without business behavior an antipattern so above domain model defines invariant business rules and behavior along with the data attributes.

7.4 Components and Modules

The library management application was divided into following modules:

components
Component Diagram

The core, utils and gateway module is shared by other modules; the parties and books module define low-level modules and catalog, patrons, checkout and hold modules define high-level modules, which also act as bounded context for managing books-catalog, patrons for managing library members and checkout/hold modules for defining behavior for the library operations.

7.4.1 core module

The core module abstracts common domain model, domain events and interfaces for command pattern, repository and controllers.

7.4.2 parties module

The parties module defines domain model for the party class and data access methods for persisting and querying parties (people and organizations).

7.4.3 books module

The books module defines domain model and data transfer model for books as well as repository for persisting and querying books using AWS DynamoDB.

7.4.4 patrons module

The patron module built upon the parties module and defines service sub-module for business logic to query and persisting patrons. The patron module also includes controller, command classes and binary/main module to define microservices based on AWS Lambda.

7.4.5 checkout module

The checkout module implements services for checking out and returning book, which are then made available as microservices using controller, command and binary sub-modules.

7.4.6 hold module

The hold module implements services for holding a book or canceling/returning it later, which are then made available as microservices using controller, command and binary sub-modules.

7.4.7 gateway module

The gateway module defines interfaces to connect to external services such as AWS CloudWatch for managing metrics and AWS SNS for publishing events from the domain and user-action changes.

7.5 Domain-Driven Design Patterns

The library management systems applies following design patterns from the domain driven design and hexagonal/clean architecture:

7.5.1 Ubiquitous Language

The domain model uses the same terminology used by the stakeholders and experts from underlying problem space such as library patrons, books, checkout, hold, etc. so that software development team can model the business problem as close as possible.

7.5.2 Domain Events

The domain events capture data change in the domain model specifically aggregate objects. This decouples domains in different bounded context as other domains can listen to the domain events asynchronously and make a local change. Following is an example of domain events in the library management systems:

#[derive(Debug, PartialEq, Serialize, Deserialize)]
pub(crate) struct DomainEvent {
    pub event_id: String,
    pub name: String,
    pub group: String,
    pub key: String,
    pub kind: DomainEventType,
    pub metadata: HashMap<String, String>,
    pub json_data: String,
    #[serde(with = "serializer")]
    pub created_at: NaiveDateTime,
}

7.5.3 Aggregates and Event Stream

The modules and domain model communicate with each other using event streams, which is built upon AWS SNS service underneath, e.g.,

impl EventPublisher for SESPublisher {
    async fn publish(&self, event: &DomainEvent) -> Result<(), LibraryError> {
        let topic = self.topics.get(event.name.as_str());
        if let Some(arn) = topic {
            let json = serde_json::to_string(event)?;
            self.client.publish().topic_arn(arn).message(json).send().await?;
            Ok(())
        } else {
            Err(LibraryError::runtime(format!("topic is not found {}", event.name).as_str(), None))
        }
    }
}

Following example depicts publishing an event as a result of checking out a book:

    async fn checkout(&self, patron_id: &str, book_id: &str) -> LibraryResult<CheckoutDto> {
        let patron = self.patron_service.find_patron_by_id(patron_id).await?;
        let book = self.catalog_service.find_book_by_id(book_id).await?;
        if book.status() != BookStatus::Available {
            return Err(LibraryError::validation(format!("book is not available {}",
                                                        book.id()).as_str(), Some("400".to_string())));
        }
        if book.is_restricted() && patron.is_regular() {
            return Err(LibraryError::validation(format!("patron {} cannot hold restricted books {}",
                                                        patron.id(), book.id()).as_str(), Some("400".to_string())));
        }
        let checkout = CheckoutDto::from_patron_book(self.branch_id.as_str(), &patron, &book);
        self.checkout_repository.create(&CheckoutEntity::from(&checkout)).await?;
        let _ = self.events_publisher.publish(&DomainEvent::added(
            "book_checkout", "checkout", checkout.checkout_id.as_str(), &HashMap::new(), &checkout.clone())?).await?;
        Ok(checkout)
    }

The events_publisher publishes the domain events upon checking out a book. Similar events are published for other user-actions or domain changes.

7.5.4 Domain Services

The domain services define high-level business logic and each bounded context defines a layer for domain services such as:

pub(crate) trait CatalogService: Sync + Send {
    async fn add_book(&self, book: &BookDto) -> LibraryResult<BookDto>;
    async fn remove_book(&self, id: &str) -> LibraryResult<()>;
    async fn update_book(&self, book: &BookDto) -> LibraryResult<BookDto>;
    async fn find_book_by_id(&self, id: &str) -> LibraryResult<BookDto>;
    async fn find_book_by_isbn(&self, isbn: &str) -> LibraryResult<Vec<BookDto>>;
}
pub(crate) trait CheckoutService: Sync + Send {
    async fn checkout(&self, patron_id: &str, book_id: &str) -> LibraryResult<CheckoutDto>;
    async fn returned(&self, patron_id: &str, book_id: &str) -> LibraryResult<CheckoutDto>;
    async fn query_overdue(&self, predicate: &HashMap<String, String>,
                           page: Option<&str>, page_size: usize) -> LibraryResult<PaginatedResult<CheckoutDto>>;
}
pub(crate) trait HoldService: Sync + Send {
    async fn hold(&self, patron_id: &str, book_id: &str) -> LibraryResult<HoldDto>;
    async fn cancel(&self, patron_id: &str, book_id: &str) -> LibraryResult<HoldDto>;
    async fn checkout(&self, patron_id: &str, book_id: &str) -> LibraryResult<HoldDto>;
    async fn query_expired(&self, predicate: &HashMap<String, String>,
                           page: Option<&str>, page_size: usize) -> LibraryResult<PaginatedResult<HoldDto>>;
}

7.5.5 Repositories

The library management application uses repository pattern to persist or query data, which can be implemented based on any supported implementation (such as DynamoDB). Also, it uses polymorphic associations for managing inheritance, e.g. parties DynamoDB table can store patrons, administrators and library branches. The repository implementation can be pointed to a local DynamoDB or AWS managed DynamoDB service, e.g.,

#[async_trait]
impl Repository<BookEntity> for DDBBookRepository {
    async fn create(&self, entity: &BookEntity) -> LibraryResult<usize> {
        let table_name: &str = self.table_name.as_ref();
        let val = serde_json::to_value(entity)?;
        self.client
            .put_item()
            .table_name(table_name)
            .condition_expression("attribute_not_exists(book_id)")
            .set_item(Some(parse_item(val)?))
            .send()
            .await.map(|_| 1).map_err(LibraryError::from)
    }

    async fn update(&self, entity: &BookEntity) -> LibraryResult<usize> {
        let now = Utc::now().naive_utc();
        let table_name: &str = self.table_name.as_ref();

        self.client
            .update_item()
            .table_name(table_name)
            .key("book_id", AttributeValue::S(entity.book_id.clone()))
            .update_expression("SET version = :version, title = :title, book_status = :book_status, dewey_decimal_id = :dewey_decimal_id, restricted = :restricted, updated_at = :updated_at")
            .expression_attribute_values(":old_version", AttributeValue::N(entity.version.to_string()))
            .expression_attribute_values(":version", AttributeValue::N((entity.version + 1).to_string()))
            .expression_attribute_values(":title", AttributeValue::S(entity.title.to_string()))
            .expression_attribute_values(":book_status", AttributeValue::S(entity.book_status.to_string()))
            .expression_attribute_values(":restricted", AttributeValue::Bool(entity.restricted))
            .expression_attribute_values(":dewey_decimal_id", AttributeValue::S(entity.dewey_decimal_id.to_string()))
            .expression_attribute_values(":updated_at", string_date(now))
            .condition_expression("attribute_exists(version) AND version = :old_version")
            .send()
            .await.map(|_| 1).map_err(LibraryError::from)
    }
  ...
}

7.5.6 Factories

The library management application uses factories to create instance of repositories, event publishers and services based on different implementations, e.g.,

pub(crate) async fn create_checkout_repository(store: RepositoryStore) -> Box<dyn CheckoutRepository> {
    match store {
        RepositoryStore::DynamoDB => {
            let client = build_db_client(store).await;
            Box::new(DDBCheckoutRepository::new(client, "checkout", "checkout_ndx"))
        }
        RepositoryStore::LocalDynamoDB => {
            let client = build_db_client(store).await;
            let _ = create_table(&client, "checkout", "checkout_id", "checkout_status", "patron_id").await;
            Box::new(DDBCheckoutRepository::new(client, "checkout", "checkout_ndx"))
        }
    }
}

pub(crate) async fn create_checkout_service(
  config: &Configuration, store: RepositoryStore) -> Box<dyn CheckoutService> {
    let checkout_repo = factory::create_checkout_repository(store).await;
    let catalog_svc = create_catalog_service(config, store).await;
    let patron_svc = create_patron_service(config, store).await;
    let publisher = create_publisher(store.gateway_publisher()).await;
    Box::new(CheckoutServiceImpl::new(config, checkout_repo,
                                      patron_svc, catalog_svc, publisher))
}

7.5.7 Data Transfer Objects

The library management application uses immutable data transfer objects when invoking a business service, a command or a method on controller so that these objects are free of side effects and can be safely shared with other modules in concurrent environment.

7.5.8 CQRS Pattern

The library management application uses command-query separation principle to bridge application services with the domain services. Each command handles a unique behavior implemented by the high-level modules for managing patrons, book-catalog, and checkout/hold behavior, e.g.,

#[derive(Debug, Deserialize)]
pub(crate) struct CheckoutBookCommandRequest {
    patron_id: String,
    book_id: String,
}

#[derive(Debug, Serialize)]
pub(crate) struct CheckoutBookCommandResponse {
    checkout: CheckoutDto,
}

#[async_trait]
impl Command<CheckoutBookCommandRequest, CheckoutBookCommandResponse> for CheckoutBookCommand {
    async fn execute(&self, req: CheckoutBookCommandRequest) -> Result<CheckoutBookCommandResponse, CommandError> {
        self.checkout_service.checkout(req.patron_id.as_str(), req.book_id.as_str())
            .await.map_err(CommandError::from).map(CheckoutBookCommandResponse::new)
    }
}

7.5.9 Application Services/Controller

The application services/controller layer defines remote APIs, which are built on top of AWS Lambda APIs, e.g.,

pub(crate) async fn checkout_book(
    State(state): State<AppState>,
    json: Json<Value>) -> Result<Json<CheckoutBookCommandResponse>, ServerError> {
    let req: CheckoutBookCommandRequest = serde_json::from_value(json.0).map_err(json_to_server_error)?;
    let svc = build_service(state).await;
    let res = CheckoutBookCommand::new(svc).execute(req).await?;
    Ok(Json(res))
}

7.5.10 Bounded Context

As, the Bounded Context defines the boundaries of the domain model, the library system is defines bounded context for managing library members (patrons), managing books (catalog), checkout and hold operations. In addition each domain also decomposes other subdomains that reflect business process within the problem space.

7.5.11 Monads and Error Handling

The library management application uses Result monad for returning results from a service, command or controller so that caller can handle errors properly. In addition, it uses Option monad is used for defining any optional data properties so that the compiler can enforce all type checking.

7.5.12 Main

The high-level modules define a main module, which instantiates the API controllers for remote invocation. The AWS Lambda requires that Rust based Lambda functions are deployed with the binary executable, which is spawned by the Lambda runtime, e.g.,

#[tokio::main]
async fn main() -> Result<(), Error> {
    setup_tracing();

    let state = if DEV_MODE {
        std::env::set_var("AWS_LAMBDA_FUNCTION_NAME", "_");
        std::env::set_var("AWS_LAMBDA_FUNCTION_MEMORY_SIZE", "4096"); // 200MB
        std::env::set_var("AWS_LAMBDA_FUNCTION_VERSION", "1");
        std::env::set_var("AWS_LAMBDA_RUNTIME_API", "http://[::]:9000/.rt");
        AppState::new("dev", RepositoryStore::LocalDynamoDB)
    } else {
        AppState::new("prod", RepositoryStore::DynamoDB)
    };

    let app = Router::new()
        .route("/catalog", post(controller::add_book))
        .route("/catalog/:id",
               get(controller::find_book_by_id).delete(controller::remove_book))
        .with_state(state);

    run(app).await
}

7.6 Code structure

Following tree structure shows the module and code structure for the library management application:

|--- books
|   |--- domain
|   |   |--- model.rs
|   |--- domain.rs
|   |--- dto.rs
|   |--- factory.rs
|   |--- repository
|   |   |--- ddb_book_repository.rs
|   |--- repository.rs
|--- books.rs
|--- catalog
|   |--- bin
|   |   |--- main.rs
|   |--- command
|   |   |--- add_book_cmd.rs
|   |   |--- get_book_cmd.rs
|   |   |--- remove_book_cmd.rs
|   |   |--- update_book_cmd.rs
|   |--- command.rs
|   |--- controller.rs
|   |--- domain
|   |   |--- service.rs
|   |--- domain.rs
|   |--- dto.rs
|   |--- factory.rs
|--- catalog.rs
|--- checkout
|   |--- bin
|   |   |--- main.rs
|   |--- command
|   |   |--- checkout_book_cmd.rs
|   |   |--- return_book_cmd.rs
|   |--- command.rs
|   |--- controller.rs
|   |--- domain
|   |   |--- model.rs
|   |   |--- service.rs
|   |--- domain.rs
|   |--- dto.rs
|   |--- factory.rs
|   |--- repository
|   |   |--- ddb_checkout_repository.rs
|   |--- repository.rs
|--- checkout.rs
|--- core
|   |--- command.rs
|   |--- controller.rs
|   |--- domain.rs
|   |--- events.rs
|   |--- library.rs
|   |--- repository.rs
|--- core.rs
|--- gateway
|   |--- ddb
|   |   |--- publisher.rs
|   |--- ddb.rs
|   |--- events.rs
|   |--- factory.rs
|   |--- logs.rs
|   |--- sns
|   |   |--- publisher.rs
|   |--- sns.rs
|--- gateway.rs
|--- hold
|   |--- bin
|   |   |--- main.rs
|   |--- command
|   |   |--- cancel_hold_book_cmd.rs
|   |   |--- checkout_hold_book_cmd.rs
|   |   |--- hold_book_cmd.rs
|   |--- command.rs
|   |--- controller.rs
|   |--- domain
|   |   |--- model.rs
|   |   |--- service.rs
|   |--- domain.rs
|   |--- dto.rs
|   |--- events.rs
|   |--- factory.rs
|   |--- repository
|   |   |--- ddb_hold_repository.rs
|   |--- repository.rs
|--- hold.rs
|--- lib.rs
|--- library.rs
|--- main.rs
|--- parties
|   |--- domain
|   |   |--- model.rs
|   |--- domain.rs
|   |--- events.rs
|   |--- factory.rs
|   |--- repository
|   |   |--- ddb_party_repository.rs
|   |--- repository.rs
|--- parties.rs
|--- patrons
|   |--- bin
|   |   |--- main.rs
|   |--- command
|   |   |--- add_patron_cmd.rs
|   |   |--- get_patron_cmd.rs
|   |   |--- remove_patron_cmd.rs
|   |   |--- update_patron_cmd.rs
|   |--- command.rs
|   |--- controller.rs
|   |--- domain
|   |   |--- service.rs
|   |--- domain.rs
|   |--- dto.rs
|   |--- factory.rs
|--- patrons.rs
|--- utils
|   |--- date.rs
|   |--- ddb.rs
|--- utils.rs

7.7 Building and Testing

7.7.1 Start locally

docker-compose -f ddb-docker-compose.yaml up

7.7.2 Start Lambda locally

cargo lambda watch

7.7.3 Build

cargo build --release
cargo lambda build --release

7.7.4 Testing catalog Lambdas

Add a book:

curl -H "Content-Type: application/json" http://localhost:9000/catalog -d '{"isbn": "123", "title": "my book"}'

which would return something like:

{
  "book": {
    "dewey_decimal_id": "749",
    "book_id": "a2b25506-2948-47bb-9c4a-cf9ad480c10b",
    "version": 0,
    "author_id": "623a01ca-8ba9-41cd-b8b6-85a5711f8453",
    "publisher_id": "f0cff296-9f6e-4b25-95e1-a783661bf91f",
    "language": "en",
    "isbn": "123",
    "title": "my book",
    "book_status": "Available",
    "restricted": false,
    "published_at": "2023-05-09T20:55:56.073008+00:00",
    "created_at": "2023-05-09T20:55:56.073027+00:00",
    "updated_at": "2023-05-09T20:55:56.073027+00:00"
  }
}

Finding the book by id:

curl -H "Content-Type: application/json" http://localhost:9000/catalog/f58ef32a-6f24-4314-8782-c7ebcad0ab59

that returns

{
  "book": {
    "dewey_decimal_id": "220",
    "book_id": "f58ef32a-6f24-4314-8782-c7ebcad0ab59",
    "version": 0,
    "author_id": "4c24b180-a146-410a-b68c-9d83c57adebc",
    "publisher_id": "88b47029-cad1-443f-8d67-aaf13863e924",
    "language": "en",
    "isbn": "123",
    "title": "my book",
    "book_status": "Available",
    "restricted": false,
    "published_at": "2023-05-09T22:18:25.436359+00:00",
    "created_at": "2023-05-09T22:18:25.436366+00:00",
    "updated_at": "2023-05-09T22:18:25.436371+00:00"
  }
}

7.7.5 Testing patrons Lambdas

Add a patron:

curl -v  -H "Content-Type: application/json" http://localhost:9000/patrons -d '{"email": "test-email@xyz.com"}'

that returns:

{
  "patron": {
    "patron_id": "cf49007e-e7fa-42c3-ac56-e15b9530597e",
    "version": 0,
    "first_name": "",
    "last_name": "",
    "email": "test-email@xyz.com",
    "under_13": false,
    "group_roles": [],
    "num_holds": 0,
    "num_overdue": 0,
    "home_phone": null,
    "cell_phone": null,
    "work_phone": null,
    "street_address": null,
    "city": null,
    "zip_code": null,
    "state": null,
    "country": null,
    "created_at": "2023-05-09T22:20:28.898831",
    "updated_at": "2023-05-09T22:20:28.898833"
  }
}

Getting patron:

curl -H "Content-Type: application/json" http://localhost:9000/patrons/cf49007e-e7fa-42c3-ac56-e15b9530597e|jq '.'

that returns:

{
  "patron": {
    "patron_id": "cf49007e-e7fa-42c3-ac56-e15b9530597e",
    "version": 0,
    "first_name": "",
    "last_name": "",
    "email": "test-email@xyz.com",
    "under_13": false,
    "group_roles": [],
    "num_holds": 0,
    "num_overdue": 0,
    "home_phone": "",
    "cell_phone": "",
    "work_phone": "",
    "street_address": null,
    "city": null,
    "zip_code": null,
    "state": null,
    "country": null,
    "created_at": "2023-05-09T22:21:35.142750",
    "updated_at": "2023-05-09T22:21:35.142757"
  }
}

7.7.6 Checkout book Lambda

Checkout a book:

curl -v  -H "Content-Type: application/json" http://localhost:9000/checkout -d '{"patron_id": "cf49007e-e7fa-42c3-ac56-e15b9530597e", "book_id": "f58ef32a-6f24-4314-8782-c7ebcad0ab59"}'

that returns:

{
  "checkout": {
    "checkout_id": "4a7ea5c5-939d-4934-8715-071c7ab5bc71",
    "version": 0,
    "branch_id": "dev",
    "book_id": "f58ef32a-6f24-4314-8782-c7ebcad0ab59",
    "patron_id": "cf49007e-e7fa-42c3-ac56-e15b9530597e",
    "checkout_status": "CheckedOut",
    "checkout_at": "2023-05-09T22:36:55.162807+00:00",
    "due_at": "2023-05-24T22:36:55.162808+00:00",
    "returned_at": null,
    "created_at": "2023-05-09T22:36:55.162812+00:00",
    "updated_at": "2023-05-09T22:36:55.162812+00:00"
  }
}

Returning a book:

curl -v  -H "Content-Type: application/json" http://localhost:9000/checkout/return -d '{"patron_id": "cf49007e-e7fa-42c3-ac56-e15b9530597e", "book_id": "f58ef32a-6f24-4314-8782-c7ebcad0ab59"}'

that returns:

{
  "checkout": {
    "checkout_id": "6b432212-8136-45a5-a8c4-953da73ee24f",
    "version": 0,
    "branch_id": "dev",
    "book_id": "f58ef32a-6f24-4314-8782-c7ebcad0ab59",
    "patron_id": "cf49007e-e7fa-42c3-ac56-e15b9530597e",
    "checkout_status": "Returned",
    "checkout_at": "1970-01-01T00:00:00+00:00",
    "due_at": "2023-05-09T22:36:59.145408+00:00",
    "returned_at": "2023-05-09T22:36:59.145607",
    "created_at": "2023-05-09T22:36:59.145415+00:00",
    "updated_at": "2023-05-09T22:36:59.145421+00:00"
  }
}

7.7.7 Hold book Lambda

Hold a book:

curl -v  -H "Content-Type: application/json" http://localhost:9000/hold -d '{"patron_id": "cf49007e-e7fa-42c3-ac56-e15b9530597e", "book_id": "f58ef32a-6f24-4314-8782-c7ebcad0ab59"}'

that returns

{
  "hold": {
    "hold_id": "b6cbff12-fe0b-4be0-9566-5e221e52c8c5",
    "version": 0,
    "branch_id": "dev",
    "book_id": "f58ef32a-6f24-4314-8782-c7ebcad0ab59",
    "patron_id": "cf49007e-e7fa-42c3-ac56-e15b9530597e",
    "hold_status": "OnHold",
    "hold_at": "2023-05-09T22:38:52.905822+00:00",
    "expires_at": "2023-05-24T22:38:52.905822+00:00",
    "canceled_at": null,
    "checked_out_at": null,
    "created_at": "2023-05-09T22:38:52.905825+00:00",
    "updated_at": "2023-05-09T22:38:52.905825+00:00"
  }
}

Canceling a hold:

curl -v  -H "Content-Type: application/json" http://localhost:9000/hold/cancel -d '{"patron_id": "cf49007e-e7fa-42c3-ac56-e15b9530597e", "book_id": "f58ef32a-6f24-4314-8782-c7ebcad0ab59"}'
that returns:
{
  "hold": {
    "hold_id": "b6cbff12-fe0b-4be0-9566-5e221e52c8c5",
    "version": 0,
    "branch_id": "dev",
    "book_id": "f58ef32a-6f24-4314-8782-c7ebcad0ab59",
    "patron_id": "cf49007e-e7fa-42c3-ac56-e15b9530597e",
    "hold_status": "Canceled",
    "hold_at": "2023-05-09T22:39:51.920045+00:00",
    "expires_at": "2023-05-09T22:39:51.920052+00:00",
    "canceled_at": "2023-05-09T22:39:51.920078",
    "checked_out_at": null,
    "created_at": "2023-05-09T22:39:51.920058+00:00",
    "updated_at": "2023-05-09T22:39:51.920063+00:00"
  }
}

Checking out a hold book:

curl -v  -H "Content-Type: application/json" http://localhost:9000/hold/checkout -d '{"patron_id": "cf49007e-e7fa-42c3-ac56-e15b9530597e", "book_id": "f58ef32a-6f24-4314-8782-c7ebcad0ab59"}'

that returns:

{
  "hold": {
    "hold_id": "f5fdb835-5ea2-428d-af12-a81ffb1b3f35",
    "version": 0,
    "branch_id": "dev",
    "book_id": "f58ef32a-6f24-4314-8782-c7ebcad0ab59",
    "patron_id": "cf49007e-e7fa-42c3-ac56-e15b9530597e",
    "hold_status": "CheckedOut",
    "hold_at": "2023-05-09T22:40:54.705417+00:00",
    "expires_at": "2023-05-09T22:40:54.705424+00:00",
    "canceled_at": null,
    "checked_out_at": "2023-05-09T22:40:54.705443",
    "created_at": "2023-05-09T22:40:54.705430+00:00",
    "updated_at": "2023-05-09T22:40:54.705435+00:00"
  }
}

8. Deployment and Infrastructure as a Code

In order to fully automate deployment, the library management system uses AWS CDK to build Dynamo DB tables, CloudWatch and AWS Lambda functions along with other security policies. You can deploy the infrastructure as follows:

8.1 Install CDK

npm install -g typescript
npm install aws-cdk-lib
npm install -g aws-cdk

8.2 Deploy

cd cdk
cdk deploy

8.3 Destroy

If you need to remove all infrastructure, simply run:

cd cdk
cdk destroy

9. Summary

A sample library management system demonstrates how to apply domain driven design and hexagonal/clean architecture to build microservices. It is implemented in Rust and uses AWS Dynamo DB, AWS SNS, AWS CloudWatch and AWS Lambda to build modern microservices. The sample domain-driven application also uses AWS CDK to manage infrastructure as a code so that you can deploy services consistently across all environments. You can download the sample application from https://github.com/bhatti/ddd-sample-microservice.

PS: The library management system is a sample application to showcase the domain-driven and hexagonal/clean architecture but you can read Building a Secured Family-friendly Password Manager and Building a Hybrid Authorization System for Granular Access Control for learning these concepts on a bit larger open source applications available at https://github.com/bhatti/PlexPass and https://github.com/bhatti/PlexAuthZ.

March 26, 2023

Elegant Implementation Patterns

Filed under: Computing,Languages — Tags: — admin @ 7:47 pm

Patterns are typical solutions to common problems in various phases of the software development lifecycle and you may find many books and resources on various types of patterns such as:

However, you may also find low-level implementation patterns that developers often apply to common coding problems. Following are a few of these coding patterns that I have found particularly fascinating and pragmatic in my experience:

Functional Options Pattern

The options pattern is used to pass configuration options to a method, e.g. you can pass a struct that holds configuration settings. These configuration properties may have default values that are used when they are not explicitly defined. You may implement options patter using builder pattern to initialize config properties but it requires explicitly building the configuration options even when there is nothing to override. In addition, error handling with the builder pattern poses more complexity when chaining methods. The functional options pattern on the other hand defines each configuration option as a function (referenced in Functional Options in Go and 100 Go Mistakes and How to Avoid), which can validate the configuration option and return an error for invalid data. For example:

type options struct {
  port *int
  timeout *time.Duration
}
type Option func(opts *options) error

func WithPort(port int) Option {
  return func(opts *options) error {
    if port < 0 {
      return errors.New("port shold be positive")
    }
    options.port = &port
    return nil
  }
}

func NewServer(addr string, opts ...Option) (*http.Server, error) {
  var options options
  for _, opt := range opts {
    err := opt(&options)
    if err != nl {
      return nil, err
    }
  }
  var port int
  if options.port == nil {
    port = defaultHTTPPort
  } else {
    if *options.port == 0 {
      port = randomPort()
    } else {
      port = *options.port
    }
  }
  ...
}

You can then pass configuration options as follows:

server, err := NewServer(
  WithPort(8080),
  WithTimeout(time.Second)
  )  

Above solution allows handling errors when overriding default values fails and passing empty list of options. In other languages where errors can be implicitly passed to the calling code may still use builder pattern for configuration such as:

const DefaultHttpPort: i32 = 8080;

#[derive(Debug, PartialEq)]
struct Options {
   port: i32,
   timeout: i32,
}

#[derive(Debug)]
pub enum OptionsError {
    Validation(String),
}

struct OptionsBuilder {
   port: Option<i32>,
   timeout: Option<i32>,
}

impl OptionsBuilder {
    fn new() -> Self {
        OptionsBuilder {
            port: Some(DefaultHttpPort),
            timeout: Some(1000),
        }
    }

    pub fn with_port(mut self, port: i32) -> Result<Self, OptionsError> {
        if port < 0 {
            return Err(OptionsError::Validation("port shold be positive".to_string()));
        }
        if port == 0 {
           self.port = Some(randomPort());
        } else {
           self.port = Some(port);
        }
        Ok(self)
    }

    pub fn with_timeout(mut self, timeout: i32) -> Result<Self, OptionsError> {
        if timeout <= 0 {
            return Err(OptionsError::Validation("timeout shold be positive".to_string()));
        }
        self.timeout = Some(timeout);
        Ok(self)
    }

    pub fn build(self) -> Options {
        Options { port: self.port.unwrap(), timeout: self.port.unwrap() }
    }
}

fn new_server(addr: &str, opts: Options) -> Result<Server, OptionsError> {
    Ok(Server::new(addr, opts.port, opts.timeout))
}

fn main() -> Result<(), OptionsError> {
    let _ = new_server("127.0.0.1", OptionsBuilder::new().build());
    let _ = new_server("127.0.0.1", OptionsBuilder::new().with_port(8000)?.with_timeout(2000)?.build());
    Ok(())
}

However, above solution still requires building config options even when no properties are overridden.

State Pattern with Enum

The state pattern is part of GoF design patterns, which is used to implement finite-state machines or strategy pattern. The state pattern can be easily implemented using “sum” (alternative) algebraic data types with use of union or enums constructs. For example, here is an implementation of state pattern in Rust:

use std::{error::Error, fmt};

#[derive(Debug)]
struct JobError {
    reason: String,
}

impl Error for JobError {}

impl fmt::Display for JobError {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        write!(f, "state error {}", self.reason)
    }
}

enum JobState {
    Pending { timeout: std::time::Duration },
    Executing { percentage_completed: f32 },
    Completed { completion_time: std::time::Duration },
    Failed { cause: JobError },
}

struct JobStateMachine {
    state: JobState,
}

impl JobStateMachine {
    fn new(timeout: std::time::Duration) -> Self {
        JobStateMachine {
            state: JobState::Pending { timeout }
        }
    }
    fn to_executing(&mut self) {
        self.state = match self.state {
            JobState::Pending { .. } => JobState::Executing { percentage_completed: 0.0 },
            _ => panic!("Invalid state transition!"),
        }
    }
    fn to_succeeded(&mut self, completion_time: std::time::Duration) {
        self.state = match self.state {
            JobState::Executing { .. } => JobState::Completed { completion_time: completion_time },
            _ => panic!("Invalid state transition!"),
        }
    }
    // ...
}

fn main() {
    let mut job_state_machine = JobStateMachine::new(std::time::Duration::new(1000, 0));
    job_state_machine.to_executing();
}

However, above implementation relies on runtime to validate state transitions. Alternatively, you can use struct to check valid transitions at compile time, e.g.,

struct Pending {
    timeout: std::time::Duration,
}

impl Pending {
    fn new(timeout: std::time::Duration) -> Self {
        Pending { timeout }
    }

    fn to_executing(self) -> Executing {
        Executing::new()
    }
}

struct Executing {
    percentage_completed: f32,
}

impl Executing {
    fn new() -> Self {
        Executing { percentage_completed: 0.0 }
    }

    fn to_succeeded(self, completion_time: std::time::Duration) -> Executing {
        Executing { percentage_completed: 0.0 }
    }
}

struct Succeeded {
    completion_time: std::time::Duration,
}

impl Succeeded {
    fn new(completion_time: std::time::Duration) -> Self {
        Succeeded { completion_time }
    }
}

// ...

fn main() {
    let pending = Pending::new(std::time::Duration::new(1000, 0));
    let executing = pending.to_executing();
}

Tail Recursion with Trampolines and Thunks

The recursion uses divide and conquer to solve complex problems where a function calls itself to break down a problem into smaller problems. However, each recursion requires adding stack frame to the call stack so many functional languages converts recursive implementation into an iterative solution by eliminating tail-call where recursive call is the final action of a function. In languages that don’t support tail-call optimization, you can use thunks and trampolines to implement it. A thunk is a no-argument function that is evaluated lazily, which in turn may produce another thunk for next function call. Trampolines define a Computation data structure to return result of a computation. For example, following code illustrates an implementation of a Trampoline in Rust:

trait FnThunk {
    type Out;
    fn call(self: Box<Self>) -> Self::Out;
}

pub struct Thunk<'a, T> {
    fun: Box<dyn FnThunk<Out=T> + 'a>,
}

impl<T, F> FnThunk for F where F: FnOnce() -> T {
    type Out = T;
    fn call(self: Box<Self>) -> T { (*self)() }
}

impl<'a, T> Thunk<'a, T> {
    pub fn new(fun: impl FnOnce() -> T + 'a) -> Self {
        Self { fun: Box::new(fun) }
    }
    pub fn compute(self) -> T {
        self.fun.call()
    }
}

pub enum Computation<'a, T> {
    Done(T),
    Call(Thunk<'a, Computation<'a, T>>),
}

pub fn compute<T>(mut res: Computation<T>) -> T {
    loop {
        match res {
            Computation::Done(x) => break x,
            Computation::Call(thunk) => res = thunk.compute(),
        }
    }
}

fn factorial(n: u128) -> u128 {
    fn fac_with_acc(n: u128, acc: u128) -> Computation<'static, u128> {
        if n > 1 {
            Computation::Call(Thunk::new(move || fac_with_acc(n-1, acc * n)))
        } else {
            Computation::Done(acc)
        }
    }
    compute(fac_with_acc(n, 1))
}

fn main() {
    println!("factorial result {}", factorial(5));
}

Memoization

The memoization allows caching results of expensive function calls so that repeated invocation of the same function returns the cached results when the same input is used. It can be implemented using thunk pattern described above. For example, following implementation shows a Rust based implementation:

use std::borrow::Borrow;
use std::marker::PhantomData;
use std::ops::{Deref, DerefMut};


enum Memoized<I: 'static, O: Clone, Func: Fn(I) -> O> {
    UnInitialized(PhantomData<&'static I>, Box<Func>),
    Processed(O),
}

impl<I: 'static, O: Clone, Func: Fn(I) -> O> Memoized<I, O, Func> {
    fn new(lambda: Func) -> Memoized<I, O, Func> {
        Memoized::UnInitialized(PhantomData, Box::new(lambda))
    }
    fn fetch(&mut self, data: I) -> O {
        let (flag, val) = match self {
            &mut Memoized::Processed(ref x) => (false, x.clone()),
            &mut Memoized::UnInitialized(_, ref z) => (true, z(data))
        };
        if flag {
            *self = Memoized::Processed(val.clone());
        }
        val
    }
    fn is_initialized(&self) -> bool {
        match self {
            &Memoized::Processed(_) => true,
            _ => false
        }
    }
}

impl<I: 'static, O: Clone, Func: Fn(I) -> O> Deref for Memoized<I, O, Func> {
    type Target = O;
    fn deref(&self) -> &Self::Target {
        match self {
            &Memoized::Processed(ref x) => x,
            _ => panic!("Attempted to derefence uninitalized memoized value")
        }
    }
}

impl<I: 'static, O: Clone, Func: Fn(I) -> O> DerefMut for Memoized<I, O, Func> {
    fn deref_mut(&mut self) -> &mut Self::Target {
        //self.get()
        if self.is_initialized() {
            match self {
                &mut Memoized::Processed(ref mut x) => return x,
                _ => unreachable!()
            };
        } else {
            *self = Memoized::Processed(unsafe { std::mem::zeroed() });
            match self {
                &mut Memoized::Processed(ref mut x) => return x,
                _ => unreachable!()
            };
        }
    }
}

impl<I: 'static, O: Clone, Func: Fn(I) -> O> Borrow<O> for Memoized<I, O, Func> {
    fn borrow(&self) -> &O {
        match self {
            &Memoized::Processed(ref x) => x,
            _ => panic!("Attempted to borrow uninitalized memoized value")
        }
    }
}


enum Memoized<I: 'static, O: Clone, Func: Fn(I) -> O> {
    UnInitialized(PhantomData<&'static I>, Box<Func>),
    Processed(O),
}

impl<I: 'static, O: Clone, Func: Fn(I) -> O> Memoized<I, O, Func> {
    fn new(lambda: Func) -> Memoized<I, O, Func> {
        Memoized::UnInitialized(PhantomData, Box::new(lambda))
    }
    fn fetch(&mut self, data: I) -> O {
        let (flag, val) = match self {
            &mut Memoized::Processed(ref x) => (false, x.clone()),
            &mut Memoized::UnInitialized(_, ref z) => (true, z(data))
        };
        if flag {
            *self = Memoized::Processed(val.clone());
        }
        val
    }
    fn is_initialized(&self) -> bool {
        match self {
            &Memoized::Processed(_) => true,
            _ => false
        }
    }
}

impl<I: 'static, O: Clone, Func: Fn(I) -> O> Deref for Memoized<I, O, Func> {
    type Target = O;
    fn deref(&self) -> &Self::Target {
        match self {
            &Memoized::Processed(ref x) => x,
            _ => panic!("Attempted to derefence uninitalized memoized value")
        }
    }
}

impl<I: 'static, O: Clone, Func: Fn(I) -> O> DerefMut for Memoized<I, O, Func> {
    fn deref_mut(&mut self) -> &mut Self::Target {
        //self.get()
        if self.is_initialized() {
            match self {
                &mut Memoized::Processed(ref mut x) => return x,
                _ => unreachable!()
            };
        } else {
            *self = Memoized::Processed(unsafe { std::mem::zeroed() });
            match self {
                &mut Memoized::Processed(ref mut x) => return x,
                _ => unreachable!()
            };
        }
    }
}

impl<I: 'static, O: Clone, Func: Fn(I) -> O> Borrow<O> for Memoized<I, O, Func> {
    fn borrow(&self) -> &O {
        match self {
            &Memoized::Processed(ref x) => x,
            _ => panic!("Attempted to borrow uninitalized memoized value")
        }
    }
}


mod test {
    use super::Memoized;

    #[test]
    fn test_memoized() {
        let lambda = |x: i32| -> String {
            x.to_string()
        };
        let mut dut = Memoized::new(lambda);
        assert_eq!(dut.is_initialized(), false);
        assert_eq!(&dut.fetch(5), "5");
        assert_eq!(dut.is_initialized(), true);
        assert_eq!(&dut.fetch(2000), "5");
        let x: &str = &dut;
        assert_eq!(x, "5");
    }
}

Type Conversion

The type conversion allows converting an object from one type to another, e.g. following interface defined in Spring shows an example:

public interface Converter<S, T> {
	@Nullable
	T convert(S source);
	default <U> Converter<S, U> andThen(Converter<? super T, ? extends U> after) {
		Assert.notNull(after, "'after' Converter must not be null");
		return (S s) -> {
			T initialResult = convert(s);
			return (initialResult != null ? after.convert(initialResult) : null);
		};
	}
}

This kind of type conversion looks very similar to the map/reduce primitives defined in functional programming languages, e.g. Java 8 added Function interface for such as transformation. In addition, Scala also supports implicit conversion from one type to another, e.g.,

object Conversions:
  given fromStringToUser: Conversion[String, User] = (name: String) => User(name)

Rust also supports From and Into traits for converting types, e.g.,

use std::convert::From;

#[derive(Debug)]
struct Number {
    value: i32,
}

impl From<i32> for Number {
    fn from(item: i32) -> Self {
        Number { value: item }
    }
}

fn main() {
    let num1: Number = Number::from(10);
    let num2: Number = 20.into();
    
    println!("{:?} {:?}", num1, num2);
}

January 1, 2023

Consumer-driven and Producer-generated Contract Testing for REST APIs

Filed under: REST,Testing,Web Services — admin @ 9:43 pm

Though, REST standard for remote APIs is fairly loose but you can document API shape and structure using standards such as Open API and swagger specifications. The documented API specification ensures that both consumer/client and producer/server side abide by the specifications and prevent unexpected behavior. The API provider may also define service-level objective (SLO) so that API meets specified latency, security and availability and other service-level indicators (SLI). The API provider can use contract tests to validate the API interactions based on documented specifications. The contract testing includes both consumer and producer where a consumer makes an API request and the producer produces the result. The contract tests ensures that both consumer requests and producer responses match the contract request and response definitions per API specifications. These contract tests don’t just validate API schema instead they validate interactions between consumer and producer thus they can also be used to detect any breaking or backward incompatible changes so that consumers can continue using the APIs without any surprises.

In order to demonstrate contract testing, we will use api-mock-service library to generate mock/stub client requests and server responses based on Open API specifications or customized test contracts. These test contracts can be used by both consumers and producers for validating API contracts and evolve the contract tests as API specifications are updated.

Sample REST API Under Test

A sample eCommerce application will be used to demonstrate contracts testing. The application will use various REST APIs to implement online shopping experience. The primary purpose of this example is to show how different request structures can be passed to the REST APIs and then generate a valid result or an error condition for contract testing. You can view the Open-API specifications for this sample app here.

Customer REST APIs

The customer APIs define operations to manage customers who shop online, e.g.:

Customer APIs

Product REST APIs

The product APIs define operations to manage products that can be shopped online, e.g.:

Product APIs

Payment REST APIs

The payment APIs define operations to charge credit card and pay for online shopping, e.g.:

Payment APIs

Order REST APIs

The order APIs define operations to purchase a product from the online store and it will use above APIs to validate customers, check product inventory, charge payments and then store record of orders, e.g.:

Order APIs

Generating Stub Server Responses based on Open-API Specifications

In this example, stub server responses will be generated by api-mock-service based on open-api specifications ecommerce-api.json by starting the mock service first as follows:

docker pull plexobject/api-mock-service:latest
docker run -p 8000:8000 -p 9000:9000 -e HTTP_PORT=8000 -e PROXY_PORT=9000 \
	-e DATA_DIR=/tmp/mocks -e ASSET_DIR=/tmp/assets api-mock-service

And then uploading open-API specifications for ecommerce-api.json:

curl -H "Content-Type: application/yaml" --data-binary @ecommerce-api.json \
	http://localhost:8000/_oapi

It will generate test contracts with stub/mock responses for all APIs defined in the ecommerce-api.json Open API specification. For example, you can produce result of customers REST APIs, e.g.:

curl http://localhost:8000/customers

to produce:

[
  {
    "address": {
      "city": "PpCJyfKUomUOdhtxr",
      "countryCode": "US",
      "id": "ede97f59-2ef2-48e5-913f-4bce0f152603",
      "streetAddress": "Se somnis cibo oculi, die flammam petimus?",
      "zipCode": "06826"
    },
    "creditCard": {
      "balance": {
        "amount": 53965,
        "currency": "CAD"
      },
      "cardNumber": "7345-4444-5461",
      "customerId": "WB97W4L2VQRRkH5L0OAZGk0MT957r7Z",
      "expiration": "25/0000",
      "id": "ae906a78-0aff-4d4e-ad80-b77877f0226c",
      "type": "VISA"
    },
    "email": "abigail.appetitum@dicant.net",
    "firstName": "sciam",
    "id": "21c82838-507a-4745-bc1b-40e6e476a1fb",
    "lastName": "inquit",
    "phone": "1-717-5555-3010"
  },
...  

Above response is randomly generated based on the types/formats/regex/min-max limits of properties defined in Open-API and calling this API will automatically generate all valid and error responses, e.g. calling “curl http://localhost:8000/customers” again will return:

* Mark bundle as not supporting multiuse
< HTTP/1.1 500 Internal Server Error
< Content-Type:
< Vary: Origin
< X-Mock-Path: /customers
< X-Mock-Request-Count: 9
< X-Mock-Scenario: getCustomerByEmail-customers-500-8a93b6c60c492e730ea149d5d09e79d85701c01dbc017d178557ed1d2c1bad3d
< Date: Sun, 01 Jan 2023 20:41:17 GMT
< Content-Length: 67
<
* Connection #0 to host localhost left intact
{"logRef":"achieve_output_fresh","message":"buffalo_rescue_street"}

Consumer-driven Contract Testing

Upon uploading the Open-API specifications of microservices, the api-mock-service generates test contracts for each REST API and response statuses. You can then customize these test cases for consumer-driven contract testing.

For example, here is the default test contract generated for finding a customer by id with path “/customers/:id”:

method: GET
name: getCustomer-customers-200-61a298e
path: /customers/:id
description: ""
predicate: ""
request:
    match_query_params: {}
    match_headers: {}
    match_contents: '{}'
    path_params:
        id: \w+
    query_params: {}
    headers: {}
response:
    headers: 
      Content-Type:
        - application/json
    contents: '{"address":{"city":"{{RandStringMinMax 2 60}}","countryCode":"{{EnumString `US CA`}}","id":"{{UUID}}","streetAddress":"{{RandRegex `\\w+`}}","zipCode":"{{RandRegex `\\d{5}`}}"},"creditCard":{"balance":{"amount":{{RandNumMinMax 0 0}},"currency":"{{RandRegex `(USD|CAD|EUR|AUD)`}}"},"cardNumber":"{{RandRegex `\\d{4}-\\d{4}-\\d{4}`}}","customerId":"{{RandStringMinMax 30 36}}","expiration":"{{RandRegex `\\d{2}/\\d{4}`}}","id":"{{UUID}}","type":"{{EnumString `VISA MASTERCARD AMEX`}}"},"email":"{{RandRegex `.+@.+\\..+`}}","firstName":"{{RandRegex `\\w`}}","id":"{{UUID}}","lastName":"{{RandRegex `\\w`}}","phone":"{{RandRegex `1-\\d{3}-\\d{4}-\\d{4}`}}"}'
    contents_file: ""
    status_code: 200
wait_before_reply: 0s

Above template demonstrates interaction between consumer and producer by defining properties such as:

  • method – of REST API such as GET/POST/PUT/DELETE
  • name – of the test case
  • path of REST API
  • description – of test
  • predicate – defines a condition which must be true to select this test contract
  • request section defines input properties for the REST API including:
    • match_query_params – to match query input parameters for selecting the test contract
    • match_headers – to match input headers for selecting the test contract
    • match_contents – defines regex for selecting input body
    • path_params – defines path variables and regex
    • query_params and headers – defines sample input parameters and headers
  • response section defines output properties for the REST API including:
    • headers – defines response headers
    • contents – defines body of response
    • contents_file – allows loading response from a file
    • status_code – defines HTTP response status
  • wait_before_reply – defines wait time before returning response

You can then invoke test contract using:

curl http://localhost:8000/customers/1

that generates test case from the mock/stub server provided by the api-mock-service library, e.g.

{
  "address": {
    "city": "PanHQyfbHZVw",
    "countryCode": "US",
    "id": "ff5d0e98-daa5-49c8-bb79-f2d7274f2fb1",
    "streetAddress": "Sumus o proferens etiamne intuerer fugasti, nuntiantibus da?",
    "zipCode": "01364"
  },
  "creditCard": {
    "balance": {
      "amount": 80704,
      "currency": "USD"
    },
    "cardNumber": "3226-6666-2214",
    "customerId": "0VNf07XNWkLiIBhfmfCnrE1weTlkhmxn",
    "expiration": "24/5555",
    "id": "f9549ef3-a5eb-4df4-a8a9-85a30a6a49c6",
    "type": "VISA"
  },
  "email": "amanda.doleat@fructu.com",
  "firstName": "quaero",
  "id": "9aeee733-932d-4244-a6f8-f21d2883fd27",
  "lastName": "habeat",
  "phone": "1-052-5555-4733"
}

You can customize above response contents using builtin template functions in the api-mock-service library or create additional test contracts for each distinct input parameter. For example, following contract defines interaction between consumer and producer to add a new customer:

method: POST
name: saveCustomer-customers-200-ddfceb2
path: /customers
description: ""
order: 0
group: Sample Ecommerce API
predicate: ""
request:
    match_query_params: {}
    match_headers: {}
    match_contents: '{"address.city":"(__string__\\w+)","address.countryCode":"(__string__(US|CA))","address.streetAddress":"(__string__\\w+)","address.zipCode":"(__string__\\d{5})","creditCard.balance.amount":"(__number__[+-]?((\\d{1,10}(\\.\\d{1,5})?)|(\\.\\d{1,10})))","creditCard.balance.currency":"(__string__(USD|CAD|EUR|AUD))","creditCard.cardNumber":"(__string__\\d{4}-\\d{4}-\\d{4})","creditCard.customerId":"(__string__\\w+)","creditCard.expiration":"(__string__\\d{2}/\\d{4})","creditCard.type":"(__string__(VISA|MASTERCARD|AMEX))","email":"(__string__.+@.+\\..+)","firstName":"(__string__\\w)","lastName":"(__string__\\w)","phone":"(__string__1-\\d{3}-\\d{4}-\\d{4})"}'
    path_params: {}
    query_params: {}
    headers:
        ContentsType: application/json
    contents: '{"address":{"city":"__string__\\w+","countryCode":"__string__(US|CA)","streetAddress":"__string__\\w+","zipCode":"__string__\\d{5}"},"creditCard":{"balance":{"amount":"__number__[+-]?((\\d{1,10}(\\.\\d{1,5})?)|(\\.\\d{1,10}))","currency":"__string__(USD|CAD|EUR|AUD)"},"cardNumber":"__string__\\d{4}-\\d{4}-\\d{4}","customerId":"__string__\\w+","expiration":"__string__\\d{2}/\\d{4}","type":"__string__(VISA|MASTERCARD|AMEX)"},"email":"__string__.+@.+\\..+","firstName":"__string__\\w","lastName":"__string__\\w","phone":"__string__1-\\d{3}-\\d{4}-\\d{4}"}'
    example_contents: |
        address:
            city: Ab fabrorum meminerim conterritus nota falsissime deum?
            countryCode: CA
            streetAddress: Mei nisi dum, ab amaremus antris?
            zipCode: "00128"
        creditCard:
            balance:
                amount: 3000.4861560368768
                currency: USD
            cardNumber: 7740-7777-6114
            customerId: Fudi eodem sed habitaret agam pro si?
            expiration: 85/2222
            type: AMEX
        email: larry.neglecta@audio.edu
        firstName: fatemur
        lastName: gaudeant
        phone: 1-543-8888-2641
response:
    headers: 
      Content-Type: 
        - application/json
    contents: '{"address":{"city":"{{RandStringMinMax 2 60}}","countryCode":"{{EnumString `US CA`}}","id":"{{UUID}}","streetAddress":"{{RandRegex `\\w+`}}","zipCode":"{{RandRegex `\\d{5}`}}"},"creditCard":{"balance":{"amount":{{RandNumMinMax 0 0}},"currency":"{{RandRegex `(USD|CAD|EUR|AUD)`}}"},"cardNumber":"{{RandRegex `\\d{4}-\\d{4}-\\d{4}`}}","customerId":"{{RandStringMinMax 30 36}}","expiration":"{{RandRegex `\\d{2}/\\d{4}`}}","id":"{{UUID}}","type":"{{EnumString `VISA MASTERCARD AMEX`}}"},"email":"{{RandRegex `.+@.+\\..+`}}","firstName":"{{RandRegex `\\w`}}","id":"{{UUID}}","lastName":"{{RandRegex `\\w`}}","phone":"{{RandRegex `1-\\d{3}-\\d{4}-\\d{4}`}}"}'
    contents_file: ""
    status_code: 200
wait_before_reply: 0s

Above template defines interaction for adding a new customer where request section defines format of request and matching criteria using match_content property. The response section includes the headers and contents that are generated by the stub/mock server for consumer-driven contract testing. You can then invoke test contract using:

curl -X POST http://localhost:8000/customers -d '{"address":{"city":"rwjJS","countryCode":"US","id":"4a788c96-e532-4a97-9b8b-bcb298636bc1","streetAddress":"Cura diu me, miserere me?","zipCode":"24121"},"creditCard":{"balance":{"amount":57012,"currency":"USD"},"cardNumber":"5566-2222-8282","customerId":"tgzwgThaiZqc5eDwbKk23nwjZqkap7","expiration":"70/6666","id":"d966aafa-c28b-4078-9e87-f7e9d76dd848","type":"VISA"},"email":"andrew.recorder@ipsas.net","firstName":"quendam","id":"071396bb-f8db-489d-a8f7-bbcce952ecef","lastName":"formaeque","phone":"1-345-6666-0618"}'

Which will return a response such as:

{
  "address": {
    "city": "j77oUSSoB5lJCUtc4scxtm0vhilPRdLE7Nc8KzAunBa87OrMerCZI",
    "countryCode": "CA",
    "id": "9bb21030-29d0-44be-8f5a-25855e38c164",
    "streetAddress": "Qui superbam imago cernimus, sensarum nuntii tot da?",
    "zipCode": "08020"
  },
  "creditCard": {
    "balance": {
      "amount": 75666,
      "currency": "AUD"
    },
    "cardNumber": "1383-8888-5013",
    "customerId": "nNaUd15lf6lqkAEwKoguVTvBnPMBVDhdeO",
    "expiration": "73/5555",
    "id": "554efad7-17ab-49f9-967a-3e47381a4d34",
    "type": "AMEX"
  },
  "email": "deborah.vivit@desivero.gov",
  "firstName": "contexo",
  "id": "db70b737-ee1d-48ed-83da-c5a8773c7a5f",
  "lastName": "delectat",
  "phone": "1-013-7777-0054"
}

Note: The response will not match the request body as the contract testing only tests interactions between consumer and producer without maintaining any server side state. You can use other types of testing such as integration/component/functional testing for validating state based behavior.

Producer-driven Generated Tests

The process of defining contracts to generate tests for validating producer REST APIs is similar to consumer-driven contracts. For example, you can upload open-api specifications or user-defined contracts to the api-mock-service provided mock/stub server.

For example, you can upload open-API specifications for ecommerce-api.json as follows:

curl -H "Content-Type: application/yaml" --data-binary @ecommerce-api.json \
	http://localhost:8000/_oapi

Upon uploading the specifications, the mock server will generate contracts for each REST API and status. You can customize those contracts with additional validation or assertion and then invoke server generated tests either by specifying the REST API or invoke multiple REST APIs belonging to a specific group. You can also define an order for executing tests in a group and can optionally pass data from one invocation to the next invocation of REST API.

For testing purpose, we will customize customer REST APIs for adding a new customer and fetching a customer by its id, i.e.,

A contract for adding a new customer

method: POST
name: save-customer
path: /customers
group: customers
order: 0
request:
    headers:
        Content-Type: application/json
    contents: |
        address:
            city: {{RandCity}}
            countryCode: {{EnumString `US CA`}}
            id: {{UUID}}
            streetAddress: {{RandSentence 2 3}}
            zipCode: {{RandRegex `\d{5}`}}
        creditCard:
            balance:
                amount: {{RandNumMinMax 20 500}}
                currency: {{EnumString `USD CAD`}}
            cardNumber: {{RandRegex `\d{4}-\d{4}-\d{4}`}}
            customerId: {{UUID}}
            expiration: {{RandRegex `\d{2}/\d{4}`}}
            id: {{UUID}}
            type: {{EnumString `VISA MASTERCARD`}}
        email: {{RandEmail}}
        firstName: {{RandName}}
        id: {{UUID}}
        lastName: {{RandName}}
        phone: {{RandRegex `1-\d{3}-\d{3}-\d{4}`}}
response:
    match_headers: {}
    match_contents: '{"address.city":"(__string__\\w+)","address.countryCode":"(__string__(US|CA))","address.id":"(__string__\\w+)","address.streetAddress":"(__string__\\w+)","address.zipCode":"(__string__\\d{5}.?\\d{0,4})","creditCard.balance.amount":"(__number__[+-]?(([0-9]{1,10}(\\.[0-9]{1,5})?)|(\\.[0-9]{1,10})))","creditCard.balance.currency":"(__string__\\w+)","creditCard.cardNumber":"(__string__[\\d-]{10,20})","creditCard.customerId":"(__string__\\w+)","creditCard.expiration":"(__string__\\d{2}.\\d{4})","creditCard.id":"(__string__\\w+)","creditCard.type":"(__string__(VISA|MASTERCARD|AMEX))","email":"(__string__.+@.+\\..+)","firstName":"(__string__\\w+)","id":"(__string__\\w+)","lastName":"(__string__\\w+)","phone":"(__string__[\\-\\w\\d]{9,15})"}'
    pipe_properties:
      - id
      - email
    assertions:
      - VariableContains contents.email @
      - VariableContains contents.creditCard.type A
      - VariableContains headers.Content-Type application/json
      - VariableEQ status 200

The request section defines content property that will build the input request, which will be sent to the producer provided REST API. The server section defines match_contents to match regex of each response property. In addition, the response section defines assertions to compare against response contents, headers or status against expected output.

A contract for finding an existing customer

method: GET
name: get-customer
path: /customers/{{.id}}
description: ""
order: 1
group: customers
predicate: ""
request:
    path_params:
        id: \w+
    query_params: {}
    headers:
      Content-Type: application/json
    contents: ""
    example_contents: ""
response:
    headers: {}
    match_headers:
      Content-Type: application/json    
    match_contents: '{"address.city":"(__string__\\w+)","address.countryCode":"(__string__(US|CA))","address.streetAddress":"(__string__\\w+)","address.zipCode":"(__string__\\d{5})","creditCard.balance.amount":"(__number__[+-]?((\\d{1,10}(\\.\\d{1,5})?)|(\\.\\d{1,10})))","creditCard.balance.currency":"(__string__(USD|CAD|EUR|AUD))","creditCard.cardNumber":"(__string__\\d{4}-\\d{4}-\\d{4})","creditCard.customerId":"(__string__\\w+)","creditCard.expiration":"(__string__\\d{2}/\\d{4})","creditCard.type":"(__string__(VISA|MASTERCARD|AMEX))","email":"(__string__.+@.+\\..+)","firstName":"(__string__\\w)","lastName":"(__string__\\w)","phone":"(__string__1-\\d{3}-\\d{3}-\\d{4})"}'
    pipe_properties:
      - id
      - email
    assertions:
      - VariableContains contents.email @
      - VariableContains contents.creditCard.type A
      - VariableContains headers.Content-Type application/json
      - VariableEQ status 200

Above template defines similar properties to generate request body and defines match_contents with assertions to match expected output headers, body and status. Based on order of tests, the generated test to add new customer will be executed first, which will be followed by the test to find a customer by id. As we are testing against real REST APIs, the REST API path is defined as “/customers/{{.id}}” for finding a customer will populate the id from the output of first test based on the pipe_properties.

Uploading Contracts

Once you have the api-mock-service mock server running, you can upload contracts using:

curl -H "Content-Type: application/yaml" --data-binary @fixtures/get_customer.yaml \
	http://localhost:8000/_scenarios
curl -H "Content-Type: application/yaml" --data-binary @fixtures/save_customer.yaml \
	http://localhost:8000/_scenarios

You can start your service before invoking generated tests, e.g. we will use sample-openapi for the testing purpose and then invoke the generated tests using:

curl -X POST http://localhost:8000/_contracts/customers -d \
	'{"base_url": "http://localhost:8080", "execution_times": 5, "verbose": true}'

Above command will execute all tests for customers group and it will invoke each REST API 5 times. After executing the APIs, it will generate result as follows:

{
  "results": {
    "get-customer_0": {
      "email": "anna.intra@amicum.edu",
      "id": "fa7a06cd-1bf1-442e-b761-d1d074d24373"
    },
    "get-customer_1": {
      "email": "aaron.sequi@laetus.gov",
      "id": "c5128ac0-865c-4d91-bb0a-23940ac8a7cb"
    },
    "get-customer_2": {
      "email": "edward.infligi@evellere.com",
      "id": "a485739f-01d4-442e-9ddc-c2656ba48c63"
    },
    "get-customer_3": {
      "email": "gary.volebant@istae.com",
      "id": "ef0eacd0-75cc-484f-b9a4-7aebfe51d199"
    },
    "get-customer_4": {
      "email": "alexis.dicant@displiceo.net",
      "id": "da65b914-c34e-453b-8ee9-7f0df598ac13"
    },
    "save-customer_0": {
      "email": "anna.intra@amicum.edu",
      "id": "fa7a06cd-1bf1-442e-b761-d1d074d24373"
    },
    "save-customer_1": {
      "email": "aaron.sequi@laetus.gov",
      "id": "c5128ac0-865c-4d91-bb0a-23940ac8a7cb"
    },
    "save-customer_2": {
      "email": "edward.infligi@evellere.com",
      "id": "a485739f-01d4-442e-9ddc-c2656ba48c63"
    },
    "save-customer_3": {
      "email": "gary.volebant@istae.com",
      "id": "ef0eacd0-75cc-484f-b9a4-7aebfe51d199"
    },
    "save-customer_4": {
      "email": "alexis.dicant@displiceo.net",
      "id": "da65b914-c34e-453b-8ee9-7f0df598ac13"
    }
  },
  "errors": {},
  "metrics": {
    "getcustomer_counts": 5,
    "getcustomer_duration_seconds": 0.006,
    "savecustomer_counts": 5,
    "savecustomer_duration_seconds": 0.006
  },
  "succeeded": 10,
  "failed": 0
}

Though, generated tests are executed against real services, it’s recommended that the service implementation use test doubles or mock services for any dependent services as contract testing is not meant to replace component or end-to-end tests that provide better support for integration testing.

Recording Consumer/Producer interactions for Generating Stub Requests and Responses

The contract testing does not always depend on API specifications such as Open API and swagger and instead you can record interactions between consumers and producers using api-mock-service tool.

For example, if you have an existing REST API or a legacy service such as above sample API, you can record an interaction as follows:

export http_proxy="http://localhost:9000"
export https_proxy="http://localhost:9000"
curl -X POST -H "Content-Type: application/json" http://localhost:8080/customers -d \
	'{"address":{"city":"rwjJS","countryCode":"US","id":"4a788c96-e532-4a97-9b8b-bcb298636bc1","streetAddress":"Cura diu me, miserere me?","zipCode":"24121"},"creditCard":{"balance":{"amount":57012,"currency":"USD"},"cardNumber":"5566-2222-8282","customerId":"tgzwgThaiZqc5eDwbKk23nwjZqkap7","expiration":"70/6666","id":"d966aafa-c28b-4078-9e87-f7e9d76dd848","type":"VISA"},"email":"andrew.recorder@ipsas.net","firstName":"quendam","id":"071396bb-f8db-489d-a8f7-bbcce952ecef","lastName":"formaeque","phone":"1-345-6666-0618"}'

This will invoke the remote REST API, record contract interactions and then return server response:

{
  "id": "95d655e1-405e-4087-8a7d-56791eaf51cc",
  "firstName": "quendam",
  "lastName": "formaeque",
  "email": "andrew.recorder@ipsas.net",
  "phone": "1-345-6666-0618",
  "creditCard": {
    "id": "d966aafa-c28b-4078-9e87-f7e9d76dd848",
    "customerId": "tgzwgThaiZqc5eDwbKk23nwjZqkap7",
    "type": "VISA",
    "cardNumber": "5566-2222-8282",
    "expiration": "70/6666",
    "balance": {
      "amount": 57012,
      "currency": "USD"
    }
  },
  "address": {
    "id": "4a788c96-e532-4a97-9b8b-bcb298636bc1",
    "streetAddress": "Cura diu me, miserere me?",
    "city": "rwjJS",
    "zipCode": "24121",
    "countryCode": "US"
  }
}

The recorded contract can be used to generate the stub response, e.g. following configuration defines the recorded contract:

method: POST
name: recorded-customers-200-55240a69747cac85a881a3ab1841b09c2c66d6a9a9ae41c99665177d3e3b5bb7
path: /customers
description: recorded at 2023-01-02 03:18:11.80293 +0000 UTC for http://localhost:8080/customers
order: 0
group: customers
predicate: ""
request:
    match_query_params: {}
    match_headers:
        Content-Type: application/json
    match_contents: '{"address.city":"(__string__\\w+)","address.countryCode":"(__string__\\w+)","address.id":"(.+)","address.streetAddress":"(__string__\\w+)","address.zipCode":"(__string__\\d{5,5})","creditCard.balance.amount":"(__number__[+-]?\\d{1,10})","creditCard.balance.currency":"(__string__\\w+)","creditCard.cardNumber":"(__string__\\d{4,4}[-]\\d{4,4}[-]\\d{4,4})","creditCard.customerId":"(.+)","creditCard.expiration":"(.+)","creditCard.id":"(.+)","creditCard.type":"(__string__\\w+)","email":"(__string__\\w+.?\\w+@\\w+.?\\w+)","firstName":"(__string__\\w+)","id":"(.+)","lastName":"(__string__\\w+)","phone":"(__string__\\d{1,1}[-]\\d{3,3}[-]\\d{4,4}[-]\\d{4,4})"}'
    path_params: {}
    query_params: {}
    headers:
        Accept: '*/*'
        Content-Length: "522"
        Content-Type: application/json
        User-Agent: curl/7.65.2
    contents: '{"address":{"city":"rwjJS","countryCode":"US","id":"4a788c96-e532-4a97-9b8b-bcb298636bc1","streetAddress":"Cura diu me, miserere me?","zipCode":"24121"},"creditCard":{"balance":{"amount":57012,"currency":"USD"},"cardNumber":"5566-2222-8282","customerId":"tgzwgThaiZqc5eDwbKk23nwjZqkap7","expiration":"70/6666","id":"d966aafa-c28b-4078-9e87-f7e9d76dd848","type":"VISA"},"email":"andrew.recorder@ipsas.net","firstName":"quendam","id":"071396bb-f8db-489d-a8f7-bbcce952ecef","lastName":"formaeque","phone":"1-345-6666-0618"}'
    example_contents: ""
response:
    headers:
        Content-Type:
            - application/json
        Date:
            - Mon, 02 Jan 2023 03:18:11 GMT
    contents: '{"id":"95d655e1-405e-4087-8a7d-56791eaf51cc","firstName":"quendam","lastName":"formaeque","email":"andrew.recorder@ipsas.net","phone":"1-345-6666-0618","creditCard":{"id":"d966aafa-c28b-4078-9e87-f7e9d76dd848","customerId":"tgzwgThaiZqc5eDwbKk23nwjZqkap7","type":"VISA","cardNumber":"5566-2222-8282","expiration":"70/6666","balance":{"amount":57012.00,"currency":"USD"}},"address":{"id":"4a788c96-e532-4a97-9b8b-bcb298636bc1","streetAddress":"Cura diu me, miserere me?","city":"rwjJS","zipCode":"24121","countryCode":"US"}}'
    contents_file: ""
    example_contents: ""
    status_code: 200
    match_headers: {}
    match_contents: '{"address.city":"(__string__\\w+)","address.countryCode":"(__string__\\w+)","address.id":"(.+)","address.streetAddress":"(__string__\\w+)","address.zipCode":"(__string__\\d{5,5})","creditCard.balance.amount":"(__number__[+-]?\\d{1,10})","creditCard.balance.currency":"(__string__\\w+)","creditCard.cardNumber":"(__string__\\d{4,4}[-]\\d{4,4}[-]\\d{4,4})","creditCard.customerId":"(.+)","creditCard.expiration":"(.+)","creditCard.id":"(.+)","creditCard.type":"(__string__\\w+)","email":"(__string__\\w+.?\\w+@\\w+.?\\w+)","firstName":"(__string__\\w+)","id":"(.+)","lastName":"(__string__\\w+)","phone":"(__string__\\d{1,1}[-]\\d{3,3}[-]\\d{4,4}[-]\\d{4,4})"}'
    pipe_properties: []
    assertions: []
wait_before_reply: 0s

You can then invoke consumer-driven contracts to generate stub response or invoke generated tests to test against producer implementation as described in earlier section. Another benefit of capturing test contracts using recorded session is that it can accurately capture all URLs, parameters and headers for both requests and responses so that contract testing can precisely validate against existing behavior.

Summary

Though, unit-testing, component testing and end-to-end testing are a common testing strategies that are used by most organizations but they don’t provide adequate support to validate API specifications and interactions between consumers/clients and producers/providers of the APIs. The contract testing ensures that consumers and producers will not deviate from the specifications and can be used to validate changes for backward compatibility when APIs are evolved. This also decouples consumers and producers if the API is still in development as both parties can write code against the agreed contracts and test them independently. A service owner can generate producer contracts using tools such as api-mock-service based on Open API specification or user-defined constraints. The consumers can provide their consumer-driven contracts to the service providers to ensure that the API changes don’t break any consumers. These contracts can be stored in a source code repository or on a registry service so that contract testing can easily access them and execute them as part of the build and deployment pipelines. The api-mock-service tool greatly assists in adding contract testing to your software development lifecycle and is freely available from https://github.com/bhatti/api-mock-service.

December 20, 2022

Property-based and Generative testing for Microservices

Filed under: REST,Technology,Testing — Tags: — admin @ 1:26 pm

The software development cycle for microservices generally include unit testing during the development where mock implementation for the dependent services are injected with the desired behavior to test various test-scenarios and failure conditions. However, the development teams often use real dependent services for integration testing of a microservice in a local environment. This poses a considerable challenge as each dependent service may be keeping its own state that makes it harder to reliably validate the regression behavior or simulate certain error response. Further, as the number of request parameters to the service or downstream services grow, the combinatorial explosion for test cases become unmanageable. This is where property-based testing offers a relief as it allows testing against automatically generated input fuzz-data, which is why this form of testing is also referred as a generative testing. A generator defines a function that generate random data based on type of input and constraints on the range of input values. The property-based test driver then iteratively calls the system under test to validate the result and assert the desired behavior, e.g.

def pre_condition_test_input_param(kind):
  ### assert pre-condition based on type of parameter and range of input values it may take

def generate_test_input_param(kind):
  ### generate data meeting pre-condition for the type
    
def generate_test_input_params(kinds):
  return [generate_test_input_param(kind) for kind in kinds]  
  
for i in range(max_attempts):
  [a, b, c, ...] = generate_test_input_params(type1, type2, type3, ...)
  output = function_under_test(a, b, c, ...)
  assert property1(output)
  assert property2(output)
  ...  

In above example, the input parameters are randomly generated based on a precondition. The generated parameters are passed to the function under test and the test driver validates result based on property assertions. This entire process is also referred as fuzzing, which is repeated based on a fixed range to identify any input parameters where the property assertions fail. There are a lot of libraries for property-based testing in various languages such as QuickCheck, fast-check, junit-quickcheck, ScalaCheck, etc. but we will use the api-mock-service library to demonstrate these capabilities for testing microservice APIs.

Following sections describe how the api-mock-service library can be used for testing microservice with fuzzing/property-based approaches and for mocking dependent services to produce the desired behavior:

Sample Microservices Under Test

A sample eCommerce application will be used to demonstrate property-based and generative testing. The application will use various microservices to implement online shopping experience. The primary purpose of this example is to show how different parameters can be passed to microservices, where microservice APIs will validate the input parameters, perform a simple business logic and then generate a valid result or an error condition. You can view the Open-API specifications for this sample app here.

Customer APIs

The customer APIs define operations to manage customers who shop online, e.g.:

Customer APIs

Product APIs

The product APIs define operations to manage products that can be shopped online, e.g.:

Product APIs

Payment APIs

The payment APIs define operations to charge credit card and pay for online shopping, e.g.:

Payment APIs

Order APIs

The order APIs define operations to purchase a product from the online store and it will use above APIs to validate customers, check product inventory, charge payments and then store record of orders, e.g.:

Order APIs

Defining Test Scenarios with Open-API Specifications

In this example, test scenarios will be generated by api-mock-service based on open-api specifications ecommerce-api.json by starting the mock service first as follows:

docker pull plexobject/api-mock-service:latest
docker run -p 8000:8000 -p 9000:9000 -e HTTP_PORT=8000 -e PROXY_PORT=9000 \
	-e DATA_DIR=/tmp/mocks -e ASSET_DIR=/tmp/assets api-mock-service

And then uploading open-API specifications for ecommerce-api.json:

curl -H "Content-Type: application/yaml" --data-binary @ecommerce-api.json \
	http://localhost:8000/_oapi

It will generate mock APIs for all microservices, e.g. you can produce result of products APIs, e.g.:

curl http://localhost:8000/products

to produce:

[
  {
    "id": "fd6a5ddb-35bc-47a9-aacb-9694ff5f8a32",
    "category": "TOYS",
    "inventory": 13,
    "name": "Se nota.",
    "price":{
      "amount":2,
      "currency": "USD"
    }
  },
  {
    "id": "47aab7d9-ecd2-4593-b1a6-c34bb5ca02bc",
    "category": "MUSIC",
    "inventory": 30,
    "name": "Proferuntur mortem.",
    "price":{
      "amount":23,
      "currency": "CAD"
    }
  },
  {
    "id": "ae649ae7-23e3-4709-b665-b1b0f436c97a",
    "category": "BOOKS",
    "inventory": 8,
    "name": "Cor.",
    "price":{
      "amount":13,
      "currency": "USD"
    }
  },
  {
    "id": "a3bd8426-e26d-4f66-8ee8-f55798440dc3",
    "category": "MUSIC",
    "inventory": 43,
    "name": "E diutius.",
    "price":{
      "amount":22,
      "currency": "USD"
    }
  },
  {
    "id": "7f328a53-1b64-4e4f-b6a6-7a69aed1b183",
    "category": "BOOKS",
    "inventory": 54,
    "name": "Dici utroque.",
    "price":{
      "amount":23,
      "currency": "USD"
    }
  }
]

Above response is randomly generated based on the properties defined in Open-API and calling this API will automatically generate all valid and error responses, e.g. calling “curl http://localhost:8000/products” again will return:

< HTTP/1.1 400 Bad Request
< Content-Type:
< Vary: Origin
< X-Mock-Path: /products
< X-Mock-Request-Count: 1
< X-Mock-Scenario: getProductByCategory-07ef44df0d38389ca9d589faaab9e458bd79e8abe7d2e1149e56c00820fac1fb
< Date: Tue, 20 Dec 2022 04:54:58 GMT
< Content-Length: 122
<
{ [122 bytes data]

* Connection #0 to host localhost left intact
{
  "errors": [
    "category_gym_bargain",
    "expand_tuna_stomach",
    "cage_enroll_between",
    "bulk_choice_category",
    "trend_agree_purse"
  ]
}

Applying Property-based/Generative Testing for Clients of Microservices

Upon uploading the Open-API specifications of microservices, the api-mock-service automatically generated templates for producing mock responses and error conditions, which can be customized for property-based and generative testing of microservice clients by defining constraints for generating input/output data and assertions for request/response validation.

Client-side Testing for Listing Products

You can find generated mock scenarios for listing products on the mock service using:

curl -v http://localhost:8000/_scenarios|jq '.'|grep "GET.getProductByCategory"

which returns:

"/_scenarios/GET/getProductByCategory-1a6d4d84e4a8a1ad706d671a26e66c419833b3a99f95cc442942f96d0d8f43f8/products": {
"/_scenarios/GET/getProductByCategory-6e522e565bb669ab3d9b09cc2e16b9d636220ec28a860a1cc30e9c5104e41f53/products": {
"/_scenarios/GET/getProductByCategory-7ede8f15af851323576a0c864841e859408525632eb002a1571676a0d835a0e1/products": {
"/_scenarios/GET/getProductByCategory-9ed14ecd11bbeb9f7bfde885d00efcbf168661354e4c48fe876c545e9a778302/products": {

and then invoking above URL paths, e.g.

curl -v http://localhost:8000/_scenarios/GET/getProductByCategory-7ede8f15af851323576a0c864841e859408525632eb002a1571676a0d835a0e1/products

which will return randomly generated response such as:

method: GET
name: getProductByCategory-7ede8f15af851323576a0c864841e859408525632eb002a1571676a0d835a0e1
path: /products
description: ""
order: 1
group: products
predicate: ""
request:
    match_query_params: {}
    match_headers: {}
    match_contents: '{}'
    path_params: {}
    query_params:
        category: '[\x20-\x7F]{1,128}'
    headers:
        "Content-Type": "application/json"
    contents: ""
response:
    headers: {}
    contents: '[{"category":"{{EnumString `BOOKS MUSIC TOYS`}}","id":"{{RandStringMinMax 0 0}}","inventory":"{{RandNumMinMax 10000 10000}}","name":"{{RandStringMinMax 2 50}}","price":{"amount":{{RandNumMinMax 0 0}},"currency":"{{RandStringMinMax 0 0}}"}}]'
    contents_file: ""
    status_code: 200
    match_headers: {}
    match_contents: '{"category":".+","id":"(__string__\\w+)","inventory":".+","name":"(__string__\\w+)","price.amount":".+","price.currency":"(__string__\\w+)"}'
wait_before_reply: 0s

We can customize above response contents using builtin template functions in the api-mock-service library to generate fuzz response, e.g.

    headers:
        "Content-Type":
          - "application/json"
    contents: >
      [
{{- range $val := Iterate 5}}
        {
          "id": "{{UUID}}",
          "category": "{{EnumString `BOOKS MUSIC TOYS`}}",
          "inventory": {{RandNumMinMax 1 100}},
          "name": "{{RandSentence 1 3}}",
          "price":{
            "amount":{{RandNumMinMax 1 25}},
            "currency": "{{EnumString `USD CAD`}}"
          }
        }{{if lt $val 4}},{{end}}
{{ end }}
      ]
    status_code: 200

In above example, we slightly improved the test template by generating product entries in a loop and using built-in functions to randomize the data. You can upload this scenario using:

curl -H "Content-Type: application/yaml" --data-binary @fixtures/get_products.yaml \
	http://localhost:8000/_scenarios

You can also generate a template for returning an error response similarly, i.e.,

method: GET
name: error-products
path: /products
description: ""
order: 2
group: products
predicate: '{{NthRequest 2}}'
request:
    headers:
        "Content-Type": "application/json"
    query_params:
        category: '[\x20-\x7F]{1,128}'
response:
    headers: {}
    contents: '{"errors":["{{RandSentence 5 10}}"]}'
    contents_file: ""
    status_code: {{EnumInt 400 415 500}}
    match_contents: '{"errors":"(__string__\\w+)"}'
wait_before_reply: 0s

Invoking curl -v http://localhost:8000/products will randomly return both of those test scenarios so that client code can test for various conditions.

Client-side Testing for Creating Products

You can find mock scenarios for creating products that were generated from above Open-API specifications using:

curl -v http://localhost:8000/_scenarios|jq '.'|grep "POST.saveProduct"

You can then customize scenarios as follows and then upload it:

method: POST
name: saveProduct
path: /products
description: ""
order: 0
group: products
request:
    match_query_params: {}
    match_headers: {}
    match_contents: '{"category":"(__string__(BOOKS|MUSIC|TOYS))","id":"(__string__\\w+)","inventory":"(__number__[+-]?(([0-9]{1,10}(\\.[0-9]{1,5})?)|(\\.[0-9]{1,10})))","name":"(__string__\\w+)","price.amount":"(__number__[+-]?(([0-9]{1,10}(\\.[0-9]{1,5})?)|(\\.[0-9]{1,10})))","price.currency":"(USD|CAD)"}'
    path_params: {}
    query_params: {}
    headers:
        "Content-Type": "application/json"
    contents: |
        category: MUSIC
        id: suavitas
        inventory: 5408.89695278641
        name: leporem
        price:
            amount: 7373.800941656166
            currency: cordis
response:
    headers: {}
    contents: '{"category":"{{EnumString `BOOKS MUSIC TOYS`}}","id":"{{RandStringMinMax 0 0}}","inventory":"{{RandNumMinMax 5 500}}","name":"{{RandStringMinMax 2 50}}","price":{"amount":{{RandNumMinMax 0 0}},"currency":"{{RandStringMinMax 0 0}}"}}'
    contents_file: ""
    status_code: 200
    match_headers: {}
    match_contents: '{"category":"(__string__(BOOKS|MUSIC|TOYS))","id":"(__string__\\w+)","inventory":"(__number__[+-]?(([0-9]{1,10}(\\.[0-9]{1,5})?)|(\\.[0-9]{1,10})))","name":"(__string__\\w+)","price.amount":"(__number__[+-]?(([0-9]{1,10}(\\.[0-9]{1,5})?)|(\\.[0-9]{1,10})))","price.currency":"(USD|CAD)"}'
    pipe_properties:
      - id
      - name
    assertions: []
wait_before_reply: 0s

And then invoke above POST /products API using:

curl -H "Content-Type: application/yaml" --data-binary @fixtures/save_product.yaml http://localhost:8000/_scenarios

curl  http://localhost:8000/products -d \
  '{"category":"BOOKS","id":"123","inventory":"10","name":"toy 1","price":{"amount":12,"currency":"USD"}}'

The client code can test for product properties and other error scenarios can be added to simulate failure conditions.

Applying Property-based/Generative Testing for Microservices

The api-mock-service test scenarios defined above can also be used to test against the microservice implementations. You can start your service, e.g. we will use sample-openapi for testing purpose and then invoke test request for server-side testing using:

curl -H "Content-Type: application/yaml" --data-binary @fixtures/get_products.yaml \
	http://localhost:8000/_scenarios
curl -H "Content-Type: application/yaml" --data-binary @fixtures/save_product.yaml \
	http://localhost:8000/_scenarios

curl -k -v -X POST http://localhost:8000/_contracts/products -d \
	'{"base_url": "http://localhost:8080", "execution_times": 5, "verbose": true}'

Above command will submit request to execute all scenarios belonging to products group five times and then return:

{
  "results": {
    "getProducts_0": {},
    "getProducts_1": {},
    "getProducts_2": {},
    "getProducts_3": {},
    "getProducts_4": {},
    "saveProduct_0": {
      "id": "895f584b-dc65-4950-982e-167680bcd133",
      "name": "Opificiis misera dei."
    },
    "saveProduct_1": {
      "id": "d89b6c16-549c-4baa-9dca-4dd9bb4b3ecf",
      "name": "Ea sumus aula teneant."
    },
    "saveProduct_2": {
      "id": "15dd54eb-fe89-4de8-9570-59fca20b9969",
      "name": "Vim odor et respondi."
    },
    "saveProduct_3": {
      "id": "e3769044-2a19-4e86-b0aa-9724378a0113",
      "name": "Me tua timeo an."
    },
    "saveProduct_4": {
      "id": "07ee20b9-df9a-487d-9ff9-cf76bef09a8f",
      "name": "Ruminando latinae omnibus."
    }
  },
  "metrics": {
    "getProducts_counts": 5,
    "getProducts_duration_seconds": 0.007,
    "saveProduct_counts": 5,
    "saveProduct_duration_seconds": 0.005
  },  
  "errors": {},
  "succeeded": 10,
  "failed": 0
}

You can also add custom assertions to validate the response in the save-product scenario:

method: POST
name: saveProduct
path: /products
description: ""
order: 0
group: products
predicate: ""
request:
    match_query_params: {}
    match_headers: {}
    match_contents: '{"category":"(BOOKS|MUSIC|TOYS)","id":"(__string__\\w+)","inventory":"(__number__[+-]?(([0-9]{1,10}(\\.[0-9]{1,5})?)|(\\.[0-9]{1,10})))","name":"(__string__\\w+)","price.amount":"(__number__[+-]?(([0-9]{1,10}(\\.[0-9]{1,5})?)|(\\.[0-9]{1,10})))","price.currency":"(USD|CAD)"}'
    path_params: {}
    query_params: {}
    headers:
        "Content-Type": "application/json"
    contents: |
        category: TOYS
        id: tempus
        inventory: 3890.9145609093966
        name: pleno
        price:
            amount: 5539.183583809511
            currency: "{{EnumString `USD CAD`}}"
response:
    headers: {}
    contents: '{"category":"{{EnumString `BOOKS MUSIC TOYS`}}","id":"{{RandStringMinMax 0 0}}","inventory":"{{RandNumMinMax 5 500}}","name":"{{RandStringMinMax 2 50}}","price":{"amount":{{RandNumMinMax 0 0}},"currency":"$"}}'
    contents_file: ""
    status_code: 200
    match_headers: {}
    match_contents: '{"category":"(__string__(BOOKS|MUSIC|TOYS))","id":"(__string__\\w+)","inventory":"(__number__[+-]?(([0-9]{1,10}(\\.[0-9]{1,5})?)|(\\.[0-9]{1,10})))","name":"(__string__\\w+)","price.amount":"(__number__[+-]?(([0-9]{1,10}(\\.[0-9]{1,5})?)|(\\.[0-9]{1,10})))","price.currency":"(USD|CAD)"}'
    pipe_properties:
      - id
      - name
    assertions:
        - VariableGE contents.inventory 5
        - VariableContains contents.category S
        - VariableContains contents.category X
wait_before_reply: 0s

If you try to run it again, the execution will fail with following error because none of the categories include X:

{
  "results": {
    "getProducts_0": {},
    "getProducts_1": {},
    "getProducts_2": {},
    "getProducts_3": {},
    "getProducts_4": {}
  },
  "errors": {
    "saveProduct_0": "failed to assert '{{VariableContains \"contents.category\" \"X\"}}' with value 'false'",
    "saveProduct_1": "failed to assert '{{VariableContains \"contents.category\" \"X\"}}' with value 'false'",
    "saveProduct_2": "failed to assert '{{VariableContains \"contents.category\" \"X\"}}' with value 'false'",
    "saveProduct_3": "failed to assert '{{VariableContains \"contents.category\" \"X\"}}' with value 'false'",
    "saveProduct_4": "failed to assert '{{VariableContains \"contents.category\" \"X\"}}' with value 'false'"
  },
  "succeeded": 5,
  "failed": 5
}

Summary

Using unit-testing and other forms of testing methodologies don’t rule out presence of the bugs but they can greatly reduce the probability of bugs. However, with large sized test suites, the maintenance of tests incur a high development cost especially if those tests are brittle that requires frequent changes. The property-based/generative testing can help fill in gaps in unit testing while keeping size of the tests suite small. The api-mock-service tool is designed to mock and test microservices using fuzzing and property-based testing techniques. This mocking library can be used to test both clients and server side implementation and can also be used to generate error conditions that are not easily reproducible. This library can be a powerful tool in your toolbox when developing distributed systems with a large number services, which can be difficult to deploy and test locally. You can read more about the api-mock-library at “Mocking and Fuzz Testing Distributed Micro Services with Record/Play, Templates and OpenAPI Specifications” and download it freely from https://github.com/bhatti/api-mock-service.

December 4, 2022

Evolving Software Development Model for Building Distributed Systems

Filed under: Business,Technology — Tags: , — admin @ 5:19 pm

1. Overview

Over the last few decades, the software systems has evolved from the mainframe and client-server models to distributed systems and service oriented architecture. In early 2000, Amazon and Netflix forged ahead the industry to adopt microservices by applying Conway’s law and self-organizing teams structure where a small 2-pizza team own entire lifecycle of the microservices including operational responsibilities. The microservice architecture with small, cross-functional and independent structure has helped team agility to develop, test and deploy microservices independently. However, the software systems are becoming increasingly complex with the Cambrian explosion of microservices and the ecosystem of microservices is reaching a boiling point where building new features, releasing the enhancements and operational load from maintaining high availability, scalability, resilience, security, observability, etc are slowing down the development teams and raising the artificial complexity as a result of mixing different concerns.

Following diagram shows difference between monolithic architecture and microservice architecture:

microservices

As you can see in above diagram, each microservice is responsible for managing a number of cross-cutting concerns such as AuthN/AuthZ, monitoring, rate-limit, configuration, secret-management, etc., which adds a scope of work for development of each service. Following sections dive into fundamental causes of the complexity that comes with the microservices and a path forward for evolving the development methodology to build these microservices more effectively.

2. Perils in Building Distributed Systems

Following is a list of major pitfalls faced by the development teams when building distributed systems and services:

2.1 Coordinating Feature Development

The feature development becomes more convoluted with increase in dependencies of downstream services and any large change in a single service often requires the API changes or additional capabilities from multiple dependent services. This creates numerous challenges such as prioritization of features development among different teams, coordinating release timelines and making progress in absence of the dependent functionalities in the development environment. This is often tackled with additional personnel for project management and managing dependencies with Gantt charts or other project management tools but it still leads to unexpected delays and miscommunication among team members about the deliverables.

2.2 Low-level Concurrency Controls

The development teams often use imperative languages and apply low-level abstractions to implement distributed services where each request is served by a native thread in a web server. Due to extensive overhead of native threads such as stack size limits, the web server becomes constrained with the maximum number of concurrent requests it can support. In addition, these native threads in the web server often share a common state, which must be protected with a mutex, semaphore or lock to avoid data corruption. These low-level and primitive abstractions couple business logic with the concurrency logic, which add accidental complexity when safeguarding the shared state or communicating between different threads. This problem worsens with the time as the code size increases that results in subtle concurrency related heisenbugs where these bugs may produce incorrect results in the production environment.

2.3 Security

Each microservice requires implementing authentication, authorization, secure communication, key management and other aspects of the security. The development teams generally have to support these security aspects for each service they own and they have to be on top of any security patches and security vulnerabilities. For example, a zero-day log4j vulnerability in December 2021 created a havoc in most organizations as multiple services were affected by the bug and needed a patch immediately. This resulted in large effort by each development team to patch their services and deploy the patched services as soon as possible. Worst, the development teams had to apply patches multiple times because initial bug fixes from the log4j team didn’t fully work, thus further multiplying the work by each development team. With growth of the dependency stack or bill of material for third party libraries in modern applications, the development teams face an overwhelming operational burden and enormous security risk to support their services safely.

2.4 Web Server

In general, each microservice requires a web server, which adds additional computing and administration overhead for deploying and running the service stack. The web server must be running all the time whether the service is receiving requests or not, thus wasting CPU, memory and storage resources needlessly.

2.5 Colocating Multiple Services

The development teams often start with a monolithic style applications that hosts multiple services on a single web server or with segregated application servers hosting multiple services on the same web server to lessen the development and deployment effort. The monolithic and service colocation architecture hinders speed, agility and extensibility as the code becomes complicated, harder to maintain and reasoned due to lack of isolation. In this style of deployment, computing resources can be entirely consumed by a single component or a bug in one service can crash the entire system. As each service may have unique runtime or usage characteristics, it’s also arduous to scale a single service, to plan service capacity or to isolate service failures in a colocated runtime environment.

2.6 Cross Cutting Concerns

Building distributed systems require managing a lot of horizontal concerns such as security, resilience, business continuity, and availability but coupling these common concerns with the business logic results in inconsistencies and higher complexity by different implementations in microservices. Mixing these different concerns with business logic in microservices means each development team will have to solve those concerns independently and any omission or divergence may cause miserable user experience, faulty results, poor protection against load spikes or a security breach in the system.

2.7 Service Discovery

Though, microservices use various synchronous and asynchronous protocols for communicating with other services but they often store the endpoints of other services locally in the service configurations. This adds maintenance and operational burden for maintaining the endpoints for all dependent services in each development, test and production environment. In addition, services may not be able to apply certain containment or access control policies such as not invoking cross-region service to maintain lower latency or sustain a service for disaster recovery.

2.8 Architectural Quality Attributes

The architecture quality attributes include performance, availability sustainability, security, scalability, fault tolerance, performance, resilience, recovery and usability, etc. Each development team not only has to manage these attributes for each service but often requires coordination with other teams when scaling their services so that downstream services can handle additional load or meet the availability/reliability guarantees. Thus, the availability, fault tolerance, capacity management or other architectural concerns become tied with downstream services as an outage in any of those services directly affect upstream services. Thus, improving availability, scalability, fault tolerance or other architecture quality attributes often requires changes from the dependent services, which adds scope of the development work.

2.9 Superfluous Development Work

When a developing a microservice, a development team owns end-to-end development and release process that includes a full software development lifecycle support such as :

  • maintaining build scripts for CI/CD pipelines
  • building automation tools for integration/functional/load/canary tests
  • defining access policies related to throttling/rate-limits
  • implementing consistent error handling, idempotency behavior, contextual information across services
  • adding alarms/metrics/monitoring/observability/notification/logs
  • providing customized personal and system dashboards
  • supporting data encryption and managing secret keys
  • defining security policies related to AuthN/AuthZ/Permissons/ACL
  • defining network policies related to VPN, firewall, load-balancer, network/gateway/routing configuration
  • adding compression, caching, and any other common pre/post processing for services

As a result, any deviations or bugs in implementation of these processes or misconfiguration in underlying infrastructure can lead to conflicting user experience, security gaps and outages. Some organizations maintain lengthy checklists and hefty review processes for applying best practices before the software release but they often miss key learnings from other teams and slow down the development process due to cumbersome release process.

2.10 Yak shaving when developing and testing features

Due to enormous artificial complexity of microservices with abysmal dependency stack, the development teams have to spend inordinate amount of time in setting up a development environment when building a new feature. The feature development requires testing the features using unit tests with a mock behavior of dependent services and using integration tests with real dependent services in a local development environment. However, it’s not always possible to run complete stack locally and fully test the changes for new features, thus the developers are encumbered with finding alternative integration environment where other developers may also be testing their features. All this yak shaving makes the software development awfully tedious and error prone because developers can’t test their features in isolation with a high confidence. This means that development teams find bugs later in phases of the release process, which may require a rollback of feature changes and block additional releases until bugs are fixed in the main/release branch.

3. Path to the Enlightenment

Following are a few recommendations to remedy above pitfalls in the development of distributed systems and microservices:

3.1 Higher level of Development Abstraction

Instead of using low-level imperative languages or low-level concurrency controls, high-level abstractions can be applied to simplify the development of the microservices as follows:

3.1.1 Actor Model

The Actor model was first introduced in 1973 by Carl Hewitt and it provides a high-level abstraction for concurrent computation. An actor uses a mailbox or a queue to store incoming messages and processes one message at a time using local state in a single green thread or a coroutine. An actor can create other actors or send messages without using any lock-based synchronization and blocking. Actors are reactive so they cannot initiate any action on their own, instead they simply react to external stimuli in the form of message passing. Actors provide much better error handling where applications can define a hierarchy of actors with parent/child relationships and a child actor may crash when encountering system errors, which are monitored and supervised by parent actors for failure recovery.

actor

Actor model is supported natively in many languages such Erlang/Elixir, Scala, Swift and Pony and it’s available as a library in many other languages. In these languages, actors generally use green threads, coroutines or a preemptive scheduler to schedule actors with non-blocking I/O operations. As the actors incur much lower overhead compare to native threads, they can be used to implement microservices with much greater scalability and performance. In addition, message passing circumvents the need to guard the shared state as each actors only maintains a local state, which provides more robust and reliable implementation of microservices. Here is an example of actor model in Erlang language:

-module(sum).
-export([init/0, add/1, get/0]).

init() ->
    Pid = spawn(fun() -> loop(0) end),
    register(sumActor, Pid).

loop(N) ->
    receive
        {add, X} -> loop(N+X);
        {Client, get} ->
            Client ! N,
            loop(N)
    end.

add(X) ->
    sumActor ! {add, X}.

get() ->
    sumActor ! {self(), get},
    receive Result -> Result end.

In above example, an actor is spawned to run loop function, which uses a tail recursion to receive next message from the queue and then processes it based on the tag of the message such as add or get. The client code uses a symbol sumActor to send a message, which is registered with a local registry. As an actor only maintains local state, microservices may use use external data store and manage a state machine using orchestration based SAGA pattern to trigger next action.

3.1.2 Function as a service (FaaS) and Serverless Computing

Function as a service (FaaS) offers serverless computing to simplify managing physical resources. Cloud vendors offer APIs for AWS Lambda, Google Cloud Functions and Azure Functions to build serverless applications for scalable workloads. There are also open source support for FaaS computing such as OpenFaas and OpenWhisk on top of Kubernetes or OpenShift. These functions resemble actor model as each function is triggered based on an event and is designed with a single responsibility, idempotency and shared nothing principles that can be executed concurrently.

FaaS/BaaS

The FaaS and serverless computing can be used to develop microservices where business logic can be embedded within a serverless functions but any platform services such as storage, messaging, caching, SMS can be exposed via Backend as a Service (BaaS). The Backend as a Service (BaaS) adds additional business logic on top of Platform as a Service (PaaS).

3.1.3 Agent based computing

Microservices architecture decouples data access from the business logic and microservices fetch data from the data-store, which incurs a higher overhead if the business logic needs to fetch a lot of data for processing or filtering before generating results. As opposed, agent style computing allow migrating business logic remotely where data or computing resources reside, thus it can process data more efficiently. In a simplest example, an agent may behave like a stored procedure where a function is passed to a data store for executing a business logic, which is executed within the database but other kind of agents may support additional capabilities to gather data from different data stores or sources and then produces desired results after processing the data remotely.

3.2 Service and Schema Registry

The service registry allows microservices to register the endpoints so that other services can look up the endpoints for communication with them instead of storing the endpoints locally. This allows service registry to enforce any authorization and access policies for communication based on geographic location or other constraints. A service registry may also allow registering mock services for testing in a local development environment to facilitate feature development. In addition, the registry may store schema definitions for the API models so that services can validate requests/responses easily or support multiple versions of the API contracts.

3.3 API Router, Reverse-Proxy or Gateway

API router, reverse-proxy or an API gateway are common patterns with microservices for routing, monitoring, versioning, securing and throttling APIs. These patterns can also be used with FaaS architecture where an API gateway may provide these capabilities and eliminate the need to have a web server for each service function. Thus, API gateway or router can result in lowering computing cost for each service and reducing the complexity for maintaining non-functional capabilities or -ilities.

3.4 Virtualization

Virtualization abstracts computer hardware and uses a hypervisor to create multiple virtual computers with different operating systems and applications on top of a single physical computer.

3.4.1 Virtual Machines

The initial implementation of virtualization was based on Virtual Machines for building virtualized computing environments and emulating a physical computer. The virtual machines use a hypervisor to communicate with the physical computer.

virtual machines

3.4.2 Containers

Containers implement virtualization using host operating system instead of a hypervisor, thus provide more light-weight and faster provisioning of computing resources. The containers use platforms such as Docker and Kubernetes to execute applications and services, which are bundled into images based on Open Container Initiative (OCI) standard.

containers

3.4.3 MicroVM

MicroVMs such as Firecracker and crosVM are based on kernel-based VM (KVM) and use hostOS acting as a hypervisor to provide isolation and security. As MicroVMs only include essential features for network, storage, throttling and metadata so they are quick to start and can scale to support multiple VMs with minimal overhead. A number of serverless platforms such as AWS Lambda, appfleet, containerd, Fly.io, Kata, Koyeb, OpenNebula, Qovery, UniK, and Weave FireKube have adopted Firecracker VM, which offers low overhead for starting a new virtual machine or executing a serverless workload.

micro virtualmachine

3.4.4 WebAssembly

The WebAssembly is a stack-based virtual machine that can run at the edge or in cloud. The applications written Go, C, Rust, AssemblyScript, etc. are compiled into WebAssembly binary and are then executed on a WebAssembly runtime such as extism, faasm, wasmtime, wamr, wasmr and wagi. The WebAssembly supports WebAssembly System Interface (WASI) standard, which provides access to the systems APIs for different operating systems similar to POSIX standard. There is also an active development of WebAssembly Component Model with proposals such as WebIDL bindings and Interface Types. This allows you to write microservices in any supported language, compile the code into WASM binary and then deploy in a managed platform with all support for security, traffic management, observability, etc.

containers

A number of WebAssembly platforms such as teaclave, wasmCloud, fermyon and Lunatic have also adopted Actor model to build a platform for writing distributed applications.

If WASM+WASI existed in 2008, we wouldn’t have needed to created Docker. That’s how important it is. Webassembly on the server is the future of computing. A standardized system interface was the missing link. Let’s hope WASI is up to the task!

3.5 Instrumentation and Service Binding

In order to reduce service code that deals specifically with non-functional capabilities such as authentication, authorization, logging, monitoring, etc., the business service can be instrumented to provide those capabilities at compile or deployment time. This means that the development team can largely focus on the business requirements and instrumentation takes care of adding metrics, failure reporting, diagnostics, monitoring, etc. without any development work. In addition, any external platform dependencies such as messaging service, orchestration, database, key/value store, caching can be injected into the service dynamically at runtime. The runtime can be configured to provide different implementation for the platform services, e.g. it may use a local Redis server for key/value store in a hosted environment or AWS/Azure’s implementation in a cloud environment.

When deploying services with WebAssembly support, the instrumentation may use WebAssembly libraries for extending the services to support authentication, rate-limiting, observability, monitoring, state management, and other non-functional capabilities, so that the development work can be shortened as shown below:

wasm platform

3.6 Orchestration and Choreography

The orchestration and choreography allows writing microservices that can be easily composed to provide a higher abstractions for business services. In orchestration design, a coordinator manages synchronous communication among different services whereas choreography uses event-driven architecture to communicate asynchronously. An actor model fits naturally with event based architecture for communicating with other actors and external services. However, orchestration services can be used to model complex business processes where a SAGA pattern is used to manage state transitions for different activities. Microservices may also use staged event-driven architecture (SEDA) to decompose complex event-driven services into a set of stages that are connected by different queues, which supports better modularity and code-reuse. SEDA allows enforcing admission control on each event queue and grants flexible scheduling for processing events based on adaptive workload controls and load shedding policies.

3.7 Automation, Continuous Deployment and Infrastructure as a code

Automation is a key to remove any drudgery work during the development process and consolidate common build processes such as continuous integration and deployment for improving the productivity of a development team. The development teams can employ continuous delivery to deploy small and frequent changes by developers. The continuous deployment often uses rolling updates, blue/green deployments or canary deployments to minimize disruption to end users. The monitoring system watches for error rates at each stage of the deployment and automatically rollbacks changes if a problem occurs.

Infrastructure as code (IaC) uses a declarative language to define development, test and production environment, which is managed by the source code management software. These provisioning and configuration logic can be used by CI/CD pipelines to automatically deploy and test environments. Many cloud vendors provide support for IaC such as Azure Resource Manager (ARM), AWS Cloud Development Kit (CDK), Hashicorp Terraform etc to deploy computing resources.

3.8 Proxy Patterns

Following sections shows implementing cross cutting concerns using proxy patterns for building microservices:

3.8.1 Sidecar Model

The sidecar model helps modularity and reusability where an application requires two containers: application container and sidebar container where sidebar container provides additional functionality such as adding SSL proxy for the service, observability, collecting metrics for the application container.

sidecar pattern

The Sidecar pattern generally uses another container to proxy off all traffic, which enforces security, access control, throttling before forwarding the incoming requests to the microservice.

3.8.2 Service Mesh

The service mesh uses a mesh of sidecar proxies to enable:

  • Dynamic request routing for blue-green deployments, canaries, and A/B testing
  • Load balancing based on latency, geographic locations or health checks
  • Service discovery based on a version of the service in an environment
  • TLS/mTLS encryption
  • Authentication and Authorization
  • Keys, certificates and secrets management
  • Rate limiting and throttling
  • State management and PubSub
  • Observability, metrics, monitoring, logging
  • Distributed tracing
  • Traffic management and traffic splitting
  • Circuit breaker and retries using libraries like FinagleStubby, and Hysterix to isolate unhealthy instances and gradually adding them back after successful health checks
  • Error handling and fault tolerance
  • Control plane to manage routing tables, service discovery, load balancer and other service configuration
Service Mesh

The service mesh pattern uses a data plane to host microservices and all incoming and outgoing requests go through a sidecar proxy that implements cross cutting concerns such as security, routing, throttling, etc. The control-plan in mesh network allows administrators to change the behavior of data plane proxies or configuration for data access. Popular service mesh frameworks include Consul, Distributed Application Runtime (Dapr), Envoy, Istio, Linkerd, Kong, Koyeb and Kuma for providing control-plane and data-plane with builtin support for networking, observability, traffic management and security. Dapr mesh network also supports Actor-model using the Orleans Virtual Actor pattern, which leverages the scalability and reliability guarantees of the underlying platform.

3.9 Cell-Based Architecture

A Cell-based Architecture (CBA) allows grouping business functions into single units called “cells”, which provides better scalability, modularity, composibility, disaster-recovery and governance for building microservices. It reduces the number of deployment units for microservices, thus simplifying the artificial complexity in building distributed systems.

cell-based architecture

A cell is an immutable collection of components and services that is independently deployable, manageable, and observable. The components and services in a cell communicate with each other using local network protocols and other cells via network endpoints through a cell gateway or brokers. A cell defines an API interface as part of its definition, which provides better isolation, short radius blast and agility because releases can be rolled out cell by cell. However, in order to maintain the isolation, a cell should be either self contained for all service dependencies or dependent cells should be on the same region so that there is no cross region dependency.

cell-based data-plane

3.10 12-Factor app

The 12-factor app is a set of best practices from Heroku for deploying applications and services in a virtualized environment. It recommends using declarative configuration for automation, enabling continuous deployment and other best practices. Though, these recommendations are a bit dated but most of them still holds except you may consider storing credentials or keys in a secret store or in an encrypted files instead of simply using environment variables.

4. New Software Development Workflow

The architecture patterns, virtual machines, WebAssembly and serverless computing described in above section can be used to simplify the development of microservices and easily support both functional and non-functional requirements of the business needs. Following sections describe how these patterns can be integrated with the software development lifecycle:

4.1 Development

4.1. Adopting WebAssembly as the Lingua Franca

WebAssembly along with its component model and WASI standard has been gaining support and many popular languages now can be compiled into wasm binary. Microservices can be developed in supported languages and ubiquitous WebAssembly support in production environment can reduce the development and maintenance drudgery for the application development teams. In addition, teams can leverage a number of serverless platforms that support WebAssembly such as extism, faasm, netlify, vercel, wasmcloud and wasmedge to reduce the operational load.

4.1.2 Instrumentation and Service Binding

The compiled WASM binary for microservices can be instrumented similar to aspects so that the generated code automatically supports horizontal concerns such as circuit breaker, diagnostics, failure reporting, metrics, monitoring, retries, tracing, etc. without any additional endeavor by the application developer.

In addition, the runtime can use service mesh features to bind external platform services such as event bus, data store, caching, service registry or dependent business services. This simplifies the development effort as development team can use the shared services for developing microservices instead of configuring and deploying infrastructure for each service. The shared infrastructure services support multi-tenancy and distinct namespaces for each microservice so that it can manage its state independently.

4.1.3 Adopting Actor Model for Microservices

As discussed above, the actor model offers a light-weight and highly concurrent model to implement a microservice APIs. The actors based microservices can be coded in any supported high-level language but then compiled into WebAssembly with a WASI support. A number of WebAssembly serverless platforms including Lunatic and wasmCloud already support Actor model while other platforms such as fermyon use http based request handlers, which are invoked for each request similar to actors based message passing. For example, here is a sample actor model in wasmCloud in Rust language though any language with wit-bindgen is supported as well:

#[derive(Debug, Default, Actor, HealthResponder)]
#[services(Actor, HttpServer)]
struct HelloActor {}

#[async_trait]
impl HttpServer for HelloActor {
    async fn handle_request(
        &self,
        _ctx: &Context,
        req: &HttpRequest,
    ) -> std::result::Result<HttpResponse, RpcError> {
        let text=form_urlencoded::parse(req.query_string.as_bytes())
            .find(|(n, _)| n == "name")
            .map(|(_, v)| v.to_string())
            .unwrap_or_else(|| "World".to_string());
        Ok(HttpResponse {
            body: format!("Hello {}", text).as_bytes().to_vec(),
            ..Default::default()
        })
    }
}

The wasmCloud supports Contract-driven design and development (CDD) using Wasmcloud interfaces based on smithy IDL for building microservices and composable systems. There is also a pending work to support OpenFaas with wasmCloud to invoke functions on capability providers with appropriate privileges.

Following example demonstrates similar capability with fermyon, which can be deployed to Fermyon Cloud:

use anyhow::Result;
use spin_sdk::{
    http::{Request, Response},
    http_component,
};
#[http_component]
fn hello_rust(req: Request) -> Result<Response> {
    println!("{:?}", req.headers());
    Ok(http::Response::builder()
        .status(200)
        .header("foo", "bar")
        .body(Some("Hello, Fermyon".into()))?)
}

Following example shows how Dapr and WasmEdge work together to support lightweight WebAssembly-based microservices in a cloud-native environment:

fn main() -> std::io::Result<()> {
    let port = std::env::var("PORT").unwrap_or(9005.to_string());
    println!("new connection at {}", port);
    let listener = TcpListener::bind(format!("127.0.0.1:{}", port))?;
    loop {
        let _ = handle_client(listener.accept()?.0);
    }
}

fn handle_client(mut stream: TcpStream) -> std::io::Result<()> {
  ... ...
}

fn handle_http(req: Request<Vec<u8>>) -> bytecodec::Result<Response<String>> {
  ... ...
}

The WasmEdge can also be used with other serverless platforms such as Vercel, Netlify, AWS Lambda, SecondState and Tencent.

4.1.4 Service Composition with Orchestration and Choreography

As described above, actors based microservices can be extended with the orchestration patterns such as SAGA and choreography/event driven architecture patterns such as SEDA to build composable services. These design patterns can be used to build loosely coupled and extensible systems where additional actors and components can be added without changing existing code.

4.2 Deployment and Runtime

4.2.1 Virtual Machines and Serverless Platform

Following diagram shows the evolution of virtualized environments for hosting applications, services, serverless functions and actors:

Evolution of hosting apps and services

In this architecture, the microservices are compiled into wasm binary, instrumented and then deployed in a micro virtual machine.

4.2.2 Sidecar Proxy and Service Mesh

Though, the service code will be instrumented with additional support for error handling, metrics, alerts, monitoring, tracing, etc. before the deployment but we can further enforce access policies, rate limiting, key management, etc. using a sidecar proxy or service mesh patterns:

For example, WasmEdge can be integrated with Dapr service mesh for adding building blocks for state management, event bus, orchestration, observability, traffic routing, and bindings to external services. Similarly, wasmCloud can be extended with additional non-functional capabilities by implementing capability provider. wasmCloud also provides a lattice, self-healing mesh network for simplifying communication between actors and capability providers.

4.2.3 Cellular Deployment

As described above, Cell-based Architecture (CBA) provides better scalability, modularity, composibility and business continuity for building microservices. Following diagram shows how above design with virtual machines, service-mesh and WebAssembly can be extended to support cell-based architecture:

In above architecture, each cell deploys a set of related microservices for an application that persists state in a replicated data store and communicate with other cells with an event-bus. In this model, separate cells are employed to access data-plane services and control-plane services for configuration and administration purpose.

5. Conclusion

The transition of monolithic services towards microservices architecture over last many years has helped development teams to be more agile in building modular and reusable code. However, as teams are building a growing number new microservices, they are also tasked with supporting non-functional requirements for each service such as high availability, capacity planning, scalability, performance, recovery, resilience, security, observability, etc. In addition, each microservice may depend on numerous other microservices, thus a microservice becomes susceptible to scaling/availability limits or security breaches in any of the downstream services. Such tight coupling of horizontal concerns escalates complexity, development undertaking and operational load by each development team resulting in larger time to market for new features and larger risk for outages due to divergence in implementing non-functional concerns. Though, serverless computing, function as a service (Faas) and event-driven compute services have emerged to solve many of these problems but they remain limited in the range of capabilities they offer and lack a common standards across vendors. The advancements in micro virtual machines and containers have created a boon to the serverless platforms such as appfleet, containerd, Fly.io, Kata, Koyeb, OpenNebula, Qovery, UniK, and Weave FireKube. In addition, widespread adoption of WebAssembly along with its component model and WASI standard are helping these serverless platforms such as extism, faasm, netlify, vercel, wasmcloud and wasmedge to build more modular and reusable components. These serverless platforms allow the development teams to primarily focus on building business features and offload all non-functional concerns to the underlying platforms. Many of these serverless platforms also support service mesh and sidecar patterns so that they can bind platform and dependent services and automatically handle concerns such as throttling, security, state management, secrets, key management, etc. Though, cell-based architecture is still relatively new and is only supported by more matured serverless and cloud platforms, but it further raises scalability, modularity, composibility, business continuity and governance of microservices. As each cell is isolated, it adds agility to deploy code changes to a single cell and use canary tests to validate the changes before deploying code to all cells. Due to such isolated deployment, cell-based architecture reduces the blast radius if a bug is found during validation or other production issues are discovered. Finally, automating continuous deployment processes and applying Infrastructure as code (IaC) can simplify local development and deployment so that developers use the same infrastructure setup for local testing as the production environment. This means that the services can be deployed in any environment consistently, thus reduces any manual configuration or subtle bugs due to misconfigurations.

In summary, the development teams will greatly benefit from the architecture patterns, virtualized environments, WebAssembly and serverless platforms described above so that application developers are not burdened with maintaining horizontal concerns and instead they focus on building core product features, which will be the differentiating factors in the competing markets. These serverless and managed platform not only boosts developer productivity but also lowers the infrastructure cost, operational overhead, cruft development work and the time to market for releasing new features.

October 18, 2022

Mocking and Fuzz Testing Distributed Micro Services with Record/Play, Templates and OpenAPI Specifications

Filed under: GO,REST,Technology — admin @ 11:36 am

Building large distributed systems often requires integrating with multiple distributed micro-services that makes development a particularly difficult as it’s not always easy to deploy and test all dependent services in a local environment with constrained resources. In addition, you might be working on a large system with multiple teams where you may have received new API specs from another team but the API changes are not available yet. Though, you can use mocking frameworks based on API specs when writing a unit tests but integration or functional testing requires an access to the network service. A common solution that I have used in past projects is to configure a mock service that can simulate different API operations. I wrote a JVM based mock-service many years ago with following use-cases:

Use-Cases

  • As a service owner, I need to mock remote dependent service(s) by capturing/recording request/responses through an HTTP proxy so that I can play it back when testing the remote service(s) without connecting with them.
  • As a service owner, I need to mock remote dependent service(s) based on a open-api/swagger specifications so that my service client can test all service behavior per specifications for the remote service(s) even when remote service is not fully implemented or accessible.
  • As a service owner, I need to mock remote dependent service(s) based on a mock scenario defined in a template so that my service client can test service behavior per expected request/response in the template even when remote service is not fully implemented or accessible.
  • As a service owner, I need to inject various response behavior and faults to the output of a remote service so that I can build a robust client that prevents cascading failures and is more resilient to unexpected faults.
  • As a service owner, I need to define test cases with faulty or fuzz responses to test my own service so that I can predict how it will behave with various input data and assert the service response based on expected behavior.

Features

API mock service for REST/HTTP based services with following features:

  • Record API request/response by working as a HTTP proxy server (native http/https or via API) between client and remote service.
  • Playback API response that were previously recorded based on request parameters.
  • Define API behavior manually by specifying request parameters and response contents using static data or dynamic data based on GO templating language.
  • Generate API behavior from open standards such as Open API/Swagger and automatically create constraints and regex based on the specification.
  • Customize API behavior using a GO template language so that users can generate dynamic contents based on input parameters or other configuration.
  • Generate large responses using the template language with dynamic loops so that you can test performance of your system.
  • Define multiple test scenarios for the API based on different input parameters or simulating various error cases that are difficult to reproduce with real services.
  • Store API request/responses locally as files so that it’s easy to port stubbed request/responses to any machine.
  • Allow users to define API request/response with various formats such as XML/JSON/YAML and upload them to the mock service.
  • Support test fixtures that can be uploaded to the mock service and can be used to generate mock responses.
  • Define a collection of helper methods to generate different kind of random data such as UDID, dates, URI, Regex, text and numeric data.
  • Ability to playback all test scenarios or a specific scenario and change API behavior dynamically with different input parameters.
  • Support multiple mock scenarios for the same API that can be selected either using round-robin order, custom predicates based on parameters or based on scenario name.
  • Inject error conditions and artificial delays so that you can test how your system handles error conditions that are difficult to reproduce or use for game days/chaos testing.
  • Generate client requests for a remote API for chaos and stochastic testing where a set of requests are sent with a dynamic data generated based on regex or other constraints.

I used this service in many past projects, however I felt it needed a bit fresh approach to meet above goals so I rewrote it in GO language, which has a robust support for writing network services. You can download the new version from https://github.com/bhatti/api-mock-service. As, it’s written in GO, you can either download GO runtime environment or use Docker to install it locally. If you haven’t installed docker, you can download the community version from https://docs.docker.com/engine/installation/ or find installer for your OS on https://docs.docker.com/get-docker/.

docker build -t api-mock-service .
docker run -p 8000:8080 -p 8081:8081 -e HTTP_PORT=8080 PROXY_PORT=8081 \
	-e DATA_DIR=/tmp/mocks -e ASSET_DIR=/tmp/assets api-mock-service

or pull an image from docker hub (https://hub.docker.com/r/plexobject/api-mock-service), e.g.

docker pull plexobject/api-mock-service:latest
docker run -p 8000:8080 -p 8081:8081 -e HTTP_PORT=8080 PROXY_PORT=8081 -e DATA_DIR=/tmp/mocks \
	-e ASSET_DIR=/tmp/assets plexobject/api-mock-service:latest

Alternatively, you can run it locally with GO environment, e.g.,

make && ./out/bin/api-mock-service

For full command line options, execute api-mock-service -h that will show you command line options such as:

./out/bin/api-mock-service -h
Starts mock service

Usage:
  api-mock-service [flags]
  api-mock-service [command]

Available Commands:
  chaos       Executes chaos client
  completion  Generate the autocompletion script for the specified shell
  help        Help about any command
  version     Version will output the current build information

Flags:
      --assetDir string   asset dir to store static assets/fixtures
      --config string     config file
      --dataDir string    data dir to store mock scenarios
  -h, --help              help for api-mock-service
      --httpPort int      HTTP port to listen
      --proxyPort int     Proxy port to listen

Recording a Mock Scenario via HTTP/HTTPS Proxy

Once you have the API mock service running, the mock service will start two ports on startup, first port (default 8080) will be used to record/play mock scenarios, updating templates or uploading OpenAPIs. The second port (default 8081) will setup an HTTP/HTTPS proxy server that you can point to record your scenarios, e.g.

export http_proxy="http://localhost:8081"
export https_proxy="http://localhost:8081"

curl -k -v -H "Authorization: Bearer sk_test_xxxx" \
	https://api.stripe.com/v1/customers/cus_xxx/cash_balance

Above curl command will automatically record all requests and responses and create mock scenario to play it back. For example, if you call the same API again, it will return a local response instead of contacting the server. You can customize the proxy behavior for record by adding X-Mock-Record: true header to your request.

Recording a Mock Scenario via API

Alternatively, you can use invoke an internal API as a pass through to invoke a remote API so that you can automatically record API behavior and play it back later, e.g.

% curl -H "X-Mock-Url: https://api.stripe.com/v1/customers/cus_**/cash_balance" \
	-H "Authorization: Bearer sk_test_***" http://localhost:8080/_proxy

In above example, the curl command is passing the URL of real service as an HTTP header Mock-Url. In addition, you can pass other authorization headers as needed.

Viewing the Recorded Mock Scenario

The API mock-service will store the request/response in a YAML file under a data directory that you can specify. For example, you may see a file under:

default_mocks_data/v1/customers/cus_***/cash_balance/GET/recorded-scenario-***.scr

Note: the sensitive authentication or customer keys are masked in above example but you will see following contents in the captured data file:

method: GET
name: recorded-v1-customers-cus
path: /v1/customers/cus_**/cash_balance
description: recorded at 2022-10-29 04:26:17.24776 +0000 UTC
request:
     match_query_params: {}
     match_headers: {}
     match_content_type: ""
     match_contents: ""
     example_path_params: {}
     example_query_params: {}
     example_headers:
         Accept: '*/*'
         Authorization: Bearer sk_test_xxx
         User-Agent: curl/7.65.2
         X-Mock-Url: https://api.stripe.com/v1/customers/cus_/cash_balance
     example_contents: ""
response:
    headers:
        Access-Control-Allow-Credentials:
            - "true"
        Access-Control-Allow-Methods:
            - GET, POST, HEAD, OPTIONS, DELETE
        Access-Control-Allow-Origin:
            - '*'
        Access-Control-Expose-Headers:
            - Request-Id, Stripe-Manage-Version, X-Stripe-External-Auth-Required, X-Stripe-Privileged-Session-Required
        Access-Control-Max-Age:
            - "300"
        Cache-Control:
            - no-cache, no-store
        Content-Length:
            - "168"
        Content-Type:
            - application/json
        Date:
            - Sat, 29 Oct 2022 04:26:17 GMT
        Request-Id:
            - req_lOP4bCsPIi5hQC
        Server:
            - nginx
        Strict-Transport-Security:
            - max-age=63072000; includeSubDomains; preload
        Stripe-Version:
            - "2018-09-06"
    content_type: application/json
    contents: |-
        {
          "object": "cash_balance",
          "available": null,
          "customer": "cus_",
          "livemode": false,
          "settings": {
            "reconciliation_mode": "automatic"
          }
        }
    contents_file: ""
    status_code: 200
wait_before_reply: 0s

Above example defines a mock scenario for testing /v1/customers/cus_**/cash_balance path. A test scenario includes:

Predicate

  • This is a boolean condition if you need to enable or disable a scenario test based on dynamic parameters or request count.

Group

  • This specifies the group for related test scenarios.

Request Matching Parameters:

The matching request parameters will be used to select the mock scenario to execute and you can use regular expressions to validate:

  • URL Query Parameters
  • URL Request Headers
  • Request Body

You can use these parameters so that test scenario is executed only when the parameters match, e.g.

    match_query_params:
      name: [a-z0-9]{1,50}
    match_headers:
      Content-Type: "application/json"

The matching request parameters will be used to select the mock scenario to execute and you can use regular expressions to validate, e.g. above example will be matched if content-type is application/json and it will validate that name query parameter is alphanumeric from 1-50 size.

Example Request Parameters:

The example request parameters show the contents captured from the record/play so that you can use and customize to define matching parameters:

  • URL Query Parameters
  • URL Request Headers
  • Request Body

Response Properties

The response properties will include:

  • Response Headers
  • Response Body statically defined or loaded from a test fixture
  • Response can also be loaded from a test fixture file
  • Status Code
  • Matching header and contents
  • Assertions You can copy recorded scenario to another folder and use templates to customize it and then upload it for playback.

The matching header and contents use match_headers and match_contents similar to request to validate response in case you want to test response from a real service for chaos testing. Similarly, assertions defines a set of predicates to test against response from a real service:

    assertions:
        - VariableContains contents.id 10
        - VariableContains contents.title illo
        - VariableContains headers.Pragma no-cache 

Above example will check API response and verify that id property contains 10, title contains illo and result headers include Pragma: no-cache header.

Playback the Mock API Scenario

You can playback the recorded response from above example as follows:

% curl http://localhost:8080/v1/customers/cus_***/cash_balance

Which will return captured response such as:

{
  "object": "cash_balance",
  "available": null,
  "customer": "cus_***",
  "livemode": false,
  "settings": {
    "reconciliation_mode": "automatic"
  }
}%

Though, you can customize your template with dynamic properties or conditional logic but you can also send HTTP headers for X-Mock-Response-Status to override HTTP status to return or X-Mock-Wait-Before-Reply to add artificial latency using duration syntax.

Debug Headers from Playback

The playback request will return mock-headers to indicate the selected mock scenario, path and request count, e.g.

X-Mock-Path: /v1/jobs/{jobId}/state
X-Mock-Request-Count: 13
X-Mock-Scenario: setDefaultState-bfb86eb288c9abf2988822938ef6d4aa3bd654a15e77158b89f17b9319d6f4e4

Upload Mock API Scenario

You can customize above recorded scenario, e.g. you can add path variables to above API as follows:

method: GET
name: stripe-cash-balance
path: /v1/customers/:customer/cash_balance
request:
    match_headers:
        Authorization: Bearer sk_test_[0-9a-fA-F]{10}$
response:
    headers:
        Access-Control-Allow-Credentials:
            - "true"
        Access-Control-Allow-Methods:
            - GET, POST, HEAD, OPTIONS, DELETE
        Access-Control-Allow-Origin:
            - '*'
        Access-Control-Expose-Headers:
            - Request-Id, Stripe-Manage-Version, X-Stripe-External-Auth-Required, X-Stripe-Privileged-Session-Required
        Access-Control-Max-Age:
            - "300"
        Cache-Control:
            - no-cache, no-store
        Content-Type:
            - application/json
        Request-Id:
            - req_2
        Server:
            - nginx
        Strict-Transport-Security:
            - max-age=63072000; includeSubDomains; preload
        Stripe-Version:
            - "2018-09-06"
    content_type: application/json
    contents: |-
        {
          "object": "cash_balance",
          "available": null,
          "customer": {{.customer}}
          "livemode": false,
          "page": {{.page}}
          "pageSize": {{.pageSize}}
          "settings": {
            "reconciliation_mode": "automatic"
          }
        }
    status_code: 200
wait_before_reply: 1s

In above example, I assigned a name stripe-cash-balance to the mock scenario and changed API path to /v1/customers/:customer/cash_balance so that it can capture customer-id as a path variable. I added a regular expression to ensure that the HTTP request includes an Authorization header matching Bearer sk_test_[0-9a-fA-F]{10}$ and defined dynamic properties such as {{.customer}}, {{.page}} and {{.pageSize}} so that they will be replaced at runtime.

The mock scenario uses builtin template syntax of GO. You can then upload it as follows:

curl -H "Content-Type: application/yaml" --data-binary @fixtures/stripe-customer.yaml \
	http://localhost:8080/_scenarios

and then play it back as follows:

curl -v -H "Authorization: Bearer sk_test_0123456789" \
	"http://localhost:8080/v1/customers/123/cash_balance?page=2&pageSize=55"

and it will generate:

 HTTP/1.1 200 OK
< Content-Type: application/json
< X-Mock-Request-Count: 1
< X-Mock-Scenario: stripe-cash-balance
< Request-Id: req_2
< Server: nginx
< Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
< Stripe-Version: 2018-09-06
< Date: Sat, 29 Oct 2022 17:29:12 GMT
< Content-Length: 179
<
{
  "object": "cash_balance",
  "available": null,
  "customer": 123
  "livemode": false,
  "page": 2
  "pageSize": 55
  "settings": {
    "reconciliation_mode": "automatic"
  }

As you can see, the values of customer, page and pageSize are dynamically updated and the response header includes name of mock scenario with request counts. You can upload multiple mock scenarios for the same API and the mock API service will play it back sequentially. For example, you can upload another scenario for above API as follows:

method: GET
name: stripe-customer-failure
path: /v1/customers/:customer/cash_balance
request:
    match_headers:
        Authorization: Bearer sk_test_[0-9a-fA-F]{10}$
response:
    headers:
        Stripe-Version:
            - "2018-09-06"
    content_type: application/json
    contents: My custom error
    status_code: 500
wait_before_reply: 1s

And then play it back as before:

curl -v -H "Authorization: Bearer sk_test_0123456789" \
	"http://localhost:8080/v1/customers/123/cash_balance?page=2&pageSize=55"

which will return response with following error response

> GET /v1/customers/123/cash_balance?page=2&pageSize=55 HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.65.2
> Accept: */*
> Authorization: Bearer sk_test_0123456789
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 500 Internal Server Error
< Content-Type: application/json
< Mock-Request-Count: 1
< X-Mock-Scenario: stripe-customer-failure
< Stripe-Version: 2018-09-06
< Vary: Origin
< Date: Sat, 29 Oct 2022 17:29:15 GMT
< Content-Length: 15

Dynamic Templates with Mock API Scenarios

You can use loops and conditional primitives of template language and custom functions provided by the API mock library to generate dynamic responses as follows:

method: GET
name: get_devices
path: /devices
description: ""
request:
    match_content_type: "application/json; charset=utf-8"
response:
    headers:
        "Server":
            - "SampleAPI"
        "Connection":
            - "keep-alive"
    content_type: application/json
    contents: >
     {
     "Devices": [
{{- range $val := Iterate .pageSize }}
      {
        "Udid": "{{SeededUdid $val}}",
        "Line": { {{SeededFileLine "lines.txt" $val}}, "Type": "Public", "IsManaged": false },
        "Amount": {{JSONFileProperty "props.yaml" "amount"}},        
        "SerialNumber": "{{Udid}}",
        "MacAddress": "{{Udid}}",
        "Imei": "{{Udid}}",
        "AssetNumber": "{{RandString 20}}",
        "LocationGroupId": {
         "Id": {
           "Value": {{RandNumMax 1000}},
         },
         "Name": "{{SeededCity $val}}",
         "Udid": "{{Udid}}"
        },
        "DeviceFriendlyName": "Device for {{SeededName $val}}",
        "LastSeen": "{{Time}}",
        "Email": "{{RandEmail}}",
        "Phone": "{{RandPhone}}",        
        "EnrollmentStatus": {{SeededBool $val}}
        "ComplianceStatus": {{RandRegex "^AC[0-9a-fA-F]{32}$"}}
        "Group": {{RandCity}},
        "Date": {{TimeFormat "3:04PM"}},
        "BatteryLevel": "{{RandNumMax 100}}%",
        "StrEnum": {{EnumString "ONE TWO THREE"}},
        "IntEnum": {{EnumInt 10 20 30}},
        "ProcessorArchitecture": {{RandNumMax 1000}},
        "TotalPhysicalMemory": {{RandNumMax 1000000}},
        "VirtualMemory": {{RandNumMax 1000000}},
        "AvailablePhysicalMemory": {{RandNumMax 1000000}},
        "CompromisedStatus": {{RandBool}},
        "Add": {{Add 2 1}},
      }{{if LastIter $val $.PageSize}}{{else}},  {{end}}
{{ end }}
     ],
     "Page": {{.page}},
     "PageSize": {{.pageSize}},
     "Total": {{.pageSize}}
     }
    {{if NthRequest 10 }}
    status_code: {{EnumInt 500 501}}
    {{else}}
    status_code: {{EnumInt 200 400}}
    {{end}}
wait_before_reply: {{.page}}s

Above example includes a number of template primitives and custom functions to generate dynamic contents such as:

Loops

GO template support loops that can be used to generate multiple data entries in the response, e.g.

{{- range $val := Iterate .pageSize }}

Builtin functions

GO template supports custom functions that you can add to your templates. The mock service includes a number of helper functions to generate random data such as:

Add numbers

  "Num": "{{Add 1 2}}",

Date/Time

  "LastSeen": "{{Time}}",
  "Date": {{Date}},
  "DateFormatted": {{TimeFormat "3:04PM"}},
  "LastSeen": "{{Time}}",

Comparison

  {{if EQ .MyVariable 10 }}
  {{if GE .MyVariable 10 }}
  {{if GT .MyVariable 10 }}
  {{if LE .MyVariable 10 }}
  {{if LT .MyVariable 10 }}
  {{if Nth .MyVariable 10 }}

Enums

  "StrEnum": {{EnumString "ONE TWO THREE"}},
  "IntEnum": {{EnumInt 10 20 30}},

Random Data

  "SerialNumber": "{{Udid}}",
  "AssetNumber": "{{RandString 20}}",
  "LastSeen": "{{Time}}",
  "Host": "{{RandHost}}",
  "Email": "{{RandEmail}}",
  "Phone": "{{RandPhone}}",
  "URL": "{{RandURL}}",
  "EnrollmentStatus": {{SeededBool $val}}
  "ComplianceStatus": {{RandRegex "^AC[0-9a-fA-F]{32}$"}}
  "City": {{RandCity}},
  "Country": {{RandCountry}},
  "CountryCode": {{RandCountryCode}},
  "Completed": {{RandBool}},
  "Date": {{TimeFormat "3:04PM"}},
  "BatteryLevel": "{{RandNumMax 100}}%",
  "Object": "{{RandDict}}",
  "IntHistory": {{RandIntArrayMinMax 1 10}},
  "StringHistory": {{RandStringArrayMinMax 1 10}},
  "FirstName": "{{SeededName 1 10}}",
  "LastName": "{{RandName}}",
  "Score": "{{RandNumMinMax 1 100}}",
  "Paragraph": "{{RandParagraph 1 10}}",
  "Word": "{{RandWord 1 1}}",
  "Sentence": "{{RandSentence 1 10}}",
  "Colony": "{{RandString}}",

Request count and Conditional Logic

{{if NthRequest 10 }}   -- for every 10th request
{{if GERequest 10 }}    -- if number of requests made to API so far are >= 10
{{if LTRequest 10 }}    -- if number of requests made to API so far are < 10

The template syntax allows you to define a conditional logic such as:

    {{if NthRequest 10 }}
    status_code: {{AnyInt 500 501}}
    {{else}}
    status_code: {{AnyInt 200 400}}
    {{end}}

In above example, the mock API will return HTTP status 500 or 501 for every 10th request and 200 or 400 for other requests. You can use conditional syntax to simulate different error status or customize response.

Loops

  {{- range $val := Iterate 10}}

     {{if LastIter $val 10}}{{else}},{{end}}
  {{ end }}

Variables

     {{if VariableContains "contents" "blah"}}
     {{if VariableEquals "contents" "blah"}}
     {{if VariableSizeEQ "contents" "blah"}}
     {{if VariableSizeGE "contents" "blah"}}
     {{if VariableSizeLE "contents" "blah"}}

Test fixtures

The mock service allows you to upload a test fixture that you can refer in your template, e.g.

  "Line": { {{SeededFileLine "lines.txt" $val}}, "Type": "Public", "IsManaged": false },

Above example loads a random line from a lines.txt fixture. As you may need to generate a deterministic random data in some cases, you can use Seeded functions to generate predictable data so that the service returns same data. Following example will read a text fixture to load a property from a file:

  "Amount": {{JSONFileProperty "props.yaml" "amount"}},

This template file will generate content as follows:

{ "Devices": [
 {
   "Udid": "fe49b338-4593-43c9-b1e9-67581d000000",
   "Line": { "ApplicationName": "Chase", "Version": "3.80", "ApplicationIdentifier": "com.chase.sig.android", "Type": "Public", "IsManaged": false },
   "Amount": {"currency":"$","value":100},
   "SerialNumber": "47c2d7c3-c930-4194-b560-f7b89b33bc2a",
   "MacAddress": "1e015eac-68d2-42ee-9e8f-73fb80958019",
   "Imei": "5f8cae1b-c5e3-4234-a238-1c38d296f73a",
   "AssetNumber": "9z0CZSA03ZbUNiQw2aiF",
   "LocationGroupId": {
    "Id": {
      "Value": 980
    },
    "Name": "Houston",
    "Udid": "3bde6570-c0d4-488f-8407-10f35902cd99"
   },
   "DeviceFriendlyName": "Device for Alexander",
   "LastSeen": "2022-10-29T11:25:25-07:00",
   "Email": "john.smith@abc.com",
   "Phone": "1-408-454-1507",
   "EnrollmentStatus": true,
   "ComplianceStatus": "ACa3E07B0F2cA00d0fbFe88f5c6DbC6a9e",
   "Group": "Chicago",
   "Date": "11:25AM",
   "BatteryLevel": "43%",
   "StrEnum": "ONE",
   "IntEnum": 20,
   "ProcessorArchitecture": 243,
   "TotalPhysicalMemory": 320177,
   "VirtualMemory": 768345,
   "AvailablePhysicalMemory": 596326,
   "CompromisedStatus": false,
   "Add": 3
 },
...
 ], "Page": 2, "PageSize": 55, "Total": 55 }  

Artificial Delays

You can specify artificial delay for the API request as follows:

wait_before_reply: {{.page}}s

Above example shows delay based on page number but you can use any parameter to customize this behavior.

Conditional Logic

The template syntax allows you to define a conditional logic such as:

    {{if NthRequest 10 }}
    status_code: {{AnyInt 500 501}}
    {{else}}
    status_code: {{AnyInt 200 400}}
    {{end}}

In above example, the mock API will return HTTP status 500 or 501 for every 10th request and 200 or 400 for other requests. You can use conditional syntax to simulate different error status or customize response.

Test fixtures

The mock service allows you to upload a test fixture that you can refer in your template, e.g.

"Line": { {{SeededFileLine "lines.txt" $val}}, "Type": "Public", "IsManaged": false },

Above example loads a random line from a lines.txt fixture. As you may need to generate a deterministic random data in some cases, you can use Seeded functions to generate predictable data so that the service returns same data. Following example will read a text fixture to load a property from a file:

"Amount": {{JSONFileProperty "props.yaml" "amount"}},

This template file will generate content as follows:

{ "Devices": [
 {
   "Udid": "fe49b338-4593-43c9-b1e9-67581d000000",
   "Line": { "ApplicationName": "Chase", "Version": "3.80", "ApplicationIdentifier": "com.chase.sig.android", "Type": "Public", "IsManaged": false },
   "Amount": {"currency":"$","value":100},   
   "SerialNumber": "47c2d7c3-c930-4194-b560-f7b89b33bc2a",
   "MacAddress": "1e015eac-68d2-42ee-9e8f-73fb80958019",
   "Imei": "5f8cae1b-c5e3-4234-a238-1c38d296f73a",
   "AssetNumber": "9z0CZSA03ZbUNiQw2aiF",
   "LocationGroupId": {
    "Id": {
      "Value": 980,
    },
    "Name": "Houston",
    "Udid": "3bde6570-c0d4-488f-8407-10f35902cd99"
   },
   "DeviceFriendlyName": "Device for Alexander",
   "LastSeen": "2022-10-29T11:25:25-07:00",
   "Email": "anthony.christian@abblhfgkpd.edu",
   "Phone": "1-573-993-7542",   
   "EnrollmentStatus": true
   "ComplianceStatus": ACa3E07B0F2cA00d0fbFe88f5c6DbC6a9e
   "Group": Chicago,
   "Date": 11:25AM,
   "BatteryLevel": "43%",
   "StrEnum": ONE,
   "IntEnum": 20,
   "ProcessorArchitecture": 243,
   "TotalPhysicalMemory": 320177,
   "VirtualMemory": 768345,
   "AvailablePhysicalMemory": 596326,
   "CompromisedStatus": false,
   "Add": 3,
   "Dict": map[one:1 three:3 two:2]
 },
...
 ], "Page": 2, "PageSize": 55, "Total": 55 }   

Playing back a specific mock scenario

You can pass a header for X-Mock-Scenario to specify the name of scenario if you have multiple scenarios for the same API, e.g.

curl -v -H "X-Mock-Scenario: stripe-cash-balance" -H "Authorization: Bearer sk_test_0123456789" \
	"http://localhost:8080/v1/customers/123/cash_balance?page=2&pageSize=55"

You can also customize response status by overriding the request header with X-Mock-Response-Status and delay before return by overriding X-Mock-Wait-Before-Reply header.

Using Test Fixtures

You can define a test data in your test fixtures and then upload as follows:

curl -H "Content-Type: application/yaml" --data-binary @fixtures/lines.txt \
	http://localhost:8080/_fixtures/GET/lines.txt/devices

curl -v -H "Content-Type: application/yaml" --data-binary @fixtures/props.yaml \
    http://localhost:8080/_fixtures/GET/props.yaml/devices

In above example, test fixtures for lines.txt and props.yaml will be uploaded and will be available for all GET requests under /devices URL path. You can then refer to above fixture in your templates. You can also use this to serve any binary files, e.g. you can define an image template file as follows:

method: GET
name: test-image
path: /images/mock_image
description: ""
request:
response:
    headers:
      "Last-Modified":
        - {{Time}}
      "ETag":
        - {{RandString 10}}
      "Cache-Control":
        - max-age={{RandNumMinMax 1000 5000}}
    content_type: image/png
    contents_file: mockup.png
    status_code: 200

Then upload a binary image using:

curl -H "Content-Type: application/yaml" --data-binary @fixtures/mockup.png \
	http://localhost:8080/_fixtures/GET/mockup.png/images/mock_image

And then serve the image using:

curl -v "http://localhost:8080/images/mock_image"

Custom Functions

The API mock service defines following custom functions that can be used to generate test data:

Numeric Random Data

Following functions can be used to generate numeric data within a range or with a seed to always generate deterministic test data:

  • Random
  • SeededRandom
  • RandNumMinMax
  • RandIntArrayMinMax

Text Random Data

Following functions can be used to generate numeric data within a range or with a seed to always generate deterministic test data:

  • RandStringMinMax
  • RandStringArrayMinMax
  • RandRegex
  • RandEmail
  • RandPhone
  • RandDict
  • RandCity
  • RandName
  • RandParagraph
  • RandPhone
  • RandSentence
  • RandString
  • RandStringMinMax
  • RandWord

Email/Host/URL

  • RandURL
  • RandEmail
  • RandHost

Boolean

Following functions can be used to generate boolean data:

  • RandBool
  • SeededBool

UDID

Following functions can be used to generate UDIDs:

  • Udid
  • SeededUdid

String Enums

Following functions can be used to generate a string from a set of Enum values:

  • EnumString

Integer Enums

Following functions can be used to generate an integer from a set of Enum values:

  • EnumInt

Random Names

Following functions can be used to generate random names:

  • RandName
  • SeededName

City Names

Following functions can be used to generate random city names:

  • RandCity
  • SeededCity

Country Names or Codes

Following functions can be used to generate random country names or codes:

  • RandCountry
  • SeededCountry
  • RandCountryCode
  • SeededCountryCode

File Fixture

Following functions can be used to generate random data from a fixture file:

  • RandFileLine
  • SeededFileLine
  • FileProperty
  • JSONFileProperty
  • YAMLFileProperty

Generate Mock API Behavior from OpenAPI or Swagger Specifications

If you are using Open API or Swagger for API specifications, you can simply upload a YAML based API specification. For example, here is a sample Open API specification from Twilio:

openapi: 3.0.1
paths:
  /v1/AuthTokens/Promote:
    servers:
    - url: https://accounts.twilio.com
    description: Auth Token promotion
    x-twilio:
      defaultOutputProperties:
      - account_sid
      - auth_token
      - date_created
      pathType: instance
      mountName: auth_token_promotion
    post:
      description: Promote the secondary Auth Token to primary. After promoting the
        new token, all requests to Twilio using your old primary Auth Token will result
        in an error.
      responses:
        '200':
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/accounts.v1.auth_token_promotion'
          description: OK
      security:

...


   schemas:
     accounts.v1.auth_token_promotion:
       type: object
       properties:
         account_sid:
           type: string
           minLength: 34
           maxLength: 34
           pattern: ^AC[0-9a-fA-F]{32}$
           nullable: true
           description: The SID of the Account that the secondary Auth Token was created
             for
         auth_token:
           type: string
           nullable: true
           description: The promoted Auth Token
         date_created:
           type: string
           format: date-time
           nullable: true
           description: The ISO 8601 formatted date and time in UTC when the resource
             was created
         date_updated:
           type: string
           format: date-time
           nullable: true
           description: The ISO 8601 formatted date and time in UTC when the resource
             was last updated
         url:
           type: string
           format: uri
           nullable: true
           description: The URI for this resource, relative to `https://accounts.twilio.com`
...           

You can then upload the API specification as:

curl -H "Content-Type: application/yaml" --data-binary @fixtures/oapi/twilio_accounts_v1.yaml \
		http://localhost:8080/_oapi

It will generate a mock scenarios for each API based on mime-type, status-code, parameter formats, regex, data ranges, e.g.,

name: UpdateAuthTokenPromotion-xx
path: /v1/AuthTokens/Promote
description: Promote the secondary Auth Token to primary. After promoting the new token, all requests to Twilio using your old primary Auth Token will result in an error.
request:
    match_query_params: {}
    match_headers: {}
    match_content_type: ""
    match_contents: ""
    example_path_params: {}
    example_query_params: {}
    example_headers: {}
    example_contents: ""
response:
    headers: {}
    content_type: application/json
    contents: '{"account_sid":"{{RandRegex `^AC[0-9a-fA-F]{32}$`}}",\
    "auth_token":"{{RandStringMinMax 0 0}}","date_created":"{{Time}}",\
    "date_updated":"{{Time}}","url":"https://{{RandName}}.com"}'
    contents_file: ""
    status_code: 200
wait_before_reply: 0s

In above example, the account_sid uses regex to generate data and URI format to generate URL. Then invoke the mock API as:

curl -v -X POST http://localhost:8080/v1/AuthTokens/Promote

Which will generate dynamic response as follows:

{
  "account_sid": "ACF3A7ea7f5c90f6482CEcA77BED07Fb91",
  "auth_token": "PaC7rKdGER73rXNi6rVKZMN1Jw0QYxPFeEkqyvnM7Ojw2nziOER7SMWkIV6N2hXYTKxAfDMfS9t0",
  "date_created": "2022-10-29T11:54:46-07:00",
  "date_updated": "2022-10-29T11:54:46-07:00",
  "url": "https://Billy.com"
}

Listing all Mock Scenarios

You can list all available mock APIs using:

curl -v http://localhost:8080/_scenarios

Which will return summary of APIs such as:

{
  "/_scenarios/GET/FetchCredentialAws-8b2fcf02dfb7dc190fb735a469e1bbaa3ccb5fd1a24726976d110374b13403c6/v1/Credentials/AWS/{Sid}": {
    "method": "GET",
    "name": "FetchCredentialAws-8b2fcf02dfb7dc190fb735a469e1bbaa3ccb5fd1a24726976d110374b13403c6",
    "path": "/v1/Credentials/AWS/{Sid}",
    "match_query_params": {},
    "match_headers": {},
    "match_content_type": "",
    "match_contents": "",
    "LastUsageTime": 0,
    "RequestCount": 0
  },
  "/_scenarios/GET/FetchCredentialPublicKey-60a01dcea5290e6d429ce604c7acf5bd59606045fc32c0bc835e57ac2b1b8eb6/v1/Credentials/PublicKeys/{Sid}": {
    "method": "GET",
    "name": "FetchCredentialPublicKey-60a01dcea5290e6d429ce604c7acf5bd59606045fc32c0bc835e57ac2b1b8eb6",
    "path": "/v1/Credentials/PublicKeys/{Sid}",
    "match_query_params": {},
    "match_headers": {},
    "match_content_type": "",
    "match_contents": "",
    "LastUsageTime": 0,
    "RequestCount": 0
  },
  "/_scenarios/GET/ListCredentialAws-28717701f05de4374a09ec002066d308043e73e30f25fec2dcd4c3d3c001d300/v1/Credentials/AWS": {
    "method": "GET",
    "name": "ListCredentialAws-28717701f05de4374a09ec002066d308043e73e30f25fec2dcd4c3d3c001d300",
    "path": "/v1/Credentials/AWS",
    "match_query_params": {
      "PageSize": "\\d+"
    },
    "match_headers": {},
    "match_content_type": "",
    "match_contents": "",
    "LastUsageTime": 0,
    "RequestCount": 0
  },
...  

Chaos Testing

In addition to serving a mock service, you can also use a builtin chaos client to test remote services for stochastic testing by generating random data based on regex or API specifications. For example, you may capture a test scenario for a remote API using http proxy such as:

export http_proxy="http://localhost:8081"
export https_proxy="http://localhost:8081"
curl -k https://jsonplaceholder.typicode.com/todos

This will capture a mock scenario such as:

method: GET
name: recorded-todos-ff9a8e133347f7f05273f15394f722a9bcc68bb0e734af05ba3dd98a6f2248d1
path: /todos
description: recorded at 2022-12-12 02:23:42.845176 +0000 UTC for https://jsonplaceholder.typicode.com:443/todos
group: todos
predicate: ""
request:
    match_query_params: {}
    match_headers:
        Content-Type: ""
    match_contents: '{}'
    example_path_params: {}
    example_query_params: {}
    example_headers:
        Accept: '*/*'
        User-Agent: curl/7.65.2
    example_contents: ""
response:
    headers:
        Access-Control-Allow-Credentials:
            - "true"
        Age:
            - "19075"
        Alt-Svc:
            - h3=":443"; ma=86400, h3-29=":443"; ma=86400
        Cache-Control:
            - max-age=43200
        Cf-Cache-Status:
            - HIT
        Cf-Ray:
            - 7782ffe4bd6bc62c-SEA
        Connection:
            - keep-alive
        Content-Type:
            - application/json; charset=utf-8
        Date:
            - Mon, 12 Dec 2022 02:23:42 GMT
        Etag:
            - W/"5ef7-4Ad6/n39KWY9q6Ykm/ULNQ2F5IM"
        Expires:
            - "-1"
        Nel:
            - '{"success_fraction":0,"report_to":"cf-nel","max_age":604800}'
        Pragma:
            - no-cache
    contents: |-
      [
        {
          "userId": 1,
          "id": 1,
          "title": "delectus aut autem",
          "completed": false
        },
        {
          "userId": 1,
          "id": 2,
          "title": "quis ut nam facilis et officia qui",
          "completed": false
        },
      ...
        ]
    contents_file: ""
    status_code: 200
    match_headers: {}
    match_contents: '{"completed":"__string__.+","id":"(__number__[+-]?[0-9]{1,10})","title":"(__string__\\w+)","userId":"(__number__[+-]?[0-9]{1,10})"}'
    assertions: []

You can then customize this scenario with additional assertions and you may remove all response contents as they won’t be used. Note that above scenario is defined with group todos. You can then submit a request for chaos testing as follows:

curl -k -v -X POST http://localhost:8080/_chaos/todos -d '{"base_url": "https://jsonplaceholder.typicode.com", "execution_times": 10}'

Above request will submit 10 requests to the todo server with random data and return response such as:

{"errors":null,"failed":0,"succeeded":10}

If you have a local captured data, you can also run chaos client with a command line without running mock server, e.g.:

go run main.go chaos --base_url https://jsonplaceholder.typicode.com --group todos --times 10

Static Assets

The mock service can serve any static assets from a user-defined folder and then serve it as follows:

cp static-file default_assets

# execute the API mock server
make && ./out/bin/api-mock-service

# access assets
curl http://localhost:8080/_assets/default_assets

API Reference

The API specification for the mock library defines details for managing mock scenarios and customizing the mocking behavior.

Summary

Building and testing distributed systems often requires deploying a deep stack of dependent services, which makes development hard on a local environment with limited resources. Ideally, you should be able to deploy and test entire stack without using network or requiring a remote access so that you can spend more time on building features instead of configuring your local environment. Above examples show how you use the https://github.com/bhatti/api-mock-service to mock APIs for testing purpose and define test scenarios for simulating both happy and error cases as well as injecting faults or network delays in your testing processes so that you can test for fault tolerance. This mock library can be used to define the API mock behavior using record/play, template language or API specification standards. I have found a great use of tools like this when developing micro services and hopefully you find it useful. Feel free to connect with your feedback or suggestions.

« Newer PostsOlder Posts »

Powered by WordPress