Distributed systems inherently involve multiple components such as services, databases, networks, etc., which are spread across different machines or locations. These systems are prone to partial failures, where one part of the system may fail while others remain operational. A common strategy for building fault-tolerant and resilient systems is to recover from transient failures by retrying failed operations. Here are some common use cases for implementing retries to maintain reliability in such environments:
Recover from Transient Failures such as network glitches, dropped packets, or temporary unavailability of services. These failures are often short-lived, and a simple retry may succeed without any changes to the underlying system.
Recover from Network Instability due to packet loss, latency, congestion, or intermittent connectivity can disrupt communication between services.
Recover from Load Shedding or Throttling where services may experience momentary overloads and are unable to handle incoming requests.
Asynchronous Processing or Eventual Consistency models may take time to converge state across different nodes or services and operations might fail temporarily if the system is in an intermediate state.
Fault Isolation in microservices architectures, where services are loosely coupled but depend on one another. The downstream services may fail temporarily due to a service restart, deployment or scaling activities.
Service Downtime affects availability of services but client application can use retries to recover from minor faults and maintain availability.
Load Balancing and Failover with redundant Zones/Regions so that when a request to one zone/region fails but can be handled by another healthy region or zone.
Partial Failures where one part of the system fails while the rest remains functional (partial failures).
Build System Resilience to allow the system to self-heal from minor disruptions.
Race Conditions or timing-related issues in concurrent systems can be resolved with retries.
Challenges with Retries
Retries help in recovering from transient or partial failures by resending requests, but they can worsen system overloads if not managed carefully. Here are some challenges associated with retries:
Retry Storms: A retry storm occurs when multiple clients or services simultaneously retry failed requests to an overloaded or recovering service. This flood of retries can exacerbate the problem and can lead to performance degradation or a self-inflicted Denial of Service (DoS) attack.
Idempotency and Data Consistency: Some operations are not idempotent and performing them multiple times can lead to inconsistent or incorrect results (e.g., processing a financial transaction multiple times).
Cascading Failures: Retrying can propagate failures upstream or to dependent services. For instance, when a service fails and clients retry excessively, which can overwhelm downstream services.
Latency Amplification: Retrying failed operations can increase end-to-end latency, as each retry adds a delay before successful resolution.
Amplified Resource Consumption: Retried operations consume additional CPU, memory, and bandwidth, potentially depleting resources at a faster rate. Even when services eventually succeed, the increased load from retries can harm the overall system.
Retry Loops or Infinite Retries: If a failed operation is retried continuously without ever succeeding, it can potentially lead to system crashes.
Threads and connections starvation: When a service invokes multiple operations and some fail, it may retry all operations, leading to increased overall request latency. If high timeouts are set, threads and connections remain occupied, blocking new traffic.
Unnecessary Retries on Non-Retryable Failures: Retrying certain types of failures, like authorization errors or malformed requests is unnecessary and wastes system resources.
Timeout Mismatch Between Services: If the timeout settings for retries between services are not aligned, a downstream service may still be processing a request while the upstream service retries or times out that can result in conflicting states.
Considerations for Retries
Here are some key considerations and best practices for implementing more effective and safer retry mechanisms in distributed systems, enhancing resilience while safeguarding system stability during periods of stress or failure:
Timeouts: Implement timeouts to prevent clients from waiting indefinitely for a response and reduce resource exhaustion (e.g., memory or threads) caused by prolonged waiting. The challenge lies in selecting the appropriate timeout value: if set too high, resources are wasted; if set too low, it can trigger excessive retries, which increases the risk of outages. It’s recommended to set timeouts that are tightly aligned with performance expectations, ideally less than 2-times your maximum response time to avoid thread starvation. Additionally, monitor for early warning signs by setting alarms when performance degrades (e.g., when P99 latency approaches 50% of the timeout value).
Timeout Budgeting: In complex distributed systems, timeout budgeting ensures that the total time taken by a request across multiple services doesn’t exceed an acceptable limit. Each downstream service gets a portion of the total timeout, so failure in one service doesn’t excessively delay the entire request chain.
Exponential Backoff: Implement exponential backoff to spread out retry attempts by gradually increasing the delay between retries, reducing the risk of overwhelming a failing component and allowing time for recovery. It’s important to cap the backoff duration and limit the total number of retries. Without these limits, the system might continue retrying unnecessarily even after the underlying issue has been resolved.
Jitter: Adding randomness (jitter) to the backoff process helps prevent synchronized retries that could lead to overload spikes. Jitter is useful for spreading out traffic spikes and periodic tasks to avoid large bursts of traffic at regular intervals for improving system stability.
Idempotency: Operations that are retried must be idempotent, meaning they can be safely repeated without causing unintended side effects (e.g., double payments or duplicated data).
Retry Limits: Retries should be capped at a certain limit to avoid endlessly retrying a failing operation. Retries should stop beyond a certain number of attempts and the failure should be escalated or reported.
Throttling and Rate Limiting: Implement throttling or rate limiting and control the number of requests a service handles within a given time period. Rate limiting can be dynamic, which is adjusted based on current load or error rates, and avoid system overloads during traffic spikes. In addition, low-priority requests can be shed during high load situations.
Error Categorization: Not all errors should trigger retries and use an allowlist for known retryable errors and only retry those. For example, 400 Bad Request (indicating a permanent client error) due to invalid input should not be retried, while server-side or network-related errors with a 500 Internal Server Error (a likely transient issue) can benefit from retrying.
Targeting Failing Components Only: In a partial failure, not all parts of the system are down and retries help isolate and recover from the failing components by retrying operations specifically targeting the failed resource. For example, if a service depends on multiple microservices for an operation and one of the service fails, the system should retry the failed request without repeating the entire operation.
Intelligent and Adaptive Retries: Design retry logic to take the system’s current state into account, such as checking service health or load conditions before retrying. For example, increase retry intervals if multiple components are detected as failing or retry quickly for timeout errors but back off more for connection errors.. This prevents retries when the system is already known to be overloaded.
Retrying at Different Levels: Retries can be implemented at various levels to handle partial failures such as application level, middleware/proxy (load-balancer or API gateway), transport level (network). For example, a distributed system using a load balancer can detect if a specific instance of a service is failing and reroute traffic to a healthy instance that triggers retries only for the requests that target the failing instance.
Retry Amplification: In multi-tiered architectures, if retries are implemented at each level of nested service calls, it can lead to increased latency and exponentially higher traffic. To mitigate this, implement retries only at critical points in the call chain, and ensure that each service has a clear retry policy with limits. Use short timeouts to prevent thread starvation when calls to downstream services take too long. If too many threads hang, new traffic will be blocked.
Retry Budget: Implementing a global limit on the number of retries across all operations helps prevent system overload. For example, using an algorithm like Leaky Bucket can regulate the number of retries within a specified time period. This ensures that retries are distributed evenly and don’t exceed system capacity, preventing resource exhaustion during high failure rates.
Retries with Circuit Breakers: The circuit breaker pattern can be combined with retries to avoid overwhelming a failing component. When a service starts failing, the circuit breaker opens, temporarily halting requests to that service until it is healthy again. Retries can be configured to happen only after the circuit breaker transitions to a half-open state, which allows a limited number of retries to test if the service has recovered.
Retries with Failover Mechanisms: Retries can be designed with failover strategies where the system switches to a backup service, region, or replica in case of partial failure. If a service in one region fails then the retries can redirect requests to a different region or zone for ensuring availability.
Latency Sensitivity: Services with strict latency requirements might not tolerate long backoff periods or extended retries so they should minimize number of retries and cap backoff times.
Sync Calls: For synchronous calls, retry once immediately to handle temporary network issues and avoid multiple retries that could lead to thread starvation. Avoid excessive sleeping of threads between retries, which can lead to thread starvation. Also, a Circuit Breaker can be used to prevent retrying if a high percentage of calls fail.
Async Calls: Use exponential backoff with jitter for asynchronous operations and use Circuit Breakers to stop retries when failure rates are high. Asynchronous APIs can queue requests for later retries, but should incorporate health checks to ensure that retry attempts don’t add excessive load to downstream services during recovery periods.
Retrying on Overload Responses: Recognize overload indicators (e.g., HTTP 503 responses) and avoid retries when the response indicates overload.
Fail-Fast: Detect issues early and fails quickly rather than continuing to process failing requests or operations to avoid wasting time on requests that are unlikely to succeed.
Graceful Degradation: Provide an alternative method of handling requests when a service fails. For example, if a primary service is down, a cached result or a simpler backup service can be used instead.
Downstream Bugs: Rather than implementing retry-based workarounds, prioritize having downstream service owners address and resolve the underlying issues.
Monitor and Analyze Retry Patterns: Implement monitoring for retry attempts and success rates, and analyze the data to gain insights into system behavior during failures. Use these insights to optimize retry strategies, such as adjusting backoff intervals and fine-tuning timeouts for improved system performance.
SLAs with Downstream Services: Establish clear service-level agreements (SLAs) with downstream services about call frequency, failure rates, and latency expectations.
Availability Over Consistency: Prioritize service availability over consistency where possible, especially during retries or failure handling. In such cases, retries might return stale data or cause inconsistency issues, so it’s crucial to align retry policies with system design.
Chaos Engineering: Chaos engineering involves intentionally injecting failures, such as server crashes or network disruptions, into a system to test its resilience under adverse conditions. By simulating real-world failures, teams can identify weaknesses and ensure that the retry policies are working as expected.
Bulkhead Pattern: The bulkhead pattern isolates different parts of a system to prevent a failure in one part from affecting the rest of the system. The bulkheads can be implemented by limiting the number of resources (threads, memory, connections) allocated to each service or subsystem so that if one service becomes overloaded or fails, it won’t exhaust resources that other services need.
System Design: It’s essential to design APIs to minimize unnecessary communication with the server. For instance, in an event-driven architecture, if an event is missing a required attribute, the application might need to make additional requests to retrieve that data, increasing system load. To avoid this, ensure that events are fully populated with all necessary information upfront.
Summary
Retries are an essential mechanism for building fault-tolerant distributed systems and to recover from transient failures such as network issues, service unavailability, and partial system outages. A well-implemented retry strategy improves system resilience by ensuring that temporary failures don’t lead to full-blown outages. Techniques such as exponential backoff with jitter, idempotency, token buckets to limit retries locally, and circuit breakers help manage retries effectively, preventing issues like retry storms, resource exhaustion, and latency amplification.
However, retries need careful management because without proper limits, retries can overwhelm services that are already struggling or exacerbate issues like cascading failures and thread starvation. Incorporating timeouts, retry limits, and adaptive retry mechanisms based on system health can prevent these negative side effects. By analyzing retry patterns and adopting error-specific handling strategies, distributed systems can strike a balance between availability and resource efficiency, and ensures robust performance even in the face of partial failures.
Building and maintaining distributed systems is challenging due to complex intricacies of production environments, configuration differences, data and traffic scaling, dependencies on third-party services, and unpredictable usage patterns. These factors can lead to outages, security breaches, performance degradation, data inconsistencies, and other operational issues that may negatively impact customers [See Architecture Patterns and Well-Architected Framework]. These risks can be mitigated with phased rollouts with canary releases, leveraging feature flags for controlled feature activation, and ensuring comprehensive observability through monitoring, logging, and tracing are crucial. Additionally, rigorous scalability testing, including load and chaos testing, and proactive security testing are necessary to identify and address potential vulnerabilities. The use of blue/green deployments and the ability to quickly roll back changes further enhance the resilience of your system. Beyond these strategies, fostering a DevOps culture that emphasizes collaboration between development, operations, and security teams is vital. The following checklist serves as a guide to verify critical areas that may go awry when deploying code to production, helping teams navigate the inherent challenges of distributed systems.
Build Pipelines
Separate Pipelines: Create distinct CI/CD pipelines for each microservice, including infrastructure changes managed through IaC (Infrastructure as Code). Also, set up a separate pipeline for config changes such as throttling limits or access policies.
Securing and Managing Dependencies: Identify and address deprecated and vulnerable dependencies during the build process and ensure third party dependencies are vetted and hosted internally.
Build Failures: Verify build pipelines with comprehensive suite of unit and integration tests, and promptly resolve any flaky tests caused by concurrency, networking, or other issues.
Automatic Rollback: Automatically roll back changes if sanity tests or alarm metrics fail during the build process.
Phased Deployments: Deploy new changes in phases gradually across multiple data centers using canary testing with adequate baking period to validate functional and non-functional behavior. Immediately roll back and halt further deployments if error rates exceed acceptable thresholds [See Mitigate Production Risks with Phased Deployment].
Avoid Risky Deployments: Deploy changes during regular office hours to ensure any issues can be promptly addressed. Avoid deploying code during outages, availability issues, when 20%+ hosts are unhealthy, or during special calendar days like holidays or peak traffic periods.
IAM Best Practices: Follow IAM best practices such as using multi-factor authentication (MFA), regularly rotating credentials and encryption keys, and implementing role-based access control (RBAC).
Authentication and Authorization: Verify that authentication and authorization policies adhere to the principle of least privilege.
Defense in Depth: Implement admission controls at every layer including network, application and data.
Vulnerability & Penetration Testing: Conduct security tests targeting vulnerabilities based on the threat model for the service’s functionality.
Encryption: Implement encryption at rest and in-transit policies.
Test Plan: Ensure test plan accurately simulate real use cases, including varying data sizes and read/write operations.
Scalability Assessment: Conduct load tests to assess the scalability of both your primary service and its dependencies.
Testing Strategies: Conduct load tests using both mock dependent services and real services to identify potential bottlenecks.
Resource Monitoring: During load testing, monitor for excessive logs, events, and other resources, and assess their impact on latency and potential bottlenecks.
Autoscaling Validation: Validate on-demand autoscaling policies by testing them under increased load conditions.
Service Unavailability: Test scenarios where the dependent service is unavailable, experiences high latency, or results in a higher number of faults.
Monitoring and Alarms: Ensure that monitoring, alarms and on-call procedures for troubleshooting and recovery are functioning as intended.
Canary Testing and Continuous Validation
This strategy involves deploying a new version of a service to a limited subset of users or servers with real-time monitoring and validation before a full deployment.
Canary Test Validation: Ensure canary tests based on real use cases and validate functional and non-functional behavior of the service. If a canary test fails, it should automatically trigger a rollback and halt further deployments until the underlying issues are resolved.
Continuous Validation: Continuously validate API behavior and monitor performance metrics such as latency, error rates, and resource utilization.
Edge Case Testing: Canary tests should include common and edge cases such as large request size.
Resilience and Reliability
Idle Timeout Configuration: Set your API server’s idle connection timeout slightly longer than the load balancer’s idle timeout.
Load Balancer Configuration: Ensure the load balancer evenly distributes requests among servers using a round-robin method and avoids directing traffic to unhealthy hosts. Prefer this approach over least-connections method.
Backward Compatibility: Ensure API changes are backward compatible that are verified through Contract-based testing, and forward compatible by ignoring unknown properties.
Correlation ID Injection: Inject a Correlation ID into incoming requests, allowing it to be propagated through all dependent services for logging and tracing purposes.
Graceful Degradation: Implement graceful degradation to operate in a limited capacity even when dependent services are down.
Idempotent APIs: Ensure APIs especially those that create resources are implemented with idempotent behavior.
Request Validation: Validate all request parameters and fail fast any requests that are malformed, improperly sized, or contain malicious data.
Single Points of Failure: Eliminate single points of failure, bottlenecks, and dependencies on shared resources to minimize the blast radius.
Cold Start Optimization: Ensure that cold service startup time is limited to just a few seconds.
Performance Optimization
Latency Reduction: Identify and optimize parts of the system with high latency, such as database queries, network calls, or computation-heavy tasks.
Pagination: Implement pagination for list operations, ensuring that pagination tokens are account-specific and invalid after the query expiration time.
Thread and Queue Management: Set up the number of threads, connections, and queuing limits. Generally, the queue size should be proportional to the number of threads and kept small.
Resource Optimization: Optimize resource usage (e.g., CPU, memory, disk) by tuning configuration settings and optimizing code paths to reduce unnecessary overhead.
Caching Strategy: Review and optimize caching strategies to reduce load on databases and services, ensuring that cached data is used effectively without becoming stale.
Database Indexing: Regularly review and update database indexing strategies to ensure queries run efficiently and data retrieval is optimized.
Web Application Firewall: Consider implementing Web application firewall integration with your services’ load balancers to enhance security, traffic management and protect against distributed denial-of-service (DDoS). Confirm WAF settings and assess performance through load and security testing.
Testing Throttling Limits: Test throttling and rate limiting policies in the test environment.
Granular Limits: Implement tenant-level rate limits at the API endpoint level to prevent the noisy neighbor problem, and ensure that tenant context is passed to downstream services to enforce similar limits.
Aggregated Limits: When setting rate limits for both tenant-level and API-levels, ensure that the tenant-level limits exceed the combined total of all API limits.
Graceful degradation: Cache throttling and rate limit data to enable graceful degradation with fail-open if datastore retrieval fails.
Unauthenticated requests: Minimize processing for unauthenticated requests and safeguard against large payloads and invalid parameters.
Dependent Services
Timeout and Retry Configuration: Configure connection and request timeouts, implement retries with backoff and circuit-breaker, and set up fallback mechanisms for API clients with circuit breakers when connecting to dependent services.
Monitoring and Logging: Monitor and log failures and latency of dependent services and infrastructure components such as load balancers, and trigger alarms when they exceed the defined SLOs.
Scalability of Dependent Service: Verify that dependent services can cope with increased traffic loads during scaling traffic.
Compliance and Privacy
Below are some best practices for ensuring compliance:
Compliance: Ensure all data compliance to local regulations such as California Consumer Privacy Act (CCPA), General Data Protection Regulation (GDPR), Health Insurance Portability and Accountability Act (HIPAA), and other privacy regulations [See NIST SP 800-122].
Privacy: Identify and classify Personal Identifiable Information (PII), and ensure all data access is protected through Identity and Access Management (IAM) and compliance based PII policies [See DHS Guidance].
Privacy by design: Incorporate privacy by design principles into every stage of development to reduce the risk of data breaches.
Audit Logs: Maintain logs for all administrative actions, access to sensitive data and changes to critical configurations for compliance audit trails.
Monitoring: Continuously monitor of compliance requirements to ensure ongoing adherence to regulations.
Data Management
Data Consistency: Evaluate requirements for the data consistency such as strong and eventual consistency. Ensure data is consistently stored across multiple data stores, and implement a reconciliation process to detect any inconsistencies or lag times, logging them for monitoring and alerting purposes.
Schema Compatibility: Ensure data schema changes are both backward and forward compatible by implementing a two-phase release process. First, deploy an intermediate version that can read the new schema format but continues to write in the old format. Once this intermediate version is fully deployed and stable, proceed to roll out the new code that writes data in the new format.
Retention Policies: Establish and verify data retention policies across all datasets.
Unique Data IDs: Ensure data IDs are unique and do not overflow especially when using 32-bit or smaller integers.
Auto-scaling Testing: Test auto-scaling policies triggered by traffic spikes, and confirm proper partitioning/sharding across scaled resources.
Data Cleanup: Clean up stale data, logs and other resources that have expired or are no longer needed.
Divergence Monitoring: Implement automated processes to identify divergence from data consistency or high lag time with data synchronization when working with multiple data stores.
Data Migration Testing: Test data migrations in isolated environments to ensure they can be performed without data loss or corruption.
Backup and Recovery: Test backup and recovery processes to confirm they meet defined Recovery Point Objective (RPO) and Recovery Time Objective (RTO) targets.
Data Masking: Implement data masking in non-production environments to protect sensitive information.
Stale Cache Handling: Handle stale cache data by setting appropriate time-to-live (TTL) values and ensuring cache invalidation is correctly implemented.
Cache Preloading: Pre-load cache before significant traffic spikes so that latency can be minimized.
Cache Validation: Validate the effectiveness of your cache invalidation and clearing methods.
Negative Cache: Implement caching behavior for both positive and negative use cases and monitor the cache hits and misses.
Peak Traffic Testing: Assess service performance under peak traffic conditions without caching.
Bimodal Behavior: Minimize reliance on caching to reduce the complexity of bimodal logic paths.
Disaster Recovery
Backup Validation: Regularly test backup and recovery processes to ensure they meet defined Recovery Point Objective (RPO) and Recovery Time Objective (RTO) targets.
Failover Testing: Test failover procedures for critical services to validate that they can seamlessly switch over to backup systems or regions without service disruption.
Chaos Engineering: Incorporate chaos engineering practices to simulate disaster scenarios and validate the resilience of your systems under failure conditions.
Configuration and Feature-Flags
Configuration Storage: Prefer storing configuration changes in a source code repository and releasing them gradually through a deployment pipeline including tests for verification.
Configuration Validation: Validate configuration changes in a test environment before applying them in production to avoid misconfigurations that could cause outages.
Feature Management: Use a centralized feature flag management system to maintain consistency across environments and easily roll back features if necessary.
Testing Feature Flags: Test every combination of feature flags comprehensively in both test and pre-production environments before the release.
Observability
Observability allows instrumenting systems to collect and analyze logs metrics and trace for monitoring system performance and health. Below are some best practices for monitoring, logging, tracing and alarms [See USE and RED methodologies for Systems Performance]:
Monitoring
System Metrics: Monitor key system metrics such as CPU usage, memory usage, disk I/O, network latency, and throughput across all nodes in your distributed system.
Application Metrics: Track application-specific metrics like request latency, error rates, throughput, and the performance of critical application functions.
Server Faults and Client Errors: Monitor metrics for server-side faults (5XX) and client-side errors (4XX) including those from dependent services.
Service Level Objectives (SLOs): Define and monitor SLOs for latency, availability, and error rates. Use these to trigger alerts if the system’s performance deviates from expected levels.
Health Checks: Implement regular health checks to assess the status of services and underlying infrastructure, including database connections and external dependencies.
Dashboards: Use dashboards to display real-time and historical graphs for throughput, P9X latency, faults/errors, data size, and other service metrics, with the ability to filter by tenant ID.
Logging
Structured Logging: Ensure logs are structured and include essential information such as timestamps, correlation IDs, user IDs, and relevant request/response data.
Log API entry and exits: Log the start and completion of API invocations along with correlation IDs for tracing purpose.
Log Retention: Define and enforce log retention policies to avoid storage overuse and ensure compliance with data regulations.
Log Aggregation: Use log aggregation tools to centralize logs from different services and nodes, making it easier to search and analyze them in real-time.
Log Levels: Properly categorize logs (e.g., DEBUG, INFO, WARN, ERROR) and ensure sensitive information (such as PII) is not logged.
Tracing
Distributed Tracing: Implement distributed tracing to capture end-to-end latency and the flow of requests across multiple services. This helps in identifying bottlenecks and understanding dependencies between services.
Trace Sampling: Use trace sampling to manage the volume of tracing data, capturing detailed traces for a subset of requests to balance observability and performance.
Trace Context Propagation: Ensure that trace context (e.g., trace IDs, span IDs) is propagated across all services, allowing complete trace reconstruction.
Alarms
Threshold-Based Alarms: Set up alarms based on predefined thresholds for key metrics such as CPU/memory/disk/network usage, latency, error rates, throughput, starvation of threads and database connections, etc. Ensure that alarms are actionable and not too sensitive to avoid alert fatigue.
Anomaly Detection: Implement anomaly detection to identify unusual patterns in metrics or logs that might indicate potential issues before they lead to outages.
Metrics Isolation: Keep metrics and alarms from continuous canary tests and dependent services separate from those generated by real traffic.
On-Call Rotation: Ensure that alarms trigger appropriate notifications to on-call personnel, and maintain a rotation schedule to distribute the on-call load among team members.
Runbook Integration: Include runbooks with alarms to provide on-call engineers with guidance on how to investigate and resolve issues.
Rollback and Roll Forward
Rolling back involves redeploying a previous version to undo unwanted changes. Rolling forward involves pushing a new commit with the fix and deploying it. Here are some best practices for rollback and roll forward:
Immutable infrastructure: Implement immutable infrastructure practices so that switching back to a previous instance is simple.
Automated Rollbacks: Ensure rollbacks are automated so that they can be executed quickly and reliably without human intervention.
Rollback Testing: Test rollback changes in a test environment to ensure the code and data can be safely reverted.
Critical bugs: To prevent customer impact, avoid rolling back if the changes involve critical bug fixes or compliance and security-related updates.
Schema changes: If the new code introduced schema changes, confirm that the previous version can still read and update the modified data.
Roll Forward: Use rolling forward when rollback isn’t possible.
Avoid rushing Roll Forwards: Avoid roll forward if other changes have been committed that still being tested.
Testing Roll Forwards: Make sure the new changes including configuration updates are thoroughly tested before the roll forward.
Documentation and Knowledge Sharing
Operational Runbooks: Maintain comprehensive runbooks that document operational procedures, troubleshooting steps, and escalation paths for common issues.
Postmortems: Conduct postmortems after incidents to identify root causes, share lessons learned, and implement corrective actions to prevent recurrence.
Knowledge Base: Build and maintain a knowledge base with documentation on system architecture, deployment processes, testing strategies, and best practices for new team members and ongoing reference.
Training and Drills: Regularly train the team on disaster recovery procedures, runbooks, and incident management. Conduct disaster recovery drills to ensure readiness for actual incidents.
Continuous Improvement
Feedback Loops: Establish feedback loops between development, operations, and security teams to continuously improve deployment processes and system reliability.
Metrics Review: Regularly review metrics, logs, and alarms to identify trends, optimize configurations, and enhance system performance.
Automation: Automate repetitive tasks, such as deployments, monitoring setup, and incident response, to reduce human error and increase efficiency.
Conclusion
Releasing software in distributed systems presents unique challenges due to the complexity and scale of production environments, which cannot be fully replicated in testing. By adhering to the practices outlined in this checklist—such as canary releases, feature flags, comprehensive observability, rigorous scalability testing, and well-prepared rollback mechanisms—you can significantly reduce the risks associated with deploying new code. A strong DevOps culture, where development, operations, and security teams work closely together, ensures continuous improvement and adaptability to new challenges. By following this checklist and fostering a culture of collaboration, you can enhance the stability, security, and scalability of each release for your platform.
An access control system establishes a structure to manage the accessibility of resources within an organization or digital environment. It aims to prevent unauthorized individuals or entities from accessing data or taking actions outside their designated privileges. Such systems can govern physical entry points—like determining who can access a building—or digital ones—such as delineating permissions on a computer network or software platform. Key components of an access control system include:
Authentication: It is the process of verifying the identity of a user, application, or system. This process ensures that the entity requesting access is who or what it claims to be. Common methods of authentication include username and password, multi-factor authentication, biometric verification, token-based authentication, and certificate-based authentication.
Authorization: It determines the level of access, or permissions, granted to a legitimately authenticated user or system. Essentially, it answers the question: “What is this authenticated entity allowed to do or see within the system?”.
Audit and Monitoring: It refer to the systematic tracking, recording, and analysis of activities or events within a system or network. These activities often include user actions, system accesses, and operations that affect data and resources. The primary goals are to ensure compliance with established policies, detect unauthorized or abnormal activities, and facilitate the identification of vulnerabilities or weaknesses. Elements often involved in audit and monitoring can include log files, real-time monitoring, alerts and notification, data analytics, and compliance reporting.
Policy Management: It involves the creation, maintenance, and enforcement of rules, guidelines, and standard operating procedures that govern the behavior of users and systems within an organization or environment. These policies may include access policies, security policies, operational policies, compliance policies, change management policies, and policy auditing.
In this article, we will focus on authorization, which may use following popular mechanisms for enforcing access control:
Role-Based Access Control (RBAC): In RBAC, permissions are associated with roles, and users are assigned to these roles. For example, a “Manager” role might have the ability to add or remove employees from a system, while an “Employee” role might only be able to view information. When a user is assigned a role, they inherit all the permissions that come with it.
Attribute-Based Access Control (ABAC): ABAC is a more flexible and complex system that uses attributes as building blocks in a rule-based approach to control access. These attributes can be associated with the user (e.g., age, department, job role), action (e.g., read, write), resource (e.g., file type, location), or even environmental factors (e.g., time of day, network security level). Policies are then crafted to allow or deny actions based on these attributes.
Policy-Based Access Control (PBAC): PBAC is similar to ABAC but tends to be more dynamic, incorporating real-time information into its decision-making process. For example, a PBAC system might evaluate current network threat levels or the outcome of a risk assessment to determine whether access should be granted or denied. Policies can be complex, allowing for a high degree of flexibility and context-aware decisions.
Access Control Lists (ACLs): A list specifying what actions a user or system can or cannot perform.
Capabilities: In a capability-based security model, permissions are attached to tokens (capabilities) rather than to subjects (e.g., users) or objects (e.g., files). These tokens can be passed around between users and systems. Having a token allows a user to access a resource or perform an action. This model decentralizes the control of access, making it flexible but also potentially harder to manage at scale.
Permissions: This is a simple and straightforward model where each object (like a file or database record) has associated permissions that specify which users can perform which types of operations (read, write, delete, etc.). This is often seen in file systems where each file and directory has an associated set of permission flags.
Discretionary Access Control (DAC): In DAC models, the owner of the resource has the discretion to set its permissions. For example, in many operating systems, the creator of a file can decide who can read or write to that file.
Mandatory Access Control (MAC): Unlike DAC, where users have some discretion over permissions, in MAC, the system enforces policies that users cannot alter. These policies often use labels or classifications (e.g., Top Secret, Confidential) to determine who can access what.
The approaches to authorization are not mutually exclusive and can be integrated to form hybrid systems. For instance, an enterprise might rely on RBAC for broad-based access management, while also employing ABAC or PBAC to handle more nuanced or sensitive use-cases. The remainder of this article will concentrate on the design and implementation of such hybrid authorization systems.
Industry Standards
Following are popular industry standards to provide a common framework for the design, implementation, and management of security policies across different systems:
OAuth 2.0: IETF (Internet Engineering Task Force) standard to provide delegated access without sharing credentials.
OpenID Connect (OIDC): OpenID Foundation standard that layers on on top of OAuth 2.0, primarily used for authentication but often used in conjunction with authorization.
Security Assertion Markup Language (SAML): OASIS standard to exchange authentication and authorization information between parties.
These standards often complement each other and can be used in combination to build robust, secure, and flexible authorization mechanisms.
Popular Authorization Systems
Various open-source and commercial authorization systems are available to cater to different needs, from simple role-based systems to complex policy-driven solutions. Here are some popular open-source authorization systems:
Open Policy Agent (OPA): A general-purpose policy engine that enables fine-grained, context-aware access control across the stack.
Casbin: A powerful, efficient, and lightweight access control library that supports various access control models.
Authelia: A single sign-on (SSO) and two-factor authentication server.
Keycloak: Offers integrated SSO and IDM for browser apps and RESTful web services, along with extensive authorization capabilities.
Apache Shiro: A Java security framework that performs authentication, authorization, cryptography, and session management.
Spring Security: Provides comprehensive security features for Java applications, including robust access control capabilities.
FreeIPA: An integrated Identity and Authentication solution for Linux/UNIX networked environments.
Pomerium: An identity-aware proxy that enables secure access to internal applications.
ORY: A set of cloud-native identity infrastructure components, which include ORY Keto for access control.
PlexRBAC: An open-source RBAC implementation that I wrote back in 2010 using Java language.
PlexRBACJS: My open-source RBAC implementation using JavaScript language.
SaasRBAC: My open-source RBAC implementation using Rust language.
Following are commercial offerings for authorization systems:
AWS IAM and Cognito: Amazon Web Services offers these services for identity and access management, both within AWS and for apps using AWS backend services.
Amazon Verified Permissions: Amazon Verified Permissions is a scalable, fine-grained permissions management and authorization service for custom applications.
Okta: Provides a wide range of identity and access management solutions including strong authorization controls.
Microsoft Azure AD: Offers identity services and access management through Azure’s cloud platform.
Cyral: Focuses on data layer authorization, especially for data clouds and data warehouses.
OneLogin: Provides unified access management, making it easier to secure connections across users and devices.
Ping Identity: Provides solutions for both workforce and customer identity types.
RSA SecurID Suite: Offers highly secure and flexible access control, including role-based and policy-driven controls.
Saviom: Specializes in role-based access control for resource scheduling and project portfolio management.
SailPoint: Offers intelligent identity management solutions, including fine-grained entitlement management and policy enforcement.
ForgeRock: An identity management solution designed for consumer-facing applications, with extensive support for access management and federation.
Idaptive (now part of CyberArk): Provides end-to-end identity automation and adaptive security.
Another noteworthy authorization solution is Google’s Zanzibar. While not available as an open-source or commercial product, Google has released a whitepaper outlining its architecture and principles. Zanzibar is engineered to meet the demands of large, intricate systems and is capable of processing millions of queries per second. The aforementioned authorization systems offer various configurations and customizations to meet an organization’s particular needs. We plan to draw from the design elements of these existing systems to create a robust and versatile authorization framework.
Design Tenets
Our authorization system will use a hybrid approach combining Role-Based Access Control (RBAC), Attribute-Based Access Control (ABAC), and Relationship-Based Access Control (ReBAC) would incorporate various features inspired by the likes of OPA, AWS IAM, Google Zanzibar, and my previous implementations of similar authentication systems. Following are primary design tenets for building such authorization systems:
Scalability: Capable of handling a large number of authorization requests per second and expand to accommodate growing numbers of users and resources.
Flexibility: Supports RBAC, ABAC, and ReBAC, allowing for the handling of various scenarios.
Fine-grained Control: Context aware such as time, location and real-time data, and can decide based on multiple attributes of the subject, object, and environment.
Auditing and Monitoring: Detailed logs for all access attempts and policy changes, and real-time insights into access patterns, possibly with alerting for suspicious activities.
Security: Applies least privilege, and enforces data masking and redaction.
Usability: Easy-to-use interfaces for assigning, changing, and revoking roles.
Extensibility: Comprehensive APIs for integration with other systems and services ability to run custom code during the authorization process.
Reliability: have minimal downtime with backup and recovery.
Compliance: adhere to regulatory requirements like GDPR, HIPAA, etc. and track changes to policies for auditing purposes.
Multi-Tenancy: support multiple services and products with a variety of authorization models under a single unified system.
Policy Versioning and Namespacing: allow multiple versions and namespaces of policies, making it possible to manage complex policy changes.
Balance between Expressive and Performance: provide a good balance with expressive policies offered by OPA and high performance offered by Zanzibar.
Policy Validation: check against invalid, unsafe or ambiguous policies and prevent users from making accidental mistakes.
Performance Optimization: using cache, indexing, parallel processing, lazy evaluation, rule simplifications, automated reasoning, decision trees and other optimization techniques to improve performance of the system.
Authorization Concepts
Following is a list of high-level data model concepts that are typically used in the authorization systems:
Principal
The entity (which could be a user, system, or another service) that is making the request. Principals are often authenticated before they are authorized to perform an action.
Subject
Similar to a principal, the subject refers to the entity that is attempting to access a particular resource. In some contexts, a subject may represent a real person that has multiple principal identities for various systems.
Permission
An action that a principal is allowed to perform on a particular resource. For example, reading a file, updating a database record, or deleting an account.
Claim
A statement made by the principal, usually after authentication, that annotates the principal with specific attributes (e.g., username, roles, permissions). Claims are often used in token-based authentication systems like JWT to carry information about the principal.
Role
A named collection of permissions that can be assigned to a principal. Roles simplify the management of permissions by grouping them together under a single label.
Group
A collection of principals that are treated as a single unit for the purpose of granting permissions. For example, an “Admins” group might be given a role that allows them to perform administrative actions.
Access Policy
A set of rules that define the conditions under which a particular action is allowed or denied. Policies can be simple (“Admins can do anything”) or complex (“Users can edit a document only if they are the creator and the document is in ‘Draft’ status”).
Relation
Relations define how different entities are connected. For instance, a “user” can have a “memberOf” relation with a “group”, or a “document” can have an “ownedBy” relation with a “user”.
Resource
The object that the principal wants to access (e.g., a file, a database record).
Context
Additional situational information (e.g., IP address, time of day) that might influence the authorization decision.
Namespace
Each namespace could serve as a container for a set of resources, roles, and permissions.
Scope or Realm
This often refers to the level or context in which a permission is granted. For instance, in OAuth, scopes are used to specify what access a token grants the user, like “read-only” access to a particular resource.
Rule
A specific condition or criterion in a policy that dictates whether access should be granted or denied.
Dynamic Conditions
Dynamic conditions or predicates are expressions that must be evaluated at runtime to determine if access should be granted or denied. Dynamic conditions consists of attributes, operators and values, e.g.,
if (principal.role == "employee" AND principal.status == "active") OR (time < "17:00") then ALLOW
if (principal.role == "admin") OR (document.owner == principal.id) then ALLOW
if IP_address in [allowed_IPs] then ALLOW
if time >= 09:00 AND time <= 17:00 then ALLOW
Authorization Data Model
Following data model is defined in Protocol Buffers definition language based on above authorization concepts:
Organization
The Organization abstracts a boundary of authorization data and it can have multiple namespaces for different security realms or segments of security domains. Here is the definition of Organization:
message Organization {
// ID unique identifier assigned to this organization.
string id = 1;
// Version
int64 version = 2;
// Name of organization.
string name = 3;
// Allowed Namespaces for organization.
repeated string namespaces = 4;
// url for organization.
string url = 5;
// Optional parent ids.
repeated string parent_ids = 6;
}
Principal
The Principal abstracts subject who is making an authorization request to perform an action on a target resource based on access rules and dynamic conditions. A Principal belongs to an organization and can be associated with groups, roles (RBAC), permissions and relationships (ReBAC). The Principal defines following properties:
message Principal {
// ID unique identifier assigned to this principal.
string id = 1;
// Version
int64 version = 2;
// OrganizationId of the principal user.
string organization_id = 3;
// Allowed Namespaces for principal, should be subset of namespaces in organization.
repeated string namespaces = 4;
// Username of the principal user.
string username = 5;
// Email of the principal user.
string email = 6;
// Name of the principal user.
string name = 7;
// Attributes of principal
map<string, string> attributes = 8;
// Groups that the principal belongs to.
repeated string group_ids = 9;
// Roles that the principal belongs to.
repeated string role_ids = 10;
// Permissions that the principal belongs to.
repeated string permission_ids = 11;
// Relationships that the principal belongs to.
repeated string relation_ids = 12;
}
Resource and ResourceInstance
The Resource represents target object for performing an action and checking an access rules policy. A resource can also be used to represent an object with a quota that can be allocated or assigned based on access policies. Here is a definition of Resource and ResourceInstance:
message Resource {
// ID unique identifier assigned to this resource.
string id = 1;
// Version
int64 version = 2;
// Namespace for resource.
string namespace = 3;
// Name of the resource.
string name = 4;
// capacity of resource.
int32 capacity = 5;
// Attributes of resource.
map<string, string> attributes = 6;
// AllowedActions that can be performed.
repeated string allowed_actions = 7;
}
enum ResourceState {
ALLOCATED = 0;
AVAILABLE = 1;
}
message ResourceInstance {
// ID unique identifier assigned to this resource instance.
string id = 1;
// Version
int64 version = 2;
// ResourceID of the resource.
string resource_id = 3;
// Namespace for resource.
string namespace = 4;
// Principal that is using the resource.
string principal_id = 5;
// state of resource instance.
ResourceState state = 6;
// Time duration in milliseconds after which instance will expire.
google.protobuf.Duration expiry = 7;
}
Permission
The Permission defines access policies for a resource including dynamic conditions based on GO Templates that are evaluated before granting an access:
enum Effect {
PERMITTED = 0;
DENIED = 1;
}
message Permission {
// ID unique identifier assigned to this permission.
string id = 1;
// Version
int64 version = 2;
// Namespace for permission.
string namespace = 3;
// Scope for permission.
string scope = 4;
// Actions that can be performed.
repeated string actions = 5;
// Resource for the action.
string resource_id = 6;
// Effect Permitted or Denied
Effect effect = 7;
// Constraints expression with dynamic properties.
string constraints = 8;
}
Role
A Principal can be associated with one or more Roles where each Role has a name and can be optionally associated with Permissions for implementing RBAC based access control, e.g.,
message Role {
// ID unique identifier assigned to this role.
string id = 1;
// Version
int64 version = 2;
// Namespace for permission.
string namespace = 3;
// Name of the role.
string name = 4;
// PermissionIDs that can be performed.
repeated string permission_ids = 5;
// Optional parent ids
repeated string parent_ids = 6;
}
A Role can also be inherited from multiple other Roles so that common Permissions can be defined in the parent Role(s) and specific Permissions are defined in the derived Roles.
Group
A Principal can be linked to multiple Groups, and each Group can be tied to several Roles. The Principal inherits access Permissions not only directly associated with it but also from the Roles it’s part of and the Groups it’s connected to. Here is the Group definition:
message Group {
// ID unique identifier assigned to this group.
string id = 1;
// Version
int64 version = 2;
// Namespace for permission.
string namespace = 3;
// Name of the group.
string name = 4;
// RoleIDs that are associated.
repeated string role_ids = 5;
// Optional parent ids.
repeated string parent_ids = 6;
}
A Group can also have one or parents similar to Roles so that access rules policies can check membership for groups or inherits all permissions that belong to a Group through its association with Roles.
Relationship
A Principal can define relationships with resources or target objects for performing actions and access policies can check for existence of a relationship before permitting an action and implementing ReBAC based policies. Though, Relationship seems similar to a Role or a Group but it differs from them because a Relationship directly associate between a Principal and a Resource where as a Role can be associated with multiple Principals and is indirectly associated with Resource through Permission object. Here is the definition for a Relationship:
message Relationship {
// ID unique identifier assigned to this relationship.
// in:body
string id = 1;
// Version
// in:body
int64 version = 2;
// Namespace for permission.
// in:body
string namespace = 3;
// Relation name.
// in:body
string relation = 4;
// PrincipalID for relationship.
// in:body
string principal_id = 5;
// ResourceID for relationship.
// in:body
string resource_id = 6;
// Attributes of relationship.
// in:body
map<string, string> attributes = 7;
}
API Specifications for Authorization
The Authorization APIs are grouped into control-plane APIs for managing above data and their relationships with Principals and data-plane (behavioral) for Authorizing decisions. Following section defines control-plane APIs in Protocol Buffers definition language for managing authorization data and policies:
Following specification defines APIs for authorizing access to resources based on permissions and constraints as well operations to allocate and deallocate resources:
The Authorize API takes AuthRequest as a request that defines Principal-Id, Resource-Name, Action and context attributes and checks permissions for granting access:
The Check API allows evaluating dynamic conditions based on GO Templates without defining Permissions so that you can check for the membership to a group, a role, an existence of a relationship or other dynamic properties.
Above hybrid authorization APIs is implemented in GO and is available freely from https://github.com/bhatti/PlexAuthZ. The following diagram illustrates structure of modules for the implementing various parts of the Authorization system:
Following are major components in above diagram:
API Layer
The API layer defines service interfaces and schema for domain model as well request/response objects. The interfaces are then implemented by gRPC servers and REST controllers.
Data Layer and Repositories
The Data layer defines interfaces for storing data in Redis or DynamoDB databases. The Repository layer defines interfaces for managing data for each type such as Principal, Organization and Resource.
Domain Services
The Domain services abstract over Repository layer and implements referential integrity between data objects and validation logic before persisting authorization data.
Authorizer
The Authorizer layer defines interfaces for Authorization decisions. The API layer implements the interface based on Casbin for communicating clients and servers. This layer defines a default implementation based on the Domain service layer for enforcing authorization decisions based on above APIs.
Factory and Configuration
The PlexAuthZ makes extensive use of interfaces with different implementations for Datastore, Repositories, Authorizer and AuthAdapter. The user can choose different implementations based on the Configuration, which are passed to the factory methods when instantiating objects that implement those interfaces.
AuthAdapter
The AuthAdapter abstracts Data services and Authorizer for interacting with underlying Authorization system. AuthAdapter defines a simplified DSL in GO language that understands the relationships between data objects. The users can instantiate AuthAdapter that can connect to remote gRPC server, REST controller, or the database directly.
Usage Examples
In above data model and APIs, Principals, Resources and Relationships can have arbitrary attributes that can be checked at runtime for enforcing policies based on attributes. In addition, the request objects for Authorize, Check and AllocateResource defines runtime context properties that can be passed along with other attributes when evaluating runtime conditions based on GO Templates. Following section defines use-cases for enforcing access policies based on ABAC, RBAC, ReBAC and PBAC:
GO Client Initialization
First, the GO client library will be setup with a selection of the implementation based on the database, gRPC client or REST API client, e.g.,
Following example illustrates implementing attribute-based access policies where three Principals (alice, bob, charlie) will define attributes for Department and Rank:
The attributes based access permissions will be checked as follows:
// Alice, Bob, Charlie should be able to read/list since alice/bob belong to
// Editors attribute and Charlie's rank >= 6
require.NoError(t, alice.Authorizer(namespace).WithAction("list").
WithResourceName("ios-app").Check())
require.NoError(t, bob.Authorizer(namespace).WithAction("list").
WithResourceName("ios-app").Check())
require.NoError(t, charlie.Authorizer(namespace).WithAction("list").
WithResourceName("ios-app").Check())
// Only Bob should be able to write because Alice's rank is lower than 6 and
// Charlie doesn't belongto Editors attribute.
require.Error(t, alice.Authorizer(namespace).WithAction("write").
WithResourceName("ios-app").Check())
require.NoError(t, bob.Authorizer(namespace).WithAction("write").
WithResourceName("ios-app").Check())
require.Error(t, charlie.Authorizer(namespace).WithAction("write").
WithResourceName("ios-app").Check())
Note: The Authorization adapter defines Check method that will invoke the Authorize or Check method of the data-plane Authorization API based on parameters.
Runtime Attributes based on IPAddresses
The GO Templates allow defining custom functions and PlexAuthz implementation includes a number of helper functions to validate IP addresses, Geolocation, Time and other environment factors, e.g.,
rwlPerm, err := orgAdapter.Permissions(namespace).
WithResource(app.Resource).
WithConstraints(`
{{$Loopback := IsLoopback .IPAddress}}
{{$Multicast := IsMulticast .IPAddress}}
{{and (not $Loopback) (not $Multicast) (IPInRange .IPAddress "211.211.211.0/24")}}
`).WithActions("read", "write", "list").Create()
alice.AddPermissions(rwlPerm.Permission)
// The app should be only be accessible if ip-address is not loop-back,
// not multi-cast and within ip-range
require.NoError(t, alice.Authorizer(namespace).WithAction("list").
WithContext("IPAddress", "211.211.211.5").WithResourceName("ios-app").Check())
// But not local ipaddress or multicast
require.Error(t, alice.Authorizer(namespace).WithAction("list").
WithContext("IPAddress", "127.0.0.1").WithResourceName("ios-app").Check())
require.Error(t, alice.Authorizer(namespace).WithAction("list").
WithContext("IPAddress", "224.0.0.1").WithResourceName("ios-app").Check())
RBAC Scenario
The following example will assign roles and groups to Principal objects and then enforce membership before granting the access:
Note: The Authorizer adapter will invoke Check API in above use-cases because it’s only using constraints without defining permissions.
ReBAC Scenario
Though, ReBAC systems generally define relationships between actors but you can consider a Principal as a subject-actor and a Resource as a target-actor for relationships. Following scenarios illustrates how relationships between Principal and Resources can be used to enforce ReBAC based access policies similar to Zanzibar:
Above snippet defines medical-records as a resource, and Principals for smith and john where smith is assigned a relationship for AsDoctor and john is assigned a relationship for AsPatient. The permissions for reading or writing medical records enforce the AsDoctor relationship and permissions for reading medical records enforce the AsPatient relationship. Then enforcing relationships is defined as follows:
// Dr. Smith should have permission for reading/writing medical records based on constraints
require.NoError(t, smith.Authorizer(namespace).WithAction("write").
WithResource(medicalRecords.Resource).
WithContext("UserLatLng", "47.620422,-122.349358", "Location", "Hospital").Check())
// Patient john should have permission for reading medical records based on constraints
require.NoError(t, john.Authorizer(namespace).WithAction("read").
WithScope("john's records").WithResource(medicalRecords.Resource).
WithContext("Location", "Hospital").Check())
// But Patient john should not write medical records
require.Error(t, john.Authorizer(namespace).WithAction("write").
WithResource(medicalRecords.Resource).
WithContext("Location", "Hospital").Check())
Above Snippet also makes use of other functions available in the Template language for enforcing dynamic conditions based on Geofencing that permits access only when the doctor is close to the Hospital.
As the Relationships are defined between actors, we can also define a Resource to represent a Doctor and a Principal for the patient so that a patient-doctor relationship can be established, e.g.,
// Now treating Doctor as Target Resource for appointment
doctorResource, err := orgAdapter.Resources(namespace).WithName(smith.Principal.Name).
WithAttributes("Year", fmt.Sprintf("%d", time.Now().Year()),
"Location", "Hospital").WithActions("appointment", "consult").Create()
doctorPatientRelation, err := john.Relationships(namespace).WithRelation("Physician").
WithAttributes("StartTime", "8:00am", "EndTime", "4:00pm").
WithResource(doctorResource.Resource).Create()
apptPerm, err := orgAdapter.Permissions(namespace).WithResource(doctorResource.Resource).
WithConstraints(`
{{$CurrentYear := TimeNow "2006"}}
{{and (TimeInRange .AppointmentTime .Relations.Physician.StartTime .Relations.Physician.EndTime)
(HasRelation "Physician") (eq "Patient" .Principal.UserRole) (eq .Resource.Year $CurrentYear) (eq .Resource.Location .Location)}}
`).WithActions("appointment").Create()
john.AddPermissions(apptPerm.Permission)
// Patient john should be able to make appointment within normal Hopspital hours
require.NoError(t, john.Authorizer(namespace).WithAction("appointment").
WithResource(doctorResource.Resource).
WithContext("Location", "Hospital", "AppointmentTime", "10:00am").Check())
Above example shows how authorization rules can also limit access between the normal hours of appointments.
Resources with Quota
PlexAuthZ supports defining access policies for resources that have quota, e.g., an organization may have a fixed set of IDE Licenses to be used by the engineering team or might be using a utility based computing resources with a daily budget. Here is an example scenario:
engGroup, err := orgAdapter.Groups(namespace).WithName("Engineering").Create()
alice, err := orgAdapter.Principals().WithUsername("alice").
WithAttributes("Title", "Engineer", "Tenure", "3").Create()
// Assigning groups
alice.AddGroups(engGroup.Group)
// AND with following resources
ideLicences, err := orgAdapter.Resources(namespace).WithName("IDELicence").
WithCapacity(5).WithAttributes("Location", "Chicago").
WithActions("use").Create()
require.NoError(t, ideLicences.
WithConstraints(`and (GT .Principal.Tenure 1) (HasGroup "Engineering") (eq .Resource.Location .Location)`).
WithExpiration(time.Hour).WithContext("Location", "Chicago").Allocate(bob.Principal))
...
// Deallocate after use
require.NoError(t, ideLicences.Deallocate(alice.Principal))
Above example demonstrates that the IDE License can only be allocated if the Principal is member of Engineering group, has a tenure of more than a year and Location matches Resource Location. In addition, the resource can be allocated only for a fixed duration and is automatically deallocated if not allocated explicitly. Both Redis and Dynamo DB supports TTL parameters for expiring data so no application logic is required to expire them.
Resources with Wildcard in the name
PlexAuthZ supports resources with wildcards in the name so that a user can match permissions for all resources that match the wildcard pattern. Here is an example:
alice, err := orgAdapter.Principals().WithUsername("alice").
WithAttributes("Department", "Sales", "Rank", "6").Create()
bob, err := orgAdapter.Principals().WithUsername("bob").
WithAttributes("Department", "Engineering", "Rank", "6").Create()
// Creating a project with wildcard
salesProject, err := orgAdapter.Resources(namespace).
WithName("urn:org-sales-*-project-1000-*").
WithAttributes("SalesYear", fmt.Sprintf("%d", time.Now().Year())).
WithActions("read", "write").Create()
rwlPerm, err := orgAdapter.Permissions(namespace).
WithResource(salesProject.Resource).
WithEffect(types.Effect_PERMITTED).
WithConstraints(`
{{$CurrentYear := TimeNow "2006"}}
{{and (GT .Principal.Rank 5) (eq .Principal.Department "Sales") (IPInRange .IPAddress "211.211.211.0/24") (eq .Resource.SalesYear $CurrentYear)}}
`).WithActions("*").Create()
require.NoError(t, err)
alice.AddPermissions(rwlPerm.Permission))
bob.AddPermissions(rwlPerm1.Permission)
// Alice should be able to access from Sales Department and complete project name
require.NoError(t, alice.Authorizer(namespace).WithAction("read").
WithResourceName("urn:org-sales-abc-project-1000-xyz").
WithContext("IPAddress", "211.211.211.5").Check())
// But bob should not be able to access project because he doesn't belong to the Sales Department
require.Error(t, bob.Authorizer(namespace).WithAction("read").
WithResourceName("urn:org-sales-abc-project-1000-xyz").
WithContext("IPAddress", "211.211.211.5").Check())
Note: The project name “urn:org-sales-abc-project-1000-xyz” matches the wildcard in resource name and permissions also verify attributes of the Resource and Principal.
Permissions with Scope
PlexAuthZ allows associating permissions with specific Scope and the permission is only granted if the scope in authorization request at runtime matches the scope, e.g.,
alice, err := orgAdapter.Principals().WithUsername("alice").
WithAttributes("Department", "Engineering", "Permanent", "true").Create()
bob, err := orgAdapter.Principals().WithUsername("bob").
WithAttributes("Department", "Sales", "Permanent", "true").Create()
project, err := orgAdapter.Resources(namespace).WithName("nextgen-app").
WithAttributes("Owner", "alice").
WithActions("list", "read", "write", "create", "delete").Create()
rwlPerm, err := orgAdapter.Permissions(namespace).WithResource(project.Resource).
WithScope("Reporting").
WithConstraints(`
{{or (eq .Principal.Username .Resource.Owner) (Not .Private)}}
`).WithActions("read", "write", "list").Create()
alice.AddPermissions(rwlPerm.Permission)
bob.AddPermissions(rwlPerm.Permission)
// Project should be only be accessible by alice as the scope matches and she is the owner.
require.NoError(t, alice.Authorizer(namespace).
WithAction("list").WithScope("Reporting").
WithContext("Private", "true").
WithResourceName("nextgen-app").Check())
// But alice should not be able to access without matching scope.
require.Error(t, alice.Authorizer(namespace).
WithAction("list").WithScope("").
WithContext("Private", "true").
WithResourceName("nextgen-app").Check())
// But bob should not be able to access as project is private and he is not the owner.
require.Error(t, bob.Authorizer(namespace).
WithAction("list").WithScope("Reporting").
WithContext("Private", "true").
WithResourceName("nextgen-app").Check())
Note: Above example also demonstrates show you can enforce ownership for private resources.
Summary
PlexAuthZ demonstrates how a hybrid Authorization system can support various forms of access policies based on on ABAC, RBAC, ReBAC and PBAC. It’s still early in development but it’s an open-source project that you can try freely from https://github.com/bhatti/PlexAuthZ.
The first chapter defines the remote API fundamentals that defines contract for the desired behavior, communication protocol, network endpoints and policies regarding failures. The chapter also surveys history of remote APIs such as TCP/IP based FTP, RPC based DCE/CORBA/RMI/gRPC, queue/messaging based, REST style, and data streams/pipelines. The authors then examines cloud native applications (CNA) and a set of principles described in IDEAL that include Isolated State, Distribution, Elasticity, Automation and Loose Coupling. These traits are then summarized as:
Fit for purpose
Rightsized and modular
Resilient and protected
Controllable and adaptable
Workfload-aware and resource-efficient
Agile and tool-supported
The authors describes how service-oriented architecture originated and evolved into Microservices that has a single responsibility within a domain-specific business capability. Microservices facilitate software reuse but also bring new challenges that include fallacies of distributed computing, data consistency and state management. These APIs should be considered as products and they may form ecosystems. API business value, visibility and lifetime help make API successful by enabling rapid integration of systems with support of the autonomy and independent evolution of those systems. The API design may differ based on:
One general vs many specialized endpoints
Fine vs coarse-grained endpoint and operation scope
Few operations carrying much data vs chatty interactions carrying little data
Data currentness vs correctness
Stable contracts vs fast changing ones
The authors reviews architecturally significant requirements such as understandability, information sharing vs hiding, amount of coupling, modifiability, performance & scalability, data parsimony, and security & privacy. The authors then describes developer experience as:
DX = function, stability, ease of use, and clarity
The developer experience includes quality attributes throughout the lifecycle of API such as development qualities, operational qualities and managerial qualities. The chapter ends with definition of a domain model for remote APIs that include communication participants, endpoints with contracts that describe operations, message structure, and API contract.
2. Lakeside Mutual Case Study
The chapter 2 introduces a fake Lakeside Mutual case study to illustrate API design. This chapter examines the user-stories and quality attributes for the case study, analsysis-level domain model and architecture overview. The architecture overviews describes system context and application architecture including service components. The chapter then describes an example of API specification using MDSL (Microservice Domain-Specific-Language).
3. API Decision Narratives
This chapter goes over the API design options and decisions where each decision includes criteria, alternative options and recommendations based on why-statement and architecture-decision-record format. The first pattern starts with Foundation API decisions and patterns that include:
3.1 API Visibility
This decision that is primarily managerial and organizational looks at the different visibility options such as Public API, Community API and Solution-Internal API.
3.2 API Integration
The chapter looks at the decision looks for the API integration types where an API can be integrated with backend horizontally or can be integrated with frontend and backend vertically. In the former option, backend exposes its services via a message-based remote backend integration API. In the latter option, the APIs are exposed via a message-based remote frontend integration API.
The API designers have to find an appropriate business granularity for the service and handle cohesion & coupling criteria. The drivers for this decision is to define the architectural role that an API endpoint should play and define the responsibility of each API operation. The role of an API endpoint can be Processing Resource for processing incoming commands or Information Holder Resource for storage and retrieval of data or metadata. The information holder roles can be further divided into operational/transactional short-lived data, master long-lived data for business transactions, reference long-lived data for looking up delivery status, zip codes, etc., link lookup resource to identify links to resources and data transfer resource to offer a shared data exchange between clients.
3.5 Defining Operation Responsibilities
The operation responsibilities include defining read-write characteristics of each API operation and can be categorized into Computing Function, State Creation Operation, State Transition Operation and Retrieval Operation. The Computation Function computes a result solely from the client input without reading or writing a server-side state. The State Creation Operation creates states with reliability on an write-only API endpoint. The State Transition Operation performs one or more activities, causing a server-side state change with considerations for network efficiency and data parsimony. The Retrieval Operation represents a read-only access operation to find the data.
3.6 Selecting Message Representation Patterns
The structural representation patterns deal with designing message representation structures with considerations for finding the optimal number of message parameters and semantic meaning and stereotypes of the representation elements. This structural representation needs to take four decisions: responsibility of message elements, structure of the parameter representation, exchange of context information required and meaning of stereotypes of message elements. The structure of parameter representation can be nested or flat with types: Atomic Parameter, Atomic Parameter List, Parameter Tree, and Parameter Forrest. The Atomic Parameter defines a single parameter or body element. The Atomic Parameter List aggregates multiple atomic parameters as list. The Parameter Tree defines a hierarchical structure with one or more child nodes. The Parameter Forest comprises two or more Parameter Trees. Security and privacy concerns such as data integrity and confidentiality as well as semantic proximity will determine choosing the right option for these structure types.
3.7 Element Stereotypes
The element stereotype patterns include Data Element and Metadata Element, Link Element, ID Element. Security and data privacy concerns drive Data Elements and Metadata Elements and messages may become larger if Metadata Elements are included. The unique ID Element is used to identify to API endpoints, operations, and message representation elements. The Link Element act as human and machine readable network accessible pointers to other endpoints and operations.
3.8 Governing API Quality
The quality of service (QoS) for API include reliability, performance, security and scalability. The themes of the decisions for quality governance include:
3.8.1 Identification and Authentication of the API Client
Identification, authentication and authorization are important for APIs security but they also enable measures for ensuring many other qualities.
3.8.2 API Key
The API Key identifies the client and additional signature made with the secret key, which are never transmitted. You may use OAuth 2.0, OpenID, Kerberos in combination with LDAP, CHAP, and EAP.
3.8.3 Pricing Plan
The pricing plan looks at the metering and charging for API consumption and its variants include Freemium Model, Flat-Rate Subscription, Usage-based Pricing and Market-based Pricing based on economic aspects.
3.8.4 Rate Limit
The Rate Limit safeguards against API clients that overuse the API.
3.8.5 Service Level Agreement
The service Level Agreement defines testable service-level objectives to establish a structured, quality-oriented agreement with the API product owner.
3.8.6 Error Report
The Error Report uses error codes in response message that indicate and classify the faults in a simple and machine-readable ways.
3.8.7 Context Representation
The context representation uses Metadata Elements to carry contextual information in request and response messages. It can be used to cope with the diversity of protocols for distributed applications and transport security tokens and digital signatures.
3.8.8 Pagination
The pagination divides large response data into smaller chunks and its variants include Page-Based Pagination, Cursor-Based Pagination, Offset-Based Pagination, and Time-Based Pagination. It can optionally allow filtering capabilities and pagination structure can be defined as Atomic Parameter List, or Parameter Forest or Parameter Tree.
3.8.9 Wish List and Wish Template
A Wish List allows API clients to provide desired data elements of requested resource in the request. When response contains a nested data, Wish Template can be used to specify parameters in the request message that should be included in the corresponding response message.
3.8.10 Conditional Request
This pattern makes a Conditional Request by adding Metadata Elements to the request message and the API service processes the request only the condition specified by the metadata is met.
3.8.11 Request Bundle
Request Bundle is defined as a data container that assembles multiple requests with unique identifiers in a single request message.
3.8.12 Embedded Entity
This pattern embeds a Data Element in the request or response instead of link or identifier.
3.8.13 Linked Information Holder
Linked Information Holder adds a Link Element to message that points to the API endpoint that represent the linked element.
3.9 API Evolution
API Evolution patterns define governing rules balancing stability and compatibility with maintainability and extensibility such as:
3.9.1 Version Identifier
Version identifier is added as a Metadata Element to the endpoint address, protocol header or the message payload to indicate possibly incompatible changes to clients.
3.9.2 Semantic Versioning
Semantic Versioning introduces a hierarchical three-number versioning schema x.y.z, which allows API providers to denote level of changes as major, minor and patch versions.
3.9.3 Commissioning and Decommissioning
The variants for version introduction and decommissioning decision include Two in Production, Limited Lifetime Guarantee, and Aggressive Obsolescence.
3.9.4 Experimental Preview
Experimental Preview provides access to an API to receive early feedback from consumers without making any commitments about the functionality, stability and longevity.
4. Pattern Language Introduction
This chapter introduces a pattern language, basic scoping and structural patterns. Many of the patterns builds upon Enterprise Integration Patterns and Gof Design Patterns when defining structure of a message. This chapter categorizes patterns into Foundation patterns, Responsibility patterns, Structure patterns, Quality patterns and Evolution patterns. These patterns also follow Design Refinement Phases based on Unified Process:
Inception – Foundation
Elaboration – Responsibility and Quality
Construction – Structure, Responsibility and Quality
Transition – Foundation, Quality and Evolution
4.1 Foundations: API Visibility and Integration Types
These patterns deal with types of systems, subsystems and components as well as where should an API be accessible. The API integration types can be Frontend Integration and Backend Integration. The Frontend Integrations, also referred as vertical integrations are consumed by API clients in application frontends. The cloud-native applications and microservices-based system benefit with Backend Integration, sometimes called horizontal integration to access information or activity in other systems. The API visibility alternatives can be Public API, Community API and Solution-Internal API. Public API specifies endpoints, operations, message representation, quality of service and lifecycle model that can be accessed by unlimited or unknown number of API clients and can be controlled with API keys. You may apply other patterns such as Version Identifiers, Pricing Plan, Rate Limit and Service Level Agreement with Public APIs. The Community API are only available to a community that may consists of different organizations. The Solution-Internal API is also referred as Platform API that may be exposed in a single cloud provider offering.
4.2 Basic Structure Patterns
The structure patterns looks at the number of representation elements for request and response messages and decides how these elements should be grouped. These patterns include Atomic Parameter that describes plain data such as text and numbers; Atomic Parameter List that groups several elementary parameters; Parameter Trees that provide nested parameters; and Parameter Forest that groups multiple tree parameters.
5. Define Endpoint Types and Operations
This chapter corresponds to Define phase of the Align-Define-Design-Refine (ADDR) process and describes high-level endpoint identification activities. The authors looks at user stories, event storming or other collaboration techniques to define API roles and responsibilities. The design of API contracts also have to define developer experience in terms of function, stability, ease of use, clarity. Other quality attributes that the API designer have to decide include: Accuracy for functional correctness including preconditions, invariants and postconditions; Distribution of control and autonomy between API client and provider; Scalability, performance and availability with Service Level Agreements for mission-critical APIs; Manageability for monitoring APIs; Consistency and atomicity for all-or-nothing semantics; Idempotence property; Auditability for risk management.
5.1 Endpoint Roles (aka Service Granularity)
The two general endpoint roles are Processing Resource and Information Holder Resource. The Processing Resource role allows remote clients to trigger an action and related design concerns include contract expressiveness and service granularity; learnability and manageability; semantic interoperability; response time; security and privacy; and compatibility and evolvability. Information Holder Resource exposes domain data in API and it may use Domain-driven design and object-oriented analysis and design to model the data. Other related concerns include quality attribute conflicts and trade-offs; security; data freshness vs consistency; and compliance with architectural design principles. Related patterns include:
Operational Data Holder to create, read, update and delete its data often and fast.
Master Data Holder to access master data that lives for long time, doesn’t change and will be referenced from many clients. The request and response messages of Master Data often take the form of Parameter Trees and master data update may come in the form of coarse-grained full updates or fine-grained partial updates.
Reference Data Holder is used to lookup reference data that is long lived and is immutable for clients using API endpoints. Its desired qualities include Do not repeat yourself (DRY) and performance vs consistency trade-off for read access.
Link Lookup Resource allows referring to other resources so that clients remain loosely coupled if API provider changes the destination of links. The design challenges include: coupling between clients and endpoints; dynamic endpoint references; centralization vs decentralization; message sizes, number of calls, resource use; dealing with broken links; and number of endpoints and API complexity.
Data Transfer Resource allows exchanging data between participants without knowing each other, without being available at the same time. The design considerations include coupling (time and location dimensions); communication constraints; reliability; scalability; storage space efficiency; latency; and ownership management. You may introduce a shared storage endpoint with a State Creation Operation and Retrieval Operation. The pattern properties include coupling (time and location dimensions); communication constraints; reliability; scalability; storage space efficiency; latency; ownership management; access control; (lack of) coordination; optimistic locking; polling; and garbage collection.
5.2 Operation Responsibilities
The four operation responsibilities include:
State Creation Operation to allow its clients that something has happened, e.g. to trigger instant or later processing. This design concerns include: coupling trade-offs (accuracy and expressiveness vs information parsimony); timing; consistency; and reliability. It may or may not have fire-and-forget semantics and idempotency may be needed for the transaction boundary. A popular variant of this pattern is Event Notification Operation, notifying the endpoint about an external event.
Retrieval Operation to retrieve information and allow further client-side processing. The design issues include: veracity, variety, velocity and volume; workload management; network efficiency vs data parsimony.
State Transition Operation to allow a client initiate a processing action that causes the provider-side application state to change. The design concerns include service granularity; consistency and auditability; dependencies on state changing being made beforehand; workload management; and network efficiency vs data parsimony. State Transition Operations are generally transactional supporting ACID behavior and may use ABAC for compliance and security controls.
Computation Function to allow client invoke side-effect-free remote processing on the provider side based on the input parameters. The relevant design issues include reproducibility and trust; performance; and workload management. Its examples include a Transformation service, Validation service, and Long Running Computation.
6. Design Request and Response Message Representations
This chapter corresponds to Design phase of the Align-Define-Design-Refine (ADDR) process and examines structural patterns for requests and responses. The challenges when designing message representations include interoperability on protocol and message-content; latency; throughput and scalability; maintainability; and developer experience.
6.1 Data Element
The Data Element pattern allow exchanging application-level information between API clients and API providers without decoupling them. The competing force concerns include rich functionality vs ease of processing and performance; security and data privacy vs ease of configuration; and maintainability vs flexibility.
6.2 Metadata Element
Metadata Element pattern allows enriching messages with additional information so that receiver can interpret the message content correctly. The design concerns include interoperability; concerns; and ease of use vs runtime efficiency. The variants of this pattern include Control Metadata Element such as identifiers, flags, filter, ACL, API eys, etc; Aggregated Metadata Elements such as counters of Pagination and statistical information; Provenance Metadata Elements such as message/request IDs, creation date, version numbers, etc.
6.3 ID Element
ID Element pattern helps identify elements of the Published Language using UUID or surrogate key when applying domain-driven design. The identification problems include effort vs stability; reliability for machines and humans; and security.
6.3 Link Element
Link Element is used to reference API endpoints and operations in request and response message payloads so that they can be called remotely.
6.4 API Key
API Key allows API provider to identify and authenticate clients and their requests. The design issues include establishing basic security; access control; avoiding the need to store or transmit user credentials; decoupling clients form their organizations; security vs ease of use; and performance.
6.5 Error Report
Error Report allows API provides inform its clients about communication and processing faults. The design concerns include expressiveness and target audience expectations; robustness and reliability; security and performance; interoperability and portability; and internationalization.
6.6 Context Representation
Context Representation allows API consumers and providers exchange context information without relying on any particular remoting protocols. The design considerations include interoperability and modifiability; dependency on evolving protocols; developer productivity (control vs convenience); diversity of clients and their requirements; end-to-end security; and logging and auditing on business domain level.
7. Refine Message Design for Quality
This chapter reviews API Quality patterns related to the Design and Refine phases of the Align-Define-Design-Refine (ADDR) process. The major challenges with API Quality include message sizes vs number of requests; information needs of individual clients; network bandwidth usage vs computation efforts; implementation complexity vs performance; statelessness vs performance; and ease of use vs legacy.
7.1 Message Granularity
The message granularity patterns deal with performance and scalability; modifiability and flexibility; data quality; data privacy; and data freshness vs consistency. These patterns include:
Embedded Pattern allows placing Data Element in the request or response to avoid exchanging multiple messages.
Linked Information Holder can be used to keep the message small when an API deals with multiple information elements that reference each other.
These patterns deal with performance, scalability, and resource use; information needs of individual clients; loose coupling and interoperability; developer experience; security and data privacy; and test and maintenance effort. These patterns include:
Pagination allows an API provider deliver large sequence of structured data without overwhelming clients. The design concerns include session awareness and isolation; and data set size and data access profile. The variants of this pattern include Page-Based Pagination, Cursor-Based Pagination, Time-Based Pagination and Offset-Based Pagination.
Wish List allows an API client inform the API provider at runtime about the data it is interested in.
Wish Template allows API client inform the API provider about nested data that it is interested in. For example, Wish Templates of GraphQL are the query and mutation schemas providing declarative descriptions of the client requirements.
These patterns provide balance competing forces for complexity of endpoint, client, and message payload design; and accuracy of reporting and billing. These patterns include:
Conditional Request prevents unnecessary server-side processing and bandwidth usage by invoking API operation only when the condition is true. The design concerns include size of message; client workload; provider workload; and data currentness vs correctness. Its variants include Time-Based Conditional Request (e.g. If-Modified-Since HTTP header) and Fingerprint-Based Conditional Request (e.g. ETag and If-None-Match HTTP headers).
Request Bundle works as a data container that assembles multiple independent requests in a single request message with unique request identifiers.
8. Evolve API
This chapter reviews API Evolution patterns related to the Refine phase of the Align-Define-Design-Refine (ADDR) process. The major challenges with API Evolution autonomy, loose coupling, extensibility, compatibility and sustainability. The patterns in this chapter include:
8.1 Versioning and Compatibility Management
These patterns include:
Version Identifier allows an API provider indicate its current capabilities as well as existence of possibly incompatible changes to clients. The design concerns include accuracy vs exact identification; no accidental breakage of comparability; client-side impact; and traceability of API versions in use.
Semantic versioning allow stakeholders compare API versions to detect incompatible changes. The design concerns include minimal effort to detect version incompatibility; clarity of change impact; clear separation of changes with different levels of impact and compatibility; manageability of API versions and related governance effort; and clarity with regard to evolution timeline. The common numbering scheme in Semantic Versioning include major version, minor version and patch version.
8.2 Life-Cycle Management Guarantees
These patterns include:
Experimental Preview allows providers make the introduction of a new API or new API versions, less risky for their clients and obtain early adopter feedback without freezing the API design prematurely. The design considerations include innovation and new features; feedback; focus effort; early learning and security.
Aggressive Obsolescence allows API providers reduce the effort for maintaining an entire API by removing unused or deprecated features. The design concerns include minimizing the maintenance effort; reducing forced changes to clients in a given time span as a consequence of API changes; repeating/acknowledging power dynamics; and commercial goals and constraints. The API version is marked as Release, Deprecate or Decommission in the lifecycle of Aggressive Obsolescence.
Limited Lifetime Guarantee allows a provider let clients know for how long they can rely on the published version of an API. The design considerations include make client-side changes caused by API changes plannable; and limit the maintenance effort for supporting old clients.
Two in Production allows a provider gradually update an API without breaking existing clients but also without maintaining a large number of API versions in production. The design concerns include allow the provider and the client to follow different life cycles; guarantee that API changes do not lead to undetected backward compatibility problems between clients and provider; ensure the ability to rollback if a new API version is designed badly; minimize changes to the client; minimize the maintenance effort for supporting clients relying on old API versions.
9. Document and Communicate API Contracts
This chapter does not correspond to any phase of the Align-Define-Design-Refine (ADDR) process. The challenges for Documenting APIs include interoperability; compliance; information hiding; economic aspects; performance and reliability; meter granularity; attractiveness from a consumer point of view. The patterns in the chapter include:
API Description to share knowledge between API provider and its clients and related concerns include interoperability; consumability; information hiding; extensibility and evolvability.
Pricing Plan to allow API provider meter API service consumption and charge for it. Its variants include Subscription-based Pricing, Usage-based Pricing, and Market-based Pricing.
Rate Limit to allow API provider prevent API clients from excessive API usage with design considerations for economic aspects; performance; reliability; impact and severity of risks of API abuse; and client awareness.
Service Level Agreement to allow API client learn about the specific quality-of-service characteristics of an API and its endpoint operations. The design concerns include business agility and vitality; attractiveness from the consumer point of view; availability; performance and scalability; security and privacy; government regulations and legal obligations; and cost-efficiency and business risks from a provider point of view.
10. Real-World Pattern Stories
This chapter examines API design and evolution in real-world business domains. The first case-study discusses large-scale process integration in Terravis, a Swiss Mortgage business had to adopt a new law for the digitization of Swiss land registry businesses. In the context dimensions defined by Philippe Krutchten, the Terravis platform was characterized in terms of system size, system criticality, system age, team distribution, rate of change, preexistence of stable architecture, governance and business model. Terravis applied role and status of API as well as other patterns such as Solution-Internal API, Community API, API Description, Service Level Agreements, Semantic Versioning, Error Report, Pricing Plan, Rate Limit, Context Representation, State Creation Operation, State Transition Operations, Pagination, etc. The other case-study showed how an internal system at the concrete column manufacturer SACAC had to integrate different existing software such as ERP and CAD systems. The chapter used the Philippe Krutchten’s project dimensions to describe the project. The key challenges included correctness of all calculations and the solution applied book’s guidelines for roles and status of API. The API used Solution-Internal, Frontend Integration, Backend Integration, State Creation Operations, State Transition Operations, Retrieval Operations, Computations Functions, etc.
11. Conclusion
The last chapter concludes with how the pattern language in the book helps integration architects, API developers and other roles involved with API design and evolution. The authors also suggest how APIs can be refactored to the patterns described in the book and use Microservice Domain Specific Language (MDSL) Tools for refactoring. The chapter also describes advancements in API protocols and standards such as HTTP/2, HTTP/3, and gRPC. OpenAPI Specification is the dominant API description language for HTTP-based APIs and AsyncAPI is gaining adoption for message-based APIs, which can also generate MDSL bindings.