Microservices « Shahzad Bhatti

September 21, 2024

Robust Retry Strategies for Building Resilient Distributed Systems

Filed under: API,Computing,Microservices — admin @ 10:50 am

Introduction

Distributed systems inherently involve multiple components such as services, databases, networks, etc., which are spread across different machines or locations. These systems are prone to partial failures, where one part of the system may fail while others remain operational. A common strategy for building fault-tolerant and resilient systems is to recover from transient failures by retrying failed operations. Here are some common use cases for implementing retries to maintain reliability in such environments:

Recover from Transient Failures such as network glitches, dropped packets, or temporary unavailability of services. These failures are often short-lived, and a simple retry may succeed without any changes to the underlying system.
Recover from Network Instability due to packet loss, latency, congestion, or intermittent connectivity can disrupt communication between services.
Recover from Load Shedding or Throttling where services may experience momentary overloads and are unable to handle incoming requests.
Asynchronous Processing or Eventual Consistency models may take time to converge state across different nodes or services and operations might fail temporarily if the system is in an intermediate state.
Fault Isolation in microservices architectures, where services are loosely coupled but depend on one another. The downstream services may fail temporarily due to a service restart, deployment or scaling activities.
Service Downtime affects availability of services but client application can use retries to recover from minor faults and maintain availability.
Load Balancing and Failover with redundant Zones/Regions so that when a request to one zone/region fails but can be handled by another healthy region or zone.
Partial Failures where one part of the system fails while the rest remains functional (partial failures).
Build System Resilience to allow the system to self-heal from minor disruptions.
Race Conditions or timing-related issues in concurrent systems can be resolved with retries.

Challenges with Retries

Retries help in recovering from transient or partial failures by resending requests, but they can worsen system overloads if not managed carefully. Here are some challenges associated with retries:

Retry Storms: A retry storm occurs when multiple clients or services simultaneously retry failed requests to an overloaded or recovering service. This flood of retries can exacerbate the problem and can lead to performance degradation or a self-inflicted Denial of Service (DoS) attack.
Idempotency and Data Consistency: Some operations are not idempotent and performing them multiple times can lead to inconsistent or incorrect results (e.g., processing a financial transaction multiple times).
Cascading Failures: Retrying can propagate failures upstream or to dependent services. For instance, when a service fails and clients retry excessively, which can overwhelm downstream services.
Latency Amplification: Retrying failed operations can increase end-to-end latency, as each retry adds a delay before successful resolution.
Amplified Resource Consumption: Retried operations consume additional CPU, memory, and bandwidth, potentially depleting resources at a faster rate. Even when services eventually succeed, the increased load from retries can harm the overall system.
Retry Loops or Infinite Retries: If a failed operation is retried continuously without ever succeeding, it can potentially lead to system crashes.
Threads and connections starvation: When a service invokes multiple operations and some fail, it may retry all operations, leading to increased overall request latency. If high timeouts are set, threads and connections remain occupied, blocking new traffic.
Unnecessary Retries on Non-Retryable Failures: Retrying certain types of failures, like authorization errors or malformed requests is unnecessary and wastes system resources.
Timeout Mismatch Between Services: If the timeout settings for retries between services are not aligned, a downstream service may still be processing a request while the upstream service retries or times out that can result in conflicting states.

Considerations for Retries

Here are some key considerations and best practices for implementing more effective and safer retry mechanisms in distributed systems, enhancing resilience while safeguarding system stability during periods of stress or failure:

Timeouts: Implement timeouts to prevent clients from waiting indefinitely for a response and reduce resource exhaustion (e.g., memory or threads) caused by prolonged waiting. The challenge lies in selecting the appropriate timeout value: if set too high, resources are wasted; if set too low, it can trigger excessive retries, which increases the risk of outages. It’s recommended to set timeouts that are tightly aligned with performance expectations, ideally less than 2-times your maximum response time to avoid thread starvation. Additionally, monitor for early warning signs by setting alarms when performance degrades (e.g., when P99 latency approaches 50% of the timeout value).
Timeout Budgeting: In complex distributed systems, timeout budgeting ensures that the total time taken by a request across multiple services doesn’t exceed an acceptable limit. Each downstream service gets a portion of the total timeout, so failure in one service doesn’t excessively delay the entire request chain.
Exponential Backoff: Implement exponential backoff to spread out retry attempts by gradually increasing the delay between retries, reducing the risk of overwhelming a failing component and allowing time for recovery. It’s important to cap the backoff duration and limit the total number of retries. Without these limits, the system might continue retrying unnecessarily even after the underlying issue has been resolved.
Jitter: Adding randomness (jitter) to the backoff process helps prevent synchronized retries that could lead to overload spikes. Jitter is useful for spreading out traffic spikes and periodic tasks to avoid large bursts of traffic at regular intervals for improving system stability.
Idempotency: Operations that are retried must be idempotent, meaning they can be safely repeated without causing unintended side effects (e.g., double payments or duplicated data).
Retry Limits: Retries should be capped at a certain limit to avoid endlessly retrying a failing operation. Retries should stop beyond a certain number of attempts and the failure should be escalated or reported.
Throttling and Rate Limiting: Implement throttling or rate limiting and control the number of requests a service handles within a given time period. Rate limiting can be dynamic, which is adjusted based on current load or error rates, and avoid system overloads during traffic spikes. In addition, low-priority requests can be shed during high load situations.
Error Categorization: Not all errors should trigger retries and use an allowlist for known retryable errors and only retry those. For example, 400 Bad Request (indicating a permanent client error) due to invalid input should not be retried, while server-side or network-related errors with a 500 Internal Server Error (a likely transient issue) can benefit from retrying.
Targeting Failing Components Only: In a partial failure, not all parts of the system are down and retries help isolate and recover from the failing components by retrying operations specifically targeting the failed resource. For example, if a service depends on multiple microservices for an operation and one of the service fails, the system should retry the failed request without repeating the entire operation.
Intelligent and Adaptive Retries: Design retry logic to take the system’s current state into account, such as checking service health or load conditions before retrying. For example, increase retry intervals if multiple components are detected as failing or retry quickly for timeout errors but back off more for connection errors.. This prevents retries when the system is already known to be overloaded.

Retrying at Different Levels: Retries can be implemented at various levels to handle partial failures such as application level, middleware/proxy (load-balancer or API gateway), transport level (network). For example, a distributed system using a load balancer can detect if a specific instance of a service is failing and reroute traffic to a healthy instance that triggers retries only for the requests that target the failing instance.
Retry Amplification: In multi-tiered architectures, if retries are implemented at each level of nested service calls, it can lead to increased latency and exponentially higher traffic. To mitigate this, implement retries only at critical points in the call chain, and ensure that each service has a clear retry policy with limits. Use short timeouts to prevent thread starvation when calls to downstream services take too long. If too many threads hang, new traffic will be blocked.
Retry Budget: Implementing a global limit on the number of retries across all operations helps prevent system overload. For example, using an algorithm like Leaky Bucket can regulate the number of retries within a specified time period. This ensures that retries are distributed evenly and don’t exceed system capacity, preventing resource exhaustion during high failure rates.
Retries with Circuit Breakers: The circuit breaker pattern can be combined with retries to avoid overwhelming a failing component. When a service starts failing, the circuit breaker opens, temporarily halting requests to that service until it is healthy again. Retries can be configured to happen only after the circuit breaker transitions to a half-open state, which allows a limited number of retries to test if the service has recovered.
Retries with Failover Mechanisms: Retries can be designed with failover strategies where the system switches to a backup service, region, or replica in case of partial failure. If a service in one region fails then the retries can redirect requests to a different region or zone for ensuring availability.
Latency Sensitivity: Services with strict latency requirements might not tolerate long backoff periods or extended retries so they should minimize number of retries and cap backoff times.
Sync Calls: For synchronous calls, retry once immediately to handle temporary network issues and avoid multiple retries that could lead to thread starvation. Avoid excessive sleeping of threads between retries, which can lead to thread starvation. Also, a Circuit Breaker can be used to prevent retrying if a high percentage of calls fail.
Async Calls: Use exponential backoff with jitter for asynchronous operations and use Circuit Breakers to stop retries when failure rates are high. Asynchronous APIs can queue requests for later retries, but should incorporate health checks to ensure that retry attempts don’t add excessive load to downstream services during recovery periods.
Retrying on Overload Responses: Recognize overload indicators (e.g., HTTP 503 responses) and avoid retries when the response indicates overload.
Fail-Fast: Detect issues early and fails quickly rather than continuing to process failing requests or operations to avoid wasting time on requests that are unlikely to succeed.
Graceful Degradation: Provide an alternative method of handling requests when a service fails. For example, if a primary service is down, a cached result or a simpler backup service can be used instead.
Downstream Bugs: Rather than implementing retry-based workarounds, prioritize having downstream service owners address and resolve the underlying issues.
Monitor and Analyze Retry Patterns: Implement monitoring for retry attempts and success rates, and analyze the data to gain insights into system behavior during failures. Use these insights to optimize retry strategies, such as adjusting backoff intervals and fine-tuning timeouts for improved system performance.
SLAs with Downstream Services: Establish clear service-level agreements (SLAs) with downstream services about call frequency, failure rates, and latency expectations.
Availability Over Consistency: Prioritize service availability over consistency where possible, especially during retries or failure handling. In such cases, retries might return stale data or cause inconsistency issues, so it’s crucial to align retry policies with system design.
Chaos Engineering: Chaos engineering involves intentionally injecting failures, such as server crashes or network disruptions, into a system to test its resilience under adverse conditions. By simulating real-world failures, teams can identify weaknesses and ensure that the retry policies are working as expected.
Bulkhead Pattern: The bulkhead pattern isolates different parts of a system to prevent a failure in one part from affecting the rest of the system. The bulkheads can be implemented by limiting the number of resources (threads, memory, connections) allocated to each service or subsystem so that if one service becomes overloaded or fails, it won’t exhaust resources that other services need.
System Design: It’s essential to design APIs to minimize unnecessary communication with the server. For instance, in an event-driven architecture, if an event is missing a required attribute, the application might need to make additional requests to retrieve that data, increasing system load. To avoid this, ensure that events are fully populated with all necessary information upfront.

Summary

Retries are an essential mechanism for building fault-tolerant distributed systems and to recover from transient failures such as network issues, service unavailability, and partial system outages. A well-implemented retry strategy improves system resilience by ensuring that temporary failures don’t lead to full-blown outages. Techniques such as exponential backoff with jitter, idempotency, token buckets to limit retries locally, and circuit breakers help manage retries effectively, preventing issues like retry storms, resource exhaustion, and latency amplification.

However, retries need careful management because without proper limits, retries can overwhelm services that are already struggling or exacerbate issues like cascading failures and thread starvation. Incorporating timeouts, retry limits, and adaptive retry mechanisms based on system health can prevent these negative side effects. By analyzing retry patterns and adopting error-specific handling strategies, distributed systems can strike a balance between availability and resource efficiency, and ensures robust performance even in the face of partial failures.

Comments (0)

August 28, 2023

Mitigate Production Risks with Phased Deployment

Filed under: Computing,Microservices — admin @ 6:08 pm

Phased deployment is a software deployment strategy where new software features, changes, or updates are gradually released to a subset of a productâ€™s user base rather than to the entire user community at once. The goal is to limit the impact of any potential negative changes and to catch issues before they affect all users. It’s often a part of modern Agile and DevOps practices, allowing teams to validate software in stagesâ€”through testing environments, to specific user segments, and finally, to the entire user base. The phased deployment solves following issues with the production changes:

Risk Mitigation: Deploying changes all at once can be risky, especially for large and complex systems. Phase deployment helps to mitigate this risk by gradually releasing the changes and carefully monitoring their impact.
User Experience: With phased deployment, if something goes wrong, it affects only a subset of users. This protects the larger user base from potential issues and negative experiences.
Performance Bottlenecks: By deploying in phases, you can monitor how the system performs under different loads, helping to identify bottlenecks and scaling issues before they impact all users.
Immediate Feedback: Quick feedback loops with stakeholders and users are established. This immediate feedback helps in quick iterations and refinements.
Resource Utilization: Phased deployment allows for better planning and use of resources. You can allocate just the resources you need for each phase, reducing waste.

The phased deployment applies following approaches for detecting production issues early in the deployment process:

Incremental Validation: As each phase is a limited rollout, you can carefully monitor and validate that the software is working as expected. This enables early detection of issues before they become widespread.
Isolation of Issues: If an issue does arise, its impact is restricted to a smaller subset of the system or user base. This makes it easier to isolate the problem, fix it, and then proceed with the deployment.
Rollbacks: In the event of a problem, it’s often easier to rollback changes for a subset of users than for an entire user base. This allows for quick recovery with minimal impact.
Data-driven Decisions: The metrics and logs gathered during each phase can be invaluable for making informed decisions, reducing the guesswork, and thereby reducing errors.
User Feedback: By deploying to a limited user set first, you can collect user feedback that can be crucial for understanding how the changes are affecting user interaction and performance. This provides another opportunity for catching issues before full-scale deployment.
Best Practices and Automation: Phase deployment often incorporates industry best practices like blue/green deployments, canary releases, and feature flags, all of which help in minimizing errors and ensuring a smooth release.

Building CI/CD Process for Phased Deployment

Continuous Integration (CI)

Continuous Integration (CI) is a software engineering practice aimed at regularly merging all developers’ working copies of code to a shared mainline or repository, usually multiple times a day. The objective is to catch integration errors as quickly as possible and ensure that code changes by one developer are compatible with code changes made by other developers in the team. The practice defines following steps for integrating developers’ changes:

Code Commit: Developers write code in their local environment, ensuring it meets all coding guidelines and includes necessary unit tests.
Pull Request / Merge Request: When a developer believes their code is ready to be merged, they create a pull request or merge request. This action usually triggers the CI process.
Automated Build and Test: The CI server automatically picks up the new code changes that may be in a feature branch and initiates a build and runs all configured tests.
Code Review: Developers and possibly other stakeholders review the test and build reports. If errors are found, the code is sent back for modification.
Merge: If everything looks good, the changes are merged into main branch of the repository.
Automated Build: After every commit, automated build processes compile the source code, create executables, and run unit/integration/functional tests.
Automated Testing: This stage automatically runs a suite of tests that can include unit tests, integration tests, test coverage and more.
Reporting: Generate and publish reports detailing the success or failure of the build, lint/FindBugs, static analysis (Fortify), dependency analysis, and tests.
Notification: Developers are notified about the build and test status, usually via email, Slack, or through the CI systemâ€™s dashboard.
Artifact Repository: Store the build artifacts that pass all the tests for future use.

Above continuous integration process allows immediate feedback on code changes, reduces integration risk, increases confidence, encourages better collaboration and improves code quality.

Continuous Deployment (CD)

The Continuous Deployment (CD) further enhances this by automating the delivery of applications to selected infrastructure environments. Where CI deals with build, testing, and merging code, CD takes the code from CI and deploys it directly into the production environment, making changes that pass all automated tests immediately available to users. The above workflow for Continuous Integration is added with following additional steps:

Code Committed: Once code passes all tests during the CI phase, it moves onto CD.
Pre-Deployment Staging: Code may be deployed to a staging area where it undergoes additional tests that could be too time-consuming or risky to run during CI. The staging environment can be divided into multiple environments such as alpha staging for integration and sanity testing, beta staging for functional and acceptance testing, and gamma staging environment for chaos, security and performance testing.
Performance Bottlenecks: The staging environment may execute security, chaos, shadow and performance tests to identify bottlenecks and scaling issues before deploying code to the production.
Deployment to Production: If the code passes all checks, it’s automatically deployed to production.
Monitoring & Verification: After deployment, automated systems monitor application health and performance. Some systems use Canary Testing to continuously verify that deployed features are behaving as expected.
Rollback if Necessary: If an issue is detected, the CD system can automatically rollback to a previous, stable version of the application.
Feedback Loop: Metrics and logs from the deployed application can be used to inform future development cycles.

The Continuous Deployment process results in faster time-to-market, reduced risk, greater reliability, improved quality, and better efficiency and resource utilization.

Phased Deployment Workflow

Phased Deployment allows rolling out a change in increments rather than deploying it to all servers or users at once. This strategy fits naturally into a Continuous Integration/Continuous Deployment (CI/CD) pipeline and can significantly reduce the risks associated with releasing new software versions. The CI/CD workflow is enhanced as follows:

Code Commit & CI Process: Developers commit code changes, which trigger the CI pipeline for building and initial testing.
Initial Deployment to Dev Environment: After passing CI, the changes are deployed to a development environment for further testing.
Automated Tests and Manual QA: More comprehensive tests are run. This could also include security, chaos, shadow, load and performance tests.
Phase 1 Deployment (Canary Release): Deploy the changes to a small subset of the production environment or users and monitor closely. If you operate in multiple data centers, cellular architecture or geographical regions, consider initiating your deployment in the area with the fewest users to minimize the impact of potential issues. This approach helps in reducing the “blast radius” of any potential problems that may arise during deployment.
PreProd Testing: In the initial phase, you may optionally first deploy to a special pre-prod environment where you only execute canary testing simulating user requests without actually user-traffic with the production infrastructure so that you can further reduce blast radius for impacting customer experience.
Baking Period: To make informed decisions about the efficacy and reliability of your code changes, it’s crucial to have a ‘baking period’ where the new code is monitored and tested. During this time, you’ll gather essential metrics and data that help in confidently determining whether or not to proceed with broader deployments.
Monitoring and Metrics Collection: Use real-time monitoring tools to track system performance, error rates, and other KPIs.
Review and Approval: If everything looks good, approve the changes for the next phase. If issues are found, roll back and diagnose.
Subsequent Phases: Roll out the changes to larger subsets of the production environment or user base, monitoring closely at each phase. The subsequent phases may use a simple static scheme by adding X servers or user-segments at a time or geometric scheme by exponentially doubling the number of servers or user-segments after each phase. For instance, you can employ mathematical formulas like 2^N or 1.5^N, where N represents the phase number, to calculate the scope of the next deployment phase. This could pertain to the number of servers, geographic regions, or user segments that will be included.
Subsequent Baking Periods: As confidence in the code increases through successful earlier phases, the duration of subsequent ‘baking periods’ can be progressively shortened. This allows for an acceleration of the phased deployment process until the changes are rolled out to all regions or user segments.
Final Rollout: After all phases are successfully completed, deploy the changes to all servers and users.
Continuous Monitoring: Even after full deployment, keep running Canary Tests for validation and monitoring to ensure everything is working as expected.

Thus, phase deployment further mitigates risk, improves user experience, monitoring and resource utilization. If a problem is identified, it’s much easier to roll back changes for a subset of users, reducing negative impact.

Criteria for Selecting Targets for Phased Deployment

When choosing targets for phased deployment, you have multiple options, including cells within a Cellular Architecture, distinct Geographical Regions, individual Servers within a data center, or specific User Segments. Here are some key factors to consider while making your selection:

Risk Assessment: The first step in selecting cells, regions, or user-segments is to conduct a thorough risk assessment. The idea is to understand which areas are most sensitive to changes and which are relatively insulated from potential issues.
User Activity: Regions with lower user activity can be ideal candidates for the initial phases, thereby minimizing the impact if something goes wrong.
Technical Constraints: Factors such as server capacity, load balancing, and network latency may also influence the selection process.
Business Importance: Some user-segments or regions may be more business-critical than others. Starting deployment in less critical areas can serve as a safe first step.
Gradual Scale-up: Mathematical formulas like 2^N or 1.5^N where N is the phase number can be used to gradually increase the size of the deployment target in subsequent phases.
Performance Metrics: Utilize performance metrics like latency, error rates, etc., to decide on next steps after each phase.

Always start with the least risky cells, regions, or user-segments in the initial phases and then use metrics and KPIs to gain confidence in the deployed changes. After gaining confidence from initial phases, you may initiate parallel deployments cross multiple environments, perhaps even in multiple regions simultaneously. However, you should ensure that each environment has its independent monitoring to quickly identify and isolate issues. The rollback strategy should be tested ahead of time to ensure it works as expected before parallel deployment. You should keep detailed logs and documentation for each deployment phase and environment.

Cellular Architecture

Phased deployment can work particularly well with a cellular architecture, offering a systematic approach to gradually release new code changes while ensuring system reliability. In cellular architecture, your system is divided into isolated cells, each capable of operating independently. These cells could represent different services, geographic regions, user segments, or even individual instances of a microservices-based application. For example, you can identify which cells will be the first candidates for deployment, typically those with the least user traffic or those deemed least critical.

The deployment process begins by introducing the new code to an initial cell or a small cluster of cells. This initial rollout serves as a pilot phase, during which key performance indicators such as latency, error rates, and other metrics are closely monitored. If the data gathered during this ‘baking period’ indicates issues, a rollback is triggered. If all goes well, the deployment moves on to the next set of cells. Subsequent phases follow the same procedure, gradually extending the deployment to more cells. Utilizing phased deployment within a cellular architecture helps to minimize the impact area of any potential issues, thus facilitating more effective monitoring, troubleshooting, and ultimately a more reliable software release.

Blue/Green Deployment

The phased deployment can employ the Blue/Green deployment strategy where two separate environments, often referred to as “blue” and “green,” are configured. Both are identical in terms of hardware, software, and settings. The Blue environment runs the current version of the application and serves all user traffic. The Green is a clone of the Blue environment where the new version of the application is deployed. This helps phased deployment because one environment is always live, thus allow for releasing new features without downtime. If issues are detected, traffic can be quickly rerouted back to the Blue environment, thus minimizing the risk and impact of new deployments. The Blue/Green deployment includes following steps:

Preparation: Initially, both Blue and Green environments run the current version of the application.
Initial Rollout: Deploy the new application code or changes to the Green environment.
Verification: Perform tests on the Green environment to make sure the new code changes are stable and performant.
Partial Traffic Routing: In a phased manner, start rerouting a small portion of the live traffic to the Green environment. Monitor key performance indicators like latency, error rates, etc.
Monitoring and Decision: If any issues are detected during this phase, roll back the traffic to the Blue environment without affecting the entire user base. If metrics are healthy, proceed to the next phase.
Progressive Routing: Gradually increase the percentage of user traffic being served by the Green environment, closely monitoring metrics at each stage.
Final Cutover: Once confident that the Green environment is stable, you can reroute 100% of the traffic from the Blue to the Green environment.
Fallback: Keep the Blue environment operational for a period as a rollback option in case any issues are discovered post-switch.
Decommission or Sync: Eventually, decommission the Blue environment or synchronize it to the Green environmentâ€™s state for future deployments.

Automated Testing

CI/CD and phased deployment strategy relies on automated testing to validate changes to a subset of infrastructure or users. The automated testing includes a variety of testing types that should be performed at different stages of the process such as:involved:

Functional Testing: testing the new feature or change before initiating phased deployment to make sure it performs its intended function correctly.
Security Testing: testing for vulnerabilities, threats, or risks in a software application before phased deployment.
Performance Testing: testing how the system performs under heavy loads or large amounts of data before and during phased deployment.
Canary Testing: involves rolling out the feature to a small, controlled group before making it broadly available. This also includes testing via synthetic transactions by simulating user requests. This is executed early in the phased deployment process, however testing via synthetic transactions is continuously performed in background.
Shadow Testing: In this method, the new code runs alongside the existing system, processing real data requests without affecting the actual system.
Chaos Testing: This involves intentionally introducing failures to see how the system reacts. It is usually run after other types of testing have been performed successfully, but before full deployment.
Load Testing: test the system under the type of loads it will encounter in the real world before the phased deployment.
Stress Testing: attempt to break the system by overwhelming its resources. It is executed late in the phased deployment process, but before full deployment.
Penetration Testing: security testing where testers try to ‘hack’ into the system.
Usability Testing: testing from the user’s perspective to make sure the application is easy to use in early stages of phased deployment.

Monitoring

Monitoring plays a pivotal role in validating the success of phased deployments, enabling teams to ensure that new features and updates are not just functional, but also reliable, secure, and efficient. By constantly collecting and analyzing metrics, monitoring offers real-time feedback that can inform deployment decisions. Here’s how monitoring can help with the validation of phased deployments:

Real-Time Metrics and Feedback: collecting real-time metrics on system performance, user engagement, and error rates.
Baking Period Analysis: using a “baking” period where the new code is run but closely monitored for any anomalies.
Anomaly Detection: using automated monitoring tools to flag anomalies in real-time, such as a spike in error rates or a drop in user engagement.
Benchmarking: establishing performance benchmarks based on historical data.
Compliance and Security Monitoring: monitoring for unauthorized data access or other security-related incidents.
Log Analysis: using aggregated logs to show granular details about system behavior.
User Experience Monitoring: tracking metrics related to user interactions, such as page load times or click-through rates.
Load Distribution: monitoring how well the new code handles different volumes of load, especially during peak usage times.
Rollback Metrics: tracking of the metrics related to rollback procedures.
Feedback Loops: using monitoring data for continuous feedback into the development cycle.

Feature Flags

Feature flags, also known as feature toggles, are a powerful tool in the context of phased deployments. They provide developers and operations teams the ability to turn features on or off without requiring a code deployment. This capability synergizes well with phased deployments by offering even finer control over the feature release process. The benefits of feature flags include:

Gradual Rollout: Gradually releasing a new feature to a subset of your user base.
Targeted Exposure: Enable targeted exposure of features to specific user segments based on different attributes like geography, user role, etc.
Real-world Testing: With feature flags, you can perform canary releases, blue/green deployments, and A/B tests in a live environment without affecting the entire user base.
Risk Mitigation: If an issue arises during a phased deployment, a feature can be turned off immediately via its feature flag, preventing any further impact.
Easy Rollback: Since feature flags allow for features to be toggled on and off, rolling back a feature that turns out to be problematic is straightforward and doesn’t require a new deployment cycle.
Simplified Troubleshooting: Feature flags simplify the troubleshooting process since you can easily isolate problems and understand their impact.
CICD Compatibility: Feature flags are often used in conjunction with CI/CD pipelines, allowing for features to be integrated into the main codebase even if they are not yet ready for public release.
Conditional Logic: Advanced feature flags can include conditional logic, allowing you to automate the criteria under which features are exposed to users.

A/B Testing

A/B testing, also known as split testing, is an experimental approach used to compare two or more versions of a web page, feature, or other variables to determine which one performs better. In the context of software deployment, A/B testing involves rolling out different variations (A, B, etc.) of a feature or application component to different subsets of users. Metrics like user engagement, conversion rates, or performance indicators are then collected to statistically validate which version is more effective or meets the desired goals better. Phased deployment and A/B testing can complement each other in a number of ways:

Both approaches aim to reduce risk but do so in different ways.
Both methodologies are user-focused but in different respects.
A/B tests offer a more structured way to collect user-related metrics, which can be particularly valuable during phased deployments.
Feature flags, often used in both A/B testing and phased deployment, give teams the ability to toggle features on or off for specific user segments or phases.
If an A/B test shows one version to be far more resource-intensive than another, this information could be invaluable for phased deployment planning.
The feedback from A/B testing can feed into the phased deployment process to make real-time adjustments.
A/B testing can be included as a step within a phase of a phased deployment, allowing for user experience quality checks.
In a more complex scenario, you could perform A/B testing within each phase of a phased deployment.

Safe Rollback

Safe rollback is a critical aspect of a robust CI/CD pipeline, especially when implementing phased deployments. Here’s how safe rollback can be implemented:

Maintain versioned releases of your application so that you can easily identify which version to rollback to.
Always have backward-compatible database changes so that rolling back the application won’t have compatibility issues with the database.
Utilize feature flags so that you can disable problematic features without needing to rollback the entire deployment.
Implement comprehensive monitoring and logging to quickly identify issues that might necessitate a rollback.
Automate rollback procedures.
Keep the old version (Blue) running as you deploy the new version (Green). If something goes wrong, switch the load balancer back to the old version.
Use Canaray releases to roll out the new version to a subset of your infrastructure. If errors occur, halt the rollout and revert the canary servers to the old version.

Following steps should be applied on rollback:

Immediate Rollback: As soon as an issue is detected that can’t be quickly fixed, trigger the rollback procedure.
Switch Load Balancer: In a Blue/Green setup, switch the load balancer back to route traffic to the old version.
Database Rollback: If needed and possible, rollback the database changes. Be very cautious with this step, as it can be risky.
Feature Flag Disablement: If the issue is isolated to a particular feature that’s behind a feature flag, consider disabling that feature.
Validation: After rollback, validate that the system is stable. This should include health checks and possibly smoke tests.
Postmortem Analysis: Once the rollback is complete and the system is stable, conduct a thorough analysis to understand what went wrong.

One critical consideration to keep in mind is ensuring both backward and forward compatibility, especially when altering communication protocols or serialization formats. For instance, if you update the serialization format and the new code writes data in this new format, the old code may become incompatible and unable to read the data if a rollback is needed. To mitigate this risk, you can deploy an intermediate version that is capable of reading the new format without actually writing in it.

Here’s how it works:

Phase 1: Release an intermediate version of the code that can read the new serialization format like JSON, but continues to write in the old format. This ensures that even if you have to roll back after advancing further, the “old” version is still able to read the newly-formatted data.
Phase 2: Once the intermediate version is fully deployed and stable, you can then roll out the new code that writes data in the new format.

By following this two-phase approach, you create a safety net, making it possible to rollback to the previous version without encountering issues related to data format incompatibility.

Sample CI/CD Pipeline

Following is a sample GitHub Actions workflow .yml file that includes elements for build, test and deployment. You can create a new file in your repository under .github/workflows/ called ci-cd.yml:

name: CI/CD Pipeline with Phased Deployment

on:
  push:
    branches:
      - main

env:
  IMAGE_NAME: my-java-app

jobs:
  
  unit-test:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Code
        uses: actions/checkout@v2

      - name: Run Unit Tests
        run: mvn test

  integration-test:
    needs: unit-test
    runs-on: ubuntu-latest
    steps:
      - name: Run Integration Tests
        run: mvn integration-test

  functional-test:
    needs: integration-test
    runs-on: ubuntu-latest
    steps:
      - name: Run Functional Tests
        run: ./run-functional-tests.sh  # Assuming you have a script for functional tests

  load-test:
    needs: functional-test
    runs-on: ubuntu-latest
    steps:
      - name: Run Load Tests
        run: ./run-load-tests.sh  # Assuming you have a script for load tests

  security-test:
    needs: load-test
    runs-on: ubuntu-latest
    steps:
      - name: Run Security Tests
        run: ./run-security-tests.sh  # Assuming you have a script for security tests

  build:
    needs: security-test
    runs-on: ubuntu-latest
    steps:
      - name: Build and Package
        run: |
          mvn clean package
          docker build -t ${{ env.IMAGE_NAME }} .

  phase_one:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to Phase One Cells
        run: ./deploy-to-phase-one.sh # Your custom deploy script for Phase One

      - name: Canary Testing
        run: ./canary-test-phase-one.sh # Your custom canary testing script for Phase One

      - name: Monitoring
        run: ./monitor-phase-one.sh # Your custom monitoring script for Phase One

      - name: Rollback if Needed
        run: ./rollback-phase-one.sh # Your custom rollback script for Phase One
        if: failure()

  phase_two:
    needs: phase_one
    # Repeat the same steps as phase_one but for phase_two
    # ...

  phase_three:
    needs: phase_two
    # Repeat the same steps as previous phases but for phase_three
    # ...

  phase_four:
    needs: phase_three
    # Repeat the same steps as previous phases but for phase_four
    # ...

  phase_five:
    needs: phase_four
    # Repeat the same steps as previous phases but for phase_five
    # ...

run-functional-tests.sh, run-load-tests.sh, and run-security-tests.sh would contain the logic for running functional, load, and security tests, respectively. You might use tools like Selenium for functional tests, JMeter for load tests, and OWASP ZAP for security tests.

Conclusion

Phased deployment, when coupled with effective monitoring, testing, and feature flags, offers numerous benefits that enhance the reliability, security, and overall quality of software releases. Here’s a summary of the advantages:

Reduced Risk: By deploying changes in smaller increments, you minimize the impact of any single failure, thereby reducing the “blast radius” of issues.
Real-Time Validation: Continuous monitoring provides instant feedback on system performance, enabling immediate detection and resolution of issues.
Enhanced User Experience: Phased deployment allows for real-time user experience monitoring, ensuring that new features or changes meet user expectations and don’t negatively impact engagement.
Data-Driven Decision Making: Metrics collected during the “baking” period and subsequent phases allow for data-driven decisions on whether to proceed with the deployment, roll back, or make adjustments.
Security & Compliance: Monitoring for compliance and security ensures that new code doesn’t introduce vulnerabilities, keeping the system secure throughout the deployment process.
Efficient Resource Utilization: The gradual rollout allows teams to assess how the new changes affect system resources, enabling better capacity planning and resource allocation.
Flexible Rollbacks: In the event of a failure, the phased approach makes it easier to roll back changes, minimizing disruption and maintaining system stability.
Iterative Improvement: Metrics and feedback collected can be looped back into the development cycle for ongoing improvements, making future deployments more efficient and reliable.
Optimized Testing: Various forms of testing like functional, security, performance, and canary can be better focused and validated against real-world scenarios in each phase.
Strategic Rollout: Feature flags allow for even more granular control over who sees what changes, enabling targeted deployments and A/B testing.
Enhanced Troubleshooting: With fewer changes deployed at a time, identifying the root cause of any issues becomes simpler, making for faster resolution.
Streamlined Deployment Pipeline: Incorporating phased deployment into CI/CD practices ensures a smoother, more controlled transition from development to production.

By strategically implementing these approaches, phased deployment enhances the resilience and adaptability of the software development lifecycle, ensuring a more robust, secure, and user-friendly product.

Comments (0)

August 23, 2023

Failures in MicroService Architecture

Filed under: Computing,Microservices — admin @ 12:54 pm

Microservice architecture is an evolution of Monolithic and Service-Oriented Architecture (SOA), where an application is built as a collection of loosely coupled, independently deployable services. Each microservice usually corresponds to a specific business functionality and can be developed, deployed, and scaled independently. In contrast to Monolithic Architecture that lacks modularity, and Service-Oriented Architecture (SOA), which is more coarse-grained and is prone to a single point of failure, the Microservice architecture offers better support for modularity, independent deployment and distributed development that often uses Conway’s law to organize teams based on the Microservice architecture. However, Microservice architecture introduces several challenges in terms of:

Network Complexity: Microservices communicate over the network, increasing the likelihood of network-related issues (See Fallacies of distributed computing).
Distributed System Challenges: Managing a distributed system introduces complexities in terms of synchronization, data consistency, and handling partial failures.
Monitoring and Troubleshooting: Due to the distributed nature, monitoring and troubleshooting can become more complex, requiring specialized tools and practices.
Potential for Cascading Failures: Failure in one service can lead to failures in dependent services if not handled properly.

Faults, Errors and Failures

The challenges associated with microservice architecture manifest at different stages require understanding concepts of faults, errors and failures:

1. Faults:

Faults in a microservice architecture could originate from various sources, including:

Software Bugs: A defect in one service may cause incorrect behavior but remain dormant until triggered.
Network Issues: Problems in network connectivity can be considered faults, waiting to lead to errors.
Configuration Mistakes: Incorrect configuration of a service is another potential fault.
Dependency Vulnerabilities: A weakness or vulnerability in an underlying library or service that hasn’t yet caused a problem.

Following are major concerns that the Microservice architecture must address for managing faults:

Loose Coupling and Independence: With services being independent, a fault in one may not necessarily impact others, provided the system is designed with proper isolation.
Complexity: Managing and predicting faults across multiple services and their interactions can be complex.
Isolation: Properly isolating faults can prevent them from causing widespread problems. For example, a fault in one service shouldn’t impact others if isolation is well implemented.
Detecting and Managing Faults: Given the distributed nature of microservices, detecting and managing faults can be complex.

2. Error:

When a fault gets activated under certain conditions, it leads to an error. In microservices, errors can manifest as:

Communication Errors: Failure in service-to-service communication due to network problems or incompatible data formats.
Data Inconsistency: An error in one service leading to inconsistent data across different parts of the system.
Service Unavailability: A service failing to respond due to an internal error.

Microservice architecture should include diagnosing and handling errors including:

Propagation: Errors can propagate quickly across services, leading to cascading failures if not handled properly.
Transient Errors: Network-related or temporary errors might be resolved by retries, adding complexity to error handling.
Monitoring and Logging Challenges: Understanding and diagnosing errors in a distributed system can be more complex.

3. Failure:

Failure is the inability of a system to perform its required function due to unhandled errors. In microservices, this might include:

Partial Failure: Failure of one or more services leading to degradation in functionality.
Total System Failure: Cascading errors causing the entire system to become unavailable.

Further, failure handling in Microservice architecture poses additional challenges such as:

Cascading Failures: A failure in one service might lead to failures in others, particularly if dependencies are tightly interwoven and error handling is insufficient.
Complexity in Recovery: Coordinating recovery across multiple services can be challenging.

The faults and errors can be further categorized into customer related and system related:

Customer-Related: These may include improper usage of an API, incorrect input data, or any other incorrect action taken by the client. These might include incorrect input data, calling an endpoint that doesn’t exist, or attempting an action without proper authorization. Since these errors are often due to incorrect usage, simply retrying the same request without fixing the underlying issue is unlikely to resolve the error. For example, if a customer sends an invalid parameter, retrying the request with the same invalid parameter will produce the same error. In many cases, customer errors are returned with specific HTTP status codes in the 4xx range (e.g., 400 Bad Request, 403 Forbidden), indicating that the client must modify the request before retrying.
System-Related: These can stem from various aspects of the microservices, such as coding bugs, network misconfigurations, a timeout occurring, or issues with underlying hardware. These errors are typically not the fault of the client and may be transient, meaning they could resolve themselves over time or upon retrying. System errors often correlate with HTTP status codes in the 5xx range (e.g., 500 Internal Server Error, 503 Service Unavailable), indicating an issue on the server side. In many cases, these requests can be retried after a short delay, possibly succeeding if the underlying issue was temporary.

Causes Related to Faults, Errors and Failures

The challenges in microservice architecture are rooted in its distributed nature, complexity, and interdependence of services. Here are common causes of the challenges related to faults, errors, and failures, including the distinction between customer and system errors:

1. Network Complexity:

Cause: Multiple services communicating over the network where one of the service cannot communicate with other service. For example, Amazon Simple Storage Service (S3) had an outage in Feb 28, 2017 and many services that were tightly coupled failed as well due to limited fault isolation. The post-mortem analysis recommended proper fault isolation, redundancy across regions, and better understanding and managing complex inter-service dependencies.
Challenges: Leads to network-related issues, such as latency, bandwidth limitations, and network partitioning, causing both system errors and potentially triggering faults.

2. Data Consistency:

Cause: Maintaining data consistency across services that use different databases. This can occur where a microservice stores data in multiple data stores without proper anti-entropy validation or uses eventual consistency, e.g. a trading firm might be using CQRS pattern where transaction events are persisted in a write datastore, which is then replicated to a query datastore so user may not see up-to-date data when querying recently stored data.
Challenges: Ensuring transactional integrity and eventual consistency can be complex, leading to system errors if not managed properly.

3. Service Dependencies:

Cause: Tight coupling between services. For example, an online travel booking platform might deploy multiple microservices for managing hotel bookings, flight reservations, car rentals, etc. If these services are tightly coupled, then a minor update to the flight reservation service unintentionally may break the compatibility with the hotel booking service.
Challenges: Cascading failures and difficulty in isolating faults. A failure in one service can easily propagate to others if not properly isolated.

4. Scalability Issues:

Cause: Individual services may require different scaling strategies. For example, Netflix in Oct 29, 2012 suffered a major outage when due to a scaling issue, the Amazon Elastic Load Balancer (ELB) that was used for routing couldn’t route requests effectively. The lessons learned from the incident included improved scaling strategies, redundancy and failover planning, and monitoring and alerting enhancements.
Challenges: Implementing effective scaling without affecting other services or overall system stability. Mismanagement can lead to system errors or even failures.

5. Security Concerns:

Cause: Protecting the integrity and confidentiality of data as it moves between services. For example, on July 19, 2019, CapitalOne had a major security breach for its data that was stored on AWS. A former AWS employee discovered a misconfigured firewall and exploited it, accessing sensitive customer data. The incident caused significant reputational damage and legal consequences to CapitalOne, which then worked on a broader review of security practices, emphasizing the need for proper configuration, monitoring, and adherence to best practices.
Challenges: Security breaches or misconfigurations could be seen as faults, leading to potential system errors or failures.

6. Monitoring and Logging:

Cause: The need for proper monitoring and logging across various independent services to gain insights when microservices are misbehaving. For example, if a service is silently behaving erratically, causing intermittent failures for customers will lead to more significant outage and longer time to diagnose and resolve due to lack of proper monitoring and logging.
Challenges: Difficulty in tracking and diagnosing both system and customer errors across different services.

7. Configuration Management:

Cause: Managing configuration across multiple services. For example, July 20, 2021, WizCase discovered unsecured Amazon S3 buckets containing data from more than 80 US locales, predominantly in New England. The misconfigured S3 buckets included more than 1,000GB of data and more than 1.6 million files. Residentsâ€™ actual addresses, telephone numbers, IDs, and tax papers were all exposed due to the attack. On October 5, 2021, Facebook had nearly six hours due to misconfigured DNS and PGP settings. Oasis cites misconfiguration as a top root cause for security incidents and events.
Challenges: Mistakes in configuration management can be considered as faults, leading to errors and potentially failures in one or more services.

8. API Misuse (Customer Errors):

Cause: Clients using the API incorrectly, sending improper requests. For example, on October 21, 2016, Dyn experienced a massive Distributed Denial of Service (DDoS) attack, rendering a significant portion of the internet inaccessible for several hours. High-profile sites, including Twitter, Reddit, and Netflix, experienced outages. The DDoS attack was primarily driven by the Mirai botnet, which consisted of a large number of compromised Internet of Things (IoT) devices like security cameras, DVRs, and routers. These devices were vulnerable because of default or easily guessable passwords. The attackers took advantage of these compromised devices and used them to send massive amounts of traffic to Dyn’s servers, especially by abusing the devices’ APIs to make repeated and aggressive requests. The lessons learned included better IoT security, strengthening infrastructure and adding API guardrails such as built-in security and rate-limiting.
Challenges: Handling these errors gracefully to guide clients in correcting their requests.

9. Service Versioning:

Cause: Multiple versions of services running simultaneously. For example, conflicts between the old and new versions may lead to unexpected behavior in the system. Requests routed to the new version might be handled differently than those routed to the old version, causing inconsistencies.
Challenges: Compatibility issues between different versions can lead to system errors.

10. Diverse Technology Stack:

Cause: Different services might use different languages, frameworks, or technologies. For example, the diverse technology stack may cause problems with inconsistent monitoring and logging, different vulnerability profiles and security patching requirement, leading to increased complexity in managing, monitoring, and securing the entire system.
Challenges: Increases complexity in maintaining, scaling, and securing the system, which can lead to faults.

11. Human Factors:

Cause: Errors in development, testing, deployment, or operations. For example, Amazon Simple Storage Service (S3) had an outage in Feb 28, 2017, which was caused by a human error during the execution of an operational command. A typo in a command executed by an Amazon team member intended to take a small number of servers offline inadvertently removed more servers than intended.and many services that were tightly coupled failed as well due to limited fault isolation. The post-mortem analysis recommended implementing safeguards against both human errors and system failures.
Challenges: Human mistakes can introduce faults, lead to both customer and system errors, and even cause failures if not managed properly.

12. Lack of Adequate Testing:

Cause: Insufficient unit, integration, functional, and canary testing. For example, on August 1, 2012, Knight Capital deployed untested software to a production environment, resulting in a malfunction in their automated trading system. The flawed system started buying and selling millions of shares at incorrect prices. Within 45 minutes, the company incurred a loss of $440 million. The code that was deployed to production was not properly tested. It contained old, unused code that should have been removed, and the new codeâ€™s interaction with existing systems was not fully understood or verified. The lessons learned included ensuring that all code, especially that which controls critical functions, is thoroughly tested, implementing robust and consistent deployment procedures to ensure that changes are rolled out uniformly across all relevant systems, and having mechanisms in place to quickly detect and halt erroneous behavior, such as a â€œkill switchâ€ for automated trading systems.
Challenges: Leads to undetected faults, resulting in both system and customer errors, and potentially, failures in production.

13. Inadequate Alarms and Health Checks:

Cause: Lack of proper monitoring and health check mechanisms. For example, on January 31, 2017, GitLab suffered a severe data loss incident. An engineer accidentally deleted a production database while attempting to address some performance issues. This action resulted in a loss of 300GB of user data. GitLab’s monitoring and alerting system did not properly notify the team of the underlying issues that were affecting database performance. The lack of clear alarms and health checks contributed to the confusion and missteps that led to the incident. The lessons learned included ensuring that health checks and alarms are configured to detect and alert on all critical conditions, and establishing and enforcing clear procedures and protocols for handling critical production systems, including guidelines for dealing with performance issues and other emergencies.
Challenges: Delays in identifying and responding to faults and errors, which can exacerbate failures.

14. Lack of Code Review and Quality Control:

Cause: Insufficient scrutiny during the development process. For example, on March 14, 2012, the Heartbleed bug was introduced with the release of OpenSSL version 1.0.1 but it was not discovered until April 2014. The bug allowed attackers to read sensitive data from the memory of millions of web servers, potentially exposing passwords, private keys, and other sensitive information. The bug was introduced through a single coding error. There was a lack of rigorous code review process in place to catch such a critical mistake. The lessons learned included implementing a thorough code review process, establishing robust testing and quality control measures to ensure that all code, especially changes to security-critical areas, is rigorously verified.
Challenges: Increases the likelihood of introducing faults and bugs into the system, leading to potential errors and failures.

15. Lack of Proper Test Environment:

Cause: Absence of a representative testing environment. For example, on August 1, 2012, Knight Capital deployed new software to a production server that contained obsolete and nonfunctional code. This code accidentally got activated, leading to unintended trades flooding the market. The algorithm was buying high and selling low, the exact opposite of a profitable strategy. The company did not have a proper testing environment that accurately reflected the production environment. Therefore, the erroneous code was not caught during the testing phase. The lessons learned included ensuring a robust and realistic testing environment that accurately mimics the production system, implementing strict and well-documented deployment procedures and implementing real-time monitoring and alerting to catch unusual or erroneous system behavior.
Challenges: Can lead to unexpected behavior in production due to discrepancies between test and production environments.

16. Elevated Permissions:

Cause: Overly permissive access controls. For example, on July 19, 2019, CapitalOne announced that an unauthorized individual had accessed the personal information of approximately 106 million customers and applicants. The breach occurred when a former employee of a third-party contractor exploited a misconfigured firewall, gaining access to data stored on Amazonâ€™s cloud computing platform, AWS. The lessons learned included implementing the principle of least privilege, robust monitoring to detect and alert on suspicious activities quickly, and evaluating the security practices of third-party contractors and vendors.
Challenges: Increased risk of security breaches and unauthorized actions, potentially leading to system errors and failures.

17. Single Point of Failure:

Cause: Reliance on a single component without redundancy. For example, on January 31, 2017, GitLab experienced a severe data loss incident when an engineer while attempting to remove a secondary database, the primary production database was engineer deleted. The primary production database was a single point of failure in the system. The deletion of this database instantly brought down the entire service. Approximately 300GB of data was permanently lost, including issues, merge requests, user accounts, comments, and more. The lessons learned included eliminating single points of failure, implementing safeguards to protect human error, and testing backups.
Challenges: A failure in one part can bring down the entire system, leading to cascading failures.

18. Large Blast Radius:

Cause: Lack of proper containment and isolation strategies. For example, on September 4, 2018, the Azure South Central U.S. datacenter experienced a significant outage affecting multiple regions. A severe weather event in the southern United States led to cooling failures in one of Azure’s data centers. Automated systems responded to the cooling failure by shifting loads to a data center in a neighboring region. This transfer was larger and faster than anticipated, leading to an overload in the secondary region. The lessons learned included deep understanding of dependencies and failure modes, limiting the blast radius, and continuous improvements in resilience.
Challenges: An error in one part can affect a disproportionate part of the system, magnifying the impact of failures.

19. Throttling and Limits Issues:

Cause: Inadequate management of request rates and quotas. For example, on February 28, 2017, AWS S3 experienced a significant disruption in the US-EAST-1 region, causing widespread effects on many dependent systems. A command to take a small number of servers offline for inspection was executed incorrectly, leading to a much larger removal of capacity than intended. Once the servers were inadvertently removed, the S3 subsystems had to be restarted. The restart process included safety checks, which required specific metadata. However, the capacity removal caused these metadata requests to be throttled. Many other systems were dependent on the throttled subsystem, and as the throttling persisted, it led to a cascading failure. The lessons learned included safeguards against human errors, dependency analysis, and testing throttling mechanisms.
Challenges: Can lead to service degradation or failure under heavy load.

20. Rushed Releases:

Cause: Releasing changes without proper testing or review. For example, on January 31, 2017, GitLab experienced a severe data loss incident. A series of events that started with a rushed release led to an engineer accidentally deleting a production database, resulting in the loss of 300GB of user data. The team was working on addressing performance issues and pushed a release without properly assessing the risks and potential side effects. The team was working on addressing performance issues and pushed a release without properly assessing the risks and potential side effects. The lessons learned included avoiding rushed decisions, clear separation of environments, proper access controls, and robust backup strategy.
Challenges: Increases the likelihood of introducing faults and errors into the system.

21. Excessive Logging:

Cause: Logging more information than necessary. For example, excessive logs can result in disk space exhaustion, performance degradation, service disruption or high operating cost due to additional network bandwidth and storage costs.
Challenges: Can lead to performance degradation and difficulty in identifying relevant information.

22. Circuit Breaker Mismanagement:

Cause: Incorrect implementation or tuning of circuit breakers. For example, on November 18, 2014, Microsoft Azure suffered a substantial global outage affecting multiple services. An update to Azureâ€™s Storage Service included a change to the configuration file governing the circuit breaker settings. The flawed update led to an overly aggressive tripping of circuit breakers, which, in turn, led to a loss of access to the blob front-ends. The lessons learned incremental rollouts, thorough testing of configuration changes, clear understanding of component interdependencies.
Challenges: Potential system errors or failure to protect the system during abnormal conditions.

23. Retry Mechanism:

Cause: Mismanagement of retry logic. For example, on September 20, 2015, an outage in DynamoDB led to widespread disruption across various AWS services. The root cause was traced back to issues related to the retry mechanism. A small error in the system led to a slight increase in latency. Due to an aggressive retry mechanism, the slightly increased latency led to a thundering herd problem where many clients retried their requests almost simultaneously. The absence of jitter (randomization) in the retry delays exacerbated this surge of requests because retries from different clients were synchronized. The lessons learned included proper retry logic with jitter, understanding dependencies, and enhancements to monitoring and alerting.
Challenges: Can exacerbate network congestion and failure conditions, particularly without proper jitter implementation.

24. Backward Incompatible Changes:

Cause: Introducing changes that are not backward compatible. For example, on August 1, 2012, Knight Capital deployed new software to a production environment. This software was intended to replace old, unused code but was instead activated, triggering old, defective functionality. The new software was not compatible with the existing system, and instead of being deactivated, the old code paths were unintentionally activated. The incorrect software operation caused Knight Capital to loss of over $460 million in just 45 minutes. The lessons learned included proper testing, processes for deprecating old code, and robust monitoring and rapid response mechanism.
Challenges: Can break existing clients and other services, leading to system errors.

25. Inadequate Capacity Planning:

Cause: Failure to plan for growth or spikes in usage. For example, on October 21, 2018, GitHub experienced a major outage that lasted for over 24 hours. During this period, various services within GitHub were unavailable or severely degraded. The incident was caused by inadequate capacity planning as GitHubâ€™s database was operating close to its capacity. A routine maintenance task to replace a failing 100G network link set off a series of events that caused the database to failover to a secondary. This secondary didnâ€™t have enough capacity to handle the production load, leading to cascading failures. The lessons learned included capacity planning, regular review of automated systems and building redundancy in critical components.
Challenges: Can lead to system degradation or failure under increased load.

26. Lack of Failover Isolation:

Cause: Insufficient isolation between primary and failover mechanisms. For example, on September 4, 2018, the Azure South Central U.S. datacenter experienced a significant outage. The Incident was caused by a lightning, which resulted in a voltage swell that impacted the cooling systems, causing them to shut down. Many services that were only hosted in this particular region went down completely, showing a lack of failover isolation between regions. The lessons learned included redundancy in critical systems, cross-region failover strategies, and regular testing of failover procedures.
Challenges: Can lead to cascading failures if both primary and failover systems are affected simultaneously.

27. Noise in Metrics and Alarms:

Cause: Too many irrelevant or overly sensitive alarms and metrics. Over time, the number of metrics and alarms may grow to a point where there are thousands of alerts firing every day, many of them false positives or insignificant. The noise level in the alerting system becomes overwhelming. For example, if many alarms are set with thresholds too close to regular operating parameters, they may cause frequent false positives. The operations team became desensitized to alerts, treating them as “normal.” The lessons learned include focusing on the most meaningful metrics and alerts, and regular review and adjust alarm thresholds and relevance to ensure they remain meaningful.
Challenges: Can lead to alert fatigue and hinder the prompt detection and response to real issues, increasing the risk of system errors and failures going unaddressed.

28. Variations Across Environments:

Cause: Differences between development, staging, and production environments. For example, a development team might be using development, testing, staging, and production environments, allowing them to safely develop, test, and deploy their services. However, production environment might be using different versions of database or middleware, using different network topology or production data is different, causing unexpected behaviors that leads to a significant outage.
Challenges: May lead to unexpected behavior and system errors, as code behaves differently in production compared to the test environment.

29. Inadequate Training or Documentation:

Cause: Lack of proper training, guidelines, or documentation for developers and operations teams. For example, if the internal team is not properly trained on the complexities of the microservices architecture, it can lead to misunderstandings of how services interact. Without proper training or documentation, the team may take a significant amount of time to identify the root causes of the issues.
Challenges: Can lead to human-induced faults, misconfiguration, and inadequate response to incidents, resulting in errors and failures.

30. Self-Inflicted Traffic Surge:

Cause: Uncontrolled or unexpected increase in internal traffic, such as excessive inter-service calls. For example, on January 31st 2017, GitLab experienced an incident that, while primarily related to data deletion, also demonstrated a form of self-inflicted traffic surge. While attempting to restore from a backup, a misconfiguration in the application caused a rapid increase in requests to the database. The lessons learned included testing configurations in an environment that mimics production, robust alerting and monitoring, clear understanding of interactions between components.
Challenges: Can overload services, causing system errors, degradation, or even failure.

31. Lack of Phased Deployment:

Cause: Releasing changes to all instances simultaneously without gradual rollout. For example, on August 1, 2012, Knight Capital deployed new software to a production environment. The software was untested in this particular environment, and an old, incompatible module was accidentally activated. The software was deployed to all servers simultaneously instead of gradually rolling it out to observe potential issues. The incorrect software operation caused Knight Capital to accumulate a massive unintended position in the market, resulting in a loss of over $440 million and a significant impact to its reputation. The lessons learned included phased deployment, thorough testing and understanding dependencies.
Challenges: Increases the risk of widespread system errors or failures if a newly introduced fault is triggered.

32. Broken Rollback Mechanisms:

Cause: Inability to revert to a previous stable state due to faulty rollback procedures. For example, a microservice tries to deploy a new version but after the deployment, issues are detected, and the decision is made to rollback. However, the rollback process fails, exacerbating the problem and leading to an extended outage.
Challenges: Can exacerbate system errors or failures during an incident, as recovery options are limited.

33. Inappropriate Timing:

Cause: Deploying new changes during critical periods such as Black Friday. For example, on Black Friday in 2014, Best Buy’s website experienced multiple outages throughout the day, which was caused by some maintenance or deployment actions that coincided with the traffic surge. Best Buy took the site down intermittently to address the issues, which, while necessary, only exacerbated the outage durations for customers. The lessons learned included avoiding deployments on critical days, better capacity planning and employing rollback strategies.
Challenges: Deploying significant changes or conducting maintenance during high-traffic or critical periods can lead to catastrophic failures.

The myriad potential challenges in microservice architecture reflect the complexity and diversity of factors that must be considered in design, development, deployment, and operation. By recognizing and addressing these causes proactively through robust practices, thorough testing, careful planning, and vigilant monitoring, teams can greatly enhance the resilience, reliability, and robustness of their microservice-based systems.

Incident Metrics

In order to prevent common causes of service faults and errors, Microservice environment can track following metrics:

1. MTBF (Mean Time Between Failures):

Prevent: By analyzing MTBF, you can identify patterns in system failures and proactively address underlying issues to enhance stability.
Detect: Monitoring changes in MTBF may help in early detection of emerging problems or degradation in system health.
Resolve: Understanding MTBF can guide investments in redundancy and failover mechanisms to ensure continuous service even when individual components fail.

2. MTTR (Mean Time to Repair):

Prevent: Reducing MTTR often involves improving procedures and tools for diagnosing and fixing issues, which also aids in preventing failures by addressing underlying faults more efficiently.
Detect: A sudden increase in MTTR can signal that something has changed within the system, such as a new fault that’s harder to diagnose, triggering a deeper investigation.
Resolve: Lowering MTTR directly improves recovery by minimizing the time it takes to restore service after a failure. This can be done through automation, streamlined procedures, and robust rollback strategies.

3. MTTA (Mean Time to Acknowledge):

Prevent: While MTTA mainly focuses on response times, reducing it can foster a more responsive monitoring environment, helping to catch issues before they escalate.
Detect: A robust monitoring system that allows for quick acknowledgment can speed up the detection of failures or potential failures.
Resolve: Faster acknowledgment of issues means quicker initiation of resolution processes, which can help in restoring the service promptly.

4. MTTF (Mean Time to Failure):

Prevent: MTTF provides insights into the expected lifetime of a system or component. Regular maintenance, monitoring, and replacement aligned with MTTF predictions can prevent unexpected failures.
Detect: Changes in MTTF patterns can provide early warnings of potential failure, allowing for pre-emptive action.
Resolve: While MTTF doesn’t directly correlate with resolution, understanding it helps in planning failover strategies and ensuring that backups or redundancies are in place for anticipated failures.

Implementing These Metrics:

Utilizing these metrics in a microservices environment requires:

Comprehensive Monitoring: Continual monitoring of each microservice to gather data.
Alerting and Automation: Implementing automated alerts and actions based on these metrics to ensure immediate response.
Regular Review and Analysis: Periodic analysis to derive insights and make necessary adjustments to both the system and the process.
Integration with Incident Management: Linking these metrics with incident management to streamline detection and resolution.

By monitoring these metrics and integrating them into the daily operations, incident management, and continuous improvement processes, organizations can build more robust microservice architectures capable of preventing, detecting, and resolving failures efficiently.

Development Procedures

A well-defined process is essential for managing the complexities of microservices architecture, especially when it comes to preventing, detecting, and resolving failures. This process typically covers various stages, from setting up monitoring and alerts to handling incidents, troubleshooting, escalation, recovery, communication, and continuous improvement. Here’s how such a process can be designed, including specific steps to follow when an alarm is received about the health of a service:

1. Preventing Failures:

Standardizing Development Practices: Creating coding standards, using automated testing, enforcing security guidelines, etc.
Implementing Monitoring and Alerting: Setting up monitoring for key performance indicators and establishing alert thresholds.
Regular Maintenance and Health Checks: Scheduling periodic maintenance, updates, and health checks to ensure smooth operation.
Operational Checklists: Maintaining a checklist for operational readiness such as:
- Review requirements, API specifications, test plans and rollback plans.
- Review logging, monitoring, alarms, throttling, feature flags, and other key configurations.
- Document and understand components of a microservice and its dependencies.
- Define key operational and business metrics for the microservice and setup a dashboard to monitor health metrics.
- Review authentication, authorization and security impact for the service.
- Review data privacy,Â archival and retention policies.
- Define failure scenarios and impact to other services and customers.
- Document capacity planning for scalability, redundancy to eliminate single point of failures and failover strategies.

2. Detecting Failures:

Real-time Monitoring: Constantly watching system metrics to detect anomalies.
Automated Alerting: Implementing automated alerts that notify relevant teams when an anomaly or failure is detected.

3. Responding to Alarms and Troubleshooting:

When an alarm is received:

Acknowledge the Alert: Confirm the reception of the alert and log the incident.
Initial Diagnosis: Quickly assess the scope, impact, and potential cause of the issue.
Troubleshooting: Follow a systematic approach to narrow down the root cause, using tools, logs, and predefined troubleshooting guides.
Escalation (if needed): If the issue cannot be resolved promptly, escalate to higher-level teams or experts, providing all necessary information.

4. Recovery and Mitigation:

Implement Immediate Mitigation: Apply temporary fixes to minimize customer impact.
Recovery Actions: Execute recovery plans, which might include restarting services, reallocating resources, etc.
Rollback (if needed): If a recent change caused the failure, initiate a rollback to a stable version, following predefined rollback procedures.

5. Communication:

Internal Communication: Keep all relevant internal stakeholders informed about the status, actions taken, and expected resolution time.
Communication with Customers: If the incident affects customers, communicate transparently about the issue, expected resolution time, and any necessary actions they need to take.

6. Post-Incident Activities:

Post-mortem Analysis: Conduct a detailed analysis of the incident, identify lessons learned, and update procedures as needed.
Continuous Improvement: Regularly review and update the process, including the alarm response and troubleshooting guides, based on new insights and changes in the system.

A well-defined process for microservices not only provides clear guidelines on development and preventive measures but also includes detailed steps for responding to alarms, troubleshooting, escalation, recovery, and communication. Such a process ensures that the team is prepared and aligned when issues arise, enabling rapid response, minimizing customer impact, and fostering continuous learning and improvement.

Post-Mortem Analysis

When a failure or an incident occurs in a microservice, the development team will need to follow a post-mortem process for analyzing and evaluating an incident or failure. Here’s how post-mortems help enhance fault tolerance:

1. Understanding Root Causes:

A post-mortem helps identify the root cause of a failure, not just the superficial symptoms. By using techniques like the “5 Whys,” teams can delve deep into the underlying issues that led to the fault, such as coding errors, network latency, or configuration mishaps.

2. Assessing Impact and Contributing Factors:

Post-mortems enable the evaluation of the full scope of the incident, including customer impact, affected components, and contributing factors like environmental variations. This comprehensive view allows for targeted improvements.

3. Learning from Failures:

By documenting what went wrong and what went right during an incident, post-mortems facilitate organizational learning. This includes understanding the sequence of events, team response effectiveness, tools and processes used, and overall system resilience.

4. Developing Actionable Insights:

Post-mortems result in specific, actionable recommendations to enhance system reliability and fault tolerance. This could involve code refactoring, infrastructure upgrades, or adjustments to monitoring and alerting.

5. Improving Monitoring and Alerting:

Insights from post-mortems can be used to fine-tune monitoring and alerting systems, making them more responsive to specific failure patterns. This enhances early detection and allows quicker response to potential faults.

6. Fostering a Culture of Continuous Improvement:

Post-mortems encourage a blame-free culture focused on continuous improvement. By treating failures as opportunities for growth, teams become more collaborative and proactive in enhancing system resilience.

7. Enhancing Documentation and Knowledge Sharing:

The documentation produced through post-mortems is a valuable resource for the entire organization. It can be referred to when similar incidents occur, or during the onboarding of new team members, fostering a shared understanding of system behavior and best practices.

Conclusion

The complexity and interdependent nature of microservice architecture introduce specific challenges in terms of management, communication, security, and fault handling. By adopting robust measures for prevention, detection, and recovery, along with adhering to development best practices and learning from post-mortems, organizations can significantly enhance the fault tolerance and resilience of their microservices. A well-defined, comprehensive approach that integrates all these aspects ensures a more robust, flexible, and responsive system, capable of adapting and growing with evolving demands.

Comments (0)

Shahzad Bhatti Welcome to my ramblings and rants!

September 21, 2024

Robust Retry Strategies for Building Resilient Distributed Systems

Introduction

Challenges with Retries

Considerations for Retries

Summary

August 28, 2023

Mitigate Production Risks with Phased Deployment

Building CI/CD Process for Phased Deployment

Continuous Integration (CI)

Continuous Deployment (CD)

Phased Deployment Workflow

Criteria for Selecting Targets for Phased Deployment

Cellular Architecture

Blue/Green Deployment

Automated Testing

Monitoring

Feature Flags

A/B Testing

Safe Rollback

Sample CI/CD Pipeline

Conclusion

August 23, 2023

Failures in MicroService Architecture

Faults, Errors and Failures

1. Faults:

2. Error:

3. Failure:

Causes Related to Faults, Errors and Failures

1. Network Complexity:

2. Data Consistency:

3. Service Dependencies:

4. Scalability Issues:

5. Security Concerns:

6. Monitoring and Logging:

7. Configuration Management:

8. API Misuse (Customer Errors):

9. Service Versioning:

10. Diverse Technology Stack:

11. Human Factors:

12. Lack of Adequate Testing:

13. Inadequate Alarms and Health Checks:

14. Lack of Code Review and Quality Control:

15. Lack of Proper Test Environment:

16. Elevated Permissions:

17. Single Point of Failure:

18. Large Blast Radius:

19. Throttling and Limits Issues:

20. Rushed Releases:

21. Excessive Logging:

22. Circuit Breaker Mismanagement:

23. Retry Mechanism:

24. Backward Incompatible Changes:

25. Inadequate Capacity Planning:

26. Lack of Failover Isolation:

27. Noise in Metrics and Alarms:

28. Variations Across Environments:

29. Inadequate Training or Documentation:

30. Self-Inflicted Traffic Surge:

31. Lack of Phased Deployment:

32. Broken Rollback Mechanisms:

33. Inappropriate Timing:

Incident Metrics

1. MTBF (Mean Time Between Failures):

2. MTTR (Mean Time to Repair):

3. MTTA (Mean Time to Acknowledge):

4. MTTF (Mean Time to Failure):

Implementing These Metrics:

Development Procedures

1. Preventing Failures:

2. Detecting Failures:

3. Responding to Alarms and Troubleshooting:

4. Recovery and Mitigation:

5. Communication:

6. Post-Incident Activities:

Post-Mortem Analysis

1. Understanding Root Causes:

2. Assessing Impact and Contributing Factors:

3. Learning from Failures: