Production Deployment Best Practices

October 6, 2019

Production Deployment Best Practices

Filed under: Computing,Project Management,Technology — admin @ 1:37 pm

Following are a few best practices I have learned over the years for deployment to the production environment especially when working with multiple versions of the software on multiple data centers:

Coding/Debugging Best Practices

Naming Threads

When using Java language, you can name background threads so that logs can show meaningful thread-names and set your threads as a daemon so that they are automatically shutdown when the java process is terminated.

Resource Cleanup

In order to avoid any resource leaking, always close the resources that require explicit cleanup such as I/O streams, database connections, etc.

Backward / Forward Compatibility

When working with multiple versions of the software, it’s critical that all changes to the domain model are both backward and forward compatible so that new schema can be read by the old service and old schema can be read by new service. If possible, always deploy the changes that can read the new schema before deploying the changes that write the new schema.

Character Encoding

Always use UTF-8 for character encoding in your code and databases.

Time Zones

Always use UTC timezone for application code and databases.

Internationalization/Localization/Accessibility

Apply i18n, l10n and accessibility standards for your user interfaces and services.

Data Validation

NullPointerException and wrong formats are common causes of many production issues so always validate your model, input parameters and results for proper format and ranges.

Failure Cases

Think about all failure and edge cases that your code may run.

Operational Best Practices

Use multiple stages for testing

Test your changes in multiple stages such as QA, UAT, alpha, beta, gamma where you can properly bake and test your changes. The size of data and variation will increase with each stage so that you can test as close to the production data and environment as possible. The testing will include load and performance testing to detect any impact to latency and availability. Further, you can use canary testing to release changes to a subset of users or data centers so that you can compare impact of changes before rolling out to all customers. If you are using multiple regions for deployment, you can start with a smaller or low-risk region for initial production release.

Check Calendar before the release

You can check calendar for any holidays or major changes to the infrastructure that might impact the release.

Automate Automate Automate

Avoid any manual changes to the build/deployment process.

Infrastructure as a code

Apply best practices and tools from infrastructure as a code so that you are applying all changes consistently.

Error Logging

Use appropriate level for production logs and log all exceptions including stack traces. You can log additional input parameters but ensure they don’t include any sensitive data.

Deploy domain objects and data access code together

In order to avoid any schema mismatch errors, always deploy both domain object and data access changes together. In addition, you can ignore unknown properties to the model to make these changes forward compatible. If domain and data access changes cannot be released together then release domain changes first and then release data access changes.

Plan for Rollback

Always test for rollback your changes and include multiple scenarios if changes were released in staggered mode where domain changes and data access changes were published separately. You can implement automated rollbacks when releasing changes incrementally so that new changes are immediately blocked before going to the next stage of testing.

SLA/SLO

Monitor SLA/SLO at each stage of the release so that you can fix any violation in these metrics.

Change Management

Apply best practices from change management to track any high-risk or major changes to the production environment.

Security Policies

Apply best practices from zero-trust security practices and principles of a lease privilege in all test and production environments.

Health Checks / Observability

Collect metrics for health checks, failure rates, usage, etc so that you can be immediately notified when you see a suspicious or a peculiar activity.

Chaos Testing

Apply chaos testing and game days to simulate various failure conditions to verify the behavior of your services and your reaction to those failures.

Integrate/Deploy often with small changes

Continuously integrate and deploy small changes to reduce the risk with major changes so that you can test your changes in production environment safely.

Feature Flags and Dark Launch

Use feature flags judiciously to launch features that are not yet available to all users so that you can test those changes for a subset of users.

Shahzad Bhatti Welcome to my ramblings and rants!

October 6, 2019