Following are a few best practices I have learned over the years for deployment to the production environment especially when working with multiple versions of the software on multiple data centers:
Coding/Debugging Best Practices
Naming Threads
When using Java language, you can name background threads so that logs can show meaningful thread-names and set your threads as a daemon so that they are automatically shutdown when the java process is terminated.
Resource Cleanup
In order to avoid any resource leaking, always close the resources that require explicit cleanup such as I/O streams, database connections, etc.
Backward / Forward Compatibility
When working with multiple versions of the software, it’s critical that all changes to the domain model are both backward and forward compatible so that new schema can be read by the old service and old schema can be read by new service. If possible, always deploy the changes that can read the new schema before deploying the changes that write the new schema.
Character Encoding
Always use UTF-8 for character encoding in your code and databases.
Time Zones
Always use UTC timezone for application code and databases.
Internationalization/Localization/Accessibility
Apply i18n, l10n and accessibility standards for your user interfaces and services.
Data Validation
NullPointerException
and wrong formats are common causes of many production issues so always validate your model, input parameters and results for proper format and ranges.
Failure Cases
Thi
nk about all failure and edge cases that your code may run.
Operational Best Practices
Use multiple stages for testing
Test your changes in multiple stages such as QA, UAT, alpha, beta, gamma where you can properly bake and test your changes. The size of data and variation will increase with each stage so that you can test as close to the production data and environment as possible. The testing will include load and performance testing to detect any impact to latency and availability. Further, you can use canary testing to release changes to a subset of users or data centers so that you can compare impact of changes before rolling out to all customers. If you are using multiple regions for deployment, you can start with a smaller or low-risk region for initial production release.
Check Calendar before the release
You can check calendar for any holidays or major changes to the infrastructure that might impact the release.
Automate Automate Automate
Avoid any manual changes to the build/deployment process.
Infrastructure as a code
Apply best practices and tools from infrastructure as a code so that you are applying all changes consistently.
Error Logging
Use appropriate level for production logs and log all exceptions including stack traces. You can log additional input parameters but ensure they don’t include any sensitive data.
Deploy domain objects and data access code together
In order to avoid any schema mismatch errors, always deploy both domain object and data access changes together. In addition, you can ignore unknown properties to the model to make these changes forward compatible. If domain and data access changes cannot be released together then release domain changes first and then release data access changes.
Plan for Rollback
Always test for rollback your changes and include multiple scenarios if changes were released in staggered mode where domain changes and data access changes were published separately. You can implement automated rollbacks when releasing changes incrementally so that new changes are immediately blocked before going to the next stage of testing.
SLA/SLO
Monitor SLA/SLO at each stage of the release so that you can fix any violation in these metrics.
Change Management
Apply best practices from change management to track any high-risk or major changes to the production environment.
Security Policies
Apply best practices from zero-trust security practices and principles of a lease privilege in all test and production environments.
Health Checks / Observability
Collect metrics for health checks, failure rates, usage, etc so that you can be immediately notified when you see a suspicious or a peculiar activity.
Chaos Testing
Apply chaos testing and game days to simulate various failure conditions to verify the behavior of your services and your reaction to those failures.
Integrate/Deploy often with small changes
Continuously integrate and deploy small changes to reduce the risk with major changes so that you can test your changes in production environment safely.
Feature Flags and Dark Launch
Use feature flags judiciously to launch features that are not yet available to all users so that you can test those changes for a subset of users.