Shahzad Bhatti Welcome to my ramblings and rants!

August 26, 2023

Modern Software Development Process

Filed under: Methodologies,Project Management,Technology — admin @ 5:20 pm

The technologies, methodologies, and paradigms that have shaped the way software is designed, built, and delivered have evolved from the Waterfall model, proposed in 1970 by Dr. Winston W. Royce to Agile model in early 2000. The Waterfall model was characterized by sequential and rigid phases, detailed documentation, and milestone-based progression and suffered from inflexibility to adapt to changes, longer time-to-market, high risk and uncertainty difficulty in accurately estimating costs and time, silos between teams, and late discovery of issues. The Agile model addresses these shortcomings by adopting iterative and incremental development, continuous delivery, feedback loops, lower risk, and cross-functional teams. In addition, modern software development process over the last few years have adopted DevOps integration, automated testing, microservices and modular architecture, cloud services, infrastructure as code, and data driven decisions to improve the time to market, customer satisfaction, cost efficiency, operational resilience, collaboration, and sustainable development.

Modern Software Development Process

1. Goals

The goals of modern software development process includes:

  1. Speed and Agility: Accelerate the delivery of high-quality software products and updates to respond to market demands and user needs.
  2. Quality and Reliability: Ensure that the software meets specified requirements, performs reliably under all conditions, and is maintainable and scalable.
  3. Reduced Complexity: Simplify complex tasks with use of refactoring and microservices.
  4. Customer-Centricity: Develop features and products that solve real problems for users and meet customer expectations for usability and experience.
  5. Scalability: Design software that can grow and adapt to changing user needs and technological advancements.
  6. Cost-Efficiency: Optimize resource utilization and operational costs through automation and efficient practices.
  7. Security: Ensure the integrity, availability, and confidentiality of data and services.
  8. Innovation: Encourage a culture of innovation and learning, allowing teams to experiment, iterate, and adapt.
  9. Collaboration and Transparency: Foster a collaborative environment among cross-functional teams and maintain transparency with stakeholders.
  10. Employee Satisfaction: Focus on collaboration, autonomy, and mastery can lead to more satisfied, motivated teams.
  11. Learning and Growth: Allows teams to reflect on their wins and losses regularly, fostering a culture of continuous improvement.
  12. Transparency: Promote transparency with regular meetings like daily stand-ups and sprint reviews.
  13. Flexibility: Adapt to changes even late in the development process.
  14. Compliance and Governance: Ensure that software meets industry-specific regulations and standards.
  15. Risk Mitigation: Make it easier to course-correct, reducing the risk of late project failure.
  16. Competitive Advantage: Enables companies to adapt to market changes more swiftly, offering a competitive edge.

2. Elements of Software Development Process

Key elements of the Software Development Cycle include:

  1. Discovery & Ideation: Involves brainstorming, feasibility studies, and stakeholder buy-in. Lean startup methodologies might be applied for new products.
  2. Continuous Planning: Agile roadmaps, iterative sprint planning, and backlog refinement are continuous activities.
  3. User-Centric Design: The design process is iterative, human-centered, and closely aligned with development.
  4. Development & Testing: Emphasizes automated testing, code reviews, and incremental development. CI/CD pipelines automate much of the build and testing process.
  5. Deployment & Monitoring: DevOps practices facilitate automated, reliable deployments. Real-time monitoring tools and AIOps (Artificial Intelligence for IT Operations) can proactively manage system health.
  6. Iterative Feedback Loops: Customer feedback, analytics, and KPIs inform future development cycles.

3. Roles in Modern Processes

Critical Roles in Contemporary Software Development include:

  1. Product Owner/Product Manager: This role serves as the liaison between the business and technical teams. They are responsible for defining the product roadmap, prioritizing features, and ensuring that the project aligns with stakeholder needs and business objectives.
  2. Scrum Master/Agile Coach: These professionals are responsible for facilitating Agile methodologies within the team. They help organize and run sprint planning sessions, stand-ups, and retrospectives.
  3. Development Team: This group is made up of software engineers and developers responsible for the design, coding, and initial testing of the product. They collaborate closely with other roles, especially the Product Owner, to ensure that the delivered features meet the defined requirements.
  4. DevOps Engineers: DevOps Engineers act as the bridge between the development and operations teams. They focus on automating the Continuous Integration/Continuous Deployment (CI/CD) pipeline to ensure that code can be safely, efficiently, and reliably moved from development into production.
  5. QA Engineers: Quality Assurance Engineers play a vital role in the software development life cycle by creating automated tests that are integrated into the CI/CD pipeline.
  6. Data Scientists: These individuals use analytical techniques to draw actionable insights from data generated by the application. They may look at user behavior, application performance, or other metrics to provide valuable feedback that can influence future development cycles or business decisions.
  7. Security Engineers: Security Engineers are tasked with ensuring that the application is secure from various types of threats. They are involved from the early stages of development to ensure that security is integrated into the design (“Security by Design”).

Each of these roles plays a critical part in modern software development, contributing to more efficient processes, higher-quality output, and ultimately, greater business value.

4. Phases of Modern SDLC

In the Agile approach to the software development lifecycle, multiple phases are organized not in a rigid, sequential manner as seen in the Waterfall model, but as part of a more fluid, iterative process. This allows for continuous feedback loops and enables incremental delivery of features. Below is an overview of the various phases that constitute the modern Agile-based software development lifecycle:

4.1 Inception Phase

The Inception phase is the initial stage in a software development project, often seen in frameworks like the Rational Unified Process (RUP) or even in Agile methodologies in a less formal manner. It sets the foundation for the entire project by defining its scope, goals, constraints, and overall vision.

4.1.1 Objectives

The objectives of inception phase includes:

  1. Project Vision and Scope: Define what the project is about, its boundaries, and what it aims to achieve.
  2. Feasibility Study: Assess whether the project is feasible in terms of time, budget, and technical constraints.
  3. Stakeholder Identification: Identify all the people or organizations who have an interest in the project, such as clients, end-users, developers, and more.
  4. Risk Assessment: Evaluate potential risks, like technical challenges, resource limitations, and market risks, and consider how they can be mitigated.
  5. Resource Planning: Preliminary estimation of the resources (time, human capital, budget) required for the project.
  6. Initial Requirements Gathering: Collect high-level requirements, often expressed as epics or user stories, which will be refined later.
  7. Project Roadmap: Create a high-level project roadmap that outlines major milestones and timelines.

4.1.2 Practices

The inception phase includes following practices:

  • Epics Creation: High-level functionalities and features are described as epics.
  • Feasibility Study: Analyze the technical and financial feasibility of the proposed project.
  • Requirements: Identify customers and document requirements that were captured from the customers. Review requirements with all stakeholders and get alignment on key features and deliverables. Be careful for a scope creep so that the software change can be delivered incrementally.
  • Estimated Effort: A high-level effort in terms of estimated man-month, t-shirt size, or story-points to build and deliver the features.
  • Business Metrics: Define key business metrics that will be tracked such as adoption, utilization and success criteria for the features.
  • Customer and Dependencies Impact: Understand how the changes will impact customers as well as upstream/downstream dependencies.
  • Roadmapping: Develop a preliminary roadmap to guide the project’s progression.

4.1.3 Roles and Responsibilities:

The roles and responsibilities in inception phase include:

  • Stakeholders: Validate initial ideas and provide necessary approvals or feedback.
  • Product Owner: Responsible for defining the vision and scope of the product, expressed through a set of epics and an initial product backlog. The product manager interacts with customers, defines the customer problems and works with other stakeholders to find solutions. The product manager defines end-to-end customer experience and how the customer will benefit from the proposed solution.
  • Scrum Master: Facilitates the Inception meetings and ensures that all team members understand the project’s objectives and scope. The Scrum master helps write use-cases, user-stories and functional/non-functional requirements into the Sprint Backlog.
  • Development Team: Provide feedback on technical feasibility and initial estimates for the project scope.

By carefully executing the Inception phase, the team can ensure that everyone has a clear understanding of the ‘what,’ ‘why,’ and ‘how’ of the project, thereby setting the stage for a more organized and focused development process.

4.2. Planning Phase

The Planning phase is an essential stage in the Software Development Life Cycle (SDLC), especially in methodologies that adopt Agile frameworks like Scrum or Kanban. This phase aims to detail the “how” of the project, which is built upon the “what” and “why” established during the Inception phase. The Planning phase involves specifying objectives, estimating timelines, allocating resources, and setting performance measures, among other activities.

4.2.1 Objectives

The objectives of planning phase includes:

  1. Sprint Planning: Determine the scope of the next sprint, specifying what tasks will be completed.
  2. Release Planning: Define the features and components that will be developed and released in the upcoming iterations.
  3. Resource Allocation: Assign tasks and responsibilities to team members based on skills and availability.
  4. Timeline and Effort Estimation: Define the timeframe within which different tasks and sprints should be completed.
  5. Risk Management: Identify, assess, and plan for risks and contingencies.
  6. Technical Planning: Decide the architectural layout, technologies to be used, and integration points, among other technical aspects.

4.2.2 Practices:

The planning phase includes following practices:

  • Sprint Planning: A meeting where the team decides what to accomplish in the coming sprint.
  • Release Planning: High-level planning to set goals and timelines for upcoming releases.
  • Backlog Refinement: Ongoing process to add details, estimates, and order to items in the Product Backlog.
  • Effort Estimation: Techniques like story points, t-shirt sizes, or time-based estimates to assess how much work is involved in a given task or story.

4.2.3 Roles and Responsibilities:

The roles and responsibilities in planning phase include:

  • Product Owner: Prioritizes the Product Backlog and clarifies details of backlog items. Helps the team understand which items are most crucial to the project’s success.
  • Scrum Master: Facilitates planning meetings and ensures that the team has what they need to complete their tasks. Addresses any impediments that the team might face.
  • Development Team: Provides effort estimates for backlog items, helps break down stories into tasks, and commits to the amount of work they can accomplish in the next sprint.
  • Stakeholders: May provide input on feature priority or business requirements, though they typically don’t participate in the detailed planning activities.

4.2.4 Key Deliverables:

Key deliverables in planning phase include:

  • Sprint Backlog: A list of tasks and user stories that the development team commits to complete in the next sprint.
  • Execution Plan: A plan outlining the execution of the software development including dependencies from other teams. The plan also identifies key reversible and irreversible decisions.
  • Deployment and Release Plan: A high-level plan outlining the features and major components to be released over a series of iterations. This plan also includes feature flags and other configuration parameters such as throttling limits.
  • Resource Allocation Chart: A detailed breakdown of who is doing what.
  • Risk Management Plan: A documentation outlining identified risks, their severity, and mitigation strategies.

By thoroughly engaging in the Planning phase, teams can have a clear roadmap for what needs to be achieved, how it will be done, who will do it, and when it will be completed. This sets the project on a course that maximizes efficiency, minimizes risks, and aligns closely with the stakeholder requirements and business objectives.

4.3. Design Phase

The Design phase in the Software Development Life Cycle (SDLC) is a critical step that comes after planning and before coding. During this phase, the high-level architecture and detailed design of the software system are developed. This serves as a blueprint for the construction of the system, helping the development team understand how the software components will interact, what data structures will be used, and how user interfaces will be implemented, among other things.

4.3.1 Objectives:

The objectives of design phase includes:

  1. Architectural Design: Define the software’s high-level structure and interactions between components. Understand upstream and downstream dependencies and how architecture decisions affect those dependencies. The architecture will ensure scalability to a minimum of 2x of the peak traffic and will build redundancy to avoid single points of failure. The architecture decisions should be documented in an Architecture Decision Record format and should be defined in terms of reversible and irreversible decisions.
  2. Low-Level Design: Break down the architectural components into detailed design specifications including functional and non-functional considerations. The design documents should include alternative solutions and tradeoffs when selecting a recommended approach. The design document should consider how the software can be delivered incrementally and whether multiple components can be developed in parallel. Other best practices such as not breaking backward compatibility, composable modules, consistency, idempotency, pagination, no unbounded operations, validation, and purposeful design should be applied to the software design.
  3. API Design: Specify the endpoints, request/response models, and the underlying architecture for any APIs that the system will expose.
  4. User Interface Design: Design the visual elements and user experiences.
  5. Data Model Design: Define the data structures and how data will be stored, accessed, and managed.
  6. Functional Design: Elaborate on the algorithms, interactions, and methods that will implement software features.
  7. Security Design: Implement security measures like encryption, authentication, and authorization.
  8. Performance Considerations: Design for scalability, reliability, and performance optimization.
  9. Design for Testability: Ensure elements can be easily tested.
  10. Operational Excellence: Document monitoring, logging, operational and business metrics, alarms, and dashboards to monitor health of the software components.

4.3.2 Practices:

The design phase includes following practices:

  • Architectural and Low-Level Design Reviews: Conducted to validate the robustness, scalability, and performance of the design. High-level reviews focus on the system architecture, while low-level reviews dig deep into modules, classes, and interfaces.
  • API Design Review: A specialized review to ensure that the API design adheres to RESTful principles, is consistent, and meets performance and security benchmarks.
  • Accessibility Review: Review UI for accessibility support.
  • Design Patterns: Utilize recognized solutions for common design issues.
  • Wireframing and Prototyping: For UI/UX design.
  • Sprint Zero: Sometimes used in Agile for initial setup, including design activities.
  • Backlog Refinement: Further details may be added to backlog items in the form of acceptance criteria and design specs.

4.3.3 Roles and Responsibilities:

The roles and responsibilities in design phase include:

  • Product Owner: Defines the acceptance criteria for user stories and validates that the design meets business and functional requirements.
  • Scrum Master: Facilitates design discussions and ensures that design reviews are conducted effectively.
  • Development Team: Creates the high-level architecture, low-level detailed design, and API design. Chooses design patterns and data models.
  • UX/UI Designers: Responsible for user interface and user experience design.
  • System Architects: Involved in high-level design and may also review low-level designs and API specifications.
  • API Designers: Specialized role focusing on API design aspects such as endpoint definition, rate limiting, and request/response models.
  • QA Engineers: Participate in design reviews to ensure that the system is designed for testability.
  • Security Engineers: Involved in both high-level and low-level design reviews to ensure that security measures are adequately implemented.

4.3.4 Key Deliverables:

Key deliverables in design phase include:

  • High-Level Architecture Document: Describes the architectural layout and high-level components.
  • Low-Level Design Document: Provides intricate details of each module, including pseudo-code, data structures, and algorithms.
  • API Design Specification: Comprehensive documentation for the API including endpoints, methods, request/response formats, and error codes.
  • Database Design: Includes entity-relationship diagrams, schema designs, etc.
  • UI Mockups: Prototypes or wireframes of the user interface.
  • Design Review Reports: Summaries of findings from high-level, low-level, and API design reviews.
  • Proof of Concept and Spikes: Build proof of concepts or execute spikes to learn about high risk design components.
  • Architecture Decision Records: Documents key decisions that were made including tradeoffs, constraints, stakeholders and final decision.

The Design phase serves as the blueprint for the entire project. Well-thought-out and clear design documentation helps mitigate risks, provides a clear vision for the development team, and sets the stage for efficient and effective coding, testing, and maintenance.

4.4 Development Phase

The Development phase in the Software Development Life Cycle (SDLC) is the stage where the actual code for the software is written, based on the detailed designs and requirements specified in the preceding phases. This phase involves coding, unit testing, integration, and sometimes even preliminary system testing to validate that the implemented features meet the specifications.

4.4.1 Objectives:

The objectives of development phase includes:

  1. Coding: Translate design documentation and specifications into source code.
  2. Modular Development: Build the software in chunks or modules for easier management, testing, and debugging.
  3. Unit Testing: Verify that each unit or module functions as designed.
  4. Integration: Combine separate units and modules into a single, functioning system.
  5. Code Reviews: Validate that code adheres to standards, guidelines, and best practices.
  6. Test Plans: Review test plans for all test-cases that will validate the changes to the software.
  7. Documentation: Produce inline code comments, READMEs, and technical documentation.
  8. Early System Testing: Sometimes, a form of system testing is performed to catch issues early.

4.4.2 Practices:

The development phase includes following practices:

  • Pair Programming: Two developers work together at one workstation, enhancing code quality.
  • Test-Driven Development (TDD): Write failing tests before writing code, then write code to pass the tests.
  • Code Reviews: Peer reviews to maintain code quality.
  • Continuous Integration: Frequently integrate code into a shared repository and run automated tests.
  • Version Control: Use of tools like Git to manage changes and history.
  • Progress Report: Use of burndown charts, risk analysis, blockers and other data to apprise stakeholders up-to-date with the progress of the development effort.

4.4.3 Roles and Responsibilities:

The roles and responsibilities in development phase include:

  • Product Owner: Prioritizes features and stories in the backlog that are to be developed, approves or rejects completed work based on acceptance criteria.
  • Scrum Master: Facilitates daily stand-ups, removes impediments, and ensures the team is focused on the tasks at hand.
  • Software Developers: Write the code, perform unit tests, and integrate modules.
  • QA Engineers: Often involved early in the development phase to write automated test scripts, and to run unit tests along with developers.
  • DevOps Engineers: Manage the CI/CD pipeline to automate code integration and preliminary testing.
  • Technical Leads/Architects: Provide guidance and review code to ensure it aligns with architectural plans.

4.4.4 Key Deliverables:

Key deliverables in development phase include:

  • Source Code: The actual code files, usually stored in a version control system.
  • Unit Test Cases and Results: Documentation showing what unit tests have been run and their results.
  • Code Review Reports: Findings from code reviews.
  • Technical Documentation: Sometimes known as codebooks or developer guides, these detail how the code works for future maintenance.

4.4.5 Best Practices for Development:

Best practices in development phase include:

  • Clean Code: Write code that is easy to read, maintain, and extend.
  • Modularization: Build in a modular fashion for ease of testing, debugging, and scalability.
  • Commenting and Documentation: In-line comments and external documentation to aid in future maintenance and understanding.
  • Code Repositories: Use version control systems to keep a historical record of all changes and to facilitate collaboration.

The Development phase is where the design and planning turn into a tangible software product. Following best practices and guidelines during this phase can significantly impact the quality, maintainability, and scalability of the end product.

4.5. Testing Phase

The Testing phase in the Software Development Life Cycle (SDLC) is crucial for validating the system’s functionality, performance, security, and stability. This phase involves various types of testing methodologies to ensure that the software meets all the requirements and can handle expected and unexpected situations gracefully.

4.5.1 Objectives:

The objectives of testing phase includes:

  1. Validate Functionality: Ensure all features work as specified in the requirements.
  2. Ensure Quality: Confirm that the system is reliable, secure, and performs optimally.
  3. Identify Defects: Find issues that need to be addressed before the software is deployed.
  4. Risk Mitigation: Assess and mitigate potential security and performance risks.

4.5.2 Practices:

The testing phase includes following practices:

  • Test Planning: Create a comprehensive test plan detailing types of tests, scope, schedule, and responsibilities.
  • Test Case Design: Prepare test cases based on various testing requirements.
  • Test Automation: Automate repetitive and time-consuming tests.
  • Continuous Testing: Integrate testing into the CI/CD pipeline for continuous feedback.

4.5.3 Types of Testing and Responsibilities:

The test phase includes following types of testing:

  1. Integration Testing:
    Responsibilities: QA Engineers, DevOps Engineers
    Objective: Verify that different modules or services work together as expected.
  2. Functional Testing:
    Responsibilities: QA Engineers
    Objective: Validate that the system performs all the functions as described in the specifications.
  3. Load/Stress Testing:
    Responsibilities: Performance Test Engineers, DevOps Engineers
    Objective: Check how the system performs under heavy loads.
  4. Security and Penetration Testing:
    Responsibilities: Security Engineers, Ethical Hackers
    Objective: Identify vulnerabilities that an attacker could exploit.
  5. Canary Testing:
    Responsibilities: DevOps Engineers, Product Owners
    Objective: Validate that new features or updates will perform well with a subset of the production environment before full deployment.
  6. Other Forms of Testing:
    Smoke Testing: Quick tests to verify basic functionality after a new build or deployment.
    Regression Testing: Ensure new changes haven’t adversely affected existing functionalities.
    User Acceptance Testing (UAT): Final validation that the system meets business needs, usually performed by stakeholders or end-users.

4.5.4 Key Deliverables:

  • Test Plan: Describes the scope, approach, resources, and schedule for the testing activities.
  • Test Cases: Detailed description of what to test, how to test, and expected outcomes.
  • Test Scripts: Automated scripts for testing.
  • Test Reports: Summary of testing activities, findings, and results.

4.5.5 Best Practices for Testing:

  • Early Involvement: Involve testers from the early stages of SDLC for better understanding and planning.
  • Code Reviews: Conduct code reviews to catch potential issues before the testing phase.
  • Data-Driven Testing: Use different sets of data to evaluate how the application performs with various inputs.
  • Monitoring and Feedback Loops: Integrate monitoring tools to get real-time insights during testing and improve the test scenarios continuously.

The Testing phase is crucial for delivering a reliable, secure, and robust software product. The comprehensive testing strategy that includes various types of tests ensures that the software is well-vetted before it reaches the end-users.

4.6. Deployment Phase

The Deployment phase in the Software Development Life Cycle (SDLC) involves transferring the well-tested codebase from the staging environment to the production environment, making the application accessible to the end-users. This phase aims to ensure a smooth transition of the application from development to production with minimal disruptions and optimal performance.

4.6.1 Objectives:

The objectives of deployment phase includes:

  1. Release Management: Ensure that the application is packaged and released properly.
  2. Transition: Smoothly transition the application from staging to production.
  3. Scalability: Ensure the infrastructure can handle the production load.
  4. Rollback Plans: Prepare contingencies for reverting changes in case of failure.

4.6.2 Key Components and Practices:

The key components and practices in deployment phase include:

  1. Continuous Deployment (CD):
    Responsibilities: DevOps Engineers, QA Engineers, Developers
    Objective: Automatically deploy all code changes to the production environment that pass the automated testing phase.
    Tools: Jenkins, GitLab CI/CD, GitHub Actions, Spinnaker
  2. Infrastructure as Code (IaC):
    Responsibilities: DevOps Engineers, System Administrators
    Objective: Automate the provisioning and management of infrastructure using code.
    Tools: Terraform, AWS CloudFormation, Ansible
  3. Canary Testing:
    Responsibilities: DevOps Engineers, QA Engineers
    Objective: Gradually roll out the new features to a subset of users before a full-scale rollout to identify potential issues.
  4. Phased Deployment:
    Responsibilities: DevOps Engineers, Product Managers
    Objective: Deploy the application in phases to monitor its stability and performance, enabling easier troubleshooting.
  5. Rollback:
    Responsibilities: DevOps Engineers
    Objective: Be prepared to revert the application to its previous stable state in case of any issues.
  6. Feature Flags:
    Responsibilities: Developers, Product Managers
    Objective: Enable or disable features in real-time without deploying new code.
  7. Resilience:
    Responsibilities: DevOps Engineers, Developers
    Objective: Prevent retry storms, throttle the number of client requests, and apply reliability patterns such as Circuit Breakers and BulkHeads. Avoid bimodal behavior and maintain software latency and throughput within the defined SLAs.
  8. Scalability:
    Responsibilities: DevOps Engineers, Developers
    Objective: Monitor scalability limits and implement elastic scalability. Review scaling limits periodically.
  9. Throttling and Sharding Limits (and other Configurations):
    Responsibilities: DevOps Engineers, Developers
    Objective: Review configuration and performance configurations such as throttling and sharding limits.
  10. Access Policies:
    Responsibilities: DevOps Engineers, Developers
    Objective: Review access policies, permissions and roles that can access the software.

4.6.3 Roles and Responsibilities:

The deployment phase includes following roles and responsibilities:

  • DevOps Engineers: Manage CI/CD pipelines, IaC, and automate deployments.
  • Product Managers: Approve deployments based on feature completeness and business readiness.
  • QA Engineers: Ensure the application passes all tests in the staging environment before deployment.
  • System Administrators: Ensure that the infrastructure is ready and scalable for deployment.

4.6.4 Key Deliverables:

The deployment phase includes following key deliverables:

  • Deployment Checklist: A comprehensive list of tasks and checks before, during, and after deployment.
  • CD Pipeline Configuration: The setup details for the Continuous Deployment pipeline.
  • Automated Test Cases: Essential for validating the application before automatic deployment.
  • Operational, Security and Compliance Review Documents: These documents will define checklists about operational excellence, security and compliance support in the software.
  • Deployment Logs: Automated records of what was deployed, when, and by whom (usually by the automation system).
  • Logging: Review data that will be logged and respective log levels so that data privacy is not violated and excessive logging is avoided.
  • Deployment Schedule: A timeline for the deployment process.
  • Observability: Health dashboard to monitor operational and business metrics, and alarms to receive notifications when service-level-objectives (SLO) are violated. The operational metrics will include availability, latency, and error metrics. The health dashboard will monitor utilization of CPU, disk space, memory, and network resources.
  • Rollback Plan: A detailed plan for reverting to the previous stable version if necessary.

4.6.5 Best Practices:

Following are a few best practices for the deployment phase:

  • Automated Deployments: Use automation tools for deployment of code and infrastructure to minimize human error.
  • Monitoring and Alerts: Use monitoring tools to get real-time insights into application performance and set up alerts for anomalies.
  • Version Control: Ensure that all deployable artifacts are versioned for traceability.
  • Comprehensive Testing: Given the automated nature of Continuous Deployment, having a comprehensive suite of automated tests is crucial.
  • Rollback Strategy: Have an automated rollback strategy in case the new changes result in system failures or critical bugs.
  • Feature Toggles: Use feature flags to control the release of new features, which can be enabled or disabled without having to redeploy.
  • Audit Trails: Maintain logs and history for compliance and to understand what was deployed and when.
  • Documentation: Keep detailed records of the deployment process, configurations, and changes.
  • Stakeholder Communication: Keep all stakeholders in the loop regarding deployment schedules, success, or issues.
  • Feedback from Early Adopters: If the software is being released to internal or a beta customers, then capture feedback from those early adopters including any bugs report.
  • Marketing and external communication: The release may need to be coordinated with a marketing campaign so that customers can be notified about new features.

The Deployment phase is critical for ensuring that the software is reliably and securely accessible by the end-users. A well-planned and executed deployment strategy minimizes risks and disruptions, leading to a more dependable software product.

4.7 Maintenance Phase

The Maintenance phase in the Software Development Life Cycle (SDLC) is the ongoing process of ensuring the software’s continued effective operation and performance after its release to the production environment. The objective of this phase is to sustain the software in a reliable state, provide continuous support, and make iterative improvements or patches as needed.

4.7.1 Objectives:

The objectives of maintenance phase includes:

  1. Bug Fixes: Address any issues or defects that arise post-deployment.
  2. Updates & Patches: Release minor or major updates and patches to improve functionality or security.
  3. Optimization: Tune the performance of the application based on metrics and feedback.
  4. Scalability: Ensure that the software can handle growth in terms of users, data, and transaction volume.
  5. Documentation: Update all documents related to software changes and configurations.

4.7.2 Key Components and Practices:

The key components and practices of maintenance phase include:

  1. Incident Management:
    Responsibilities: Support Team, DevOps Engineers
    Objective: Handle and resolve incidents affecting the production environment.
  2. Technical Debt Management:
    Responsibilities: Development Team, Product Managers
    Objective: Prioritize and resolve accumulated technical debt to maintain code quality and performance.
  3. Security Updates:
    Responsibilities: Security Engineers, DevOps Engineers
    Objective: Regularly update and patch the software to safeguard against security vulnerabilities.
  4. Monitoring & Analytics:
    Responsibilities: DevOps Engineers, Data Analysts
    Objective: Continuously monitor software performance, availability, and usage to inform maintenance tasks.
  5. Documentation and Runbooks:
    Responsibilities: Support Team, DevOps Engineers
    Objective: Define cookbooks for development processes and operational issues.

4.7.3 Roles and Responsibilities:

The maintenance phase includes following roles and responsibilities:

  • DevOps Engineers: Monitor system health, handle deployments for updates, and coordinate with the support team for incident resolution.
  • Support Team: Provide customer support, report bugs, and assist in reproducing issues for the development team.
  • Development Team: Develop fixes, improvements, and updates based on incident reports, performance metrics, and stakeholder feedback.
  • Product Managers: Prioritize maintenance tasks based on customer needs, business objectives, and technical requirements.
  • Security Engineers: Regularly audit the software for vulnerabilities and apply necessary security patches.

4.7.4 Key Deliverables:

Key deliverables for the maintenance phase include:

  • Maintenance Plan: A detailed plan outlining maintenance activities, schedules, and responsible parties.
  • Patch Notes: Documentation describing what has been fixed or updated in each new release.
  • Performance Reports: Regular reports detailing the operational performance of the software.

4.7.5 Best Practices for Maintenance:

Following are a few best practices for the maintenance phase:

  • Automated Monitoring: Use tools like Grafana, Prometheus, or Zabbix for real-time monitoring of system health.
  • Feedback Loops: Use customer feedback and analytics to prioritize maintenance activities.
  • Version Control: Always version updates and patches for better tracking and rollback capabilities.
  • Knowledge Base: Maintain a repository of common issues and resolutions to accelerate incident management.
  • Scheduled Maintenance: Inform users about planned downtimes for updates and maintenance activities.

The Maintenance phase is crucial for ensuring that the software continues to meet user needs and operate effectively in the ever-changing technical and business landscape. Proper maintenance ensures the longevity, reliability, and continual improvement of the software product. Also, note that though maintenance is defined as a separate phase above but it will include other phases from inception to deployment and each change will be developed and deployed incrementally in iterative agile process.

5. Best Practices for Ensuring Reliability, Quality, and Incremental Delivery

Following best practices ensure incremental delivery with higher reliability and quality:

5.1. Iterative Development:

Embrace the Agile principle of delivering functional software frequently. The focus should be on breaking down the product into small, manageable pieces and improving it in regular iterations, usually two to four-week sprints.

  • Tools & Techniques: Feature decomposition, Sprint planning, Short development cycles.
  • Benefits: Faster time to market, easier bug isolation, and tracking, ability to incorporate user feedback quickly.

5.2. Automated Testing:

Implement Test-Driven Development (TDD) or Behavior-Driven Development (BDD) to script tests before the actual coding begins. Maintain these tests to run automatically every time a change is made.

  • Tools & Techniques: JUnit, Selenium, Cucumber.
  • Benefits: Instant feedback on code quality, regression testing, enhanced code reliability.

5.3. Design Review:

  • Detailed Explanation: A formal process where architects and developers evaluate high-level and low-level design documents to ensure they meet the project requirements, are scalable, and adhere to best practices.
  • Tools & Techniques: Design diagrams, UML, Peer Reviews, Design Review Checklists.
  • Benefits: Early identification of design flaws, alignment with stakeholders, consistency across the system architecture.

5.4 Code Reviews:

Before any code gets merged into the main repository, it should be rigorously reviewed by other developers to ensure it adheres to coding standards, is optimized, and is free of bugs.

  • Tools & Techniques: Git Pull Requests, Code Review Checklists, Pair Programming.
  • Benefits: Team-wide code consistency, early detection of anti-patterns, and a secondary check for overlooked issues.

5.5 Security Review:

A comprehensive evaluation of the security aspects of the application, involving both static and dynamic analyses, is conducted to identify potential vulnerabilities.

  • Tools & Techniques: OWASP Top 10, Security Scanners like Nessus or Qualys, Code Review tools like Fortify, Penetration Testing.
  • Benefits: Proactive identification of security vulnerabilities, adherence to security best practices, and compliance with legal requirements.

5.6 Operational Review:

Before deploying any new features or services, assess the readiness of the operational environment, including infrastructure, data backup, monitoring, and support plans.

  • Tools & Techniques: Infrastructure as Code tools like Terraform, Monitoring tools like Grafana, Documentation.
  • Benefits: Ensures the system is ready for production, mitigates operational risks, confirms that deployment and rollback strategies are in place.

5.7 CI/CD (Continuous Integration and Continuous Deployment):

Integrate all development work frequently and deliver changes to the end-users reliably and rapidly using automated pipelines.

  • Tools & Techniques: Jenkins, GitLab CI/CD, Docker, Kubernetes.
  • Benefits: Quicker discovery of integration bugs, reduced lead time for feature releases, and improved deployability.

5.8 Monitoring:

Implement sophisticated monitoring tools that continuously observe system performance, user activity, and operational health.

  • Tools & Techniques: Grafana, Prometheus, New Relic.
  • Benefits: Real-time data analytics, early identification of issues, and rich metrics for performance tuning.

5.9 Retrospectives:

At the end of every sprint or project phase, the team should convene to discuss what worked well, what didn’t, and how processes can be improved.

  • Tools & Techniques: Post-its for brainstorming, Sprint Retrospective templates, Voting mechanisms.
  • Benefits: Continuous process improvement, team alignment, and reflection on the project’s success and failures.

5.10 Product Backlog Management:

A live document containing all known requirements, ranked by priority and constantly refined to reflect changes and learnings.

  • Tools & Techniques: JIRA, Asana, Scrum boards.
  • Benefits: Focused development on high-impact features, adaptability to market or user needs.

5.11. Kanban for Maintenance:

For ongoing maintenance work and technical debt management, utilize a Kanban system to visualize work, limit work-in-progress, and maximize efficiency.

  • Tools & Techniques: Kanban boards, JIRA Kanban, Trello.
  • Benefits: Dynamic prioritization, quicker task completion, and efficient resource utilization.

5.12 Feature Flags:

Feature flags allow developers to toggle the availability of certain functionalities without deploying new versions.

  • Tools & Techniques: LaunchDarkly, Config files, Custom-built feature toggles.
  • Benefits: Risk mitigation during deployments, simpler rollbacks, and fine-grained control over feature releases.

5.13 Documentation:

Create comprehensive documentation, ranging from code comments and API docs to high-level architecture guides and FAQs for users.

  • Tools & Techniques: Wiki platforms, OpenAPI/Swagger for API documentation, Code comments.
  • Benefits: Streamlined onboarding for new developers, easier troubleshooting, and long-term code maintainability.

By diligently applying these multi-layered Agile best practices and clearly defined roles, your SDLC will be a well-oiled machine—more capable of rapid iterations, quality deliverables, and high adaptability to market or user changes.

6. Conclusion

The Agile process combined with modern software development practices offers an integrated and robust framework for building software that is adaptable, scalable, and high-quality. This approach is geared towards achieving excellence at every phase of the Software Development Life Cycle (SDLC), from inception to deployment and maintenance. The key benefits of this modern software development process includes flexibility and adaptability, reduced time-to-market, enhanced quality, operational efficiency, risk mitigation, continuous feedback, transparency, collaboration, cost-effectiveness, compliance and governance, and documentation for sustainment. By leveraging these Agile and modern software development practices, organizations can produce software solutions that are not only high-quality and reliable but also flexible enough to adapt to ever-changing requirements and market conditions.

May 12, 2022

Applying Laws of Scalability to Technology and People

As businesses grow with larger customers size and hire more employees, they face challenges to meet the customer demands in terms of scaling their systems and maintaining rapid product development with bigger teams. The businesses aim to scale systems linearly with additional computing and human resources. However, systems architecture such as monolithic or ball of mud makes scaling systems linearly onerous. Similarly, teams become less efficient as they grow their size and become silos. A general solution to solve scaling business or technical problems is to use divide & conquer and partition it into multiple sub-problems. A number of factors affect scalability of software architecture and organizations such as the interactions among system components or communication between teams. For example, the coordination, communication and data/knowledge coherence among the system components and teams become disproportionately expensive with the growth in size. The software systems and business management have developed a number of laws and principles that can used to evaluate constraints and trade offs related to the scalability challenges. Following is a list of a few laws from the technology and business domain for scaling software architectures and business organizations:

Amdhal’s Law

Amdahl’s Law is named after Gene Amdahl that is used to predict speed up of a task execution time when it’s scaled to run on multiple processors. It simply states that the maximum speed up will be limited by the serial fraction of the task execution as it will create resource contention:

Speed up (P, N) = 1 / [ (1 - P) + P / N ]

Where P is the fraction of task that can run in parallel on N processors. When N becomes large, P / N approaches 0 so speed up is restricted to 1 / (1 – P) where the serial fraction (1 – P) becomes a source of contention due to data coherence, state synchronization, memory access, I/O or other shared resources.

Amdahl’s law can also be described in terms of throughput using:

N / [ 1 + a (N - 1) ]

Where a is the serial fraction between 0 and 1. In parallel computing, a class of problems known as embarrassingly parallel workload where the parallel tasks have a little or no dependency among tasks so their value for a will be 0 because they don’t require any inter-task communication overhead.

Amdah’s law can be used to scale teams as an organization grows where the teams can be organized as small and cross-functional groups to parallelize the feature work for different product lines or business domains, however the maximum speed up will still be limited by the serial fraction of the work. The serial work can be: build and deployment pipelines; reviewing and merging changes; communication and coordination between teams; and dependencies for deliverables from other teams. Fred Brooks described in his book The Mythical Man-Month how adding people to a highly divisible task can reduce overall task duration but other tasks are not so easily divisible: while it takes one woman nine months to make one baby, “nine women can’t make a baby in one month”.

The theoretical speedup of the latency of the execution of a program according to Amdahl’s law (credit wikipedia).

Brooks’s Law

Brooks’s law was coined by Fred Brooks that states that adding manpower to a late software project makes it later due to ramp up time. As the size of team increases, the ramp up time for new employees also increases due to quadratic communication overhead among team members, e.g.

Number of communication channels = N x (N - 1) / 2

The organizations can build small teams such as two-pizza/single-threaded teams where communication channels within each team does not explode and the cross-functional nature of the teams require less communication and dependencies from other teams. The Brook’s law can be equally applied to technology when designing distributed services or components so that each service is designed as a loosely coupled module around a business domain to minimize communication with other services and services only communicate using a well designed interfaces.

Universal Scalability Law

The Universal Scalability Law is used for capacity planning and was derived from Amdahl’s law by Dr. Neil Gunther. It describes relative capacity in terms of concurrency, contention and coherency:

C(N) = N / [1 + a(N – 1) + B.N (N – 1) ]

Where C(N) is the relative capacity, a is the serial fraction between 0 and 1 due to resource contention and B is delay for data coherency or consistency. As data coherency (B) is quadratic in N so it becomes more expensive as size of N increases, e.g. using a consensus algorithm such as Paxos is impractical to reach state consistency among large set of servers because it requires additional communication between all servers. Instead, large scale distributed storage services generally use sharding/partitioning and gossip protocol with a leader-based consensus algorithm to minimize peer to peer communication.

The Universal Scalability Law can be applied to scale teams similar to Amdahl’s law where a is modeled for serial work or dependency between teams and B is modeled for communication and consistent understanding among the team members. The cost of B can be minimized by building cross-functional small teams so that teams can make progress independently. You can also apply this model for any decision making progress by keeping the size of stake holders or decision makers small so that they can easily reach the agreement without grinding to halt.

The gossip protocols also applies to people and it can be used along with a writing culture, lunch & learn and osmotic communication to spread knowledge and learnings from one team to other teams.

Little’s Law

Little’s Law was developed by John Little to predict number of items in a queue for stable stable and non-preemptive. It is part of queueing theory and is described mathematically as:

L = A W

Where L is the average number of items within the system or queue, A is the average arrival time of items and W is the average time an item spends in the system. The Little’s law and queuing theory can be used for capacity planning for computing servers and minimizing waiting time in the queue (L).

The Little’s law can be applied for predicting task completion rate in an agile process where L represents work-in-progress (WIP) for a sprint; A represents arrival and departure rate or throughput/capacity of tasks; W represents lead-time or an average amount of time in the system.

WIP = Throughput x Lead-Time

Lead-Time = WIP / Throughput

You can use this relationship to reduce the work in progress or lead time and improve throughput of tasks completion. Little’s law observes that you can accomplish more by keeping work-in-progress or inventory small. You will be able to better respond to unpredictable delays if you keep a buffer in your capacity and avoid 100% utilization.

King’s formula

The King’s formula expands Little’s law by adding utilization and variability for predicting wait time before serving of requests:

{\displaystyle \mathbb {E} (W_{q})\approx \left({\frac {\rho }{1-\rho }}\right)\left({\frac {c_{a}^{2}+c_{s}^{2}}{2}}\right)\tau }
(credit wikipedia)

where T is the mean service time, m (1/T) is the service rate, A is the mean arrival rate, p = A/m is the utilization, ca is the coefficient of variation for arrivals and cs is the coefficient of variation for service times. The King’s formula shows that the queue sizes increases to infinity as you reach 100% utilization and you will have longer queues with greater variability of work. These insights can be applied to both technical and business processes so that you can build systems with a greater predictability of processing time, smaller wait time E(W) and higher throughput ?.

Note: See Erlang analysis for serving requests in a system without a queue where new requests are blocked or rejected if there is not sufficient capacity in the system.

Gustafson’s Law

Gustafson’s law improves Amdahl’s law with a keen observation that parallel computing enables solving larger problems by computations on very large data sets in a fixed amount of time. It is defined as:

S = s + p x N

S = (1 – s) x N

S = N + (1 – N) x s

where S is the theoretical speed up with parallelism, N is the number of processors, s is the serial fraction and p is the parallel part such that s + p = 1.

Gustafson’s law shows that limitations imposed by the sequential fraction of a program may be countered by increasing the total amount of computation. This allows solving bigger technical and business problems with a greater computing and human resources.

Conway’s Law

Conway’s law states that an organization that designs a system will produce a design whose structure is a copy of the organization’s communication structure. It means that the architecture of a system is derived from the team structures of an organization, however you can also use the architecture to derive the team structures. This allows defining building teams along the architecture boundaries so that each team is a small, cross functional and cohesive. A study by the Harvard Business School found that the often large co-located teams tended to produce more tightly-coupled and monolithic codebases whereas small distributed teams produce more modular codebases. These lessons can be applied to scaling teams and architecture so that teams and system modules are built around organizational boundaries and independent concerns to promote autonomy and reduce tight coupling.

Pareto Principle

The Pareto principle states that for many outcomes, roughly 80% of consequences come from 20% of causes. This principle shows up in numerous technical and business problems such as 20% of code has the 80% of errors; customers use 20% of functionality 80% of the time; 80% of optimization improvements comes from 20% of the effort, etc. It can also be used to identify hotspots or critical paths when scaling, as some microservices or teams may receive disproportionate demands. Though, scaling computing resources is relatively easy but scaling a team beyond an organization boundary is hard. You will have to apply other management tools such as prioritization, planning, metrics, automation and better communication to manage critical work.

Metcalfe’s Law

The Metcalfe’s law states that if there are N users of a telecommunications network, the value of the network is N2. It’s also referred as Network effects and applies to social networking sites.

Number of possible pair connections = N * (N – 1) / 2

Reed’s Law expanded this law and observed that the utility of large networks can scale exponentially with the size of the network.

Number of possible subgroups of a network = 2N – N – 1

This law explains the popularity of social networking services via viral communication. These laws can be applied to model information flow between teams or message exchange between services to avoid peer to peer communication with extremely large group of people or a set of nodes. A common alternative is to use a gossip protocol or designate a partition leader for each group that communicates with other leaders and then disseminate information to the group internally.

Dunbar Number

The Dunbar’s number is a suggested cognitive limit to the number of people with whom one can maintain stable social relationships. It has a commonly used value of 150 and can be used to limit direct communication connections within an organization.

Wirth’s Law and Parkinson’s Law

The Wirth’s Law is named after Niklaus Wirth who observed that the software is getting slower more rapidly than hardware is becoming faster. Over the last few decades, processors have become exponentially faster as a Moor’s Law but often that gain allows software developers to develop more complex software that consumes all gains of the speed. Another factor is that it allows software developers to use languages and tools that may not generate more efficient code so the code becomes bloated. There is a similar law in software development called Parkinson’s law that work expands to fill the time available for it. Though, you also have to watch for Hofstadter’s Law that states that “it always takes longer than you expect, even when you take into account Hofstadter’s Law”; and Brook’s Law, which states that “adding manpower to a late software project makes it later.”

The Wirth’s Law, named after Niklaus Wirth, posits that software tends to become slower at a rate that outpaces the speed at which hardware becomes faster. This observation reflects a trend where, despite significant advancements in processor speeds as predicted by Moor’s Law , software complexity increases correspondingly. Developers often leverage these hardware improvements to create more intricate and feature-rich software, which can negate the hardware gains. Additionally, the use of programming languages and tools that do not prioritize efficiency can lead to bloated code.

In the realm of software development, there are similar principles, such as Parkinson’s law, which suggests that work expands to fill the time allotted for its completion. This implies that given more time, software projects may become more complex or extended than initially necessary. Moreover, Hofstadter’s Law offers a cautionary perspective, stating, “It always takes longer than you expect, even when you take into account Hofstadter’s Law.” This highlights the often-unexpected delays in software development timelines. Brook’s Law further adds to these insights with the adage, “Adding manpower to a late software project makes it later.” These laws collectively emphasize that the demand upon a resource tends to expand to match the supply of the resource but adding resources later also poses challenges due to complexity in software development and project management.

Dunbar Number

The Dunbar’s number is a suggested cognitive limit to the number of people with whom one can maintain stable social relationships. It has a commonly used value of 150 and can be used to limit direct communication connections within an organization.

Summary

Above laws shows how you can partition tightly coupled architecture and large teams into modular architecture and small autonomous teams. For example, Amdahl’s and Universal Scalability laws demonstrate that you have to account for the cost of serial work, coordination and communication between partitions as you parallelize the problem because they become bottleneck as you scale. Brook’s and Metcalfe’s laws indicate that you will need to manage the number of communication paths among modules or teams as they can explode quadratically thus stifling your growth. Little’s law and King’s formula establishes that you need to reduce inventory or work in progress and avoid 100% utilization in order to provide reliable throughput. Conway’s law shows how architecture and team structures can be aligned for maximum autonomy and productivity. This allows you to accomplish more work by using small cross functional teams who own independent product lines and build modular architecture to reduce dependency on other teams and subsystems. Pareto principle can be used to make small changes to the architecture or teams that results in higher scalability and productivity. Wirth’s Law and Parkinson’s Law, when applied judiciously, can be instrumental in enhancing efficiency in software development. By setting more stringent timelines and clear, concise objectives, it can counteract the tendency for work to expand to fill the available time. Dunbar number only applies to people but it can be used to limit dependencies for external teams as a human mind has a finite capacity to maintain external relationships. However, before applying these laws, you should have clear goals and collect proper metrics and KPIs so that you can measure the baseline and improvements from these laws. You should also be cautious when applying these laws prematurely for scalability as it may make things worse. Finally, when solving scalability and performance related problems, it is vital to focus on global optimization to scale an entire organization or the system as opposed to a local optimization by focusing strictly only on a specific part of the system.

March 24, 2022

Architecture Patterns and Practices for Sustainable Software Delivery Pipelines

Filed under: Project Management,Technology — Tags: , , , , — admin @ 10:31 pm

Abstract

Software is eating the world and today’s businesses demand shipping software features at a higher velocity to enable learning at a greater pace without compromising the quality. However, each new feature increases viscosity of existing code, which can add more complexity and technical debt so the time to market for new features becomes longer. Managing a sustainable pace for software delivery requires continuous improvements to the software development architecture and practices.

Software Architecture

The Software Architecture defines guiding principles and structure of the software systems. It also includes quality attribute such as performance, sustainability, security, scalability, and resiliency. The software architecture is then continuously updated through iterative software development process and feedback cycle from the the actual use in production environment. The software architecture decays if it’s ignored that results in a higher complexity and technical debt. In order to reduce technical debt, you can build a backlog of technical and architecture related changes so that you can prioritize along with the product development. In order to maintain consistent architecture throughout your organization, you can document the architecture principles to define high-level guidelines for best practices, documentation templates, review process and guidance for the architecture decisions.

Quality Attributes

Following are major quality attributes of the software architecture:

  • Availability — It defines percentage of time, the system is available, e.g. available-for-use-time / total-time. It is generally referred in percentiles such as P99.99, which indicates a down time of 52 minutes in a year. It can also be calculated in terms of as mean-time between failure (MTBF) and mean-time to recover (MTRR) using MTBF/(MTBF+MTRR). The availability will depend not only on the service that you are providing but also its dependent services, e.g. P-service * P-dep-service-1 * P-dep-service-2. You can improve availability with redundant services, which can be calculated as Max-availability - (100 - Service-availability) ** Redundancy-factor. In order to further improve availability, you can detect faults and use redundancy and state synchronization for fault recovery. The system should also handle exceptions gracefully so that it doesn’t crash or goes into a bad state.
  • Capacity — Capacity defines how the system scales by adding hardware resources.
  • Extensibility — Extensibility defines how the system meets future business requirements without significantly changing existing design and code.
  • Fault Tolerance — Fault tolerance prevents a single point of failure and allows the system to continue operating even when parts of the system fail.
  • Maintainability — Higher quality code allows building robust software with higher stability and availability. This improves software delivery due to modular and loosely coupled design.
  • Performance — It is defined in terms of latency of an operation under normal or peak load. A performance may degrade with consumptions of resources, which affects throughput and scalability of the system. You can measure user’s response-time, throughput and utilization of computational resources by stress testing the system. A number of tactics can be used to improve performance such as prioritization, reducing overhead, rate-limiting, asynchronicity, caching, etc. Performance testing can be integrated with continuous delivery process that use load and stress testing to measure performance metrics and resource utilization.
  • Resilience — Resilience accepts the fact that faults and failure will occur so instead system components resist them by retrying, restarting, limiting error propagation or other measures. A failure is when a system deviates from its expected behavior as a result of accidental fault, misconfigurations, transient network issues or programming error. Two metrics related to resilience are mean-time between failure (MTBF) and mean-time to recover (MTTR), however resilient systems pay more attention to recovery or a shorter MTTR for fast recovery.
  • Recovery — Recovery looks at system recover in relation with availability and resilience. Two metrics related to recovery are recovery point objective (RPO) and recovery time objective (RTO), where RPO determines data that can be lost in case of failure and RTO defines wait time for the system recovery.
  • Reliability — Reliability looks at the probability of failure or failure rate.
  • Reproducibility — Reproducibility uses version control for code, infrastructure, configuration so that you can track and audit changes easily.
  • Reusability — It encourages code reuse to improve reliability, productivity and cost savings from the duplicated effort.
  • Scalability — It defines ability of the system to handle increase in workload without performance degradation. It can be expressed in terms of vertical or horizontal scalability, where horizontal reduces impact of isolated failure and improves workload availability. Cloud computing offers elastic and auto-scaling features for adding additional hardware when higher request rate is detected by the load balancer.
  • Security — Security primarily looks at confidentiality, integrity, availability (CIA) and is critical in building distributed systems. Building secure systems depends on security practices such as a strong identity management, defense in depth, zero trust networks, auditing, ad protecting while data in motion or at rest. You can adopt DevSecOps that shifts security left to earlier in software development lifecycle with processes such as Security by Design (SbD), STRIDE (Spoofing, Tampering, Repudiation, Disclosure, Denial of Service, Elevation of privilege), PASTA (Process for Attack Simulation and Threat Analysis), VAST (Visual, Agile and Simple Threat), CAPEC (Common Attack pattern Enumeration and Classification), and OCTAGE (Operationally Critical Threat, and Vulnerability Evaluation).
  • Testability — It encourages building systems in a such way it’s easier to test them.
  • Usability — It defines user experience of user interface and information architecture.

Architecture Patterns

Following is a list of architecture patterns that help building a high quality software:

Asynchronicity

Synchronous services are difficult to scale and recover from failures because they require low-latency and can easily overwhelm the services. Messaging-based asynchronous communication based on point-to-point or publish/subscribe are more suitable for handling faults or high load. This improves resilience because service components can restart in case of failure while messages remain in the queue.

Admission Control

The admission control adds a authentication, authorization or validation check in front event queue so that service can handle the load and prevent overload when demand exceeds the server capacity.

Back Pressure

When a producer is generating workload faster than the server can process, it can result in long request queues. Back pressure signals clients that servers are overloaded and clients need to slow down. However, rogue clients may ignore these signals so servers often employ other tactics such as admission control, load shedding, rate limiting or throttling requests.

Big fleet in front of small fleet

You should look at all transitive dependencies when scaling a service with a large fleet of hosts so that you don’t drive a large network traffic that needs to invoke dependent services with a smaller fleet. You can use use load testing to find the bottlenecks and update SLAs for the dependent services so that they are aware of network load from your APIs.

Blast Radius

A blast radius defines impact of failure on overall system when an error occurs. In order to limit the blast radius, the system should eliminate a single point of failure, rolling deploy changes using canaries and stop cascading failures using circuit breakers, retry and timeout.

Bulkheads

Bulkheads isolate faults from one component to another, e.g. you may use different thread pool for different workloads or use multiple regions/availability-zones to isolate failures in a specific datacenter.

Caching

Caching can be implemented at a several layers to improve performance such as database-cache, application-cache, proxy/edge cache, pre-compute cache and client-side cache.

Circuit Breakers

The circuit breaker is defined as a state machine with three states: normal, checking and tripped. It can be used to detect persistent failures in a dependent service and trip its state to disable invocation of the service temporarily with some default behavior. It can be later changed to the checking state for detecting success, which changes its state to normal after a successful invocation of the dependent service.

CQRS / Event Sourcing

Command and Query Responsibility Segregation (CQRS) separates read and update operations in the database. It’s often implemented using event-sourcing that records changes in an append-only store for maintaining consistency and audit trails.

Default Values

Default values provide a simple way to provide limited or degraded behavior in case of failure in dependent configuration or control service.

Disaster Recovery

Disaster recovery (DR) enables business continuity in the event of large-scale failure of data centers. Based on cost, availability and RTO/RPO constraints, you can deploy services to multiple regions for hot site; only replicate data from one region to another region while keeping servers as standby for warm site; or use backup/restore for cold site. It is essential to periodically test and verify these DR procedures and processes.

Distributed Saga

Maintaining data consistency in a distributed system where data is stored in multiple databases can be hard and using 2-phase-commit may incur high complexity and performance. You can use distributed Saga for implementing long-running transactions. It maintains state of the transaction and applies compensating transactions in case of a failure.

Failing Fast

You can fail fast if the workload cannot serve the request due to unavailability of resources or dependent services. In some cases, you can queue requests, however it’s best to keep those queues short so that you are not spending resources to serve stale requests.

Function as a Service

Function as a service (FaaS) offers serverless computing to simplify managing physical resources. Cloud vendors offer APIs for AWS Lambda, Google Cloud Functions and Azure Functions to build serverless applications for scalable workloads. These functions can be easily scaled to handle load spikes, however you have to be careful scaling these functions so that any services that they depend on can support the workload. Each function should be designed with a single responsibility, idempotency and shared nothing principles that can be executed concurrently. The serverless applications generally use event-based architecture for triggering functions and as the serverless functions are more granular, they incur more communication overhead. In addition, chaining functions within the code can result in tightly coupled applications, instead use a state machine or a workflow to orchestrate the communication flow. There is also an open source support for FaaS based serverless computing such as OpenFaas and OpenWhisk on top of Kubernetes or OpenShift, which prevents locking into a specific cloud provider.

Graceful Degradation

Instead of failing a request when dependent components are unhealthy, a service may use circuit-breaker pattern to return a predefined or default response.

Health Checks

Health checks runs a dummy or synthetic transaction that performs the action without affecting real data to verify the system component and its dependencies.

Idempotency

Idempotent services completes API request only exactly once so resending same request due to retries has no side effect. Idempotent APIs typically uses a client-generated identifier or token and the idempotent service returns same response if duplicate request is received.

Layered Architecture

The layered architecture separates software into different concerns such as:

  • Presentation Layer
  • Business Logic Layer
  • Service Layer
  • Domain Model Layer
  • Data Access Layer

Load Balancer

Load balancer allows distributing traffic among groups of resources so that a single resource is not overloaded. These load balancers also monitors health of servers and you can setup a load balancer for each group of resources to ensure that requests are not routed to unhealthy or unavailable resources.

Load Shedding

Load shedding allows rejection the work at the edge when server side exceeds its capacity, e.g. a server may return HTTP 429 error to signal clients that they can retry at a slower rate.

Loosely coupled dependencies

Using queuing systems, streaming systems, and workflows isolate behavior of dependent components and increases resiliency with asynchronous communication.

MicroServices

Microservices evolved from service oriented architecture (SOA) and support both point to point protocols such as REST/gRPC and asynchronous protocols based on messaging/event bus. You can apply bounded-context of domain-driven design (DDD) to design loosely coupled services.

Model-View Controller

It decouples user interface from the data model and application functionality so that each component can be independently tested. Other variations of this pattern include model–view–presenter (MVP) and model–view–viewmodel (MVVM).

NoSQL

NoSQL database technology provide support for high availability and variable/write-heavy workloads that can be easily scaled with additional hardware. NoSQL optimizes CAP and PACELC tradeoffs of consistency, availability, partition tolerance and latency, A number of cloud vendors provide managed NoSQL database solutions, however they can create latency issues if services accessing these databases are not colocated.

No Single Point of Failure

In order to eliminate single-points of failures for providing high availability and failover, you can deploy redundant services to multiple regions and availability zones.

Ports and Adapters

Ports and Adapters (Hexagon) separates interface (ports) from implementation (adapters). The business logic is encapsulated in the Hexagon that is invoked by the implementation (adapters) when actors operate on capabilities offered by the interface (port).

Rate Limiting and Throttling

Rate-limiting defines the rate at which clients can access the services based on the license policy. The throttling can be used to restrict access as a result of unexpected increase in demand. For example, the server can return HTTP 429 to notify clients that they can backoff or retry at a slower rate.

Retries with Backoff and Jitter

A remote operation can be retried if it fails due to transient failure or a server overload, however each retry should use a capped exponential backoff so that retries don’t cause additional load on the server. In a layered architecture, retry should be performed at a single point to minimize multifold retries. Retries can use circuit-breakers and rate-limiting to throttle requests. In some cases, requests may timeout for clients but succeed on the server side so the APIs must be designed with idempotency so that they are safe to retry. In order to avoid retries at the same time, a small random jitter can be added with retries.

Rollbacks

The software should be designed with rollbacks in mind so that all code, database schema and configurations can be easily rolled back. A production environment might be running multiple versions of same service so care must be taken to design the APIs that are both backward and forward compatibles.

Stateless and Shared nothing

Shared nothing architecture helps building stateless and loosely decoupled services that can be easily horizontally scaled for providing high availability. This architecture allows recovering from isolated failures and support auto-scaling by shrinking or expanding resources based on the traffic patterns.

Startup dependencies

Upon start of services, they may need to connect to certain configuration or bootstrap services so care must be taken to avoid thundering herd problems that can overwhelm those dependent services in the event of a wide region outage.

Timeouts

Timeouts help building resilient systems by throttling invocation of external services and preventing the thundering herd problem. Timeouts can also be used when retrying a failed operation after a transient failure or a server overload. A timeout can also add a small jitter to randomly spread the load on the server. Jitter can also be applied to timers of scheduled jobs or delayed work.

Watchdogs and Alerts

A watchdogs monitors a system component for a specific action such as latency, traffic, errors, saturation and SLOs. It then send an alert based on the monitoring configuration that triggers an email, on-call paging or an escalation.

Virtualization and Containers

Virtualization allows abstracting computing resources using virtual machines or containers so that you don’t depend on physical implementation. A virtual machine is a complete operating system on top of hypervisors whereas container is an isolated, lightweight environment for running applications. Virtualization allows building immutable infrastructure that are specially design to meet application requirements and can be easily deployed on a variety of hardware resources.

Architecture Practices

Following are best practices for sustainable software delivery:

Automation

Automation builds pipelines for continuous integration, continuous testing and continuous delivery to improve speed and agility of the software delivery. Any kind of operation procedures for deployment and monitoring can be stored in version control and then automatically applied with CI/CD procedures. In addition, automated procedures can be defined to track failures based on key performance indicators and trigger recovery or repair for the erroneous components.

Automated Testing

Automated testing allows building software with a suite of unit, integration, functional, load and security tests that verify the behavior and ensures that it can meet production demand. These automated tests will run as part of CI/CD pipelines and will stop deployment if any of the tests fail. In order to run end-to-end and load tests, the deployment scripts will create a new environment and setup a tests data. These tests may replicate synthetic transactions based on production traffic and benchmark the performance metrics.

Capacity Planning

Using load testing, monitoring production traffic patterns and demand with workload utilization help forecast the resources needed for future growth. This can be further strengthened with a capacity model that calculates unit-price of resources and growth forecast so that you can automate addition or removal of resources based on demand.

Cloud Computing

Adopting cloud computing simplifies resource provisioning and its elasticity allows organizations to grow or shrink those resources based on the demand. You can also add automation to optimize utilization of the resources and reduce costs when allocating more resources.

Continuous Delivery

Continuous delivery automates production deployment of small and frequent changes by developers. Continuous delivery relies on continuous integration that runs automated tests and automated deployment without any manual interventions. During a software development process, a developer picks a feature, works on changes and then commits changes to the source control after peer code-review. The automated build system will run a pipeline to create a container image based on the commit and then deploy it to a test or QA environment. The test environment will run automated unit, integration and regression tests using a test data in the database. The code is then promoted to the main branch and the automated build system tags and build the image on the head commit of main-branch, which that is pushed to the container registry. The pre-prod environment pulls the image, restarts the pre-prod container and runs more comprehensive tests with a larger set of test data in the database including performance tests. You may need multiple stages of pre-prod deployment such as alpha, beta and gamma environments, where each environment may require deployment to a unique datacenter. After successful testing, the production systems are updated with the new image using rolling updates, blue/green deployments or canary deployments to minimize disruption to end users. The monitoring system watches for error rates at each stage of the deployment and automatically rollbacks changes if a problem occurs.

Deploy over Multiple Zones and Regions

In order to provide high availability, compliance and reduced latency, you can deploy to multiple availability zones and regions. Global load balancers can be used to route traffic based on geographic proximity to the closest region. This also helps implementing business continuity as applications can easily failover to another region with minimal data.

Service Mesh

In order to easily build distributed systems, a number of platforms based on service-mesh pattern have emerged to abstract a common set of problems such as network communication, security, observability, etc:

Dapr – Distributed Application Runtime

The Distributed Application Runtime (Dapr) provides a variety of communication protocols, encryption, observability and secret management for building secured and resilient distributed services.

Envoy

Envoy is a service proxy for building cloud native application with builtin support for networking protocols and observability.

Istio service mesh

Istio is built on top of Kubernetes and Envoy to build service mesh with builtin support for networking, traffic management, observability and security. A service mesh also addresses features such as A/B testing, canary deployments, rate limiting, access control, encryption, and end-to-end authentication.

Linkerd

Linkerd is a service mesh for Kubernetes and consists of control-plane and data-plane with builtin support for networking, observability and security. The control-plane allows controlling services and data-plane acts as a sidecar container that handles network traffic and communicate with the control-plane for configuration.

WebAssembly

The WebAssembly is a stack-based virtual machine that can run at the edge or in cloud. A number of WebAssembly platforms have adopted Actor model to build a platform for writing distributed applications such as wasmCloud and Lunatic.

Documentation

The architecture document defines goals and constraints of the software system and provides various perspectives such as use-cases, logical, data, processes, and physical deployment. It also includes non-functional or quality attributes such as performance, growth, scalability, etc. You can document these aspects using standards such as 4+1, C4, and ERD as well as document the broader enterprise architecture using methodologies like TOGAF, Zachman, and EA.

Incident management

Incident management defines process of root-cause analysis and actions that organization can take when an incident occurs affecting production environment. It defines best practices such as clear ownership, reducing time to detect/mitigate, blameless postmortems and prevention measures. The organization can then implement preventing measures and share lessons learned from all operational events and failures across teams. You can also use pre-mortem to identify potential areas that can be improved or mitigated. Another way to simulate potential problems is using chaos engineering or setting up game days to test the workloads for various scenarios and outage.

Infrastructure as Code

Infrastructure as code uses declarative language to define development, test and production environment, which is managed by the source code management software. These provisioning and configuration logic can be used by CI/CD pipelines to automatically deploy and test environments. Following is a list of frameworks for building infrastructure from code:

Azure Resource Manager

Azure cloud offer Azure Resource Manager (ARM) templates based on JSON format to declaratively define the infrastructure that you intend to deploy.

AWS Cloud Development Kit

The Cloud Development Kit (CDK) supports high-level programming languages to construct cloud resources on Amazon Web Services so that you can easily build cloud applications.

Hashicorp Terraform

Terraform uses HCL based configurations to describe computing resources that can be deployed to multiple cloud providers.

Monitoring

Monitoring measures key performance indicators (KPI) and service-level objectives (SLO) that are defined at the infrastructure, applications, services and end-to-end levels. These include both business and technical metrics such as number of errors, hot spots, call graphs, which are visible to the entire team for monitoring trends and reacting quickly to failures.

Multi-tenancy

If your system is consumed by a different groups or tenants of users, you will need to design your system and services so that it isolates data and computing resources for secure and reliable fashion. Each layer of the system can be designed to treat tenant context as a first-class construct, which is tied to the user identity. You can capture usage metrics per tenant to identify bottlenecks, estimate cost and analyze the resource utilization for capacity planning and growth projections. The operational dashboards can also use these metrics to construct tenant-based operational views and proactively respond to unexpected load.

Security Review

In order to minimize the security risk, the development teams can adopt shift-left on security and DevSecOps practices to closely collaborate with the InfoSec team and integrate security review into every phase of the software development lifecycle.

Version Control Systems

Version control systems such as Git or Mercurial help track code changes, configurations and scripts over time. You can adopt workflows such as gitflow or trunk-based development for check-in process. Other common practices include smaller commits, testing code and running static analysis or linters/profiling tools before checkin.

Summary

The software complexity is a major reason for missed deadlines and slow/buggy software. This complexity can be essential complexity within the business domain but it’s often result of accidental complexity as a result of technical debt, poor architecture and development practices. Another source of incidental complexity comes from distributed computing where you need handle security, rate-limiting, observability, etc. that needs to be applied consistently across the distributed systems. For example, virtualization helps building immutable infrastructures and adopting infrastructure as a code; functions as a service simplifies building micro-services; and distributed platforms such as Istio, Linkerd remove a lot of cruft such as security, observability, traffic management and communication protocols when building distributed systems. The goal of a good architecture is to simplify building, testing, deploying and operating a software. You need to continually improve the systems architecture and its practices to build sustainable software delivery pipelines that can meet both current and future demands of users.

October 6, 2019

Production Deployment Best Practices

Filed under: Computing,Project Management,Technology — admin @ 1:37 pm

Following are a few best practices I have learned over the years for deployment to the production environment especially when working with multiple versions of the software on multiple data centers:

Coding/Debugging Best Practices

Naming Threads

When using Java language, you can name background threads so that logs can show meaningful thread-names and set your threads as a daemon so that they are automatically shutdown when the java process is terminated.

Resource Cleanup

In order to avoid any resource leaking, always close the resources that require explicit cleanup such as I/O streams, database connections, etc.

Backward / Forward Compatibility

When working with multiple versions of the software, it’s critical that all changes to the domain model are both backward and forward compatible so that new schema can be read by the old service and old schema can be read by new service. If possible, always deploy the changes that can read the new schema before deploying the changes that write the new schema.

Character Encoding

Always use UTF-8 for character encoding in your code and databases.

Time Zones

Always use UTC timezone for application code and databases.

Internationalization/Localization/Accessibility

Apply i18n, l10n and accessibility standards for your user interfaces and services.

Data Validation

NullPointerException and wrong formats are common causes of many production issues so always validate your model, input parameters and results for proper format and ranges.

Failure Cases

Think about all failure and edge cases that your code may run.

Operational Best Practices

Use multiple stages for testing

Test your changes in multiple stages such as QA, UAT, alpha, beta, gamma where you can properly bake and test your changes. The size of data and variation will increase with each stage so that you can test as close to the production data and environment as possible. The testing will include load and performance testing to detect any impact to latency and availability. Further, you can use canary testing to release changes to a subset of users or data centers so that you can compare impact of changes before rolling out to all customers. If you are using multiple regions for deployment, you can start with a smaller or low-risk region for initial production release.

Check Calendar before the release

You can check calendar for any holidays or major changes to the infrastructure that might impact the release.

Automate Automate Automate

Avoid any manual changes to the build/deployment process.

Infrastructure as a code

Apply best practices and tools from infrastructure as a code so that you are applying all changes consistently.

Error Logging

Use appropriate level for production logs and log all exceptions including stack traces. You can log additional input parameters but ensure they don’t include any sensitive data.

Deploy domain objects and data access code together

In order to avoid any schema mismatch errors, always deploy both domain object and data access changes together. In addition, you can ignore unknown properties to the model to make these changes forward compatible. If domain and data access changes cannot be released together then release domain changes first and then release data access changes.

Plan for Rollback

Always test for rollback your changes and include multiple scenarios if changes were released in staggered mode where domain changes and data access changes were published separately. You can implement automated rollbacks when releasing changes incrementally so that new changes are immediately blocked before going to the next stage of testing.

SLA/SLO

Monitor SLA/SLO at each stage of the release so that you can fix any violation in these metrics.

Change Management

Apply best practices from change management to track any high-risk or major changes to the production environment.

Security Policies

Apply best practices from zero-trust security practices and principles of a lease privilege in all test and production environments.

Health Checks / Observability

Collect metrics for health checks, failure rates, usage, etc so that you can be immediately notified when you see a suspicious or a peculiar activity.

Chaos Testing

Apply chaos testing and game days to simulate various failure conditions to verify the behavior of your services and your reaction to those failures.

Integrate/Deploy often with small changes

Continuously integrate and deploy small changes to reduce the risk with major changes so that you can test your changes in production environment safely.

Feature Flags and Dark Launch

Use feature flags judiciously to launch features that are not yet available to all users so that you can test those changes for a subset of users.

July 28, 2009

Cut the scope and make your life easy

Filed under: Project Management — admin @ 10:45 am

I have been developing software for over twenty years and in every project you have to grapple with iron triangle of schedule/cost/functionality or sometime referred to as cost/quality/schedule or cost/resources/schedule. In my experience, curtailing the scope produces better results than adding more resources or extending deadline. In addition, slashing the scope also produces other side effects such as reducing the complexity of the software, easier learning curve for users, less training/support cost and better communication among team members.

You can reduce the scope by focusing on essential features using Pareto principle (80-20 rule) and companies like like Apple or 37Signals produce great products that are not only more useful but are much simpler to use. However, this is not easy as project manager or product owner have to say NO. Too often, I see project managers say YES to anything to please upper management and users. In the end, the team is overwhelmed and under stress. Also, a big pile of features where all features are of same importance (priority) is biggest reason for death-march projects.

Working with a small number of features reduces complexity such as essential complexity, cyclomatic complexity or accidental complexity because your codebase is smaller. Though, you still have to apply good software engineering principles such as domain driven design, unit testing, refactoring, etc, but maintenance becomes easier with smaller codebase. When you have a small codebase you have fewer bugs as they are no bugs for zero code. Fewer bugs means less support cost when some user complains of a bug or when system crashes in the middle of the night.

With a small set of features, the user interface becomes simpler, which in turn provides better usability to the users. Often, I have seen users get confuse when they have to work with a complex software that has a lot of features. This often is remedied by providing training or adding support that adds a lot more overhead to the projects. Again, better user interface does not come free automatically with a small set of features, but the usability problem becomes easier with fewer features.

Finally, small number of features and small code means your team size will remain small so communication among team members becomes easier. I like to work with team with size of 5 plus/minus 2, as number of communication links increase exponentially when you add more members. Also, with smaller teams that are colocated, you have better
Osmotic communication that Alistair Cockburn talks about. At Amazon, we have “2-Pizza” teams, i.e., teams are small enough to have team lunch with just two pizzas. Another factor when building teams is whether they are cross functional (vertical) or focus on single expertise such as systems, database, UI, etc. I prefer working with cross functional teams that focus on a single service or an application as communication and priorities within a single team is much easier to manage than between different teams.

In nutshell, reducing scope not only helps you deliver the software in time and delight your users but prepares you better to maintain and support the software. The complexity is number one killer for the software and results in buggy and bloated software. You should watch out when someone says “Wouldn’t it be cool if it did X?” kind of feature requests and often I see developers see this as a challenge or an opportunity to learn or apply new technology. However, each new feature takes a toll on your existing features, software maintenance and your team.

February 23, 2009

Software Estimation

Filed under: Project Management,Technology — admin @ 6:04 pm

Software estimation is a difficult art that I am still learning despite developing software for more than twenty years. I have worked on a number of projects that started with some broad vision and manager asked me how many man-months will it take. You feel like a guy who is asked how long will it take you to survey a cave without going inside (see Software Estimates and the Parable of the Cave). So based on some initial requirements, you make up some numbers. But, often that number translates into commitment and some target date. This issue has been also brought up by Software Estimation by Steve McConnell, Manage It by Johanna Rothman, Lean Software Development by Mary Poppendieck and a number of other people. So it must be made clear that your estimate is not the target date.

As a project is always constrained by iron triangle of schedule/cost/functionality or sometime referred to as cost/quality/schedule or cost/resourcs/schedule. It is crucial to find what’s driving the project as also suggested by Johanna Rothman in her book Manage It. I have seen a number of cases where dates were arbitrary picked, sometime referred to as “happy date”. Though, at other times, dates may depend on marketing campaign, seasons, tax time, Olympics, etc. So, you can negotiate between functionality and schedule based on what’s driving the project. Following are some of techniques that I have found useful with estimation:

  • Get the vision and requirements straight – It’s important about the charter, constraints and requirements for the project as any misdirection here would lead to disaster. Luke Hohmann in his book Beyond Software Architecture recommends starting with good vision and mission statement. Johanna Rothman also recommends creating a project charter before starting the project.
  • Probablistic based estimation – Despite the fact, you are often pressured to produce more precise estimates even though they would be inaccurate, it is better to give estimate with some probablity. Both Johanna Rothman and Steve McConnell cite cone of uncertainty, where your estimate becomes more accurate as project progresses.

  • Based on best/worst/most-likely case – use following formula from Steve McConnell’s book can be used when estimates are more accurate:
expected_case = (best_case + (4 * most_likely) + worse_case) / 6

If estimates are not accurate, then Steve McConnell recommends

expected_case = (best_case + (3 * most_likely) + (2 * worse_case)) / 6

Bob Martin also similar formula from his article PERT, CPM, and Agile Project Management:

Mean     = (best_case + worst_case + (4 * most_likely) ) / 6

Variance = ((worst_case_best_case) / 6) ^ 2
  • Iterative development – No matter if you are working on small or large project, the only way to bring some reality and feedback on initial estimate is to develop iteratively starting with highest valued features.
  • T-shirt based estimation – I find t-shirt based estimation useful when estimating with minimal information available. For example, you may have to estimate projects that you can deliver in Q1, Q2, etc and you can order them in small, medium, large and compare them against their business value.
  • Spiking can also help in areas that are new to the team and spending a little time creating walking skeleton or tracer bullet can give you some idea on the size of the effort for the project.
  • Delphi estimation – where PM and team prepares task list, assumptions and estimate in private and reviews them together.
  • Divide and conquer/Decomposition/WBS – as with any large effort, breaking a project into smaller subsystems, components, services and tasks will help estimate better. In general any errors in estimation of smaller tasks will cancel each other.
  • Estimate fine grained tasks – I can rarely estimate with some accuracy for tasks that are longer than a few days so it’s important to estimate only fine grained tasks. XP has a concept of inch pebble and story points that can help in this case. The idea is that each task is either done or not done.
  • Planning poker a technique from Agile Estimating and Planning by Mike Cohn, where each member of the team pick an estimate for a story based on fibonacci numbers, but don’t show until everyone selects some number. The members then pick some average or may ask member with highest or lowest estimates to explain.
  • Historical data – though I rarely see PM track estimates but tracking them can help future projects and new projects can use LOC, man-months, function-points, # of services, files, interfaces, bugs from prior projects for estimation.
  • Schedule chicken – Kent Beck often talks about schedule chicken where you have some some meeting about who is ontrack and you hope there is someone who is behind so that you don’t have to admit you are behind as well. Integrity is big part of the XP and agile methodologies so it encourages transparency and honesty instead of schedule chicken.
  • Better to overestimate than underestimate – programmers often underestimate and though there is risk of student syndrome or Parkinson’s law but it’s better to overestimate.
  • Don’t question developer’s estimate – even though developers tend to underestimate, some managers still question them, which is not a good idea.
  • In XP or Scrum, you use story points, which can be ideal hours or based on some multiplier. These numbers are generally follow fibonacci sequence such as 1, 2, 3, 5, 8, 13, 21.
  • Function points use number of external input/output/queries, internal logical files/external interface files and it can be used as unit of measurements similar to story points.
  • Estimation quality factory (EQF) as proposed by Tom Demarco in his paper A Defined Process For Project Postmortem Review can be used to check how accurate estimates are.
  • Include vacation, sick, holidays as well as non-development activities such as testing, deployment, configuration, migration, etc in your project plan.
  • Scheduling is all about ordering with highest value features. I find rolling-wave scheduling based on milestones useful when planning iterations.

Summary

I often find projects turn into death march projects due to overly optimistic estimates and “queen of denial” manager who holds developers’ estimates as commitment and refuses to accept the reality. One way to overcome bad estimation is to adopt iterative development that delivers small features based on the value proposition, which creates biggest value for the business. Another way is to use advice from the Rational Unified Process that uses risk management to prioritize the highest risk tasks first. Though, some managers are keen to accept more work than the team can handle in order to aim high but it takes a courage to say NO. In the end, under-promise and over deliver as it can save your credibility and not to mention unnecessary overtime and stress on the team.

Powered by WordPress