# Azure VMSS Scaling with Outdated Code
## Summary
When an Azure VM Scale Set (VMSS) scales out and creates a new VM instance (VM3), it deploys the latest code from the `DEV` branch during a Release Pipeline deployment. However, VM3 boots up with an older base image from an earlier commit, causing it to run outdated code. The Load Balancer (LB) routes traffic to VM3 before it completes pipeline tasks like package installation, leading to users accessing legacy functionality. This violates the principle of **“Zero Downtime Deployments”**, resulting in a poor customer experience and potential regressions.
## Root Cause
- **Pipeline Execution Timing**: The Release Pipeline was configured with the `--remove-vm` flag to trigger VMSS scale-out events but lacked integration with the LB health checks. New VMs like VM3 are added to the LB pool immediately post-deployment, before pipeline tasks finalize.
- **Automatic Scale-Set Behavior**: VMSS scales out by cloning the **base image** specified during deployment, which does not reflect pipeline-managed code changes unless the image is rebuilt.
- **Missing Automation**: No automation existed to rebuild the base image or trigger post-deployment tasks (e.g., code deployment) for new VM instances.
## Why This Happens in Real Systems
- **Asynchronous Pipeline Execution**: Scale-out events prioritize speed over consistency, causing the LB to route traffic before pipeline work completes.
- **Legacy Deployment Patterns**: Teams often rely on base images for code inheritance, assuming post-deployment tasks handle everything. When these tasks fail or are skipped, stale images persist.
- **LB Health Check Delays**: VMSS marks instances as healthy as soon as they pass OS-level probes, bypassing application-level health checks required for pipeline tasks (e.g., database connectivity).
## Real-World Impact
- **User-Facing Errors**: Users encountered “404 Not Found” errors and API timeouts due to mismatched database schemas between VM3 and VM1/VM2.
- **Reputation Risk**: The incident was logged in production monitoring tools, prompting leadership to fast-track a root-cause analysis.
- **Operational Debt**: Engineers spent 8+ hours rolling back the deployment and manually patching VM instances.
## Example or Code
```powershell
# Sample pipeline command that fails to update the base image
az vmss deployment start \
--resource-group "Prod" \
--name "WebAppVMSS" \
--deployment-name "UpdateInstance" \
--tags "Environment=Prod" \
--remove-vm # Triggers scale-out without base image update
How Senior Engineers Fix It
- Rebuild Base Image on Pipeline Trigger: Modify the pipeline to rebuild and redeploy the base image (e.g., Custom VM Image in Azure) whenever code changes are detected in the target branch.
- Post-Deployment Tasks: Add pipeline steps that execute once per VMSS instance, such as:
- task: AzureCLI@2 inputs: azclishínd: | az vmss run-command \ -g "Prod" \ -n "WebAppVMSS" \ --command-definition "DeployCodeV3" \ -r "1*" - LB Health Checks: Configure the LB to validate application-level health (e.g.,
/healthzendpoint) before routing traffic. Delay traffic to new instances until pipeline tasks and health checks pass.
Why Juniors Miss It
Juniors often overlook these critical factors:
- Overlooking Pipeline-as-Code: They fail to link base image updates to pipeline triggers, assuming manual redeployments are always required.
- Ignoring Scale-Set Nuances: They assume all VMSS features (scaling, instance management) are handled transparently, missing post-deployment automation needs.
- Misinterpreting LB Behavior: They deploy health checks but rely solely on OS-level probes instead of validating application readiness.