Migrating 70+ Microservices to Azure Kubernetes Service — A Platform Engineer's Playbook

From on-premises to cloud-native at enterprise scale

Posted by Saurabh Chaubey on Saturday, September 20, 2025

Migrating a handful of microservices to Kubernetes is one thing. Migrating over seventy — with their interconnected dependencies, legacy configurations, and the weight of years of organic growth — is an entirely different challenge. This is the story of how our platform engineering team planned and executed a large-scale migration from on-premises infrastructure to Azure Kubernetes Service (AKS) at a major enterprise insurance company, and the lessons we learned along the way.

The Starting Point

Our on-premises estate was a product of evolution rather than design. Over the years, the organisation had built a microservices architecture that ran across a fleet of virtual machines managed by a combination of Ansible scripts and manual processes. Deployments involved SSH-ing into machines, running shell scripts, and hoping that the environment variables were configured correctly. Some services had automated deployments through Jenkins; many did not.

The infrastructure worked, but it was showing its age. Scaling was slow and manual. Environment parity between development, staging, and production was aspirational at best. Deploying a new service meant provisioning VMs, configuring load balancers, setting up monitoring — a process that could take two weeks or more. And troubleshooting issues often came down to “it works on my machine” because the environments were genuinely different.

The decision to move to Kubernetes wasn’t made lightly. It came after months of evaluation, proof-of-concept work, and building the business case. We chose AKS specifically because the organisation was already invested in the Azure ecosystem, and AKS offered a managed control plane that reduced our operational burden. We weren’t interested in running our own Kubernetes clusters — we wanted to focus on what ran on the clusters, not the clusters themselves.

Assessment and Planning

Before migrating a single service, we spent three months on assessment and planning. This phase was unglamorous but absolutely essential.

Service Inventory

The first task was understanding what we actually had. It sounds obvious, but in a large organisation, the answer to “how many microservices do we have?” is surprisingly hard to pin down. Services had been created by different teams over several years, documentation was inconsistent, and some services were running but effectively abandoned — still consuming resources but no longer actively maintained.

We built a comprehensive inventory that captured each service’s technology stack, resource requirements, dependencies, data persistence needs, and criticality. We categorised services into tiers: Tier 1 (business-critical, high-traffic), Tier 2 (important but not customer-facing), and Tier 3 (internal tools and batch jobs).

Containerisation Readiness

Not all services were equally ready for containerisation. Most were Mulesoft applications and APIs, which required careful consideration for containerisation. Mulesoft runtimes have specific memory and configuration requirements, and many of our APIs had dependencies on shared domains, custom connectors, and environment-specific property files. We also had a few legacy Node.js apps with hardcoded file paths and batch processing jobs that assumed access to network file shares.

For each service, we assessed containerisation complexity on a simple scale: low (already follows 12-factor patterns), medium (needs configuration externalisation), and high (requires code changes or architectural modifications). About 60% fell into the low category, 30% were medium, and 10% were high — with many of the Mulesoft APIs falling into the medium category due to their configuration and connector dependencies.

Migration Strategy: Lift-and-Shift vs Re-Platform

We made a pragmatic decision early on: this migration was about getting services onto Kubernetes, not about rewriting them. We adopted a “lift-and-shift with guardrails” approach. Services would be containerised as-is wherever possible, with modifications limited to what was necessary for container compatibility — externalising configuration, removing filesystem dependencies, and ensuring graceful shutdown handling.

The “guardrails” part meant that while we weren’t rewriting services, we were establishing standards. Every migrated service would have health check endpoints, structured logging, Dynatrace monitoring integration, and a standardised Helm chart. If a service didn’t have these, we’d add them as part of the migration — but we wouldn’t refactor the business logic.

This approach let us move fast. Trying to modernise seventy services simultaneously would have turned a migration into a multi-year rewrite programme.

Networking Challenges

Networking was, predictably, one of the hardest aspects of the migration. Our on-premises services communicated over a flat network with relatively simple firewall rules. Moving to AKS introduced a new networking model that we had to design carefully.

Cluster Networking

We chose Azure CNI Overlay networking for our AKS clusters. With CNI Overlay, pods receive IP addresses from a private CIDR overlay network rather than directly from the VNet subnet. This was a significant advantage for our scale — with 70+ services, multiple replicas per service, and several environments (dev, staging, production), CNI Overlay eliminated the IP exhaustion concerns that come with traditional Azure CNI, where every pod consumes a VNet IP address. Our subnet sizing could remain manageable while supporting significant pod density.

The Azure landing zone was connected to the organisation’s on-premises network via ExpressRoute, providing a private, high-bandwidth, low-latency connection to on-premises databases and legacy systems that weren’t migrating to the cloud. This was critical because many of our Mulesoft APIs needed to communicate with backend systems that remained on-premises.

All traffic between the cloud environment and the on-premises network was secured through Azure Firewalls, which provided network-level filtering, threat intelligence, and centralised logging of all cross-boundary traffic. The firewall rules were managed as code and reviewed as part of our change management process, ensuring that connectivity changes were auditable and controlled.

Service Mesh Considerations

We evaluated Istio and Linkerd for service mesh capabilities but ultimately decided against introducing a service mesh during the initial migration. The reasoning was simple: the migration itself was complex enough without adding another layer of infrastructure to learn and operate. We used Kubernetes-native services and ingress controllers for traffic management, with the option to introduce a service mesh later once the migration was stable.

This decision was controversial within the team, but I stand by it. You can always add complexity later; removing it is much harder.

DNS and Service Discovery

On-premises, services discovered each other through a combination of environment variables, configuration files, and in some cases, hardcoded IP addresses. In Kubernetes, we used internal DNS (service names) for intra-cluster communication and configured external-dns for services that needed to be reachable from outside the cluster.

Migrating DNS was one of those tasks that sounds trivial but consumed weeks of effort. Every service had to be updated to use the new endpoints, and we had to maintain backward compatibility during the transition period when some services were on-premises and others were in AKS.

Secrets Management

Secrets management was a critical workstream. On-premises, secrets were stored in a mix of places — environment variables baked into VM images, files on shared drives, and entries in a legacy vault solution. The migration was an opportunity to consolidate everything into Azure Key Vault.

We implemented the Azure Key Vault CSI driver for Kubernetes, which mounts secrets from Key Vault directly into pods as files or environment variables. Each service namespace had its own Key Vault access policy, following the principle of least privilege.

The migration of secrets was painstaking. For each service, we had to identify every secret it used, verify the values, create corresponding entries in Key Vault, and update the service configuration to reference the vault. We built automation to help — a Python script that scanned service configurations for secret references and generated the Key Vault entries — but it still required manual verification for each service.

One lesson we learned the hard way: audit your secrets before migrating them. We discovered several services using secrets that were years old and no longer valid, and a few cases where multiple services shared the same credential (a security anti-pattern we cleaned up during the migration).

Observability

Moving to Kubernetes gave us an opportunity to standardise our observability stack. On-premises, monitoring was fragmented — some services used Sumo Logic, others used custom logging solutions, and a few had no meaningful monitoring at all.

For the AKS environment, we adopted Dynatrace as our single-pane-of-glass observability platform. We deployed the Dynatrace Operator on our AKS clusters, which provided automatic instrumentation of pods with OneAgent, enabling deep visibility without requiring code changes in individual services.

Dynatrace gave us a unified platform covering all observability pillars:

  • Metrics: Automatic collection of infrastructure and application metrics, including Kubernetes cluster health, pod resource utilisation, and custom application metrics — all surfaced through Dynatrace dashboards
  • Logging: Centralised log ingestion from all AKS workloads, with full-text search, log analytics, and correlation with traces and metrics
  • Distributed Tracing: Automatic end-to-end trace capture across our Mulesoft APIs and supporting services, with AI-powered root cause analysis through Dynatrace’s Davis AI engine
  • Alerting: Dynatrace’s anomaly detection and alerting capabilities replaced the need for manually configured thresholds, automatically baselining service behaviour and alerting on deviations

Every migrated service was automatically instrumented by the Dynatrace Operator, giving teams immediate visibility into their service’s health post-migration. Dynatrace dashboards were configured for each service, providing a comprehensive view of performance, dependencies, and error rates from day one.

CI/CD Pipeline Modernisation

The migration was also the forcing function for modernising our CI/CD pipelines. We migrated from Jenkins to Azure DevOps Pipelines, taking advantage of the tighter integration with AKS and Azure Container Registry (ACR).

Our standardised pipeline followed this flow:

  1. Build: Compile the application and run unit tests
  2. Scan: Static code analysis (SonarQube) and container image scanning (Trivy)
  3. Containerise: Build the Docker image and push to ACR
  4. Deploy to Dev: Automatic deployment via Helm
  5. Integration Tests: Run automated integration tests against the dev environment
  6. Deploy to Staging: Manual approval gate, then Helm deployment
  7. Deploy to Production: Manual approval gate with change management integration

We used Helm charts for deployments, with a base chart that encoded our standards (resource limits, health checks, pod disruption budgets, network policies) and service-specific value files for customisation.

The pipeline standardisation had an unexpected benefit: it became much easier to enforce security and compliance requirements. Instead of checking each service’s bespoke pipeline, we could update the base pipeline template and have the changes propagate to all services.

The Rollout Approach

We didn’t migrate all seventy services at once. We used a phased approach:

Phase 1 — Pathfinder (2 services, 4 weeks): We chose two low-risk, Tier 3 services as pathfinders. The goal was to validate our migration process, identify gaps in our tooling, and build confidence. These first two services took disproportionately long because we were building the foundation — the base Helm chart, the pipeline templates, the networking configuration, and the operational runbooks.

Phase 2 — Proof (8 services, 6 weeks): We expanded to eight services across different teams and technology stacks. This phase validated that our approach worked beyond the platform team’s own services. It also surfaced issues with team-specific configurations and undocumented dependencies.

Phase 3 — Scale (30 services, 8 weeks): With the process proven, we parallelised. Multiple teams migrated their services concurrently, with the platform team providing support and guidance. We ran migration workshops, created detailed runbooks, and established a dedicated Teams channel for migration questions.

Phase 4 — Complete (remaining services, 10 weeks): The final phase tackled the harder services — the ones with complex dependencies, legacy code, or high criticality. These required more hand-holding and in some cases, code changes to achieve container compatibility.

Throughout the rollout, we ran services in parallel — traffic flowing to both on-premises and AKS instances — before cutting over. This gave us a safety net and the ability to roll back quickly if issues arose.

Results

The migration took approximately seven months from the first planning session to the last service cutover. Here’s what we achieved:

  • Deployment frequency: Increased from an average of once per fortnight to multiple times per week
  • Environment provisioning: Reduced from 2 weeks to under 1 hour
  • Incident response time: Improved by approximately 40%, thanks to standardised observability
  • Resource utilisation: Improved by roughly 35% due to Kubernetes bin-packing and right-sizing
  • Cost: Initial cloud costs were higher than on-premises (as expected), but the operational efficiency gains and reduced manual toil more than compensated

Lessons Learned

Invest disproportionately in the first few services. The foundation you build during the pathfinder phase determines the speed of everything that follows. Don’t rush it.

Networking will take longer than you think. Every networking assumption you made on-premises will be challenged in the cloud. Start the networking workstream early and involve your network team from day one.

Don’t underestimate the people side. Migrating to Kubernetes isn’t just a technology change — it’s a skills change. We ran Kubernetes training sessions for development teams and created a certification pathway. Developers who understood the platform were significantly more effective at troubleshooting and optimising their services.

Automate the boring stuff. The migration involved a huge amount of repetitive work — creating namespaces, configuring RBAC, setting up monitoring. Every task we automated freed up time for the harder problems.

Keep the old infrastructure running longer than you’d like. We maintained on-premises environments for three months after the migration completed. It felt wasteful, but it was invaluable for debugging issues and providing a rollback path. The cost of keeping the lights on was far less than the cost of a botched migration.

Perfect is the enemy of migrated. Some services migrated with known imperfections — suboptimal resource limits, missing metrics, incomplete documentation. We tracked these as tech debt and addressed them post-migration. Trying to make everything perfect before cutover would have delayed the migration by months.

Looking Back

If I had to do this migration again, I’d change two things. First, I’d invest in a migration tracking dashboard from day one. We used spreadsheets initially, and they became unwieldy quickly. A proper dashboard with per-service status, blocking issues, and timeline tracking would have improved coordination significantly.

Second, I’d start the secrets migration earlier. It was on the critical path for almost every service, and delays in secrets provisioning repeatedly blocked other workstreams.

But overall, the migration was a success. Our services are more reliable, our deployments are faster, and our platform team can focus on building capabilities rather than babysitting infrastructure. Moving to AKS wasn’t just a lift-and-shift — it was the foundation for the next generation of our platform.