6 Lessons learned from Microservices Architecture failures
“Pick yourself up, learn from it and move on…”
I always tell myself this when I fail at something in software development.
Microservices architecture still seems like a silver bullet or a hype that every company should follow, but it may have a lot of downsides of this approach and I learned a few of this the hard way. This is a story that will try to explain what I bumped into during last 4 year in my current company, Trendyol.
Data ownership
I guess one of the biggest problem to solve in microservices architecture is data ownership and it creates all the noise between the domains and teams. Dependencies between teams based on data, deciding on data duplication or single source of truth or fat/thin event etc. are examples of the important questions that I ask during architecture design talks.
We had a few incident due to this actually. Once we changed a name of an attribute in one core catalog domain ( something like from “color” to “colour”), it affected the other 3 domains and the mobile applications started to crash as there was a string comparison based on that specific “color” text. As each domain used that data, localized it, assigned different meaning to it, we failed. We reverted back the change on catalog domain, do the fix on necessary applications and make sure no other domain assign specific meaning to those text.
Beware of how to split the data, extensively think about ownership, think of data change events, work deeply on event driven architecture, use thin/fat event based on your scenario, think about race conditions.
Race Conditions, I call them the silent killers, that is something you may not trace, something that all apis has returned 200 but the data is not correct.
While working with data in microservices, Race conditions is your public enemy number one. Use versioning, check for concurrent document modification, log for data change, anything you can think of for data validity. Our sellers sends 150 Million stock and price updates to our system daily ( will be 1 Billion in one year I guess ), sometimes they send the same value for same product or sometimes change the same products’ stock or price value within milliseconds. While sellers update their inventory, other systems may also update the same data. So you should really consider race conditions while designing your systems.
Sync vs Async communication
If there are sync communication between two API, they are heavily bound to each other, in terms of scale, contract etc etc. So any sync all should be considered twice and should be checked for failures, contract verifications, timeouts, scale issues etc. If any two apis dependent on each other in a sync way, this may be a pain point in the future. We always try to choose async communication between two systems which helps us to scale each domain independently. You should consider data consistency if you choose sync communication by the way.
As an incident example, our product invalidation system was depending on sync calls when we first design the architecture. When our catalog system had problems or returning 5xx or 4xx responses due to high request ,our product detail system for buyers started flagging products as not sellable.
Project Planning including multiple teams
Even if each domain team delivers the stories within their sprints, it may take serious time if each team waits other teams to finish their part. For example if there are 4 teams that are included in project and all the teams run 1-week sprints, it may take 4 weeks to see a small project in production.
While planning projects, one option is to align teams with business so that in 1 sprint, whole team can finish the whole project. You can have vertical full stack teams to achieve this which may include backend, frontend and mobile developers in the team. You may need infrastructures like microfrontends to be separated from other systems to see the end result on production.
If you have different teams that are responsible for same or similar domain, you can also count on contracts, start working in paralel and each delivers its own part based on those contracts. This will require a little more planning and more communication ( written preferably ) but will also work if executed correctly.
We have problems or we have lost a lot of time when we don’t do enough communication or when we don’t document the requirements from each team. Wrong understood contracts; wrong named arguments, fields has problems but we learned how to overcome those issues. Our last 3 or 4 big projects including instant market, instant meal, second hand business or international application projects was delivered on time all within at most 2–3 months time. But it really took some time to get used to it and planning.
Check & Monitor permissions, network or things other than code
As one of the domain api had been updated to new API, I was doing the update for that new API for our BFF. All Test & Stage tests were ok, everything seemed fine, so at around 00:30, I hit the deploy button, comfortable as hell, sipping my coffee; watching the logs. At 00:32 the error messages started to arise telling our BFF cannot reach to the domain’s new API’s prod environment.
I just checked from my computer for the new Prod API, it was accessible and up. Then I checked our monitoring tool for the deep backend new api, everything seemed perfectly normal. BFF still logging errors and mobile apps were not working properly. Then I logged in to the BFF pod, pinged the domain API, saw that it cannot reach that servers and reverted the deployment. Then figured out that it was due to network permissions that I never thought.
So when you make async/sync calls between systems, your code may be perfectly fine, but make sure to check network, servers, nodes of each servers, network between those nodes, monitor disks etc. Your code may be working but you should be aware of where your code is running from the specific hardware until the end user.
Other incident that we bumped was around 2020, Our stock & price inventory system is on top of Couchbase ( we benefit heavily from CB, I think we may be one of the biggest users in the world ), we started getting errors from the system, the response time increased during a campaign. We were not able to do reservations on products which means our basket service were unable to reserve stocks for users, thus buyers were unable to buy stuff. We thought it was the code, the scale, the K8s etc. etc. Actually it was the data that was causing problem. There were too many reservations on one product ( which was on the TV ads with a very good price as I remember ) and it made the document size climb up, as there were tens of thousands ops for same document, which caused a CB node to misbehave etc. First with some tests we make sure the case is reproducible, then applied fixes and optimizations and finally do the tests again to verify it is not a problem any more. So you may never now what happens until you bump into it. You should monitor and logs as much as you can.
Another type of incident that we recently had was about not monitoring the api usage and service map between applications correctly. If we change the url for that api for example, clients that are using that api start getting errors. So in the middle of the day you start to get alarms from notifications systems, and look for changes in whole company to understand what change might cause an issue. As we started using istio and open telemetry more and more, this will not be a problem hopefully. Using correct headers for correlation ids, client ids etc is a really life saver.
Resilient & Robust BFFs & APIs
In 2018, I was responsible for the BFF apis for our mobile applications. By that time it had ( and still has ) the most throughput during high traffic campaigns in whole system. In 2018 November Campaigns, when the campaign started, the traffic began to increase. One of our deep backend api started to have some trouble and response time of that backend api increased to 1 second.
As there was one BFF for all deep backend api interactions and as we didn’t separate each deep backend api requests thread pools, all threads were busy trying to get response from that specific backend api, BFF became unable to response to even health check requests, thus failed and restarting. Whole APP started crashing or popping up error messages.
We separated pools, fixed timeouts, applied better Circuit Breaking and finally separated BFF into smaller ones based on domains , switched to golang for resource, performance and startup time efficiency.
For the BFF architecture and having multiple teams working on a single client systems, like a desktop app or a mobile app, microfrontend architectures or zool like api gateways are very handy. After correctly configured in terms of ci/cd, deployment, dependencies etc. each team can deliver business value very quickly.
Test for every single goddamn thing
I think the process of delivering a business value is really important if you don’t want to have any issues. In Trendyol we have over 90 domain teams, maybe 1000 APIs and tens of thousand of pods running on maybe a 100 K8s clusters using 1000 data sources, I don’t know. We do ship something everyday, maybe 50–100 production changes is being done daily. So anytime anything can change, so we should do test every scenario, hopefully with automation and do chaos testing.
Ci/Cd pipeline and how a team delivers a feature is the most important indicator of the seniority and maturity of a team. Any domain team should be able to delivery easily, quickly and as much as they can without causing any problem, and revert back easily in terms of a problem. All the teams should focus on lead time and try to minimize it.
What I should more on seems to be chaos testing I guess. As everything fails when you don’t think it will, you should be ready for those times and have scenarios accordingly. What happens if my datasource fails? What happens if my messaging queue fails? What happens if I got high response time or error responses from systems that I depend on? Will my data be lost? Can business survive without my system? Do we have a feature toggling for those times? Can we do a reindex if something happens? You can generate more and more questions based on your domain but I guess getting ready to those times will be important while working on microservices architecture.
Three years ago, on a sunny saturday, our storage system was down. We couldn’t believe that all our data was not accessible and we were not ready for those kind of problems. After that incident, I guess we initiated our multi datacenter project where Trendyol can run on different data centers so that minimizing risk of something similar. We also tried to improve our data flow processes for doing data migrations or reindexes quicker in case any problem occurs. We try to make systems loosely coupled so that there are not many dependencies between domains and tribes.
With microservices architecture, the systems will depend on other systems and those ‘contracts’ should be tested in the ci/cd pipeline. Contract testing makes sure the dependent systems are able to work together, via checking if the provider and consumer systems are obeying the contract. We have had numerous problems due to absence of contract testing in the past ( maybe still :( ) so we are asking those provider/consumer systems to setup contract testing infrastructure as much as possible. We have incidents by changing the contract like providing null values to non-nullable fields, changed datatypes, deleted endpoints etc.
In Trendyol, our biggest challenge is scale. Every year the throughput, load, number of items in Datasources increases madly. We had 25 M products last year, it is 250M nowadays. 150M stock & price updates occurs every single fricking day. Load Testing is very important and it has become a habit for us. We even created our own load testing framework to do those tests. So Load Testing is very important for every new API and should also be done periodically.
TL/DR
My incident track is still very famous in Trendyol, but the above list may hopefully help you not bumping into same ones. Always get ready for failures but make sure to monitor everything including business metrics so that you have full control over your data, applications, systems, everything…
Cheers.