Oliver Gould About Architecting to Avoid and Recover from Failure

Categorías:

In this week’s podcast, Robert Blumen talks to Oliver Gould at QCon San Francsico 2016. Oliver is the CTO of Buoyant where he leads open source development efforts. Prior to Buoyant he was a Staff Infrastructure Engineer at Twitter where he was technical lead on Observability, Traffic, Configuration and Co-ordination teams. Why listen to this podcast: - Stratification allows applications to own their logic while libraries take care of the different mechanisms, such as service discovery and load balancing - Cascading failures can’t be tested or protected against, so having a fast time to recovery is important - Having developers own their services with on-call mechanisms improves the reliability of the service; it’s best to optimise automatic restarts so problems can be addressed during normal working hours - Post mortem analysis of failures are important to improve run books or checklists and to share learning between teams - Incremental roll out of features with feature flags or weighted routing provides agility while testing with production load, which highlights issues that aren’t seen during limited developer testing Notes and links can be found on: http://bit.ly/2ivoz9w 4m:05s - Each domain has different failure and operating modes, and the layered approach to resiliency means that the layer handles this automatically 4m:30s - Large systems may fail in unexpected ways 4m:35s - Twitter originally had the “Fail Whale” but this has been phased out as the system has become more stable 4m:50s - As Twitter grew, it needed to move quicker, with more engineers and less whale time 5m:10s - Automation and social tools were needed to improve the situation More on this - Quick scan our curated show notes on InfoQ: http://bit.ly/2ivoz9w You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq

Visit the podcast's native language site