7 Months to full rebuild
15yrs Uninterrupted operation
4hr RTO maintained throughout
Zero Unplanned DR failures

The Situation

A Cayman Islands headquartered, multi-national, fund administrator was operating a transfer agency system that had been designed by the CTO, but built using external resourcing. The platform had been serviceable in its original context but its architecture was showing significant limitations in regard to opportunites for continual development and with respect to the disaster recovery framework. Extending it further risked creating technical debt that would compound over time rather than reduce it.

The question put to the technology leadership was whether to continue extending the legacy platform or rebuild it from the ground up. The answer had to account not just for current capability requirements, but for what the business would need over the following decade — and for the resilience properties of whatever system was built.

The Decision

The decision to rebuild was not solely a technology decision. It was also a resilience decision. The legacy architecture could not be made reliably recoverable within the operational constraints of the business — and no amount of extension work would change that fundamental property. A system built on the wrong architectural foundation remains fragile regardless of how much is added to it.

The case for a clean rebuild was made to key stakeholders on precisely those grounds: that the cost of building resilience into a new system from the outset would be significantly lower, over any reasonable planning horizon, than the cost of maintaining a fragile legacy platform and managing the consequences of its eventual failure. The case was accepted, and full autonomy was granted for the rebuild.

"The most important architectural decisions are the ones made at the beginning — before the pressures of live operation create incentives to cut corners."

The Approach

Avatar — the resulting platform — was built as a proprietary transfer agency, CRM, workflow, and reporting system with disaster recovery designed in as a first-order requirement rather than a post-deployment addition. The rebuild was completed and fully operational within seven months.

Key design principles embedded from the outset included a clean separation between the primary operational environment and the DR environment, with no shared dependencies that would prevent one from operating when the other was unavailable. Recovery procedures were documented, tested, and refined before they were needed rather than after. The 4-hour RTO was a design specification, not a target derived from a post-build assessment.

Resilience requirements were continuous design inputs throughout the development cycle. The BC practitioner — with designated BC authority for the organisation — sat alongside the development process rather than reviewing outputs after the fact. This is the principle that the engagement illustrates most directly: when resilience is in the room during design, it does not need to be retrofitted later at greater cost and lower effectiveness.

The Outcome

The platform was maintained in live operation from March 2009 to April 2024 — fifteen years — without a single unplanned DR failure. The 4-hour RTO was sustained throughout that period, through multiple rounds of platform evolution, two acquisitions, and significant growth in the complexity and volume of the operations it supported.

At the point of acquisition, Avatar was described by the acquiring organisation as industry-leading. Its resilience properties were a material part of that assessment: a platform that had demonstrated fifteen years of reliable operation, with documented and tested recovery procedures, is a qualitatively different acquisition from one that has merely not failed yet.

"Resilience embedded at design costs a fraction of resilience retrofitted after deployment — and performs differently when it matters."

The Lesson

The case is not primarily about the technology. It is about the decision-making framework that produced the technology. The choice to rebuild rather than extend was a choice to treat resilience as a design requirement rather than a compliance consideration. The fifteen-year operational record is the outcome of that decision — made once, at the beginning, before the pressures of live operation created reasons to compromise it.

Organisations that find themselves in the legacy extension trap — spending increasing effort to maintain a fragile system that cannot be made reliably recoverable — are often aware of the problem but underestimate the cost of continuing versus the cost of resolving it cleanly. The rebuild option is rarely as expensive as it appears when the alternative is accurately costed over time.