Context
The fund administrator had tested disaster recovery before — but always in isolation. The procedure was: take the system offline, fail over to the DR environment, process a small number of dummy transactions, then fail back. Clean, contained, and insufficient as a test of real-world operational resilience.
The live system had three core databases: the primary system database, a system document database, and a document delivery database. The first two had high-availability solutions with straightforward recovery procedures. The third — the document delivery database — held every document sent from the system to clients and investors. It was not required for internal operation and housed a substantial and growing volume of data. Critically, the document delivery database at the DR site was empty.
This asymmetry created a specific recovery challenge: documents created and sent during any period of DR operation would be written to the empty DR document database. Recovering back to primary meant running an additional step — inserting those DR-generated documents into the primary document delivery database. Prior experience of large-volume data migrations and ETL processes had shown that document transfer at scale could take considerable time. This had created a well-founded reluctance in the CTO to fail over to DR unless the situation absolutely demanded it.
The problem with that reluctance, as had been raised following an earthquake experienced in Cayman, was that it set the threshold for activation too high. BC and DR must be actionable as a precautionary step — deployed when conditions are deteriorating, before they become critical — not held in reserve for the moment when there is no alternative. That principle requires confidence in recovery. The exercise was designed to generate that confidence.
"BC and DR must be actionable — not held in reserve for the moment when there is no alternative. Bureaucracy in the decision-making process is itself a resilience risk."
Exercise Objectives
Nine specific objectives were defined, spanning communications, DR technical validation, data integrity, cross-continental coordination, and organisational engagement.
Communications system — primary and DR
The STORM communications module was built into the proprietary system itself, which would transition to DR mode in a serious scenario. The exercise needed to verify it functioned from the primary site before failover and seamlessly from the DR site after it.
Document delivery database recovery
Demonstrate that the additional recovery step for the document delivery database — inserting DR-generated documents into the primary database — was manageable, including in a phased approach if the organisation had been operating under DR conditions for a significant period.
STORM messaging templates
Test the preset storm messages defined within STORM: templates with merge fields for storm parameters (current strength, projected strength, anticipated landfall time), content defined with the BC Committee and Crisis Management Team.
User access across all locations
Verify that every user across every office and country could access the system from the DR site — not just the core operational team but the full organisational estate.
Live data processing in DR mode
Actively process a greater volume of data — using demonstration funds — across all countries, and send documents from the system while operating from the DR site.
Non-compliance simulation
Vary user behaviour to test resilience against non-compliance with failover instructions. Some users followed instructions; others continued working up to the moment of failover. Verify that data saved within the final seconds before failover and recovery was successfully copied.
Cross-timezone engagement
Engage members from all offices including senior leadership across all timezones, working with the reality of a geographically distributed organisation under a hurricane scenario that affects one location but requires coordinated response from all.
Demonstrate DR efficiency under BC load
The principals responsible for overall BC were also responsible for the DR of the core systems. Demonstrate that the DR process was mature and efficient enough — taking seconds — to leave that team free to focus on the wider BC position.
Broad organisational engagement
Engage a significant body of employees — not just those in a response capacity — in a realistic scenario, embedding resilience thinking practically and demonstrably across the organisation. Timed before the onset of hurricane season.
Outcomes
The STORM communications system operated as intended in both the primary and DR configurations. Pre-defined messages with storm parameter merge fields deployed correctly.
Failover succeeded in under five seconds. The full failback procedure including the additional document delivery database recovery step was demonstrated successfully, including the phased approach for larger data volumes. This directly resolved the CTO's reluctance — the evidence now existed that the extra step was manageable — and enabled the decision to be taken to use failover as a precautionary measure.
All pre-defined STORM message templates performed as expected with correct merge field resolution.
All participants across all locations — Cayman, Canada, Ireland, and the United States — accessed the system successfully in DR mode.
Data was successfully entered and edited within the system in both primary and DR modes. Documents sent from the system during DR operation were correctly captured.
Data submitted in the final seconds before both failover and recovery was successfully preserved — including from users who continued working in deliberate non-compliance with instructions.
A significant number of employees volunteered to participate over a weekend to limit operational disruption, representing all offices, all departments, and all timezones the organisation operated in.
The DR process completed in seconds, leaving the BC and IT teams immediately available to support other needs — demonstrating that DR could be executed without consuming the team responsible for the broader BC position.
A substantial body of employees across the organisation participated in a realistic scenario exercise, with timing timed to serve as a practical pre-season resilience reminder.
The Governance Outcome
The most significant outcome of the exercise was not technical. It was a change in governance. Following the exercise, sole failover authority was granted to the BC authority — replacing the requirement for a joint decision with the CTO. The evidence generated by the exercise — that failover was clean, that recovery was manageable, and that the additional document delivery step could be handled in a phased approach where required — removed the technical basis for the joint-decision requirement.
It is important to understand the principle the governance change reflects. The point is not that a specific individual was granted sole authority. The point is that BC and DR arrangements must be actionable in scenarios where the window for a precautionary response is short. A failover triggered before conditions deteriorate, at a moment chosen by the BC practitioner rather than forced by circumstances, is executed under better conditions and with more options than one triggered at the last moment. The joint-decision requirement — however well-intentioned — was a structural inhibition to that precautionary posture.
"The exercise did not find something broken. It demonstrated something working — and created the evidence base for a governance change that made the organisation faster to respond when it mattered."
The Lesson
Two related lessons emerge from this engagement. The first is about exercise design: exercises that are designed only to find failures miss an equally important function — generating the evidence that builds confidence in capability that already exists but is not being used. The inhibition to failover in this case was not irrational. It was based on genuine uncertainty about the recovery step. Resolving that uncertainty required a demonstration under realistic conditions, not a review of the procedure on paper.
The second is about governance. Resilience arrangements that require multi-party authorisation for time-sensitive decisions carry an embedded risk: the decision will be delayed, or not made at all, in precisely the scenarios where speed matters most. BC and DR governance should be designed so that a single competent, accountable individual can act decisively when the situation requires it. The exercise created the evidence base for that design change. The principle it reflects applies far beyond this specific engagement.