Forum

Approaches to replication and redundancy

Jaap Velde, modified 5 Months ago.

Approaches to replication and redundancy

Keen Forecaster Posts: 1 Join Date: 3/13/15 Recent Posts
In the past, Delft-FEWS has primarily relied on application-level replication for disaster recovery setups. I.e. a first set of servers runs the main workload, while a second MC with supporting servers (FSS, Web PI, etc.) is set up to keep in sync running off of a second database. Given the architecture and considering state for most of the components, that seemed like the most sensible setup for a system like FEWS.  

However, in the current model (FEWS 2019-02), it would seem that an equally valid strategy for DR / redundancy would be to replicate the database and file storage outside the application and simply keep recent images of the servers on hand to spin up in case of a major incident. With start-up times of new VMs in the cloud measured in a few minutes and database and file replication able to be set up with fairly low latency and high reliability as well, the added advantage of users being able to switch over within seconds to an ‘always on’ spare, instead of minutes to a freshly provisioned set of servers, connected to an up to date set of data sources seems to be quickly diminishing.

One of our clients currently runs a full DR based on live database replication and server disk image replication. Additionally they have a fully live 'classic' FEWS DR environment, which mainly serves to increase availability of the application in case of specific FEWS server failure. The DR environment in this case is a full infrastructure DR, with all of their mission critical applications tied together and switching to it is an expensive last resort - which means the traditional DR is considered worth having. However, the full DR has been tested and shows that the solution works in principle without issue. This prompted me to consider the cost and benefits of just replicating the FEWS environment at infrastructure level instead of doing so at application level.

I tried to find some guidance around this on the wiki, but couldn’t find any. Is there information available around this, concrete advice you can share? Does anyone have experience with this approach or know of reasons why this approach is best avoided?
AS
André Speelmans, modified 5 Months ago.

RE: Approaches to replication and redundancy

Keen Forecaster Posts: 2 Join Date: 10/30/14 Recent Posts
From the view point of FEWS, we approach this mostly as application level redundancy, because this is the only one we can control.
Technically there is no reason to do this on the application level and if you have an infrastructure DR in place, it is definitely a valid option to use that and avoiding the complexity of the dual MC set up. The "live" data in the FEWS system is contained in the database and - for users of the Open Archive - on the filesystem of the Archive Server. All other components have only static data, so making an image of that after installation and spinning it up again should be fine.
Jan Talsma, modified 5 Months ago.

RE: Approaches to replication and redundancy

Keen Forecaster Posts: 2 Join Date: 5/30/12 Recent Posts
This question has been posed to Delft-FEWS support. These are our conclusions:
(Geo-)replicating databases is indeed supported in many environments (e.g. Azure, Oracle Rac, AWS, etc.). The most prohibiting element is that it is not possible also replicate the Master Controller service itself, e.g. run a second MC simultaneously at the replicated database.
Where a very limited number (less than 3) clients use database replication, this is done by scripts executing the switch to the secondary MC becoming the Master Controller. These scripts are very much depending on the cloud environment, so this is impossible to support from Deltares perspective.
If database replication would need to be supported by allowing a secondary "sleeping MC" to take over when the first instance is down, this would require a feature request. The database needs some additional mechanism to guarantee at no point in time, there are two MCs concurrently active.

Best Regards,

Delft-FEWS Support