Dan Pritchett has brought up an interesting issue:
Software should never need to be restarted.
This is pretty strong statement and question is when and how we shall realize it. These questions are actually broader then seems to be apparent.
Software development obeys same economical rules like any other domain. This means cost measures apply when software developed or later is operated and maintained. On top of it there might be business requirements which constrain overall system/service design and therefore cost expenses. All that is mixed together to define what software quality you will end-up with.
If you are developing software with requirements of expected long running service without restarts then such software is probably part of critical business process. Such processes need to have higher scalability, reliability and most importantly availability requirements. If this is your case then you can help with it already in the design stage. There are several approaches how to do it but if your development platform allows it then redundancy is your best bet. You are even in better position if service is stateless. Then even without supporting platform and with small or zero effort you can run same service type instances in redundancy mode as well.
Service redundancy approach has beauty of providing logical service functionality even in case not all physical service instances in same logical service cloud are running. Therefore problems with shorter time between restarts become less pressing to service clients. Service redundancy approach also means that your development and testing of redundant software instances remain same – no matter if you are restarting often or not.
If you don’t have the luxury of stateless service or you can’t add support for redundancy to your system then you have to dive into code. This essentially means to do iterations between integration tests which are usually quite long, analyzing results and in case of indication problems debugging and/or fixing code. You do this till requirement of long running service time is met. Problem with this approach is consumed time which materializes into overall dev. cost.
Cost involved is mainly related to spend time on tests, problem analyze and code change. From those three main factors are code analyzes and code change are probably the biggest cost contributors because of following:
· Code analyze is much more costly compare to service crash analyze because you have to do usually detail and tedious source code review instead of relying on tool/environment help (i.e. When software crash then you will get nice call stack report including place of crash. In case of “slow” service quality degradation is this much harder. I’m not considering tools here like Rational Purify software because often you can’t use it under production like test circumstances which deals with high data volumes and quick events).
· Code fix is costly because analyze often indicates change in multiple parts of service and usually some change in data model. All this code changes then increase potential for importing new bugs.
I’m working in company which must prefer system scalability, reliability and high availability on most of the services even in case more dev. and running cost is involved. As main system design and running strategy we have selected redundancy. We have in-house developed framework which allow us to minimize effort on service side to run it redundant even in case of state full service mode.
In any case if service has high availability requirement (i.e. one of the aspects of this is long time between restarts from client perspective) then we prefer redundancy as solution.
Even thought our environment is favorable for make system redundant we do not apply it for all circumstances. We always carefully weight if it is really necessary from requirements point of view to do it this way or even if external conditions allows us to apply it. Cost of development, management and running redundant system is also important to consider.
I have also met with similar case as described by Dan Pritchett. We have been asked to improve startup service time as MTBF was quite short and started to interfere with business operations hours. Main problem here was this service was heavily dependent on external resource (i.e. mores pacifically on the electronic exchange). If service lost connection with external resource – due service internal and/or network problems – then next reconnect forced service to do lengthy data resynchronization. Solution was to address both – make resync. as short as possible (but we were limited in this) and extend service to be redundant. Once was service redundant then we run two instances in cloud from different geo datacenter locations. If nothing else the redundancy solution bought us time to properly shorten startup time and extend time between service restarts. Of course we have heavily tested integration with special attention to time required for service restart.
Finally how then approach this problem? I think from experience I have gained there is rational way as follows:
1. All services regardless of requirement on time span between service restarts:
a. Definitely test service under long running test conditions to see how software copes with it. It is always better to know what will happen in case such conditions will become reality in the production environment.
b. Run long running test cases early on during service development to catch this type of error. It will make code fix much cheaper due smaller code base as well as easier to spot.
2. If long running service is not your top priority then:
a. Assign identified problems medium priority to ensure there is at least some chance to fix it in long run.
b. Sort those issues according source code complexity and group them according interrelations between them .
c. Try to fix group of long running issues with lowest complexity beside regular bug fixes or maintenance extensions. This makes smallest cost increase and gives you at least some forward development on it.
d. Often test impact of code changes on long running service result and based on it change list if issues and complexity estimates.
3. Services where long running time is one of the key criteria:
a. Carefully design service and if you can go for stateless service. This makes your code simplest and cleanest solution.
b. If your service is state full and you have platform which supports easy redundancy development then go for it.
c. If you can then run service in redundant setup (this can be applied to stateless service as well). This way you greatly decrees critical impact of shorter time between restarts on overall service functionality from client side perspective.
d. If you can’t use stateless or redundant approach then make big effort to simplify code and keep it that way. This way you will decrease on analyze and code fixing.
e. Make sure iteration between code change, integration testing, result analyze and rescheduling of new development is “well oiled” to get fastest possible roundtrip. This lowers you dev. cost and keeps you on deadline target.
-Libor