Avoid Durable Messages in Enterprise Services

By libor

Recent blog posts from Harry Pierson here and here, Nick Malik’s here and Dottie Shaw’s here are suggesting that good enterprise oriented services can’t live without asynchronous communication and durable message support. While I fully agree with async. communication part I can’t somehow digest assertion about durable message support. They basically state that message delivery is inherently unreliable (which is true) and as remedy to that there is durable messaging capability. This feature they want to use mainly in following cases:

  1. Cooperating on transaction with external system.
  2. Send message to unavailable external system (aka case when traveling salesman sending data/docs from remote place via MQ to central computer) which receives and process it once get available again.
  3. Makes Operations live easier to track messages in case of failure and enable simpler recovery from known state without losing message data.

While idea to use durable messages and therefore solve above mentioned issues is quite tempting but it is definitely false to think it will address them as expected. At least is false in terms of true service oriented enterprise application as is referenced by above mentioned posts. Use message durability goes directly against main application values as robustness, scalability, well defined communication context and system maintainability in terms of updates/upgrades. Reasoning where I see those problems strike: Software robustness is another term for resilience capability. One might think durable message must be in-line with this principle as message “can’t” get lost when is persistently saved to message store via transactional data exchange between sender/receiver and transport layer. But saving message data to the message store does not mean message can be also reliably and successfully retrieved. Send/delivered to the recipient side as this is running in different transaction context with its own failure scenarios. From this point of view is message and message store still single point of failure. Even if you can ensure “errorless” condition when sending/receiving message inside transport message store you have to handle error conditions inside service itself when applying message data to in-memory data model. You might say there is one “simple” way how to ensure consistency and that is distributed transaction (i.e. DTS). But DTS is killer from execution speed and configuration complexity point of view so for true enterprise solution especially when coordinating with external sources is this no real option. If you want to avoid DTS then you have to come up with some sort of error/failure compensation techniques. You will start with writing it here and there for these little cases and suddenly you will end-up with complex code inside service which completely whip-out supposedly great durable message properties. Moreover majority of our clients (we are providing electronic trading system in SaaS delivery version) getting to the point where resilience requirement is to ensure application will be uninterruptable even against failure from power/network and/or terrorist attack point of view (i.e. we have requirement to have two/more datacenters spread across different countries to minimize failure causes…). In this case is transport layer required to save data of durable message such a way they will be equally accessible from any geographically different location including application transparent transaction support over all those geographically dispersed locations. And last but not least issue on resilience is question of implementing cold/hot service failover support where location of service can change dramatically once is done failover in terms of physical process location and delivery/target address. If you want to fix/handle these issues then communication layer must be really advance and complex think. Software scalability is on durable messages equally troublesome area as robustness. You will certainly need to handle load balancing or/and system reconfiguration to address more users or different usage/performance patterns. If you use durable messages which might be still not delivered as target service was down or maybe network link between sender/recipient is broken then you might create more problems to the system then you think. Problem is that you are not able to estimate how many “pending” messages are on its way to the service or back as reply to the user. Plus you need to ensure that if the delivery address was changed “pending” messages gets routed to new one. Message communication context is again great problem. If you send a request and this one gets just saved into message store because target service is not available then you might not be able query of the message status to track down progress on the message execution (e.g. you have to write code which deals with all kinds of situation as message is still in senders message store, service actually retrieves it and process it but before result was send back service again failed and is not accessible for direct call, etc.). Another important issue is if message delivery takes too long and you might want invalidating message entirely because it gets obsolete from business point of view. You might think message TTL is good candidate to make message invalidated and this might indeed help in some cases but certainly this is not general approach. The system maintainability will goes quickly down if you use durable message because you need to take care of additional persistent storage management (i.e. issue with backup, disc space management, read/write security access rights from application and outside application point of view – think about it as external access to the relational database, etc.) which even might not be available to your operations people easily (i.e. problem with “traveling” resource if persistent message storage is on local OS system). Whole range of problems is with updates/upgrades as you have to come up with strategy how to address upgrades on “pending” data not yet delivered. What to recommend as it works instead? I think is bad idea to use transactional support on service oriented system as this is not scalable, essentially not reliable and too complex to use if I don’t consider huge performance impact. If you want to build “enterprise class system” then you have to:

  • Async. communication between services must be a law
  • Make system data/messages idempotent as much as you can. Lost message is then not a problem as you can send same one several time without warring target service gets wrong interpretation on them.
  • Live with fact the message delivery is inherently not reliable and prepare service functionality on it (i.e. message gets lost so query target service for state of message and in case error is indicated take appropriate compensative steps).
  • Instead of transactions (even if they are long running once!) prepare service design such a way you can query message/request status and in case of error run compensation processing steps. With idempotent data you will be living with much smaller problems.
  • Use Event Driven Architecture whenever you can if you seriously plan to address scalability, robustness and modified service growth. This way you make your dev. live and system extensibility much, MUCH easier.
  • Don’t access persistent storage (i.e. vast majority cases is done via RDBMS) from business service directly but rather build persistent storage service which provides functionality of “real” persistent storage for you. Your troubles in case of access persistent storage permissions, cold/hot failover support and system functionality across geographically dispersed system will give you much less problems.

 - Libor

2 Responses to “Avoid Durable Messages in Enterprise Services”

  1. Is Durability in messaging 'bad'? Libor thinks so. - Noticias externas Says:

    [...] to Harry Pierson for pointing out an interesting post from Libor Soucek.  He links to Harry and myself and states, basically, that durable messaging is not [...]

  2. Durable Messages Still Losing Over Redundancy and EDA « Libor.SOUCEK(”WEBLog”) Says:

    [...] Maybe I shall first clarify main message I wanted to give with original post. [...]

Leave a Reply