Protocol Buffers – Missing Usage Guide?

Monday, July 14, 2008 by libor

Last week Google released code called Protocol Buffers (PB) under open source license. Code essentially enables hierarchical data serialization base on some IDL like definition to binary form with advantage to be cross languages (presently Python, Java and C++) and cross OS capable.
 Not long after code release appeared many articles discussing code capability and (dis)advantages compare to XML. The big wave of reaction was also spurred by suggested use of binary message together with RPC method call.
Among posts on this topic were such highly respectable people like Ted Neward, Stefan Tilkov, Steve Vinosky here and here or Dare Obasanjo. Unfortunately those “big shooters” failed short with clean recommendation what part of solution (if anything) to use and under which conditions.
IMHO list shall their list look something like following:
Use it:

  1. Only for internal cross-service communication. Avoid expose it to client facing API.
  2. You need to handle high volume and/or low latency traffic and message parsing and size is high on your list of things to care.
  3. Underlying protocol is TCP, message queuing or similar (HTTP text nature definitely negates gains from binary here).
  4. Size of message or processing speed does not justify use of compression on more general text form like XML.
  5. If multiple languages are used on cross communicated services and you don’t need general and heavy format like IDL (i.e. WSDL or similar).

Not recommended:

  1. For client facing API.
  2. When you need document exchange rather than specialized business/processing data processing (i.e. document source which is transformed in some/several stages to XHTML presentation – invoice type data).
  3. When is used HTTP based communication protocol.
  4. When you use post data processing with standardized processing data tools (i.e. XSLT, etc.).

Questionable:

  1. Suggested use of RPC as offered by abstract classes in the PB. Supplied interface without reference implementation might not be good guidance for uses as example for proper implementation on different communication protocol (i.e. Danger of easy fall into RPC leaky solution).
  2. Use as storage data format. Main concern here is limited capability from DB point of view to understand message content which limits use data in WHERE part of queries.

I don’t think PB is dangerous for XML, REST, EDA, SOA or whatever else. It only shows GOOGLES pragmatic approach to the technology use. Is nice to be XML fan but who will pay bills for servers, electricity and maintenance? If you control environment from end to end then you can allow this kind of solution very easily as you can ensure consistency and management of definitions here. Moreover is not a problem generate out of “proto” file some form of XML schema and processor code to enable exchange data in this format.
 Right tool for right job is the key.

-Libor

QCon London 2008 – Part 2 (Banking track details)

Saturday, March 22, 2008 by libor

This post is solely dedicated to presentations on “Banking: Complex high volume/low latency architectures” track. For more general conference overview you can go to my previous blog post.

Banking track was very interesting for me not only because I’m working in exactly same domain field but also because challenges imposed by high volume/low latency systems demands very well balanced architecture with extremely careful selection of technology in use. Moreover in this domain is true that some of the latest/greatest stuff of emerging technologies is not always usable here (i.e. for example dynamic languages, WS-*, etc.). Track itinerary should started with Lennart Augustsson presentation about use of DSL in option pricing. Calculate option prices is considered to be the most complex task among all applied areas of finance calculus you can do therefore I was quite anxious to see how DSL model can make this task easier. Unfortunately Lennart Augustsson has got sick and therefore no goodies from this one. Instead of planned DSL presentation take spot place John Davies. His quickly hammered out presentation was actually very refreshing for me. He made great overview of what is state of requirements there, what are expected message speed and volume numbers, what are possibilities and limitations of use of varies data interchange protocols and finally what IONA can offer when dealing with large number of messages specifically in case of messages transformed from one format to another with use of JavaSpaces.
The biggest take away here:

  1. Banking sector will soon (i.e. within 3-6 years) become number 1 in world in processed data volume (i.e. it will surpass Telcos here!). 
  2. By far the most common application interchange format is CSV as simple and easy to understand form – this I have to totally endorse. XML looks for front and middle office as very blown version which adds a huge portion of latency time during parsing on execution.
  3. The back office uses SWIFT protocol. Protocol is very messy but when implemented and used in production by companies almost no errors are encountered. Why? There is heavy penalty from SWIFT calculated as considerable % of price the message carried. Because messages typically carry several millions/billions of dollars penalties are quite costly to companies. This is also way how to ensure system reliability… ;-)
  4. It seems they are using JavaSpaces extensively. There were shown examples on “matching” and “message translations”. Both examples are assuming use of “perfectly” parallelizable process and this starts troubles for me. There was shown a big gain one can get with javaspaces. I’m not 100% sure here about message translation case but order matching is definitely out of scope of this assumption. In real case one must ensure correct order of matching (first one is first served, reflecting special order modificators like complete volume, etc.)  and I believe message translation has exactly same case (i.e. send of translated message from application must be in exactly same order as they arrived). Therefore I would doubt you can gain as much as it was presented especially in case processing must be synchronized.

Next part was about “Keeping 99.95% up time at Merrill Lynch” (i.e. ML) presented by Iain Mortimer. Presentation was primarily focused on designing centralized management and monitoring system on core bank (i.e. tier 0 and 1) systems to quickly pinpoint HW/SW problems and therefore prevent failure (i.e. when there is low disc space) or identify quickly failure(s) and estimate what to do to fix it.

Major challenge ML faced was to unify management and monitoring of 344+ core systems across two geo data centers (i.e. in the USA and Singapore). I’m not sure how many HW boxes and processes they actually have to manage but according Iain’s statement their new monitoring system generates 20 000 events/sec which is a huge number (720M events a 10 hours working day!).

They elected to use slightly “customized” version of standard monitoring/management system. Iain did not mentioned that specifically but I can imagine well known source like HP Open View, IBM Tivoli, etc. Because of number of updates they collecting they’ve used with hierarchical event processing. They make 4 levels of event aggregation, analyzes and monitoring from lowest HW/SW process level up to datacenter/global level. Each monitoring level makes attempt to analyze state of incoming events and decides if events shall be propagated to higher level. This way they can greatly limit number of events on each management level including top one (i.e. datacenter) and therefore not overload system. This approach seems to me is quite smart move.

Presentation of “Real-time Java for latency critical banking applications” was probably the most important for JAVA based developers and not that much in my interested. For me it was curiosity to see what Real-Time systems based on GC can do and how they are constrained from code and memory utilization point of view. Presentation was mainly related to upcoming version from SUN. It only runs on newest Solaris 10 with latest SUN hardware which was regarded by audience as unpleasantly closed target deployment. It would be definitely good to have such chance to test real-time version of .NET in enterprise world but at the end the big question is whether worth to do it as coding and managing can gets quite tricky.

After a bit of real-time theory it was given talk “From Betting to Gaming to Tradefair”. This one was supposed to be again some practical application stuff. Matt Youill actually presented Betfair’s development path from betting site to trading exchange. I have got mixed feeling from it. Presentation has way too much “marketing” in it and second half was probably the lowest quality from all of presentations. What I have taken from it is Betfair has build in whole business logic into Oracle RAC database. Actually they utilized DB so much they are among 5 “hottest” Oracle DB instances in world. Given that they have over million registered users and revenue 200M GBP it looks to me quite risky for company to run system only on single DB instance not to speaking about missing geo scaling/diversity (it might be that they have actually 2 data centers and database sync is done via log shipping but how often they can update state on second place? Definitely not real time failover).

Tradefair application was build in different layout. They make use of distributed services with detail step journaling and persistent support. Implementation is done over abstract Actor object. This object represents user actions and play major role in handling user requests (submit orders, etc.). As the Actors progress with execution they directly save steps into plain disc files. For each master Actor instance there is backup instance which gets synchronized via those saved/updated files (i.e. log tailing).  Obviously this means they use shared SAN disc RAID array to address system reliability and recovery.

Seems to me solution will suffer when they would need to go for geo scaling as RAID setup is not really option here. Another question is how difficult is to do the user scaling if each actor generate file (I assume one specific file instance belongs to specific user) and how easily they can address error/failure inside system given they are using specialized files for each actor. Is really question how they trading adventure ends up.

Last presentation from banking track was “LiquidityHub” presented by Tony Harrop & Jeremy Vickers and I have to say very interesting one. Solution they shown was based on Spring 2 container and with use of JRocket real-time Java version. Given that they were able create solution from nothing to the production within 9 month is quite admirable task.

Solution is essentially based on 3 service configuration (gates in, calc engine, gates out) with communication realized  via Fiorano JMS among them. They claimed end to end latency is between  2-4 miliseconds within 95 percentile. On direct question whether they are using Hibernate for saving data they carefully stated NO due to performance reasons. Overall solution seems to be quite lightweight but I doubt they can really achieve claimed latency on JAVA framework even if used real time version.

And what left unanswered from all presentations? How all those companies reliably measure latency on distributed system without specialized HW? As I’m working on Windows I have two problems here. First one is relatively coarse tick of standard clocks (i.e. sensitivity 16ms here) if I’m not considering high-perf. timers. Second one is how they ensure all clocks across distributed computers are time synchronized on such precise level (i.e. essentially microseconds if one needs to measure in milliseconds). So lack of explanation from presenters makes a big dent on presented latency times.

-Libor

QCon London 2008 – Part 1 (Conference Overview)

Monday, March 17, 2008 by libor

The QCon is without doubt top ranked conference and first one which I visited thanks to my employer. I was very anxious to see what such conference is like and I have to admit it did not disappoint me at all even I have been part of it only during Wednesday.
QCon was jointly done with JAOO therefore I was expected there will be large portion of presentations dedicated to Java. What surprised me a lot there was how overwhelming majority Java has got among companies in exhibition hall and more importantly in “Banking/Financial Complex high volume/low latency architectures” track. “Banking” track I’m mentioning here because it was exact fit with my professional interest and also where I have been primarily subscribe to. Actually none of the presented application on “banking” track did even touch Windows and/or .NET based system to my surprise! According to John Davies presentation 80-85 percent applications are actually written in Java in this domain. Rest of “market” share is then predominantly occupied by C/C++ due its performance/memory capabilities and predictability of execution (i.e. all “standard” GC based environments are fighting with unpredictable time of execution on near real time application with low latency here).
As second observation which astonished me quite greatly was how many people (there was accredited 600 people according organization’s info) used MacBook machine variants (well given that JAVA was predominant on conference I should probably expected that) carried in sleek “MB” covers. Even more people (I would say majority at least according to my impression) owned iPhone. I was so shamed of my old Siemens mobile “gizmo”.
Real conference program started with keynote from Erich Gamma. His presentation was dedicated to Eclipse development. It was pretty OK from my point of view but nothing which I would be excited about. First half of speech was actually quite interesting as Erich dedicated it to project planning and management. The biggest take out was that program release shall stick with firm release date. The best date working form Eclipse point of view is June as this creates the smallest disruptions due holidays and other events. They are developing in 6 week dev. “sprints” periods (1 week planning, 4 development and 1 integration and code finalization/testing) during year to add new features for yearly release. One month before release they fully dedicate all dev. resources to final code polishing and testing.
What was also quite interesting on Erich’s presentation was describing how fast formed strong Eclipse community was. Community fully covered support of newly joined developers/users, writing manuals, etc. which actually freed original IBM dev. resources to again do what they know the best – full speed development. This is definitely very interesting model to consider for similarly typed projects.
Banking track was naturally the high interest for me given that I’m working in exactly same domain problem field. Because of many interesting thinks were presented there I decided to put it into separated post.
Day was finished with great entertaining keynote given by Martin Fowler and Jim Webber on theme ESB use in SOA based applications. In many aspects it remained me presentations from Don Box.
Even thought presentation was essentially same information value as I have already seen recorded on internet for example here never less it was big fun time. They make so much show and great marketing value to convince all attending people to prefer internet based message integration via standard HTTP protocol for SOA compare to ISV specific solution. I would say everyone must get this message clearly presented and seriously start to think how to use it in own solutions.
The end of Wednesday was finished in “Revolution Bar” of London Soho. Get there was surprisingly challenging task for our guides. From conference center has QCon organizations prepared buses which took us close to Piccadilly Circus. Bar was just several corners around as later we have discovered but our guide got lost there. It was funny how group of roughly 150 people walked “pointlessly” through narrow streets there and tried to find bar place on GPS maps in mobile phones. It took good 20 minutes till we got to right place.
In the bar I have finally take a chance to talk personally to high profile people like Steve Vinoski and Jim Webber which bolgs I read regularly already long time. I was mainly interested in getting their personal opinion on use of REST/ATOM in high performance systems as they are not usually addressing that in their write-ups/presentations.
Jim Webber was quite open and admitted that his recommendations are mainly applicable to systems where latency is generally 1second and more. This seems fair to me. Actually vast majority of enterprise applications can fit into this category where prime examples are ERP systems and like.
Steve Vinoski suggested that for such case one shall probably follow REST model conceptually if not directly via common HTTP version due performance constrains. That is certainly possible but I have quite strong doubt is practical here.  Does anyone know such successful application in this field to confirm Steve’s suggestions?
All in all I have to say conference was well worth it to attend!
 

-Libor

Tea Break #20071031

Wednesday, October 31, 2007 by libor
  • Today date gets interesting meaning from computer science perspective: 31OCT == 25DEC. Will IT Santa come in the town? ;-)
  • InformationWeek article points out on very interesting twist in the IT infrastructure of enterprise world. SaaS based applications become significant force to change traditional position of “kingdom” of enterprise IT department. As our product is available only via SaaS model we’ve got similar experience as described. But winning position against client’s in-house IT department is not easy one. To convince clients adopt sophisticated SaaS application you need to go extra mile or two to help them integrate offered solution with their internal processes and infrastructure. And you can be sure their IT will fight hard against “foreign intruders”. Definitely interesting time for IT and business sector is coming.
  • There are many ways how to develop distributed system and more importantly how to manage it. Applications complexity will continue to grow while app management becomes more and more overtaken by business staff rather than IT specialists. IT department or service provider will “just” ensure that application will perform optimally on given set of HW and within expected reliability/latency. This is clearly shown by trend of enterprises to adopt SaaS applications (see previous point). Application management functionality will therefore need to offer simple to use and “all-in-one-place” presentation/access. Additionally would system need to ensure configuration consistency and eliminate all miss configurations which might lead to system malfunctioning. I don’t see any other way to solve that than make management centralized – at least from user’s point of view (no matter if “user” means here business staff or IT specialist). I have made some comments on this theme related to original Harry Piersons article. Harry Pierson replied with position that centralized management “doesn’t make it feasible at any significant scale”. Unfortunately my experience with it is contrary to his statement. Moreover I have “living” example within our application. It would be therefore beneficial if Harry can dedicate some time and clarify what management model he has in mind and what constrains he assumes. And what about clarification what scale is significant enough to prove something is working (if one application instance is spread over 50-100 boxes and two/more geo locations is large/significant enough)? Arguments which Harry places against centralized management seems to me are at very best quite weak at least for “average” SOA based application.

-Libor

Tags: ,

Missing Pragmatic View on Management of SOA Based System

Friday, October 26, 2007 by libor

Is interesting to see how so many people are writing about SOA technical details (communication protocol, best used language, etc.) but almost nothing is written about management and operations of such system.

What are then management requirements of SOA anyway you might ask? Well it depends from many constrains but if you are expecting to do real enterprise application then you have to prepare system to be at reasonably manageable. If your application is going to be SaaS subtype then application management is even more important (i.e. management cost is directly related to your expenses which means your net profit not to mentioned maintain high level of client satisfaction). Configuration and management mistakes can have profound impact on application stability; performance and reliability which at the end of day might render whole system look to your clients as unusable.

OK, management in SOA is important part of solution but what with it? How to realize it? Shall we follow same pattern as on services and decentralize it or shall we make some “logical” central point where management application or users are able work with it is?

My preference is to make configuration and management accessible from central place as this makes operations live much, MUCH easier and application much more stable from configuration and management point of view. With this approach you simply get all data on one place which makes process much easier to handle, check consistency and address issues. Therefore I can’t really agree with Harry Pierson as he writes:

Yes, decentralized decision making can create a management nightmare. Personally, a management nightmare is much more attractive anything centralized approaches have ever delivered in the IT industry.

Probably the most pragmatic way how to realize central management is as follows:

  1. Let all services adopt same common management interface (see SmartFrog for one implementation example – via Bill de hÓra).
  2. Run some management agents on each HW node where you are intent to run any service from your solution (not strictly necessary by greatly helps with some tasks like start new service instances, reporting health state, etc. See SmartFrog for example of use).
  3. Use centralized configuration database to:
    1. Record service topology (what runs where, preferred locations, failover rules, etc.).
    2. Collect application telemetric data (this term “coined” by Dan Pritchett).
    3. Keep and manage detailed service setup to ensure consistent setup for all services regardless of their physical running location.
    4. Optionally keep and manage upgrades/updates for services from one place.
  4. Define process to ensure configuration database and access to it is highly reliable (it might be done via standard tools of DB or via custom solution if you need to address special requirements).
  5. Develop in-house (recommended if you are really serious about it) or use 3rd party management application which understands configuration DB and how to interpret them.

As you can see listed items above creates significant additional work for design and application development. Even this is additional work it definitely pays off not only in less trouble and software issues in production environment but much more likely during testing when is necessary to test varies number of configurations in controllable way.

And now probably the most important question. If you will go for central configuration and management will retain all benefits SOA approach? You might have at this point same opinion as Harry Pierson on it.

SOA’s “distributed nature” is it’s primary strength. SOA’s not primarily about standards or ease-of-connectivity – though those obviously play a role. It’s about enabling decentralized decision making. Since you can’t be both centralized and decentralized, enforcing centralized management basically negates SOA’s primary strength. This seems like the worst of both worlds to me. All the hassle of distributed decision making combined with all the hassle of centralized management.

My position is you NOT LOSE ANYTHING from SOA goods. Actually opposite is true, as you take the best from both worlds: scalable, distributed, providing intended business value while reliably configured and managed with real business processes modeled closely within IT infrastructure.

I have intentionally avoided talking about concept of ESB so far but if ESB is done properly than the centralized configuration and management capability along with possibility to run individual services in unified environment is probably the biggest contribution of it. Unfortunately is questionable if you can buy such infrastructure on the market just to be right for your application needs.

Finally I would like to stress that average enterprise SOA based application is not equal to large scaled internet application (actually only a few such large scale applications can fit to this definition internet application anyway). Therefore no same running, deployment constrains or configurations conditions can be applicable here. you shall not be scared that your SOA based application will use centralized management/configuration and those large scaled internet applications will not be able to do this (at least not in same way you would do it).

Is my view wrong? Comments appreciated.

-Libor

Use of Persistent Storage on Distributed Application Code Shall be Simple

Wednesday, October 17, 2007 by libor

Distributed system are from its nature very hard to manage and even harder to build correctly. One of its key components which must be done right is persistent storage subsystem. Main reason for careful care is storage usually is one of sub-systems which limit entire application in reliability, availability exec. speed and scalability.

As persistent sub-system gets on central stage of interest over time this triggers writing several very interesting papers describing varies approaches to this domain area. Among the most well known is Google’s BigTable or recent paper from Amazon’s Dynamo.

If you want to see quick summary of possible approaches when saving data on distributed system you can go to Dare Obasanjo page which was partially created to explain Dynamo functionality.

While Dare’s article does very nice job of explanation what are pros/cons of each selected approach including that one used in Dynamo, it does not really considers additional possible constrains imposed by business. The most important one is SaaS business model or constrains of versatility in  service SLA definition (i.e. some services in the system might required higher/lover quality characteristics like reliability, durability, etc.).

What I actually mean by specific SaaS business model constrains? Well in short this means that running software must be capable of provide each client with customized level of reliability, availability and exec. speed according client specific SLA even on persistent storage level. For example one client company might elect for simple reliability model where all client data are stored only in one datacenter while other client wants data accessible/replicated data diversification from 2 or more different geographical locations. If this is your case then is impractical make application source code a way to be explicitly aware of stored data geographical location or prepare in code all possible storage option combinations which might occurs. Therefore correct solution is to make application code as simple as possible and rather build-in complexity of resolving version conflicts and replication to DAL or persistent storage layer only.

Lets return back to Dare’s examples and show how one can improve client and server code to enable easy use of such distributed storage. Examples are built with considerations of two users updating their friend list by adding each other as reference to it. Moreover both are using data from very different geo places. System is build due scalability requirements the way of storing data to different persistent storage places usually locally close to user location (i.e. use of shards). Main message is that data for both users will eventually become consistent.

As I have mentioned before leaving application developer to deal with all those kinds of troubles and compensations in case of failed data transfer on save is not good way forward. It makes application code complex, repeats same pattern all over and result code is very sensitive on proper implementation logic. Errors can’t be spotted unless good volume or manual testing is done or maybe some thorough code inspection is executed.

Used storage framework shall therefore make very easy way for developers to deal with persistent storage without worry to do any manual or in code data synchronization. Developer shall not even know where is data physically saved and what level of resilience they got (both might be subject of SLA which typically differs client from client).

Therefore application code we are following in our solution looks ideally like following snippet (fatal errors when calling methods is solved by throwing exception to higher level):

// Saving data no matter what location or resilience level is used 

public void confirm_friend_request_D(user1, user2){  

   DAL.update_friend_list(user1, user2, status.confirmed);  

   DAL.update_friend_list(user2, user1, status.confirmed);  

}  

// Retrieved data no matter from what location.  

// Load balancing support on persistent level when data read possible  

public list get_friends_D(user1){  

   return DAL.get_friend_list(user1);  

}

Persistent storage infrastructure shall then take much more responsibility and make higher effort to be able automatically resolve conflicts between versions and propagate changes from one geo location to another once persistent storage become available again. What I mean by this is that persistent storage sub-system shall be aware of possible relations between physical data locations even between virtual storage nodes and resolve missing or outdated versioned data.

In ideal situation will storage sub-system identify data dependences including those on geo place location (i.e. as in Dare’s case is identified that data for each user are placed on different datacenter location) and run consistency checks/repair/merge process.

With storage system driven reconciliation is raised question which storage component is active in data reconciliation – i.e. one which failed and is back on running track or some “independent” component which actively polls failed services to identify they are back in service. While component with active polling is possible way our experience shows is better to leave such activity on recovered service.

Recovered service can better estimate which data are missing, from where to pull/push missing data and finally when is process recovery fully done so it can be fully operational for all business read/write operations. In typical case of our execution environment (i.e. resilience maintained on sharded virtual storage node done across two datacenters) is recovery from short outage (i.e. say outage few minutes due network failure) resolved on single member node at most in several minutes under full system load. Recovery shall be transparent to all client applications and even shall not make any significant impact on other business saving transactions. In any case you are still optionally able to run reconciliation during off-peek hours or during maintenance window in case data amount is huge (i.e. you are doing join of new node whith replication of entire data set from active node to newly joined one).

Selecting proper storage strategy approach in distributed system is really tricky part. It is highly dependent on business requirements and target functionality. I think if there will be more distributed applications with SOA backbone we will see more those implementations and definitely stronger “drive” towards high scalability, reliability and faster exec. speed. From this environment might even emerge few recommended patter/frameworks which will solve all today’s issues in uniform way.

Tags: , , , ,

Enterprise Solution for Pub/Sub Deserves Better Protocol than Unicast Based Model

Wednesday, October 10, 2007 by libor

Many very respectable people like Mike Herrick, Bill de hÓra, Tim Bray, Stefan Tilkov or Dare Obasanjo are recently writing about event driven applications and its scalability.

Today’s “Standard” Way of Pub/Sub Implementation

All those post are offering different approach of solving it via AMQP with XMPP backbone, SQL Service Broker as replacement of MSMQ or Atom Pub/Sub in Push/Pull mode.

While logical functionality is same and underlying technology or available functionality differ all have basically one common denominator. That denominator is using unicast for data delivery.

Pub/Sub Over HTTP only?

Why is unicast so popular and only option consider here? No one really sad that directly. Is their consideration limited to HTTP due need to support heterogeneous networks mostly over WEB? Or are they integrating many different system from varies providers? I assume one of this cases must be true other vice they would probably explore other implementation as well.

In any case, limiting consideration on above mentioned cases is pity and does not reflect fully reality. In our case we integrate with 3rd party systems too but not over HTTP. Using HTTP REST or WS-* model for everything is short sight way.

Unicast Pub/Sub Implementation Issues

Now its probably time to be fair and list my objections against unicast based version in Pub/Sub system. Well I will start with major one. If you want to deliver same message multiple recipients – and this is usually main reason why you want to use Pub/Sub – then you have to copy it and move it across network multiple times. In case of only 10 subscribers your network traffic will be increased 10 times without matter if you prefer Push or Pull mode! Moreover network use is increasing at minimum linearly with number of subscribers you are getting into much worst position.

If your choice is Push based unicast mode then you are entering more-less into “Sever-Client” configuration. This configuration might be easily used within datacenter but is not usually right choice for web based scenario (BTW Here I have to disagree with Bill de hÓra statement where on Web you are constraint against push approach. We certainly did that in our application and other are capable to do it as well. Take a look to all those on-line trading/betting applications – like Sportbet – which deliver “real time” updates on plain browser installation. Another nice example is Googe Apps).

Send relevant data to all “subscribed” recipients you need to take substantial design and development effort to ensure all subscribers are served without delays (i.e. you have to solve connection/delivery difficulties in case client is not available and probably keep track of user “session” state where is stored what user already got and what should be send over). Our previous trading application version had this type of delivery and believe me this was the most complex and error prone part in the system. I don’t even need to mentioned great problems with addressing scalability and resilience here.

Many people are hoping AMQP will pull ahead and if enhanced with XMPP then brings required Pub/Sub functionality in Open Source version. It may be well true in long run but from present to midterm functionality point of view is offered functionality nothing more than replacement of paid MQ version. Present functionality therefore brings with it all shortcomings you can found on other unicast based solutions when used on pub/sub event processing. We have had chance to run some test and directly talk to guys from iMatix who are developing OpenAMQ as open source version of AMQP. Their software version is running in the J.P.Morgan Chase bank with very fast price data streams but for data which might get lost without recovery in case of failure. Additionally was usage done with relatively small number of subscribers/topics (max. of 20 subscribers here which subscribe all data if I remember correctly) and almost static configuration so unicast implementation is no issue here. If you are serious about scalability, resilience, recovery time and/or possible high number different recipients than you shall do test and review first if this is middleware for you.

Latest contribution to this unicast “gang” is SQL Service Broker based on SQL 2005 version. What is essentially provided here is message queuing system implement directly inside relational database. User can access system via T-SQL commands and use some sort of calls like SEND/RECEIVE. Clear advantage of this approach compare to above mention is proven storage and transactional technology with possibility to buffer send/receive messages in case destination is not available (even for long period of time) or consumer is not fast enough to process incoming messages.

SQL Service Broker implementation even allows using logical destination naming to free application setup from complex knowledge of physical message destination. Basic problems with unicast approach are still there. Additionally you might be very badly hit by execution speed during peek load time. I have found very interesting conversation where is discussed comparison SSB and MSMQ including performance comparison and I don’t much believe in stated performance.  We have developed similar capability based on SQL 2000. I must say 500 updates/sec as mentioned in the article is really ambitious target (in our case was message size around 1kb).

Our solution was able roughly reach top speed 300 updates/sec if only one publisher was used and if combined with one receiver on same SQL instance then we maxed out 400 updates/sec (msg. distributed 230 on send site and 170 on receive side). More senders/receivers hooked to the single SQL instance reached same overall total updates rate (i.e. 400 update/sec for all SEND/RECEIVE operation on given SQL instance which means that for example 40 clients have got only 10 updates/sec on average per one client). Worst of it was unbalanced performance. It appears that in some periodic times especially during peek time become SQL very slow and we were not able address that even with code change/indexing and/or help of SQL profiler.

Therefore If SSB is your favorite option than make sure do performance prototype and measure, measure and again measure under highest num of concurrent clients and highest load you are designing solution.

Multicat as Way Forward in Dedicated Environment

Contrary to all those smart people preferences on selection of unicast based technology for Pub/Sub implementation I do believe more reasonable way of doing it is via utilization of reliable multicast protocol here.

Why reliable multicast approach is usable and in fact shall be preferable here? From my point of view many distributed enterprise applications are running in specialized and dedicated environment with end to end development control. Runtime environment is usually hosted inside specialized and very powerful datacenters with latest hardware or/and networking equipment. These environments are therefore more than capable route and consume multicast messages without any problems. Even in case of geographically distributed system multicast routing is more than possible as in our application case.

What is major advantage of multicast based Pub/Sub implementation then? There are three big pros: most importantly system scalability, simpler business code and last but not least advantage is much lower resource consumption! You can’t really compete with it on proposed unicast implementation especially if you need to address other system qualities like system reliability and resilience at the same time.

If you don’t think my arguments and experience are really valid than go and read more details about other cases where is multicast data propagation/eventing used. One nice example is on upcoming technology – grid based frameworks. Providers like Tangosol use multicast update propagation for data.

Summary

I think there is really big technology gap in Pub/Sub implementation. As anyone else I have my own private hopes on this topic.

The biggest hope is that promising potential of AMQP will not be wasted only to make new JMS like middleware. People around AMQP shall get enough courage to add at least as minimum optional implementation of multicast based Pub/Sub. If multicast option available than I believe more people will go out of present unicast thinking box to see huge potential for easy application use.

Do we really need non-stop running service?

Monday, August 27, 2007 by libor

 Dan Pritchett has brought up an interesting issue:

Software should never need to be restarted.

This is pretty strong statement and question is when and how we shall realize it. These questions are actually broader then seems to be apparent.

Software development obeys same economical rules like any other domain. This means cost measures apply when software developed or later is operated and maintained. On top of it there might be business requirements which constrain overall system/service design and therefore cost expenses. All that is mixed together to define what software quality you will end-up with.

If you are developing software with requirements of expected long running service without restarts then such software is probably part of critical business process. Such processes need to have higher scalability, reliability and most importantly availability requirements. If this is your case then you can help with it already in the design stage. There are several approaches how to do it but if your development platform allows it then redundancy is your best bet. You are even in better position if service is stateless. Then even without supporting platform and with small or zero effort you can run same service type instances in redundancy mode as well.

Service redundancy approach has beauty of providing logical service functionality even in case not all physical service instances in same logical service cloud are running. Therefore problems with shorter time between restarts become less pressing to service clients. Service redundancy approach also means that your development and testing of redundant software instances remain same – no matter if you are restarting often or not.

 If you don’t have the luxury of stateless service or you can’t add support for redundancy to your system then you have to dive into code. This essentially means to do iterations between integration tests which are usually quite long, analyzing results and in case of indication problems debugging and/or fixing code. You do this till requirement of long running service time is met. Problem with this approach is consumed time which materializes into overall dev. cost.

Cost involved is mainly related to spend time on tests, problem analyze and code change. From those three main factors are code analyzes and code change are probably the biggest cost contributors because of following:

·         Code analyze is much more costly compare to service crash analyze because you have to do usually detail and tedious source code review instead of relying on tool/environment help (i.e. When software crash then you will get nice call stack report including place of crash. In case of “slow” service quality degradation is this much harder. I’m not considering tools here like Rational Purify software because often you can’t use it under production like test circumstances which deals with high data volumes and quick events).

·         Code fix is costly because analyze often indicates change in multiple parts of service and usually some change in data model. All this code changes then increase potential for importing new bugs.

I’m working in company which must prefer system scalability, reliability and high availability on most of the services even in case more dev. and running cost is involved. As main system design and running strategy we have selected redundancy. We have in-house developed framework which allow us to minimize effort on service side to run it redundant even in case of state full service mode.

In any case if service has high availability requirement (i.e. one of the aspects of this is long time between restarts from client perspective) then we prefer redundancy as solution.

Even thought our environment is favorable for make system redundant we do not apply it for all circumstances. We always carefully weight if it is really necessary from requirements point of view to do it this way or even if external conditions allows us to apply it. Cost of development, management and running redundant system is also important to consider.

I have also met with similar case as described by Dan Pritchett. We have been asked to improve startup service time as MTBF was quite short and started to interfere with business operations hours. Main problem here was this service was heavily dependent on external resource (i.e. mores pacifically on the electronic exchange). If service lost connection with external resource – due service internal and/or network problems – then next reconnect forced service to do lengthy data resynchronization. Solution was to address both – make resync. as short as possible (but we were limited in this) and extend service to be redundant. Once was service redundant then we run two instances in cloud from different geo datacenter locations. If nothing else the redundancy solution bought us time to properly shorten startup time and extend time between service restarts. Of course we have heavily tested integration with special attention to time required for service restart.

Finally how then approach this problem? I think from experience I have gained there is rational way as follows:

1.       All services regardless of requirement on time span between service restarts:

a.       Definitely test service under long running test conditions to see how software copes with it. It is always better to know what will happen in case such conditions will become reality in the production environment.

b.      Run long running test cases early on during service development to catch this type of error. It will make code fix much cheaper due smaller code base as well as easier to spot.

2.       If long running service is not your top priority then:

a.       Assign identified problems medium priority to ensure there is at least some chance to fix it in long run.

b.      Sort those issues according source code complexity and group them according interrelations between them .

c.       Try to fix group of long running issues with lowest complexity beside regular bug fixes or maintenance extensions. This makes smallest cost increase and gives you at least some forward development on it.

d.      Often test impact of code changes on long running service result and based on it change list if issues and complexity estimates.

3.       Services where long running time is one of the key criteria:

a.       Carefully design service and if you can go for stateless service. This makes your code simplest and cleanest solution.

b.      If your service is state full and you have platform which supports easy redundancy development then go for it.

c.        If you can then run service in redundant setup (this can be applied to stateless service as well). This way you greatly decrees critical impact of shorter time between restarts on overall service functionality from client side perspective.

d.      If you can’t use stateless or redundant approach then make big effort to simplify code and keep it that way. This way you will decrease on analyze and code fixing.

e.      Make sure iteration between code change, integration testing, result analyze and rescheduling of new development is “well oiled” to get fastest possible roundtrip. This lowers you dev. cost and keeps you on deadline target.

-Libor

External Data Integration in SOA

Monday, August 20, 2007 by libor

Designing and developing any kind of application almost certainly mean need to deal with external data. Proper exchange data implementation therefore shall consider following main points:

1.       Direction of data exchange (i.e. application is either data provider or consumer).

2.       Used data format (i.e. standardized data format or custom based).

3.       Business logic on data exchange including boundary condition and error handling.

4.       Link treatment (like used connection type, possible data volume restrictions, etc.).

5.       Access and security rights.

6.       Data processing management and monitoring.

While point 1 to 5 might be predefined by external party point 6 is almost always defined by application where you are adding data exchange.

Overview of Data Exchange Types

If you need to add data exchange into your application is always good to opt for standardized data protocol and implementation. With standard you get most, if not all, previously mentioned points right with big plus of greatly increased chance interoperation with similarly capable applications around. In the case of electronic derivatives trading is such protocol FIX.

While standardized data exchange has clear advantages reality is (at least experience from industry I’m working in) the big portion (in some cased only the option available) of data exchange is based on proprietary protocol going over file or at the best set of database tables. Legacy of counterparty systems is more than strongly presented here.

With proprietary protocol (or in some cases protocols based on by third party systems which own majority share on the market) you get usually into cold water of poorly documented data types and proprietary behavior unique to each company.

Typical example of such *semi-standard* protocol in my industry is ClearVision. You will get file or database record definitions mostly with string based types. Because of string fields can accommodate almost anything they are heavily customized for each company and process company is using.

SOA Implementation Options

In the SOA based system you shall tread every data source as service and therefore external data exchange shall naturally fall into this categorization as well. Moreover you might want to have some control and monitoring capabilities over data exchange and therefore define additional rules/constrains on externally connected system. And last but not least requirement when importing from external sources might be implement additional business logic which ensures data are imported consistently and with proper transformation.

Because of all above stated requirements is clear there are two possible options here:

1)      You will make service which comply internally with application data format, monitoring and management constrains while providing externally desired protocol implementation and behavior.

2)      Let external system make service which connects to your application and is compliant with all requirements, data types and processing steps as “native” service.

In either case you will end up with new service representation here and not with directly connected external system to your application.

When you want to add standardized protocol implementation it makes only sense to implement it in-house as separate service in your application. Monitoring and management of data exchange becomes integral part of your application.

With proprietary protocols it depends if external party “cares” about your applications. I personally have met only with one case where external party was willing (actually forced) to write service into our application. In all other cases (i.e. 99%) was our job to implement connection and therefore prepare service which represent such connection link.

Even if you need to write all such services yourself it is not as bad as it looks from first point. Fortunately some of the data exchange characteristics can be generalized due is simple nature (configuration, file/DB monitoring, data from/to translation, etc.). Therefore you will usually end up with one or at most few service type instances which can be specialized via plug-in with proprietary protocol implementations.

It would be interesting to know if anyone else has different approach.

-Libor

Enterprise Application Reality

Friday, August 17, 2007 by libor

 

  • Harry Pierson thinks distinguishing application type to *enterprise* and *supporting* is not anymore needed once you write connected distributed application. I think he is quite wrong in this case. Main difference is that enterprise application provide critical functionality for entire company and therefore must comply with much higher standard on quality, availability and robustness then common of the shelf (i.e. “supporting”) type applications do.  Therefore it does not matter if application is based on messaging infrastructure, lives fully in the RDBMS (like old fashioned ERP apps.) or is tightly coupled with the mainframe and cobol.
  •  LINQ capability to enable work with SQL like resources as *native* part of .NET code is really great extension. Even idea is quite good I just feel from all presented examples so fare stored procedure (SP) is second grade citizen in the LINQ. This is quite unfortunate. I fully understand excitements of developer to embed SQL code directly into application. Embedded SQL code on *common standard*application is quite reasonable and probably only productive solution but enterprise applications is wholly different story. Addressing system upgrade and bug fixes (sometimes emergency once) on distributed enterprise system usually means minimize impact on almost continually running application. If I will find a bug in the DB code then changing SP and apply patch in the RDBMS server is much easier and more importantly safer step than change .NET source code, compile it and replace binary which must be potentially deployed on many places while is stopped effected service.