Tuesday, March 29, 2011

So long and thanks for all the subs...

Ah well, sadly SNS looks like it has too high latency, so for now, and for the purposes of cache replication ... its out. Pity, really, but at least it means I'm no longer worrying so much about being able to test pubsub malarky.

In its place for now, I've returned to an old (2 years) friend in the form of Hazelcast.

I have to say I do like Hazelcast - whilst it does provide some 'cool stuff' and has a decent variety of constructs that it supports, at its core, it just provides a nice, lightweight (no external dependencies) set of clustered connections.

Having implemented the Hazelcast backed cache though, I have been looking into avoiding a few past mistakes with it.

1. ThreadLocals
This may have changed, but in performance/scale testing previously, I found that use of hazelcast instances injects some hefty ThreadLocal data onto the thread, and in high multithreading scenarios I saw this being a significant chunk of overall memory.
Still ... good and straightforward ExecutorService comes to the rescue here - at the cost of more Thread based fancy footwork, I've wrapped the calls on the Cache implementation using Hazelcast with Callables, so I end up with calls a bit like this :-


   @Override  
   public V remove(final String key)  
   {  
     Future<V> future = executor.submit(new Callable<V>()  
     {  
       @Override  
       public V call() throws Exception  
       {  
         CacheEntry<V> existing = getMap().remove(key);  
         return (existing == null ? null : existing.getValue());  
       }  
     });  
     try  
     {  
       return future.get();  
     }  
     catch (Exception ex)  
     {  
       throw new RuntimeException(ex.getClass() + " thrown in executor", ex);  
     }  
   }  

Whilst as I say, its a bit of indirection I'd rather not have, I don't have to look at it too often, as its encapsulated away. I could take it a step further, and perform a number of the cache calls remotely using Hazelcast's executors - particularly conditional puts

2. Data coherence
This one I'm currently slightly less sure on, but I've made the changes locally, and I'm trying to see how well I can live with the consequences ...
Last time I used Hazelcast, I put in quite a chunk of effort to use distributed locking of Map keys and so on in order to ensure different server nodes didn't bother attempting concurrent pulls of data from the DB, merely in order to overwrite each other when pushing this data into the cache.
Whilst this was great for minimising the number of hits to the DB (SimpleDB in this case), minimising costs, yadda yadda, it increased HUGELY the amount of roundtrip crap going on between nodes, and was a key cause of slowdown when performing data lookups.


(sidebar - in the previous setup, it turned out replicated/coherent cache wasn't actually necessary in any case - don't get bogged down in interesting code, keep looking at the usage scenario and build the simplest system which can meet the requirement)

Instead I've been looking at how people integrate to the likes of memcached, in situations where the framework (*cough* PHP *cough*) don't tend to go down the clustering route particularly, and seeing how to trade off and ensure that whilst we can't get an absolute minimum amount of backing store activity, we do avoid overwriting cache data in cases where multiple nodes have updated state.

I'll leave that area as a snippet for now, but may put more in later on - ideally this interface style will be amenable to dropping in and out multiple implementations including memcached.


   @Override  
   public void onGameResultsPosted(GameOwner owner, GameId id, Map<String, String> attributes, Date timestamp)  
   {  
     SimpleGameResult result = new SimpleGameResult(id, owner, attributes, timestamp);  
     int attempts = 0;  
     do  
     {  
       try  
       {  
         CacheEntry<List<SimpleGameResult>> entry = this.ownerGameResultsCache.getEntry(owner.getId());  
         if (entry != null)  
         {  
           entry.getValue().add(result);  
           this.ownerGameResultsCache.put(owner.getId(), entry.getValue(), entry.getCas(), entry.getCas() + 1);  
         }  
       }  
       catch (OutOfOrderException ex)  
       {  
         log.info(ex.getClass() + " thrown whilst attempting to include posted results in cache", ex);  
       }  
     }  
     while (attempts < CACHE_MAX_RETRIES);  
     log.warn("Too many OutOfOrder exceptions when attempting to update cache");  
   }  


Any feedback, I'd be interested to hear, even if its to tell me how badly this will go booom! in my face.

Friday, March 25, 2011

First Past The Post

Just by way of a change, I sit here right now in Cambridge, pulling my hair out, which is handy as at least I have hair to pull out, unlike my brother.

Now as a server programmer, this is no real no experience as angry tension seems a natural state when working with distributed systems, but today I have a particularly fun problem.

For a little background, I'm looking into getting some decent horizontal scalability built into a server architecture for a game I'm working on, by making use of Amazon's Simple Notification Service for pubsub type behaviour, in order to allow most requests to be served from in-memory caches, whilst at the same time avoiding session affifinty on my REST front facing servers (which is even more ugly than usual with REST sessions).

For example, when a new member is accepted into a clan, I want to be able to push the updated clan member list to all the front facing servers - this is sensible as the likelihood of that new clan member (or other active clan members) going and checking out the clan profile and other members is ... high ... to say the least, so incurring a SimpleDB roundtrip hit would be sub-optimal.

Therefore I'm looking at pushing cache invalidations or where possibly, updates, out to whatever other nodes are alive via AWS SNS.

 This is a reasonably sensibly put together service, works, but has the annoying (for dev) characteristic that as a push service with somewhat of an affinity to http based notifications, it needs the subscribing server to be accessible from an outside request to do its thing. Fine in production, trickier for a functional test running locally.

Now sometimes I also want to have some durable notification shtuff going on as well - not right now, and not for caching, but later on. I'd been hoping to leave this area off for now, but now it seems I need to dig into this, as I can 'transform' the system from a push to a pull, by using a 'corner' feature of SNS which is the ability to subscribe to an SQS queue, which can then be polled...

... Great! I can do functional tests, I have confidence in everything but the servlet handler for the SNS pushes direct to http, and I've also got the durable SNS+SQS subscriptions implemented as well! ...


Not so much, as whilst the things were working yesterday, it looks like today I can publish as many messages as I like to the SNS topic (and I can even create an email based subscription to veirfy the message is going in), but do I see the messages appear on the SQS queue that is subscribed to the topic? do I heck as like.

So there, grrr

As a first, test, post on this blog, that does pretty well to sum me up - angry programmer

:D