Thursday, July 14, 2011

A Leaderboard Storage implementation using Amazon SimpleDB

Summary


This post is an outline of a component design which uses SimpleDB as a backing store for a leaderboard system.

It is still a work-in-progress, but has already had various stress test operations performed on it.

This document was written in order to provide a little insight into one way of tackling this problem domain, as well as allowing myself to clarify the design which I had described a few times already internally, but never really written a design summary for.

The implementation of the service itself is in Java, using aws-java-sdk to communicate with SimpleDB via a wrapper library written for the project which helps provide monitoring information (via JMX) to keep tabs on a live basis on SimpleDB usage and issues, as well as to handle some of the dynamic request ramping/throttling we previously found we needed when making use of AWS services.


N.B. YMMV - This is for certain not the only way to do it, and does not mean I would endorse this as the only way to do it, or indeed a necessarily optimal way to do your leaderboards



TL;DR
Multiple leaderboards per table. Cron task update rank prediction curve per leaderboard sort column. Estimate rank from current score when retrieving results.

Domain Structure

An instance at the application level of SimpleDBLeaderboardService maps to a single SimpleDB domain. This is used for one or more individual leaderboards.

The SimpleDB limitations of domain sizes are a fundamental limiting factor on the size of the data within a SimpleDBLeaderboardService instance.

Dependant on the usage pattern there could be a small number of unique leaderboards, each with a large number of individual entries, or a larger number of leaderboards with fewer entries.

If there is a need for many leaderboards and many rows, then either the implementation would have to be expanded to allow finer grained splitting out of leaderboards, with the simplest amendment being to have one domain per leaderboard (though this would be at a cost of eating up domains rather rapidly).

As you can see from the table below, there are relatively few attributes inherent in the system, with most being those relating to the runtime determined and leaderboard specific sorted and unsorted columns.

Attribute Description Example
itemName() SimpleDB item UID - in this case a leaderboardId + ownerId Week-21_1070385706236341127
leaderboardId Unique identifier to the leaderboard within the service domain Week-21
timestamp Timestamp of when the score for this row was last updated 2011-05-23T19:30:52.709
owner Entity which posted the score (e.g. a user or clan UID) 1070385706236341127
<sorted attribute> Value of a named attribute which is a valid column for rankingss (e.g. xp, accuracy, rescues, damage, kills). Padded, fixed accuracy, and offset for natural sort order of string value to apply 0000000600
<unsorted attribute> Value of a named attribute which is not a column used for ranking (e.g. shots, damage). Not padded/offset as natural sort orderings need not apply 1600000


Posting Scores
When a client posts game results, this results in some internal processing outside of the scope of this description, but which ultimately can result in some leaderboard columns being updated.

The part relevant to the implementation of the SimpleDBLeaderboardService is the part where the application :-
  • Retrieves the current row data for the specified LeaderboardId and owner UID.
  • Accumulates data for columns which are accumulators (e.g. shots_fired)
  • Conditionally updates other columns which are either 'lowest' or 'highest' value if the score post contains a better score than the current value for that column (e.g. time_to_complete in posted score = 60, stored = 40).
  • Posts any updated sorted/unsorted attribute values
Posting always has to be done by a read for existing Leaderboard data for that user followed by a possible write if there is anything to update, as the services makes use of SimpleDB conditional writes to ensure transactionality on the update (the timestamp of the read in data is used as a condition check if there was a previous entry, and the absence of any timestamp attribute if there was not). If this fails on the write, the process has to be repeated from the read onwards.


Retrieving Scores
Retrieval of score data from the leaderboard consists of 4 main elements, and can be pulled using several different view types available from the SimpleDBLeaderboardService.

The elements are :-
  • Systemic data such as LeaderboardId, Owner Id and timestamp.
  • Unsorted stats column data.
  • Sorted stats column data.
  • Ranking information for a specified sorted column.
The SimpleDBLeaderboardService implementation are as follows for the different views :-

1. Positions for owner
Returns just the ranks for one or more sort columns for a single specified owner on a particular leaderboard.

The service does this by retrieving current score data for the specified user, then for each specified sort column gets the rank from the PositionCalculator (see Rank Predictors below) specified for the service instance for that column.

2. Result for owner
Returns just the current leaderboard result (sorted + unsorted column values) for a single specified owner (no ranking information returned).

The service does this by retrieving current score data for the specified user.

3. Results around owner
Returns a pivot of leaderboard results of the people at the ranks around that of the specified owner for a specified sort column.

The service does this by :-
  1. Retrieving the results data for the specified user.
  2. Requesting the rank from the PositionCalculator (see Rank Predictors below) specified for the service instance for that column for the specified user
  3. Executing a query which retrieves (in order of the specified sort column) the a fixed number of rows with scores higher than that of the initial user, 
  4. Executing a query which retrieves a fixed number of rows with scores lower than that of the initial user.
  5. Assembling the results in a single respose list in order (better score rows, user rows, worse score rows) using the rank from the pivot user to fill in the ranking information for the other rows.
4. Results for owners
Returns current leaderboard results for a set of specified owners, optionally including ranks for a specified sort column

The service does this by executing a series of individual requests for results for each owner, and then requesting that owner's rank from the PositionCalculator  (see Rank Predictors below) specified for the service instance for that column for the specified user.

It is likely this request would want/need optimisation at a later stage to either parallelise the requests, or instead make a single query of form 'SELECT * from domainname WHERE itemName() IN ###names###' which may perform better.

5. Results paged
Returns an ordered page of leaderboard results for a specified sort column, along with a cursor for retrieving further results.

The services does this by performing a SELECT query using the specified name in an ORDER BY clause, limiting by the requested page size, and extracting in an application 'opaque' cursor if provided, the start rank for the current page and SimpleDB nextCursor to use in the select. The response includes an updated cursor for further pagination.

As such this request never refers to the PositionCalculator  (see Rank Predictors below), instead relying on the fact that either no cursor was supplied and we are starting at rank #1, or have been provided an encoded rank within the cursor supplied with the request.


Parallelism

Retrieval requests of all forms are pretty much constant time regardless of concurrency of requests coming into the system, as the each make use of SimpleDB largely independantly (YMMV disclaimer here!).

This is mainly down to avoiding coupling in the system between requests acting in parallel, and the fact that connections are not too much of a precious commodity when communicating with SimpleDB - with the current setup each front end server instance communicating with SimpleDB has a pool of 100 http connections it can put into use for SimpleDB communications for the purpose of leaderboards, and this can be increased either directly by increasing the pool, or indirectly by splitting the pools (particularly useful if 1 set of leaderboards starts starving out everyone else!).

There is some more parallelism which could be applied to the retrieval process, but mainly on in the request which gets multiple users data in a single request, which currently performs 'n' queries equal to the number of users in the request - this is mitigated somewhat though, as it does not go through the 'select' mechanism but instead is a straightforward 'getItem' request, which in terms of billed CPU time is much cheaper than a number of  'select's.

Write throughput when posting scores is limited not so much by the amount of per-leaderboard activity, but mainly by the rate at which posts can be made to a leaderboard before concurrent updates collide and cause out-of-order updates to be detected at which point back-offs need to start being done. This is almost certainly going to mainly affect clan rather than user leaderboards, as the use-case which would generate concurrent multiple leaderboard updates for user leaderboards is far more rare than that where score posts from different users which generates clan leaderboard updates will be attempting to update a single clan's leaderboard row at the same time.

Largely though, even though a single leaderboard update in our current setup will usually result in updates to multiple leaderboards (for example overall all-time, overall this week, overall today, per-mission all-time, per-mission today ...), these are done in series rather than in parallel, as part of a single job processing operation, as there is expected to be many other jobs which are being processed separate to this, which are running in parallel 

(and setting up each leaderboard update as its own queued job in that end of the system may drive me bonkers when debugging).

Therefore the main worry is in the retrieval of rank in these, as described below.


Rank Predictors

The initial prototypical implementation to provide rank information in the responses involved making the other requests to form up the dataset for the response, followed by looping round and essentially executing a big count to find the rank

embed

end embed

Clearly, this is meh-fine when looking at low-traffic uses, or if correct ranking is more import than any semblance of performance, but it does at least provide a 'native' fallback.

When testing on larger datasets, it became rapidly obvious this was going to be painfully slow though, particularly when querying for the rank of people who were in the >100,000 rank bracket (where I usually am on most games).

As a result of this, we implemented and tested an alternative way to provide estimated ranking information.

This is based on performing samplings from the full dataset to refresh a prediction curve periodically as follows :-

  1. Determine total number of entries in leaderboard
  2. Divide total entries by required sample count to get 'skip size'
  3. Generate a query statement for 'skip' and for 'retrieve', using the same WHERE clause (ordering on sort column).
  4. execute the retrieve query, with a LIMIT of 1 to retrieve a single row's column value - store this along with the rank (1)
  5. execute the skip query using nextToken from previous query
  6. repeat execution of retrieve query LIMT 1 for single row, storing the value along with rank (1 + skipsize)
  7. repeat skip + retrieve 1 until end of leaderboard
  8. Generate score predictor to determine estimated position based on a provided score value.
This predictor can (and is) be generated periodically by any job-processing backend node, which then shares this data via a clustered caching mechanism (memcached, hazelcast for example), so that client-facing nodes only ever get their rankings from a predictor - never having to generate the predictor on the fly (this isn't fast by any means)

N.B. This still needs a slightl tweak as tests against real stats from a previous title showed the predictor was pretty accurate but only once past the initial top 'N' scores (say 500 of 250,000). As such an update may be made which either increases the number of samples taken at top-end, or instead just reverts to NativePositionCalculator if the estimated rank is near the top end.


Alternative implementations may be added in the future which could include
  • Singleton updated inverted index for rankings - downside could involve temporarily showing ranking for a user based against their old score not their current, unless we can think of a way to be able to still do lookups on this index against scores and get the 'nearest' 2 scores if no exact score match is found and interpolate (or somesuch).
  • Use a different nature of datastore such as Cassandra or MongoDB which does this sort of sorting natively (although semi-ironically Cassandra sorts but really doesn't want to provide the ranking).