Author Topic: Revised Networking Paradigm  (Read 946 times)

daniel.santos

  • Guest
Revised Networking Paradigm
« on: 2 October 2009, 19:03:28 »
I'm visiting my gf in Philidelphia.  On the plane, I had several hours on my hands and I decided to do a re-analysis of the networking paradigm, for several reasons really; to freshen it all in my mind (as I plan to start digging into it now), to see if I could learn anything new and to see if anything I've learned this year can improve my original paradigm.  The exercise has proved to be quite fruitful!  While I wont be able to determine the best way to tune some of these mechanisms until it's built and I can start testing it, I think that all (or most) of the core concepts needed to make this work optimally are in place.  I'm also considering creating a build option (yet another! :) ) to output statistical information during network play that players can then (at their option) submit for analysis for further tuning.

So here we go!  First off, most of the concepts specified in the chop, chop, chop (network code) thread remain, although some are changed.  None of the OO architecture (class hierarchy, etc.) has changed.

So first a little re-cap of some pertinent points from that thread:
  • The architecture is a hybrid client/server and p2p.  The server will "host" the game and be in control of it, but each client will attempt to connect to eachother to reduce latency (eliminating the need for the server to relay the actions of one client to other clients, and the associated delays).
  • All hosts that are connected to each other will ping their counter parts once per second and the class Game::Net::NetworkInfo will accumulate this data.  In the future, we may set it up so that all communications will have an ack, allowing any message to be treated as a ping.  This will allow us to determine latency when data size increases.  But for now, the Networkinfo class accumulates data based upon small messages being sent.
  • Clients will send a summary of this data to the server every x interval (probably 8 seconds or so)
  • The server uses the connection statistics from it's own pings, as well as those received from clients to determine what the maxiumum latency of the entire network is and generate a commandLeadTime value -- an amount of time that we can reasonably expect a message to be transmitted and received between every set of peers.  This will probably be something like 75% of the maximum average latency of all connections in the game (approximately 133% of the average time messages are delivered in -- but that can't be guaranteed because upstream may be slower than downstream).
  • Whenever any participant in the game issues a command to their units, the command is actually scheduled to occur commandLeadTime milliseconds in the future.  The command is transmitted to all other players, and queued locally, with the idea that it will execute at nearly the exact same time on all machines.  (Initially, I may use "world frames" that is currently being used, but I want this eventually changed anyway, for many reasons).
  • When somebody receives a command early, it will be queued for execution at the appropriate time.  If they received it late, they will issue the command and fast-forward it for the amount of time that it was late, to try to synchronise it as much as possible, with the state of the game on everybody else's computer.

So this already provides us a pretty descent mechanism to keep games in sync.  But when a command does arrive late, we introduce the possibility for two games to become out of sync.  How this is handled is the subject of this revised paradigm.  I previously had specified that the client would just request an update for those unit(s) who's commands arrived late.  But that still can't promise things to be in sync.  For one, the update will need to be fast-forwarded half of the average latency of the connection to the server (or based upon the frame the server generated the update and the frame the client received it).  Further, it can affect units other than those who were commanded.  For example, an enemy unit may have been ordered to patrol an area.  On one machine, the player ordered his units to move forward and the patroling enemy saw the units and attacked.  But on the machine where it arrived late, the patrolling units may have already passed when the late orders were received, and even fast-forwarding may have fast-forwarded right through the window where the patrolling unit would have spied them.  In short, it is an imperfect solution.

This new paradigm introduces a number of mechanisms and protocols (in the behavioral sense) to overcome these obstacles in the least intrusive fashion.

Client Sync
As mentioned before, the client will already send latency statistics to the server every x interval (I should probably come up with a name for that value).  I will be adding the method uint64 Game::Unit::hash() that will generate a hash value based upon pertinent data members of a Game::Unit object.  I might pull in some external library to generate this, that is less fool-proof than the Shared::Util::Checksum class (even though that class isn't horrible).  Either way, the method should have a fairly low CPU cost.  In the new paradigm, clients will periodically generate hash values for Units and send them to the server (along with the frame and/or time).  The server can then compare these relatively small values to it's own generated versions to determine if a unit is out of sync or not. This is more of a "here is my state, let me know if I'm screwed up" type of message.

To reduce bandwidth for transmitting numbers that are stored in 32-bits but are usually very small (like unit IDs), I'm introducing two new classes, each implementing the Shared::Platform::NetSerializable interface: Shared::Util::Int30 and Shared::Util::UInt30.  This is how they work (the value "d" indicates that a bit is part of the data):

Network SizeData SizeUnsignedSignedByte 0Byte 1Byte 2Byte 3
1 byte7 bits0-127+-630ddddddd---
2 bytes14 bits0-16,383+-8,19110dddddddddddddd--
4 bytes30 bits0-1,073,741,823+-536,870,91111dddddddddddddddddddddddddddddd

Thus, until you have 128 units, such hash values can be transmitted in an id/value pair taking only 9 bytes on the network, after which they will only take 10 bytes.  When there's only 20 units in the game, a max of 10 bytes per unit isn't so bad, that's only 200 bytes.  But when there's 400, now we're talking about 4k, which can make for a fairly large message.  Thus, these "sync" messages can group unit hashes together by specifying a start id, end id and 64 bit hash with each unit's hash() value XORed together.  If we sent them in groups of 10 units each, we're down to roughly one byte per unit now and this still gives the server the ability to verify that everything is in sync (even though the server wont know which units in a group are out of sync and will, thus, have to presume that all of them are).  Again, testing & tuning will determine the best way forward for this.  Additionally, clients don't have to send all of the data at once, it can be divided up so that different groups of units are sent at different times.

Real-Time Re-Sync
I'm not opposed to continuing to send real-time updates to clients when units are found to be out of sync, but I want to test more to find out how effective this really is (applying & fast forwarding updates on the client).  By re-sending hash values for these units, the server can determine if the real-time sync was successful or if a "Stop-The-World Re-Sync" is needed.

Stop-The-World Re-Sync
When the server determines that the game is out of sync (and if we do use Real-Time Re-Sync, but it fails), then the server issues a "pause and re-sync" command to be executed a liberal distance in the future (probably the maximum average latency time, which is is double the normal expected delivery time, so we presume that each client will receive this message on time).  When each host reaches the time to pause & resync, the game will pause and a "re-syncing" message will appear on the screen.  Each client will generate hash values for the entire world (think units for now, but I'll discuss other stateful game objects later) and send them to the server.  When the server reaches the pause, it will generate updates for all units that it already knows are out of sync & send them out, without waiting for the current hash values from its clients (this is designed to keep the amount of time required down).  When the server receives the hash values from the clients, it will generate any additional updates needed.  As each client receives a batch of updates and applies them, it will send an explicit message to the server to inform it that it's done (this is to overcome any false expectations due to large updates taking a long time to transmit across the network).  When the server has received acknowledgement of all updates having been applied, it will schedule and transmit a resume message to all clients and the game will resume at that time, presumably back in sync.  Ideally, this type of thing will cause less than one second of pause and should never be needed unless messages are received late.  The need for it under any other condition should indicate a bug, design flaw or something else that was missed in design and/or implementation.

Other Stateful Objects
I've identified a number of other objects that need to be checked for synchronization.  I'm not going to get too deep into details because I plan on writing up a much more detailed analysis, probably on the wikia page.
  • AttackParticleSystems: As mentioned earlier, the existence or non-existence of these on various hosts in a game should be syncrhonized during a stop-the-world re-sync, as it is very important, especially for units with attacks that do a lot of damage & cost a lot of EPs.
  • Map Resources: Trees, gold, stone, etc.  Bandwidth can be saved by only transmitting the differences between the original map values & current (although I don't think we currently keep the original map values in memory).
  • AttackParticleSystems: I'm going to butcher this entire class, possibly before the initial release of the network re-write!  The reason is that it's two conceptual objects that are dysfunctionally meshed into one: an attack object and a particle system (decorative object).  I'll be writing a very long thing about so-called "Incorporeal Objects" (at least, that's what I'm calling them right now).  These are any game objects that do not live in cells.  Additionally, saved games must start saving and restoring AttackParticleSystems because dropping them can drastically alter the state of a game when restored!  Think of a group of archmages that spend their last mana to all launch their fire attack on a hoard of enemies.  Save the game (without these "AttackParticleSystems" saved) and then restore it -- the battle drastically changes!  Similarly, network games can be affected if a unit is updated and on the server, he has already fired his attack, but on the client receiving the update, the unit suddenly lost a ton of mana, but no AttackParticleSystem exists to show for the loss of EP.
  • Map Heights (low priority): Map heights change when you build buildings, these can get out of sync.  They can also change when future skills to change the map are implemented.  I have no plans on implementing this one any time soon :)  but it should be tracked as a need.
  • Lua Stateful Variables/Objects: I still haven't even looked very deeply at the Lua code yet!!  But I know that this is a place that inconsistencies can occur.  This is also a lower priority, but will need to be addressed eventually.
  • Map Explored Values (lowest priority): When units get out of sync, this can get out of sync.  This is mostly for note, I'm not certain we'll ever implement it's syncrhonization unless somebody can demonstrate it affecting game play.

Refactoring AttackParticleSystem
The current AttackParticleSystem can be modified slightly to be friendly with both saved games & this improved networking paragigm.  However, a serious overhaul is needed and I think I'll post that in a new thread since it's so off-topic. :)

Final Notes
In the end, a balance may be found between real-time re-syncs and stop-the-world re-syncs.  Perhaps if the client only sends sync messages with unit hashes grouped to keep the messages small until an out-of-sync condition is discovered.  When the server discovers this, it can then explicitly requests hash codes from the client for each individual unit in the group(s) that were found to be out of sync to know exactly which units are and aren't out of sync.  Then, bandwidth isn't wasted by the client sending masses of hash data every 8 seconds for units that are almost always in-sync, nor by the server sending updates for large groups of units when only one of those units is out of sync.  Hopefully, when stop-the-world re-syncs are needed, they will take less than a second and not occur frequently.

EDITS: Edited it a few times to fix errors and improve clarity.  I'm done editing it now though. :)
« Last Edit: 2 October 2009, 19:27:29 by daniel.santos »