What are the essentials of distributed real-time systems?
A distributed real-time system composes two challenging sets of properties which are imposed by the problem domain or the solution domain (or both.)
A distributed system links a number of independent computing entities with local properties by way of a communication mechanism. As a consequence, algorithms and other design components must take into consideration the synchrony and the failure model. A useful summary (not entirely objective) of distributed computing concerns is included in Deutsch’s Eight Fallacies of Distributed Computing. (See this useful exposition.) All of these are useful to consider in (real-time) distributed design; each is a departure point for essential design and implementation concerns:
- The network is reliable
- Latency is zero
- Bandwidth is infinite
- The network is secure
- Topology doesn’t change
- There is one administrator
- Transport cost is zero
- The network is homogeneous
A real-time system is a system in which the timeliness of operation completion is a part of the functional requirements and correctness measure of the system. (I opened an SO question here to try to clarify this.) In reality, nearly all systems might be considered “soft” real-time, in that there are usually unspoken requirements/expectations for the timeliness of operations. We reserve the real-time term, sometimes qualified by soft or hard, for systems which are incorrect when time constraints are not met. Note that many of the concerns summarized in the fallacies above intersect with timeliness. (See also the real-time tag wiki)
It is useful to note that RT (and DRT) systems exist on a continuum of requirements, with “deterministic” (or conventionally, hard real-time) at one extreme. However, plenty of systems have very important time constraints which are nevertheless non-deterministic. Especially in the context of DRT systems, it is also useful to separate the concept of activity urgency from activity priority. In large systems where latency and failure are real and non-trivial factors, the explicit management of computing and communication resources to effect timeliness and other design requirements becomes more important, and the separation of these two dimensions becomes important.
Composing Distributed with Real-Time
- Explicit timeliness requirements — What are the requirements, how are they mapped to activities, are they true trans-node timeliness requirements, how will the time constraints be represented explicitly in the design and implementations, and how will failures be detected, reported, and recovered?
- Time synchronization — What are the requirements for and mechanisms for achieving clock synchrony? Wiki on clock synchronization; many applications require only NTP; more stringent requirements may necessitate special hardware (e.g., IRIG-B) or approaches.
- Synchrony requirements — What are the synchrony assumptions constraining and requirements for system synchrony? This is connected to clock synchrony, but not identical. Some thoughts on formal models from Doug Jensen; wikipedia on Asynchronous System and Synchronous; SO question on partial event ordering;
- Design patterns — What are the moving parts, and how do they relate over the transport? (In particular, how do these relationships affect timeliness?)
- Middleware — How are you going to encode the distributed aspects of the system? Examples include Real-Time CORBA (here’s a good page from OIS) or DDS.
- Time Constraints — How are you going to document, measure, and enforce time constraints in the system?
- Partial Failure — A real-time system typically has reliability requirements. One of the unique aspects of distributed systems is the potential for whole classes of failures called “partial” failures, due either to true crash/comms failures or timeliness errors that must be treated as failures. SO question on failover approaches;
- RTOS — What real-time operating system(s) will be employed?
A few references
For a fairly traditional presentation of DRT systems, take a look at Kopetz’ book. For a more dynamic view, Jensen’s work and his website are recommended. In the Java realm, I suggest reading the excellent “Introduction to Reliable Distributed Programming”. It doesn’t address the full realm of timeliness issues, but does address partial failure in a particularly clear way.
Recently, the concept of (unreliable) failure detectors has emerged as a useful synchrony construct, enabling useful theoretical reasoning and practical formulation/design/construction techniques for DRT systems. The seminal paper on the topic is On the impact of fast failure detectors on real-time fault-tolerant systems, by Aguilera, Le Lann, and Toueg. This paper is heavy sledding, but rewards every ounce of intellectual investment.