The Game of Distributed Systems Programming. Which Level Are You?

转载 rslootma; 陈舸 [译]   2013-10-19 10:13  

Introduction

When programming distributed systems becomes part of your life, you go through a learning curve. This article tries to describe my current level of understanding of the field, and hopefully points out enough mistakes for you to be able follow the most optimal path to enlightenment: learning from the mistakes of others.

当分布式系统编程成为你生活中的一部分时,你需要经历一段学习曲线。这篇文章描述了一下我当前在这个领域大致属于哪个层次,并希望能为你指出足够多的错误,从别人的错误中学习,从而使你能以最优的路径通向成功。

For the record: I entered Level 1 in 1995, and I’m currently Level 3. Where do you see yourself?

先声明一下,我在1995年时达到第1级,我现在处于第3级。你自己属于哪一级呢?

Level 0: Clueless

Every programmer starts here. I will not comment too much here as there isn’t a lot to say. Instead, I quote some conversations I had, and offer some words of advice to developers that never battled distributed systems.

每个程序员都从这一级开始。我不会在此浪费太多口舌,因为这实在没什么太多可说的。相反,我会引用一些我曾经经历过的对话,为从未接触过分布式系统的开发者们提供一些建议。

NN1:”replication in distributed systems is easy, you just let all the machines store the item at the same time“

NN1:“在分布式系统中,复制是个很容易的操作,你只需要让所有的结点同时存储你要复制的东东就行了”。

Another conversation (from the back of my memory):

另一段对话(从我记忆深处挖出来的):

NN: “For our first person shooter, we’re going to write our own networking engine”

NN:“为了我们的第一人称射击游戏,我们得写一个自己的网络处理引擎。”

ME: “Why?”

我:“为什么?”

NN: “There are good commercial engines, but license costs are expensive and we don’t want to pay these.”

NN:“虽然已经有一些优秀的商业引擎了,但获取license的费用非常高昂,我们不想为此买单。”

ME: “Do you have any experience in distributed systems?”

我:“你之前对于分布式系统有什么经验吗?”

NN: “Yes, I’ve written a socket server before.”

NN:“是的,我之前写过一个套接字服务器。”

ME: “How long do you think you will take to write it?”

我:“你觉得你要花多久能完成这个网络引擎?”

NN: “I think 2 weeks. Just to be really safe we planned 4.”

NN:“我想2周吧。保险起见,我计划用4周时间。”

Sometimes it’s better to remain silent.

好吧,有时候还是保持沉默比较好。

Level 1: RPC

RMI is a very powerful technique for building large systems. The fact that the technique can be described, along with a working example, in just a few pages, speaks volumes of Java. RMI is tremendously exciting and it’s simple to use. You can call to any server you can bind to, and you can build networks of distributed objects. RMI opens the door to software systems that were formerly too complex to build.

RMI是一种非常强大的用来构建大型系统的技术。事实上,这个技术用Java来描述的话,结合一些工作的例子可以在短短几页纸内描述清楚。RMI技术非常令人振奋,而且它很容易使用。你可以调用你所能绑定到的任何服务器资源,而且你可以构建出分布式的网络对象。过去人们常常为构建复杂的软件系统犯难,现在RMI打开了这道大门。

—Peter van der Linden, Just Java (4th edition, Sun Microsystems)

Let me start by saying I’m not dissing this book. I remember disctinctly it was fun to read (especially the anecdotes between the chapters), and I used it for the Java lessons I used to give (In a different universe, a long time ago). In general, I think well of it. His attitude towards RMI however, is typical of Level 1 distributed application design. People that reside here share the vision of unified objects. In fact, Waldo et al describe it in detail in their landmark paper “a note on distributed computing” (1994), but I will summarize here:

我先声明,我并不是说这本书很烂。我清楚的记得这本书读起来很有趣(尤其是章节之间插入的轶闻),我曾经学习Java的时候就是用的这本书(太久以前了,简直不像在一个时空里似的)。一般情况下,我觉得作者说的挺好。他对RMI的态度就是典型的分布式系统设计的第1级水平。处于这个等级的人对统一的对象有共同的看法。事实上,Waldo在他们著名的论文“a note on distributed computing”(1994)上曾深入描述过,这里我做下总结:

The advocated strategy to writing distributed applications is a three phase approach. The first phase is to write the application without worrying about where objects are located and how their communication is implemented. The second phase is to tune performance by “concretizing” object locations and communication methods. The final phase is to test with “real bullets” (partitioned networks, machines going down, …).

我所倡导的写分布式应用的策略可分为3个阶段。第1阶段,写这个应用时不用担心对象存储的位置,以及它们之间的通讯如何实现。第2阶段,通过具体化对象的位置以及通讯方法来调整程序性能。第3阶段,真枪实弹的测试(网络隔离、机器宕机等各种情况)。

The idea is that whether a call is local or remote has no impact on the correctness of a program.

这里的思想就是,不管一个调用是本地的还是远程的,对程序的正确性都不会产生任何影响。

The same paper then disects this further and shows the problems with it. It has thus been known for almost 20 years that this concept is wrong. Anyway, if Java RMI achieved one thing, it’s this: Even if you remove transport protocol, naming and binding and serialization from the equation, it still doesn’t work. People old enough to rember the hell called CORBA will also remember it didn’t work, but they have an excuse: they were still battling all kinds of lower level problems. Java RMI took all of these away and made the remaining issues stick out. There are two of them. The first is a mere annoyance:

同样还是这篇论文,随后进一步挖掘了这个主题并展示了其中的问题。这个观点是错误的,而且已经错了快20年。不管如何,如果说Java RMI达成了一个目标,那就是:就算你从等式中拿掉传输协议、命名、绑定以及序列化,它还是不成立。能记得起CORBA的老程序员们同样也会记得它也是不好使的,但他们有一个借口:CORBA还在同各种底层的问题缠斗中。Java RMI将所有这些都抛开了,但使剩下的问题变得更为突出。其中有两点,第一点纯粹就是个麻烦:

Network Transparency isn’t

Let’s take a look at a simple Java RMI example (taken from the same ‘Just Java’)

让我们看看这段简单的Java RMI代码示例(同样取自Just Java一书)

1 public interface WeatherIntf extends javva.rmi.Remote{
2   public String getWeather() throws java.rmi.RemoteException;
3 }

A client that wants to use the weather service needs to do something like this:

想要使用天气服务的客户端需要这样做:

1 try{
2    Remote robj = Naming.lookup("//localhost/WeatherServer");
3    WeatherIntf weatherserver = (WeatherInf) robj;
4    String forecast = weatherserver.getWeather();
5    System.out.println("The weather will be " + forecast);
6 }catch(Exception e){
7    System.out.println(e.getMessage());
8 }

The client code needs to take RemoteExceptions into account.

客户端代码需要将RemoteExceptions考虑在内。

If you want to see what kinds of remote failure you can encounter, take a look at the more than 20 subclasses. Ok, so your code will be a tad less pretty. We can live with that.

如果你想看看你究竟会遇到什么样的异常错误,可以看看那20多个子类的定义。这样你的代码就会变得丑陋,好吧,这个我们就忍了。

Partial Failure

The real problem with RMI is that the call can fail partially. It can fail before the action on the other tier is invoked, or the invocation might succeed but the return value might not make it afterwards, for whatever reason. These failure modes are in fact the very defining property of distributed systems or otherwise stated:

RMI的真正问题在于这些调用可能会出现局部性失败的情况。比如,调用可能会在对其他层的请求操作执行前失败,又或者请求成功了,但之后的返回值又不正确。引起这类局部性失败的原因非常多。其实,这些故障模式正是分布式系统特性的明确定义:

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable”

“分布式系统就是某一台你根本意识不到其存在的计算机,它的故障会造成你的计算机无法正常使用。”

—(Leslie Lamport)

If the method is just the retrieval of a weather forecast, you can simply retry, but if you were trying to increment a counter, retrying can have results ranging from 0 to 2 updates. The solution is supposed to come from idempotent actions, but building those isn’t always possible. Moreover, since you decided on a semantic change of your method call, you basically admit RMI is different from a local invocation. This is an admission of RMI being a fallacy.

如果这个方法只是去检索天气预报,出现问题时你可以简单的进行重试,但如果你想递增一个计数器,重试可能会导致产生0到2次的更新,结果就不确定了。这个解决方案应该来自幂等操作,但构建这样的操作并不总是可行的。此外,因为你决定改变方法调用的语义,那你基本上就承认了RMI与本地调用是不同的。而这也就承认了RMI实际上是个悖论。

In any case the paradigm is a failure as both network transparency and architectural abstraction from distribution just never materialise. It also turns out that some software methodologies are more affected than others. Some variations of scrum tend to prototype. Prototypes concentrate on the happy path and the happy path is not the problem. It basically means you will never escape Level 1. (sorry, this was a low blow. I know)

不论什么情况下,这种范式都是失败的。因为网络的透明度和分布式系统的架构抽象从来就是无法实现的。这也表明了某些软件所采用的方法比其他软件为此所受到的影响更多。Scrum的一些变种方法中倾向于做原型化。原型更集中于“好的方面”(happy path),而好的方面通常都不是问题所在之处。这基本上意味着你将永远停留在第1级的水平。(不好意思,我知道这是个小小的打击)

People who do escape Level 1 understand they need to address the problem with the respect it deserves. They abandon the idea of network transparency, and attack the handling of partial failure strategically.

那些脱离了第一级水平的人懂得对于需要解决的这个问题,我们要有足够的尊重。他们摒弃了网络透明化的思想,从战略性的角度来处理局部性失败的问题。

Level 2: Distributed Algorithms + Asynchronous messaging + Language support

<dwsarcasm>”Just What We Need: Another RPC Package” </sarcasm>

—(Steve Vinoski)

Ok, you’ve learned the fallacies of distributed computing. You decided to bite the bullet, and model the message passing explicitly to get a control of failure.

OK,你已经学习了分布式计算中的悖论是什么。你决定吞下这颗子弹,然后对消息传递机制建模,以此显式地控制出现失败的情况。

You split your application into 2 layers, the bottom being responsible for networking and message transport, while the upper layer deals with the arrival of messages, and what needs to be done when they do.

你将应用分为两个层次,底层负责网络和消息传递,而上层处理消息的到达,以及需要处理的各种请求。

The upper layer implements a distributed state machine, and if you ask the designers what it does, they will tell you something like : “It’s a multi-paxos implementation on top of TCP”.

这个上层实现了一种分布式状态机,如果你去问设计者这个状态机是用来做什么的,他们可能会这样回答你:这是建立在TCP之上的一个Multi-Paxos算法实现。

Development-wise, the strategy boils down to this: Programmers first develop the application centrally using threads to simulate the different processes. Each thread runs a part of the distributed state machine, and basically is responsible for running a message handling loop. Once the application is locally complete and correct, the threads are taken away to become real processes on remote computers. At this stage, in the absence of network problems, the distributed application is already working correctly. In a second phase fault tolerance can be straighforwardly achieved by configuring each of the distributed entities to react correctly to failures (I liberally quoted from “A Fault Tolerant Abstraction for Transparent Distributed Programming”).

明智的开发,这里用到的策略可以归结为:程序员首先在本地主要采用线程来模拟不同的进程来开发这个应用。每个线程运行分布式状态机的一个部分,基本上就是负责运行一段消息处理的循环。一旦这个应用是本地完整的且运行正确,就可以在远端的计算机上用真正的进程来取代线程。到这个阶段,除去网络中可能出现的问题外,这个分布式应用已经可以正常工作了。到容错阶段时,可以通过配置每个分布式实体来正确反映故障的方式来达成,这种方式很直接。(我引述自“A Fault Tolerant Abstraction for Transparent Distributed Programming”)

Partial failure is handled by design, because of the distributed state machine. With regards to threads, there are a lot of options, but you prefer coroutines (they are called fibers, Light weight threads, microthreads, protothreads or just theads in various programming languages, causing a Babylonic confusion) as they allow for fine grained concurrency control.

因为分布式状态机的存在,局部性故障可以通过设计来解决。对于线程,其实也有很多种选择,但协程(coroutines)更适合(在各种不同的编程语言中,协程也被称为纤程fiber,轻量级线程,微线程或者就叫线程),因为协程允许我们对并发行为有更细粒度的控制。

Combined with the insight that “C ain’t gonna make my network any faster”, you move to programming languages that support this kind of fine grained concurrency. Popular choices are (in arbitrary order)

结合“C代码并不会使网络变得更快”的论点,你可以转移到在语言级支持这种细粒度并发控制的编程语言中去。流行的选择如下(排名不分先后)注意,这些编程语言往往都是函数式的:

(Note how they tend to be functional in nature)

As an example, let’s see what such code looks like in Erlang (taken from Erlang concurrent programming)

举个例子,下面让我们看看在Erlang中这种并发控制的代码看起来是怎样的(取自Erlang concurrent programming)

 1 -module(tut15).
 2 
 3 -export([start/0, ping/2, pong/0]).
 4 
 5 ping(0, Pong_PID) ->
 6     Pong_PID ! finished,
 7     io:format("ping finished~n", []);
 8 
 9 ping(N, Pong_PID) ->
10     Pong_PID ! {ping, self()},
11     receive
12         pong ->
13             io:format("Ping received pong~n", [])
14     end,
15     ping(N - 1, Pong_PID).
16 
17 pong() ->
18     receive
19         finished ->
20             io:format("Pong finished~n", []);
21         {ping, Ping_PID} ->
22             io:format("Pong received ping~n", []),
23             Ping_PID ! pong,
24             pong()
25     end.
26 
27 start() ->
28     Pong_PID = spawn(tut15, pong, []),
29     spawn(tut15, ping, [3, Pong_PID]).

This definitely looks like a major improvement over plain old RPC. You can start reasoning over what would happen if a message doesn’t arrive.

这看起来绝对是对旧有的RPC机制的一个重大提升。现在你可以推想一下,如果有消息没有到达时会发生什么事情了。

Erlang gets bonus points for having Timeout messages and a builtin after Timeout construct that lets you model and react to timeouts in an elegant manner.

Erlang还有附加的超时消息以及一个语言内建的“超时”组件,可以使你以一种优雅的方式来处理超时。

So, you picked your strategy, your distributed algorithm, your programming language and start the work. You’re confident you will slay this monster once and for all, as you ain’t no Level 1 wuss anymore.

现在,你选择了你要采用的策略,选择了恰当的分布式算法以及合适的编程语言,然后就可以开干了。你很自信能驾驭分布式编程这头野兽了,因为你再也不是第一级的水平了。

Alas, somewhere down the road, some time after your first releases, you enter troubled waters. People tell you your distributed application has issues. The reports are all variations on a theme. They start with a frequency indicator like “sometimes” or “once”, and then describe a situation where the system is stuck in an undesirable state. If you’re lucky, you had adequate logging in place and start inspecting the logs. A little later, you discover an unfortunate sequence of events that produced the reported situation. Indeed, it was a new case. You never took this into consideration, and it never appeared during the extensive testing and simulation you did. So you change the code to take this case into account too.

哎呀,可惜的是这一路上并非风平浪静。过了一段时间,当第一个版本发布后,你将陷入泥潭之中。人们会告诉你,你的分布式应用有些问题。问题报告中的主题全都是和变化有关的。开始时会出现“有时”或者“一次”这样的表示频率的词,之后的描述变成了:系统处于不期望的状态,卡住不动了。如果够幸运,你有足够的log信息,可以开始着手检查这些日志。稍后,你发现是一系列不幸的事件序列造成了报告中所描述的情况。确实,这是个新的问题。你从来没有考虑过这些,而且在你做大量的测试和模拟时问题从未出现过。所以,你修改代码以将这种情况也纳入考虑范围。

Since you try to think ahead, you decide to build a monkey that pseudo randomly lets your distributed system do silly things. The monkey rattles its cage and quickly you discover a multitude of scenarios that all lead to undesirable situations like being stuck (never reaching consensus) or even worse: reaching an inconsistent state that should never occur.

因为你试着要超前考虑,你决定构建一个“猴子”组件,它以伪随机的方式让你的分布式系统做些愚蠢的事情。“猴子”在笼子里使劲扑腾着,很快你会发现在很多场景下都会导致出现不期望的情况,比如系统卡住了,或者甚至更糟糕的情况:系统出现不一致的状态,而这在分布式系统中是永远也不应该发生的事情。

Having a monkey was a great idea, and it certainly reduces the chance of encountering something you’ve never seen before in the field. Since you believe that a bugfix goes hand in hand with a testcase that first produced the bug, and now proves its demise, you set out to build just that test. Your problem however is reproducing the failure scenario is difficult, if not impossible. You listen to the gods as they hinted when in doubt, use brute force. So you produce a tests that runs a zillion times to compensate the small probability of the failure. This makes your bug fixing process slow and your test suites bulky. You compensate again by doing divide and conquer on your volume of testsets. Anyway, after a heavy investment of effort and time, you somehow manage to get a rather stable system and ditto process.

构建一个“猴子”是很棒的主意,而且它确实能减少遇到那些你从未在这个领域内碰到过的怪事的几率。因为你相信,修改一个bug必须和发现这个bug的测试用例联系起来,现在需要回归测试这个用例,以证明bug的消除。你现在只需要再构建一次这个测试用例就可以了。可是现在的问题在于,如果说并非不可能的话,要重现这个错误的场景起码是很困难的。你向上帝祈祷,得到的启示是:当心存疑虑时,就使用暴力法吧。因此,你构建一个测试用例,然后让它跑上无数次,以此来弥补这极小的失败概率。这会使你解决bug的过程变得缓慢,而且你的测试套件会变得笨重。通过对你的测试集做分而治之的处理,你不得不再次做一些补偿。无论如何,经过在时间和精力上的大量投入之后,你终于设法得到了一个较为稳定的系统。

You’re maxed out on Level 2. Without new insights, you’ll be stuck here forever.

你在第2级已经到顶了,如果没有新的启示,你将永远卡在这一级。

Level 3: Distributed Algorithms + Asynchronous messaging + Purity

It takes a while to realise that a combination of long running monkeys to discover evil scenarios and brute force to reproduce them ain’t making it. Using brute force just demonstrates ignorance. One of the key insights you need is that if you could only remove indeterminism from the equation, you would have perfect reproducibility of every scenario. A major side effect of Level 2 distributed programming is that your concurrency model tends to go viral on your codebase. You desired fine grained concurrency control… well you got it. It’s everywhere. So concurrency causes indeterminism and indeterminism causes trouble. So concurrency must go. You can’t abandon it: you need it. You just have to ban it from mingling with your distributed state machine. In other words, your distributed state machine has to become a pure function. No IO, No Concurrency, no nothing. Your state machine signature will look something like this

我们需要花点时间才能意识到:长时间运行“猴子”以此发现系统中的缺陷然后再结合暴力法来重现它们,这种做法并不可取。使用暴力法重现只会显示出你的无知。你需要的关键性的启示之一是,如果你可以只将等式中的不确定性拿掉的话,你就可以完美的对每一种场景做重现了。第2级分布式编程的一个重大的缺点是:你的并发模型往往会成为你代码库中的病毒。你希望有细粒度的并发控制,好吧,你得到了,代码里到处都是。因此是并发导致了不确定性,而不确定性造成了麻烦。因此必须得把并发给踢出去。可是你又不能抛弃并发,你需要它。那么,你一定要禁止把并发和你的分布式状态机结合在一起。换句话说,你的分布式状态机必须成为纯函数式的。没有IO操作,没有并发,什么都没有。你的状态机特征看起来应该是这样的:

module type SM = sig
  type state
  type action
  type msg
  val step: msg -> state -> action * state
end

You pass in a message and a state, and you get an action and a resulting state. An action is basically anything that tries to change the outside world, needs time to do so, and might fail while trying. Typical actions are

你传入一个消息和一个状态,你得到一个操作和一个结果状态。操作基本上就是任何试着改变外部世界的东西,需要一定的时间来完成,尝试的过程中可能会失败。典型的操作有:

  • send a message

    发送一个消息

  • schedule a timeout

    安排一次超时

  • store something in persistent storage

    将数据存储在持久性的存储介质内

The important thing to realise here is that you can only get to a new state via a new message. nothing else. The benefits of such a strict regime are legio. Perfect control, perfect reproducibility and perfect tracibility. The costs are there too. You’re forced to reify all your actions, which basically is an extra level of indirection to reduce your complexity. You also have to model every change of the outside world that needs your attention into a message.

这里要意识到的重要部分是:你只能通过一个新的消息来得到新的状态,再无其他。在这种严格的规定下所得到的好处是很多的。完美的控制,完美的重现能力以及完美的可追踪性。为此而得到的开销也同样存在,你将被迫使所有的操作都变得具体化。而这些基本上就是为了减少程序复杂性而附加的一层间接。你还需要将每一个你关心的外部世界变化都建模为一个消息。

Another change from Level 2 is the change in control flow. At Level 2, a client will try to force an update and set the machinery in motion. Here, the distributed state machine assumes full control, and will only consider a client’s request when it is ready and able to do something useful with it. So these must be detached.

相比第2级的分布式编程,另一个改变在于控制流。在第2级中,客户端会尝试强制更新并动态设置状态。而在这里,分布式状态机假定有完全的控制力,并且只有当它准备就绪,可以做些有用的事情时才会考虑客户端的请求。因此这些必须分离开来。

If you explain this to a Level 2 architect, (s)he will more or less accept this as an alternative. It, however, takes a sufficient amount of pain (let’s call it experience or XP) to realize it’s the only feasible alternative.

如果你把这些道理解释给一个2级的分布式系统架构师听,他可能或多或少的会把这个当成一种替代方案。然而,你需要经历足够多的痛苦之后才会意识到这是唯一可行的选择,我们姑且把这些痛苦称为经验吧。

Level 4: Solid domination of distributed systems: happiness, piece of mind and a good night’s rest

To be honest, as I’m a mere Level 3 myself, I don’t know what’s up here. I am convinced that both functional programming and asynchronous message passing are parts of the puzzle, but it’s not enough.

老实说,我现在只是第3级水平,我也不知道在这一级里有什么新鲜玩意。我深信,函数式编程和异步消息传递是分布式系统谜题的一部分,但这些还不够。

Allow me to reiterate what I’m struggling against. First, I want my distributed algorithm implementation to fully cover all possible cases.

请允许我重申我所反对的东西。首先,我希望我的分布式算法实现能够涵盖到所有的可能情况。

This is a big deal to me as I’ve lost lots of sleep being called in on issues in deployed systems (Most of these turn out to be PEBKAC but some were genuine, and cause frustration). It would be great to know your implementation is robust. Should I try theorem provers, should I do exhaustive testing ? I don’t know.

这对我而言是个大问题,我已经在系统部署的问题上牺牲掉了很多睡眠时间。大部分问题都是PEBKAC类的(Problem Exists Between Keyboard And Chair意指用户引起的错误),但有一些确是真正的问题,这给我造成了一些挫败感。知道自己实现的健壮性程度是很好的。我应该试试证明一下那些定理吗?我应该做更详尽的测试吗?我不知道。

As an aside, for an append only btreeish library called baardskeerder, we know we covered all cases by exhaustively generating insert/delete permutations and asserting their correctness. Here, it’s not that simple, and I’m a bit hesitant to Coqify the codebase.

附带提一下,github上有一个称为baardskeerder的仅用于插入操作的B-树库,我们知道可以通过详尽的生成插入/删除排列并断言它们的正确性之后,我们就可以涵盖到所有的情况。但这里,并没有那么简单,而且我对于要对整个代码库做Coqify处理(Coq是一个正式的证明管理系统,它在一种半交互式的环境下提供了一个正式的语言用来编写数学定义、可执行的算法和定理,用计算机来做检查证明,这里作者生造出了Coqify这个词)还有些犹豫。

Second, for reasons of clarity and simplicity, I decided not to touch other, orthogonal requirements like service discovery, authentication, authorization, privacy and performance.

第二,为了保持清晰和简单,我决定不去碰其它一些正交性的需求。比如,服务发现、认证、授权、私密性以及性能。

With regard to performance, we might be lucky as the asynchronuous message passing at least doesn’t seem to contradict performance considerations.

说到性能,我们也许是幸运的,至少异步消息传递似乎与性能方面并不产生矛盾。

Security however is a real bitch as it crosscuts almost everything else you do. Some people think security is a sauce that you can pour over your application to make it secure.

安全性则完全是一个XX(作者真的爆粗口了…),因为它几乎切断了所有你所做的事情。有些人把安全性看成是一种调味酱汁,你只要把它倒在你的应用程序上就可以保证安全了。

Alas, I never succeeded in this, and currently think it also needs to be addressed strategically during the very first stages of design.

哎,在这方面我从未取得过成功,而且现在我也认为这个问题需要在设计的最初阶段从宏观的角度策略性的去分析解决。

Closing words

Developing robust distributed systems is a difficult problem that is practically unsolved, or at least not solved to my satisfaction.

开发出健壮的分布式系统是个颇为棘手的问题,实际上根本没有完美的解决方案,或者说至少没有让我觉得完全满意的解决方案。

I’m sure its importance will increase significantly as latency between processors and everything else increases too. This results in an ever growing area of application for this type of application development.

我敢肯定分布式系统的重要性将随着处理器和其它一切事物之间的延迟增加而显著提高。这一结果使得这种类型的应用程序开发变得愈发繁荣。

As far as Level 4 goes, maybe I should ask Peter Van Roy. Over the years, I’ve read a lot of his papers, and they offered me a lot of insight in my own mistakes. The downside of insight is that you see others repeating your mistakes and most of the time, I fail to convince people they should do it differently.

至于分布式编程的第4级,也许我该去问问Peter Van Roy。这么些年来,我阅读了很多他写的论文,这些论文对于我自己的一些错误认识给了很多启示。关于这些启示的缺点嘛,你常常在大部分时间里看到别人在重复自己的错误,但我无法说服他们应该换种方式去做。

Probably, this is because I cannot offer the panacea they want. They want RPC and they want it to work. It’s perverse … almost religious

也许,这是因为我无法提供他们想要的那种灵丹妙药。他们就想要RPC,而且他们希望这样能搞定问题。这是固执的…就像宗教信仰一样。

http://blog.incubaid.com/2012/03/28/the-game-of-distributed-systems-programming-which-level-are-you/

http://www.yeolar.com/note/2013/10/19/the-game-of-distributed-systems-programming/

http://www.yeolar.com/note/2013/10/19/the-game-of-distributed-systems-programming/