| | My team develops distributed systems. I own several components. One is Quorum. Quorum is the config of the system deciding if the system should be up or down. More explanation below:
In a distributed system, there are several nodes. We often need to tell when a system is up or down. In a node majority quorum, we need more than half of the node to be up to declare the system to be up. (There are more quorum types such as node + disk majority).
I did a kernel debug last week and saw this issue:
Consider a system S with 4 nodes {N1,N2,N3,N4} using node majority quorum. N1 crashed, then N2 crashed. After less than 1 microsecond of the crash of N2, N3 also crash; however, at this time, the system only know N1 is down. As a result, S is declared as up; however, this is a bug. Timing is really a issue in DS.
This type of bug is hard to debug and must need log files to see what happened...
So one of my major daily job is to look at logs to see what happened. |
| | Posted 7/15/2009 4:20 AM - 2 Views - 0 eProps - 0 comments
- recommend
    - recs0
- share
- email
 - sent0
Give eProps or Post a Comment |