web hit counter
About this Entry
Posted by: edu328

Visit edu328's Xanga Site

Original: 7/15/2009 4:20 AM
Views: 2
Comments: 0
eProps: 0

Read Comments
Post a Comment
Back to Your Xanga Site



Wednesday, July 15, 2009

Challenges in Distributed Systems

 My team develops distributed systems.  I own several components.  One is Quorum.  Quorum is the config of the system deciding if the system should be up or down.  More explanation below:

In a distributed system, there are several nodes.  We often need to tell when a system is up or down.  In a node majority quorum, we need more than half of the node to be up to declare the system to be up.  (There are more quorum types such as node + disk majority).

I did a kernel debug last week and saw this issue:

Consider a system S with 4 nodes {N1,N2,N3,N4} using node majority quorum. N1 crashed, then N2 crashed.  After less than 1 microsecond of the crash of N2, N3 also crash; however, at this time, the system only know N1 is down.  As a result, S is declared as up; however, this is a bug.  Timing is really a issue in DS.

This type of bug is hard to debug and must need log files to see what happened...

So one of my major daily job is to look at logs to see what happened.
 Posted 7/15/2009 4:20 AM - 2 Views - 0 eProps - 0 comments

Give eProps or Post a Comment

Choose Identity
(?)
 
Give eProps (?)
Post a Comment
Add Link | Preview HTML comment help 
  • Say it with Minis! (?)

Profile Pic:
Default  |  Choose »  (?)



Back to edu328's Xanga Site!
Note: your comment will appear in edu328's local time zone:
GMT -06:00 (Central Standard - US, Canada)