Talk:Kademlia

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Internet (Rated B-class, Low-importance)
WikiProject icon This article is within the scope of WikiProject Internet, a collaborative effort to improve the coverage of the Internet on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
B-Class article B  This article has been rated as B-Class on the project's quality scale.
 Low  This article has been rated as Low-importance on the project's importance scale.
 

Implementations[edit]

There is yet another implementations though I'm not sure anymore, who's developing it or if it recently forked...CSpace

CSpace-HP and at Google Code —Preceding unsigned comment added by 81.189.24.2 (talk) 16:27, 23 September 2008 (UTC)

Overlays[edit]

Here is a list of P2P Overlays

Overlays.

Is overlay the correct word? --ShaunMacPherson 19:24, 24 Sep 2004 (UTC)

Overlay is the correct word used by academics. The link given is of DHT (or structured) P2P overlays. Other unstructured P2P overlays exist, such as Gnutella. Non-P2P overlays exist such as the Skype network, etc. See also Structured versus Unstructured. Bpringlemeir 19:37, 18 September 2007 (UTC)

How does it work?[edit]

I read the white paper and parts of it are very difficult to understand. Can some one create an in depth description of how this network works, but in plain English. (or at least put a hyper link to such a description?)

I agree. Currently this article doesn't make any sense. It's just a bunch of hand waving.

It's a rather popular design for peer-to-peer networked distributed hash tables (see also: hash table). In short, hash tables provide the functionality of efficiently resolving abstract keys into values, and the insertion of these key-value pairs. Distributed hash tables just provide this functionality over a network of nodes. For example, a key could be a search keyword and its value would be a single article/file it matched. The same key can be used in multiple key-value pairs, so a single search keyword can return several results. This "primitive operation" by itself might sound pretty trivial, but it allows one to build very complex functionality on top. So, Kademlia is just a design for implementing these kinds of distributed hash tables. Does this make sense? -- intgr 19:45, 29 September 2006 (UTC)
I have inserted a section called "System Details". This tries to explain how Kademlia works. I will use just a binary mode in the description. I think the XOR notation confuses people. At the end, in the "accelerated lookup", it can be shown that the XOR metric allow the binary tree to expand to higher degree trees. There are a lot of nit-picky details to handle pathological cases in Kademlia. I am not sure if I have too much detail now, or not enough. Also, I think the introduction is too long. Parts of the "file sharing" could be moved to the details. Bpringlemeir 15:50, 14 April 2007 (UTC)
I took the original article (a mess) and rewrote it by creating a separation between Kademlia itself and it uses (File Sharing). Did not attempt to go into much detail because it is too complicated to explain and I was afraid of destroying intelligibility. Just wanted to give readers a hint of what is going on. Now I read your (excellent) work and I doubt if this separation is to be mantained, since you went into File Sharing specifics in your description of the algorithm. I agree that its time for some further reorganization. Should the general description have comments here and there about the different implementations?
Some notes:
  • Kademlia is a distributed database that can be used to store any kind of information. Storing pointers to files is a specific implementation.
  • The concept of "distance" is basic to understand Kademlia. I find the entry "Routing Tables" very difficult to grasp without having the distance concept. A simple description of distance should be taken out of "Accelerated lookups" and placed at the beggining, as a part of the list.
  • FIND_VALUE Doesn't return a pointer, but the actual value. Or many values, depending on implementations.
  • You never lookup for a node, but for the closest nodes to some ID.
  • You mention the caching mechamism (storing values futher away from original k nodes). You do not mention the replication mechanism. In some implementations (Overnet, eMule) the lack of replication is compensated by periodical STORE from the storer node and by not ending a FIND_VALUE when the first node returns one value, but exhaustively querying all nodes close to the desired ID. These compensations are not part of the Kademlia paper. These two P2P networks do not use caching either.
  • While every node ID should be different (according to the paper), implementations do not care and create lots of duplicates. Kademlia is resilient enough to work well even then. It even may compensate a bit for the lack of caching.
  • Node-splitting is just implementation, a good idea if binary trees are used, but useless if you use a table (memory is cheap). Shouldn't this be explained in more general terms? (I know it is like this in the original papers) Again, no hope of understanding its use without understanding distance first.
  • Prefixes. Your description is absolutely correct. But the two last sentences confused me:
  • "These are maximum values and the average value will be far less due to redundancy, increasing the chance of finding a node in a k-bucket that share more than just the prefix with the target". Not due to redundancy but to sheer luck. Statistically you cannot expect to go always along the longest path, but this is not redundancy.
  • "Nodes can use mixtures of prefixes in their routing table, such as the Kad Network used by eMule. The Kademlia network could even be heterogeneous in routing table implementations. This would just complicate the analysis of lookups". Kad uses single bit buckets. Am I correct? Kad goes complicated at the Value side, but buckets are very simple. Are there any implementations with heterogeneous prefixes? If not, I think this sentence just complicates something that is not very simple already.
Now I'll try to improve the article a bit based on your words. Of course, feel free to revert anything incorrect! If you disagree with anything I said, please comment under it. [Apparently by 83.41.82.122? Bpringlemeir 04:23, 20 June 2007 (UTC)]
Sorry, I didn't see your talk entry before I edit some things. You have some good points. I think that what you wrote about RTT might be correct for some implementation. However, general Kademlia doesn't really say anything about it. You have good information, but paired with what I wrote, i think it is too much. However, I have lost objectivity. As I know more about Kadmelia, I might not think some things need explaining now. I did find the parts added to accelerated lookups was wrong (or at least highly confusing). I agree that some re-organization is still needed. I found the concept of distance confused the issue for me when I started. Note, that when you are using accelerated lookups, it is not so much counting the bits. The "XOR metric" is abstract sounding and would sound rather "alien" to many people. However, the concept of bits is easier to understand. The "XOR metric" is only really necessary to describe accelerated lookups. It partitions from a binary tree to a quad-tree, octo-tree, etc. I will have to look at the rest of your comments later. Bpringlemeir 04:32, 20 June 2007 (UTC)
I gave a cursory read at your modifications and they look good. XOR must be mentioned in the article and given the proper importance. There is no need to understand XOR to understand Kademlia, but what differentiates Kademlia from other distributed hash tables is, precisely, XOR usage. MUST be mentioned. A section about "algorith details" at the bottom might be its place.
Regarding prefixes, please look at the paper by Daniel Stutzbach at the bottom of the article. Redundant meant the minimum 'k' to provide a full map of the network. Ie, k=1. Anyways, the first k-bucket covers 1/2 the network, but you might have an entry in the k bucket that is closer to the target. It is discussed in the paper. Bpringlemeir 04:40, 20 June 2007 (UTC)
Kademlia is highly redundant in the sense that there is not one single path to find a Key, but many many possible ones. Agreed. If mentioned in this sense, this should be clear enough so the reader understands what is being ment by redundance. This is my point. "redundantly interwined" sounds good?

It does not make any sense what the routing table (the binary tree) is used for. I mean we already have 160 k-buckets, right? Please clarify. —Preceding unsigned comment added by 193.40.6.68 (talk) 13:42, 13 July 2009 (UTC)

The graphic was for a single node. I clarified this in the caption. I think the graphic is still confusing, but it is better than just words in my opinion. I don't know why you are talking about 160 k-buckets. I think you are talking about a particular network. This hypothetical network only has eight possible nodes. Feel free to make a graphic with 2^160 nodes. Bpringlemeir (talk) 02:11, 28 December 2009 (UTC)

I disagree. I found this article very well-written, and cleared up a lot of confusion I had from reading the paper.

Other uses[edit]

  • A non-p2p use found it's way into I2P (they mentioned Kademlia anyway). This is one of the scalability projects i am currently studying. If i get more information on such usage i will certainly contribute it to this article.
--Sarah (Kuro) 22:43, 27 September 2005 (UTC);
Nice, the article no longer (confusingly) describes the algorythm as a "protocal" as it had in 2005 --Kuzetsa (talk) 09:54, 18 September 2008 (UTC) (oops, strike that) --Kuzetsa (talk) 09:56, 18 September 2008 (UTC)

Protocol[edit]

In my understanding, Kademlia is not a network protocol. It is an algorithm (or "system", as the authors of the original article call it) that can be implemented as a network protocol. I have edited the article to reflect this belief. If I'm mistaken, please let me know. Strait 04:39, 9 April 2006 (UTC)

The paper refers to it as a system. There are protocol details, algorithms and data structures. Some parts are a bit sketchy, like how the unbalanced binary tree is handled and splitting k-buckets that aren't in the nodes range. Bpringlemeir 15:57, 14 April 2007 (UTC)

Name[edit]

What is the derivation of the name? Roberthoff82 15:24, 3 February 2007 (UTC)

Yeah, what on earth does it mean? Maikel 09:27, 21 May 2007 (UTC)
I believe it was named after the mountain in Bulgaria. --PacoBell (talk) 21:28, 13 May 2008 (UTC)

In Bulgarian "Kademlia" (Bulgarian: кадемлия) means "who brings good luck". "Kadem" means "good luck". Both Kadem and Kademlia are not frequently used in everyday language in recent decades. Probably the origin of the word is Turkish.
The Kademlia Peak cited above is actually "Goliam Kademlia" (Big Kademlia), which together with "Malak Kademlia" (Small Kademlia) and Pirgos peaks is forming the more popular Triglav Peak (do not confuse with the peak with the same name in Slovenia). The common between Kademlia technology and Kademlia peak is that both are supposed to bring good luck. Dobrichev (talk) 21:25, 11 December 2009 (UTC)

How Kademlia start a node[edit]

kadc need some working peers stored in a file to start a node, but what about utorrent ? —The preceding unsigned comment was added by 217.44.131.209 (talk) 23:46, 18 March 2007 (UTC).

For trackerless torrents, there are several peers included in the torrent. At least this is what some torrents downloaded from Gnutella show when dumped with "torrentinfo" in the Python Mainline code. I think you are referring to bittorrent and not utorrent. In this case, the bootstrap nodes are in the torrent file (afaik). Bpringlemeir 15:54, 14 April 2007 (UTC)

Intro[edit]

I think the intro is overly long. There is the section above by "Intgr" under "How does this work". I also have an analogy in comments in the main article under "System details". The 1st, 2nd, 3rd generation comment might also be more suitable for an intro. Ie, How does Kademlia differ from other P2p networks? The main thing is that information pointers are distributed throughout the network. Bpringlemeir 16:28, 14 April 2007 (UTC)


I have reworded the introduction to not include details like the OSI layer and some redundant wording. For example, repeating Distributed hash table. Distance information was moved to "System Details". Bpringlemeir 19:04, 28 April 2007 (UTC)
The following text is useful, but I found it confusing in the introduction. There are probably too many abstract concepts in the article already, so adding the node name "A", "B", "C", etc. doesn't seem to help a new reader understand this. I think the information is repeated again in the article. This text was removed from the intro,
The knowledge that a participant node has of the network varies with "distance". A concept of distance is defined and implemented in the network, so that a node "A" can always tell which of other two nodes "B" and "C" is closer to himself (to "A"). Any node has a very detailed knowledge of the nodes that are "very close" to himself (i.e. will know the ID's of all of the close nodes), a very sparse knowledge of very distant nodes (i.e. will know the ID's of very few of them), and varying degrees of knowledge of the parts of the network that are at intermediate distances, the shorter the distance to himself, the more detailed the knowledge.
Kademlia purpose is to store, and retrieve, some "Values", which can be any data, usually pointers to files. Values are pointed to by "Keys". Nodes ID's and Keys can have its "distance" computed just like between two node ID's. Therefore any node "A" can calculate the distance to two keys "K1" and "K2" and decide which one is closer to himself (to "A") than the other key. In the same way, a node "A" can decide which one of two keys "K1" and "K2" is closer to another known node "B" than the other key. "A" and "B" will agree in that calculation.
To store a value, the storer node will calculate the corresponding key (usually a hash calculated from the value itself), search the network in several steps, each step finding nodes that are closer to the desired key than in previous steps, until the closest nodes to that key can be determined. Then it will store the value in all of these nodes.

Cyclic group?[edit]

I can't find a reference to cyclic groups in the paper. It is pretty clear that ({0,1}, xor) is a cyclic group, but ({00, 01, 10, 11}, bitwise xor) is not cyclic because it has no generators (for all x, x xor x = 0), and similarly for all bitstrings of length longer than 1. Does anyone have a citation for the reference to cyclic groups being important in the analysis? —Preceding unsigned comment added by 141.212.111.83 (talk) 22:00, 8 April 2010 (UTC)

You are correct. It doesn't say anything about cyclic groups in the paper. I was looking for a simple term for a non-abelian algebra. It uses the fact 2^N is an algebra to prove things in the paper cited. Bpringlemeir (talk) 02:51, 25 September 2011 (UTC)
err, not algebra but group. The larger {00, 01, 10, 11} is not a cyclic group, it is two cyclic sub groups. You could change this to what ever you like. I was only trying to convey that algebra/groups are used in the proof. Implementers of the protocol/system (and probably most readers) won't care too much about this. Bpringlemeir (talk) 02:58, 25 September 2011 (UTC)

Definition of triangle inequality[edit]

There are 2 places in which the triangle inequality is defined. There are 2 different definitions on this page. I changed the definition and put it in both places. The reason I could not use the definition in the triangle inequality page, is because it is defined from an edge perspective not a vertex perspective. Let me know if everything is in order, I have not made many commits. — Preceding unsigned comment added by Lyle Stephan (talkcontribs) 15:05, 5 August 2012 (UTC)

Possible Error[edit]

Hey,

I guess there is a slight inaccuracy which makes the section extremely confusing:

"Nodes that can go in the nth list must have a differing nth bit from the node's ID" from "Routing Tables" , should be "Nodes that can go in the nth list must have n differing bits from the node's ID" as each list refers to a difference "distance" which is in this case the hamming distance of those two strings...

I'm not really familiar with the wikipedia editing process, so I post it here.. Sorry.

bye — Preceding unsigned comment added by 134.19.118.207 (talk) 10:20, 7 April 2013 (UTC)

About hops: [edit]

About this sentence:

This is very efficient: Like many other DHTs, Kademlia contacts only nodes during the search out of a total of nodes in the system.

Does this mean that in 10 hops, you can contact 10 billion nodes? --Dan Bolser (talk) 13:26, 25 May 2015 (UTC)

Actually, this is probably natural log, not base 10 log, so 20 hops would be required to reach (approximately) 10 billion nodes. --Dan Bolser (talk) 13:30, 25 May 2015 (UTC)


Possible Reference error[edit]

The Next Generation section says "Petar Maymounkov, one of the original authors of Kademlia, has proposed a way to circumvent this weakness by incorporating social trust relationships into the system design." and then references a paper by Chris Lesniewski-Laas, there is no mention of Petar Maymounkov. — Preceding unsigned comment added by 197.89.192.117 (talk) 18:48, 14 October 2015 (UTC)

Mistake in the contact accounting with the replacement cache[edit]

"When a k-bucket is full and a new node is discovered for that k-bucket, the least recently seen node in the k-bucket is PINGed. If the node is found to be still alive, the new node is placed in a secondary list, a replacement cache. The replacement cache is used only if a node in the k-bucket stops responding. In other words: new nodes are used only when older nodes disappear."

This description is inconsistent with the section "4.1 Optimized Contact Accounting" of the Kademlia paper. Specifically, the first sentence is only valid for a naive implementation of the contact accounting. When the paper presents an optimized contact accounting in the 4.1 section, which introduces the replacement cache, it mentions that PINGing the least recently seen node in the k-bucket every single time the k-bucket is full and a new node is discovered for that k-bucket results in a lot of traffic, and to reduce that traffic "Kademlia delays probing contacts until it has useful messages to send them", i.e. it doesn't do these PINGs that the first sentence mentions anymore. So it's either the naive contact accounting -- PING least recently seen node and have no replacement cache, or optimized contact accounting -- don't PING the least recently seen node and do have the replacement cache. Not both, as the article currently says. 97.102.162.195 (talk) 04:22, 31 January 2017 (UTC)