Peer-to-peer and self-organizing algorithms
Márk Jelasity, Fall Semester, 2007, Szeged, Hungary
Introduction|
Complex Networks|
Search: Unstructured Networks|
Search: Structured Networks|
Cooperative Content Distribution|
Techniques for Hiding
Home Projects
Send anonymous feedback through this form.
Abstract
The course is about distributed algorithms and systems that are fully
decentralized, extremely scalable and fault tolerant.
We will see techniques for organizing millions of independent
components to perform useful functions, while avoiding bottlenecks or single
points of failure.
The need for studying such algorithms has emerged in many fields independently.
Most recently P2P systems, and especially file sharing protocols, have
triggered a considerable research effort into this direction.
In other fields, such as artificial intelligence (through multi-agent systems)
and networking (routing protocols, usenet, etc) similar problems have long
been studied, not to mention seemingly unrelated fields such as biological
self-organization, sociology, and so on.
Fully decentralized systems and algorithms have recently become widely visible
and mainstream due to P2P applications. However, in various fields of research
decentralization and emergence have long been a topic of investigation.
We review the history of decentralized and self-organizing algorithms
to better understand their various motivations. Finally, we also discuss some
main issues central to our discussion, such as the role of cooperation,
dynamism, scale, and the topology of the communication network.
Reading material
Large and distributed self-managing systems inevitably involve complex
networks, either explicitly designed, or unexpected (emergent),
if the system is not centrally controlled.
We review some of the basic models of complex networks and their main
properties.
Reading material
- Sergei N.
Dorogovtsev and J. F. F. Mendes.
Evolution of networks: From Biological Nets to the Internet and WWW.
Oxford University Press, 2003.
- Réka Albert and Albert-László Barabási.
Statistical mechanics of
complex networks.
Reviews of Modern Physics, 74(1):47–97, January 2002.
- R. Milo,
S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U Alon.
Network motifs: Simple building blocks of complex networks.
Science, 298:824–827, 2002.
- Mark E. J. Newman.
Random graphs as models of
networks.
In Stefan Bornholdt and Heinz G. Schuster, editors, Handbook of Graphs
and Networks: From the Genome to the Internet, chapter 2. John Wiley,
New York, NY, 2002.
Lecture slides
The first real decentralized file sharing networks did not pay much attention
to overlay network topology: for example, Gnutella used a relaxed approach
allowing for
self-organizing overlay construction, and flooding-based communication.
Since no strict overlay topology is enforced, similar systems are often coined
unstructured, although this terminology is somewhat misleading.
Motivated by the initial success of Gnutella (and its subsequent evolution)
we review the good and bad sides of working with unstructured networks and
the techniques one can apply there including search protocols and different
ways of adapting the system (replication, topology adaptation) that can
substantially improve search performance.
The overall conclusion is that (enhanced) unstructured systems are a viable
and simple alternative with the only drawback of not being able to deal with
rare (unpopular) items.
Reading material
- Yatin Chawathe,
Sylvia Ratnasamy, Lee Breslau, Nick Lanham, and Scott Shenker.
Making
gnutella-like p2p systems scalable.
In Proceedings of ACM SIGCOMM 2003, pages 407–418, 2003.
- Georgos Siganos,
Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos.
Power-laws and the AS-level Internet topology.
IEEE/ACM Transactions on Networking, 11(4):514–524, 2003.
- Qin Lv, Pei Cao, Edith
Cohen, Kai Li, and Scott Shenker.
Search and replication in unstructured peer-to-peer networks.
In Proceedings of the 16th ACM International Conference on
Supercomputing (ICS'02), 2002.
- Matei Ripeanu,
Adriana Iamnitchi, and Ian Foster.
Mapping the
gnutella network.
IEEE Internet Computing, 6(1):50–57, 2002.
(doi:10.1109/4236.978369)
- Lada A. Adamic, Rajan M.
Lukose, Amit R. Puniyani, and Bernardo A. Huberman.
Search in power-law
networks.
Physical Review E, 64:046135, 2001.
Lecture slides
Problems with search protocols in unstructured networks include their
inability to locate rare items and sometimes their lack of scalability.
Partly to address these issues, and partly to support other applications
than search, the distributed hash table (DHT) abstraction has been introduced
which supports the storage of (key,value) pairs and their lookup based
on the key.
DHTs are based on self-organizing overlay networks in which the neighborhood
relations are more strictly controlled than in unstructured networks.
We will overview the DHT abstraction, the most well-known DHT protocols,
and take a look at how they can be utilized for search.
Finally, we conclude the discussion of search protocols with discussing
approaches to combine the best parts of the unstructured and DHT worlds.
Reading material
- Miguel Castro,
Manuel Costa, and Antony Rowstron.
Peer-to-peer
overlays: structured, unstructured, or both?.
Technical Report MSR-TR-2004-73, Microsoft Research, Cambridge, UK, 2004.
- Boon Thau Loo, Ryan
Huebsch, Ion Stoica, and Joseph M. Hellerstein.
The case for a
hybrid P2P search infrastructure.
In Proceedings of the 3rd International Workshop on Peer-to-Peer Systems
(IPTPS'04), San Diego, CA, USA, 2004.
- Patrick
Reynolds and Amin Vahdat.
Efficient peer-to-peer
keyword searching.
In Middleware 2003, volume 2672 of Lecture Notes in Computer
Science, pages 21–40. Springer-Verlag, 2003.
- Sylvia Ratnasamy, Paul
Francis, Mark Handley, Richard Karp, and Scott Schenker.
A scalable
content-addressable network.
In Proceedings of the 2001 Conference on Applications, Technologies,
Architectures, and Protocols for Computer Communications (SIGCOMM),
pages 161–172, San Diego, CA, 2001. ACM, ACM Press.
- Antony Rowstron
and Peter Druschel.
Pastry: Scalable, decentralized object location and routing for large-scale
peer-to-peer systems.
In Rachid Guerraoui, editor, Middleware 2001, volume 2218 of
Lecture Notes in Computer Science, pages 329–350.
Springer-Verlag, 2001.
- Ion Stoica, Robert
Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan.
Chord: A scalable
peer-to-peer lookup service for internet applications.
In Proceedings of the 2001 Conference on Applications, Technologies,
Architectures, and Protocols for Computer Communications (SIGCOMM),
pages 149–160, San Diego, CA, 2001. ACM, ACM Press.
Lecture slides
Locating content through search is only one side of the story.
Content (large files, audio, video) also has to be actually delivered to the
user.
Besides search, content delivery is quickly becoming a fundamental
issue in the Internet and in P2P networks.
In particular, this is one of the applications where P2P protocols represent
mainstream technology (for example, BitTorrent). Here
decentralization and cooperation is the best way to go due to scalability
and robustness issues, and their role is not simply to avoid legal trouble.
We overview some of the basic approaches to P2P content delivery for
both BitTorrent-style distribution of large files and streaming media.
Reading material
- Christos
Gkantsidis and Pablo Rodriguez Rodriguez.
Network coding for large scale content distribution.
In Proceedings of the 24th Annual Joint Conference of the IEEE Computer
and Communications Societies (INFOCOM 2005), pages 2235–2245,
2005.
- Michael Freedman, Eric
Freudenthal, and David Mazières.
Democratizing content publication with
Coral.
In Proceedings of the First Symposium on Networked Systems Design and
Implementation (NSDI '04), 2004.
- M. Izal,
G. Urvoy-Keller, E.W. Biersack, P.A. Felber, A. Al Hamra, and
L. Garcés-Erice.
Dissecting bittorrent: Five months in a torrent's lifetime.
In Passive and Active Network Measurement, volume 3015 of
Lecture Notes in Computer Science, pages 1–11. Springer,
2004.
- Miguel Castro,
Peter Druschel, Anne-Marie Kermarrec, Animesh Nandi, Antony Rowstron, and
Atul Singh.
Splitstream: high-bandwidth multicast in cooperative
environments.
In Proceedings of the nineteenth ACM symposium on Operating systems
principles (SOSP'03), pages 298–313, New York, NY, USA, 2003. ACM
Press.
(doi:10.1145/945445.945474)
- Bram Cohen.
Incentives build robustness in bittorrent.
In Proceedings of the 1st Workshop on Economics of Peer-to-Peer Systems
(P2PECON), Berkeley, CA, 2003.
- Dejan Kostic,
Adolfo Rodriguez, Jeannie Albrecht, and Amin Vahdat.
Bullet:
high bandwidth data dissemination using an overlay mesh.
In Proceedings of the nineteenth ACM symposium on Operating systems
principles (SOSP'03), pages 282–297, New York, NY, USA, 2003. ACM
Press.
(doi:10.1145/945445.945473)
Lecture slides
Decentralized networks have a controversial application: hiding various activities and
information from observers. We review two branches of this theme: botnets and anonym
networks. In the former, criminals apply P2P technology to avoid detection
of their activity and identity. In the latter, specialized
P2P networks offer anonymity for users who wish to access Internet
services that are indecent, prohibited or otherwise problematic. In this case the
emphasis and intent is "white-hat" applications, such as protecting free speech, etc.
Reading material
- Patrick Gray.
The hack of the year.
The Sunday Morning Herald, November 2007.
- Sandeep Sarat and
Andreas Terzis.
Measuring the storm
worm network.
Technical Report 01-10-2007, Hopkins InterNetworking Research Group, Johns
Hopkins University, 2007.
- The Honeynet Project and Research Alliance.
Know your enemy: Fast-flux service
networks, 2007.
- Roger Dingledine, Nick
Mathewson, and Paul Syverson.
Tor: the second-generation
onion router.
In Proceedings of the 13th USENIX Security Symposium
(SSYM'04), pages 21–21, Berkeley, CA, USA, 2004. USENIX
Association.
Lecture slides
Students are required to do one home project.
This will typically involve performing simulation experiments with the
PeerSim simulator, except
if the project is the real implementation of a particular protocol.
Some possible topics for projects are listed below, but other topics are
also acceptable (ask me if you have your own project idea).
The PeerSim simulator is discussed in class, besides, you can find
documentation on the PeerSim homepage.
Try to solve problems on your own, but if you really get stuck, ask me.
Presentation of the projects
Graduate students will write a paper about their project in which they
should discuss related literature, applied research methodology (what was
tested and how, why exactly that way, what is the goal) and the actual
results. It is expected that the work shows signs of independent analytical
and critical thinking.
Undergraduate students will present their work verbally, with the help
of a few slides (a short presentation). They are expected to show a clear
understanding of the functioning and the importance of a selected protocol.
Some possible projects
All projects (except implementation projects)
will use PeerSim. Both the cycle based and
event based models are acceptable, but if the cycle based model is used,
some arguments are needed to justify the simplification.
Truly excellent projects will compare the two models...
-
Simulate the evolution of the early unstructured Gnutella network.
Implement the join protocol and the failure detection mechanism.
Test your implementation in different scenarios (network size, churn, maybe
different up-time distributions, etc).
Test the resulting network from the point of view of graph properties
(path length, clustering, degree distribution, connectivity, robustness
to massive node failure).
Do you find the same properties as researchers did in the real network?
Hint: in PeerSim you have all the components for generating the required
statistics, and to simulate churn; what you need to implement is the rather
simple join protocol
(probably as a NodeInitializer
) and the maintenance protocol
(probably with the CDProtocol
interface, even in the event
driven model).
-
Pick at least two search protocols designed for unstructured networks
(random walk (self-avoiding, degree bias, etc), expanding ring, etc), implement
them, and compare them on at least two different topologies, that include
a power-law topology and a k-out random graph topology.
Do the comparison as a function of popularity of the searched item.
Test at least the mean time to finding the searched item, the
required number of messages in the whole network, and load balancing
(for example, the maximal and minimal number of messages processed by
a single node).
Hint: you need to implement only search protocols, because topologies are
readily available in PeerSim.
When collecting data about the experiment, you might need to count the
number of messages processed in each node, and use standard PeerSim
components from the vector package to collect statistics.
-
Implement a simple random walk search algorithm, and implement at least
two replication strategies: path replication and owner replication, as learned
in class.
Start with a uniform distribution of search items, and with a power-law
distribution of queries for these items.
Test the theoretical result that we have seen in class, that predicts
the distribution of the number of replicas in the system.
Do you see square-root and proportional replication emerging?
Is the average performance of square-root replication indeed better?
How about the most popular items only?
Test your system on both k-out random graphs and power-law graphs.
Do you see the same result on both graphs? Why?
Hint: you need to implement only search protocols, because topologies are
readily available in PeerSim.
Also, you might need to implement a custom Control
component
to collect statistics, that iterates through the network.
Use IncrementalStats
or IncrementalFreq
cleverly.
-
Implement the join protocol of a distributed hash table, such as Chord or
Pastry.
Test the resulting DHT as a function of protocol parameter, from the point
of view of average hop-count
while routing random ID-s to their destination, and also test whether
routing is successful.
Hint: if you use the event based engine, your work is much more valuable,
and you can play with the speed at which nodes are added, etc.
If you use the cycle based model, then the joins can be modelled as
completely non-overlapping, and you can implement them completely
within a NodeInitializer
(that is, all traffic
generated by the join is finished when the initializer returns).
This way it is impossible to test what happens if nodes are added too fast,
but you can still test the resulting network.
-
Compare different coding and chunk selection strategies for BitTorrent-style
distribution of large files by implementing at least two strategies.
Pick one of the experiments in this paper
by Gkantsidis and Rodriguez and reproduce it. Most importantly, test whether
the results hold
for larger network size (at least 10,000 nodes), larger neighborhoods
(around 30-40, as in BitTorrent), realistic traces (realistic node capacity
distributions and join/leave times) and a more faithful implementation
of BitTorrent techniques such as the endgame? Do you agree with the conclusions of the
paper? Do you find some contradictions?
Hint: You do not actually have to implement network coding, you only need to
track the coefficients. Use a compact representation of the chunk set and
coefficients to save memory. If you use the cycle based engine, take care
of modeling chunk download time by not allowing a chunk to do more than one
hop in each cycle.
-
Implement a client for one of the popular P2P networks, for example,
a BitTorrent client. Demonstrate that you can use your client to
participate in the respective P2P network.
Hint:
Do not get lost in details, such as optimizations and user interface;
focus on the crucial functionality of the client,
first make sure you can join the network and participate, and then if you
have time left work on optimizations and other details.
Jelasity Márk
Fri Dec 14 16:47:01 CET 2007