Skip to content

Commit 9b1a634

Browse files
committed
Add README; minor changes.
1 parent 29aa20d commit 9b1a634

10 files changed

+118
-13
lines changed

README

+95
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
The Driller lib
2+
---------------
3+
4+
The set of files in this tarball implement a way for a group of
5+
cooperating processes to directly access memory segments of one
6+
another. This can form the basis of a message passing system that
7+
avoids costly data copy, as is demonstrated with mmpi ("mini-MPI), a
8+
very simple API loosely inspired from MPI (it only has send, recv and
9+
barrier primitives).
10+
11+
The name "driller" comes from the idea that we could drill holes in
12+
the closed container that forms the memory of a regular process.
13+
14+
The code has been designed for Linux and tested on 2.6 only, though
15+
it might be portable to some Unix variants. It has small portions of
16+
architecture specific code, but has been tested on x86 and x64, and
17+
it should not be difficult to port to other architectures.
18+
19+
How to test it
20+
--------------
21+
22+
- run "make" to build all library files and tests, or
23+
"make DEBUG_FLAGS=-DDEBUG" for more verbose output during tests
24+
- run "./test_driller.sh" for a basic sanity test
25+
- run "./test_mmpi.sh" for a performance test with mmpi
26+
27+
How it works
28+
------------
29+
30+
The code uses the following tricks to achieve its goals:
31+
32+
- under Linux, a process can examine the layout of its own memory
33+
space by reading /proc/self/maps (on Linux 2.6, the stack and heap
34+
are clearly identified)
35+
36+
- an existing memory segment can be atomically replaced by a call to
37+
mmap at the same address; thus, we can copy an existing segment to a
38+
file, and then map this file over the original segment, which gives
39+
us the same data, except it is now in a file
40+
41+
- unix sockets can be used to pass file descriptors from one process
42+
to another; for example one created with the trick above, so another
43+
process can map the same segment in its own memory space
44+
45+
- under Linux, memory is usually allocated by calling malloc (in the
46+
C library) or mmap, both of which can be intercepted (using malloc
47+
hooks and symbol overloading); this means that these memory segments
48+
can be memory-mapped files if we want them to
49+
50+
These tricks make it possible for a process to simply replace most of
51+
its memory segments by memory-mapped files. The file descriptors for
52+
al replaced regions can then be passed to cooperating process, which,
53+
after an mmap(), can directly access the same memory. When the first
54+
process modifies or destroys its mapping, this is notified to other
55+
processes, which will drop any reference to the file and eventually
56+
free associated ressources.
57+
58+
The code implementing these mechanisms is split in three parts:
59+
- fdproxy.c: this implements file descriptor passing; one of the
60+
participants forks a server process that will forward file
61+
descriptors to any client
62+
- driller.c: this replaces the memory segments of a process with
63+
memory-mapped files, and tracks calls to mmap or brk, which modify
64+
the process memory layout
65+
- map_cache.c: this provides a cache structure for memory segments
66+
that a process can have remapped from another process
67+
68+
Since the glibc implementation of malloc cannot be forced to use our
69+
overloaded version of mmap, an alternate allocator had to be used, and
70+
Doug Lea's malloc in dlmalloc.c is a convenient substitute.
71+
72+
The file mmpi.c implements a very simple message passing API that uses
73+
the files above and requires at most one buffer copy to transfer a
74+
message. Of course, the cost of a few system calls cannot always be
75+
avoided, but in favorable cases, it can be spread over many messages.
76+
77+
Licensing
78+
---------
79+
80+
The files dlmalloc.c and dlmalloc.h are released by Doug Lea to the
81+
public domain.
82+
83+
All other files were written by Jean-Marc Saffroy and are released as
84+
free software, distributed under the terms of the GNU General Public
85+
License version 2.
86+
87+
Contact
88+
-------
89+
90+
Jean-Marc Saffroy <[email protected]>
91+
92+
Acknowledgements
93+
----------------
94+
95+
Thanks to Ga�l Roualland for testing and debugging this code on IA32.

TODO

+4
Original file line numberDiff line numberDiff line change
@@ -1 +1,5 @@
11
- README
2+
- mmpi: per-sibling recvq for better perf on more than 2 cores
3+
- fdproxy: credentials?
4+
- use brk and stack ptr to identify heap and stack segments
5+

driller.c

+2-2
Original file line numberDiff line numberDiff line change
@@ -303,7 +303,7 @@ static void map_overload_stack(void) {
303303
perr("lseek");
304304

305305
/* copy mapped area to file */
306-
rc = write(map_stack->fd, (char*)map_stack->start, size);
306+
rc = write(map_stack->fd, map_stack->start, size);
307307
if(rc < 0)
308308
perr("write");
309309
if(rc < size)
@@ -465,7 +465,7 @@ static void map_rebuild(struct map_rec *map, int index) {
465465
/* copy mapped area to file */
466466
if(lseek(map->fd, map->offset, SEEK_SET) < 0)
467467
perr("lseek");
468-
rc = write(map->fd, (char*)map->start, size);
468+
rc = write(map->fd, map->start, size);
469469
if(rc < 0)
470470
perr("write");
471471
if(rc < size)

driller.txt

+4-6
Original file line numberDiff line numberDiff line change
@@ -104,22 +104,20 @@ Cons
104104
* stack growth costs signal + 2 syscalls (mmap + getrlimit)
105105
mitigated by stack growth granularity
106106
could save call to getrlimit
107+
could save call to mmap by mapping a lot in advance and handling sigbus
108+
this would mean no signal, no control of getrlimit
107109
* not thread safe (yet)
108110
* not sure if it breaks some libs (qx, ib), may be worked around (blacklist)
109111
* some linuxisms limit OS portability:
110112
credentials over unix sockets (optional)
111113
abstract namespace for unix sockets (easier cleanup)
112114
parse /proc/self/maps
113-
* some housekeeping needed to keep track of new segments; need to check
114-
if a thread is needed to handle unix socket comms while process is busy
115115
* could use many file descriptors when malloc calls mmap, unless we
116116
use a single fd for most maps (ie. those starting at TASK_UNMAPPED_BASE)
117117
this will require an allocator
118118
see also notes above
119-
* brk can no longer grow the heap
120-
malloc uses mmap areas instead; can be improved by overloading brk
121-
brk overloaded, so it's ok
122-
* brk costs 2 syscalls instead of one (ftruncate + mmap)
119+
* brk costs 2 syscalls (ftruncate + mmap) vs. 1
120+
could save call to mmap by mapping a lot in advance
123121
* malloc of mmap'ed area costs 3 syscalls (open + ftruncate + mmap) vs. 1
124122
could be mitigated by fd cache
125123
* free of mmap'ed area costs 3 syscalls (munmap + ftruncate + close) vs. 1

test_dlmalloc.sh

100644100755
File mode changed.

test_driller.sh

100644100755
File mode changed.

test_fdproxy.sh

100644100755
File mode changed.

test_mmpi.c

+13-5
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ int main(int argc, char**argv) {
7777

7878
printf("%d: send to %d\n", rank, 0);
7979
for(i = 0; i < iter; i++)
80-
mmpi_send(0, (char*)&rank, sizeof(rank));
80+
mmpi_send(0, &rank, sizeof(rank));
8181
} else {
8282
int i, j;
8383
struct timeval tv1, tv2;
@@ -93,7 +93,7 @@ int main(int argc, char**argv) {
9393

9494
printf("%d: recv from %d\n", rank, j);
9595
for(i = 0; i < iter; i++) {
96-
mmpi_recv(j, (char*)&r, &sz);
96+
mmpi_recv(j, &r, &sz);
9797
assert(sz == sizeof(r));
9898
assert(r == j);
9999
}
@@ -109,7 +109,10 @@ int main(int argc, char**argv) {
109109
mmpi_barrier();
110110

111111
/* test throughput */
112+
112113
#if 1
114+
/* increase the odds that buf is allocated with mmap
115+
* (it won't be if there is enough free space in the heap) */
113116
mallopt(M_MMAP_THRESHOLD, THRTEST_CHUNK_SIZE);
114117
#endif
115118
buf = malloc(THRTEST_CHUNK_SIZE);
@@ -121,15 +124,20 @@ int main(int argc, char**argv) {
121124
#if 0
122125
memset(buf, (char)i, THRTEST_CHUNK_SIZE);
123126
#else
127+
/* we don't necessarily want to benchmark memset */
124128
buf[0] = buf[THRTEST_CHUNK_SIZE-1] = (char)i;
125129
#endif
130+
126131
mmpi_send(0, buf, THRTEST_CHUNK_SIZE);
127-
#if 1
132+
133+
#if 0
134+
/* if buf is allocated with mmap, this will force
135+
* invalidation of the fd and its memory mapping */
128136
free(buf);
129137
buf = malloc(THRTEST_CHUNK_SIZE);
130138
#endif
131139
}
132-
} else {
140+
} else { /* in rank 0 */
133141
int i, j;
134142
struct timeval tv1, tv2;
135143
float delta;
@@ -150,7 +158,7 @@ int main(int argc, char**argv) {
150158
}
151159
}
152160
gettimeofday(&tv2, NULL);
153-
delta = (float)(tv2.tv_usec - tv1.tv_usec) / 1000000
161+
delta = (float)(tv2.tv_usec - tv1.tv_usec) / 1E6
154162
+ (float)(tv2.tv_sec - tv1.tv_sec);
155163
printf("average send/recv throughput: %.2f MB/s (%.2fs)\n",
156164
(float)((nprocs-1) * THRTEST_VOLUME >> 20) / delta,

test_mmpi.sh

100644100755
File mode changed.

test_spinlock.sh

100644100755
File mode changed.

0 commit comments

Comments
 (0)