Memory Leak

Mar 3, 2014 - Foreword. All code in this notebook are written in C. Why? Because it's easier to observe hardware behavior in C than in Python. For instance: ...
459KB taille 2 téléchargements 354 vues
Memory Leak Unknown Author March 3, 2014 In [1]: from IPython.core.display import Image Image(’files/memory_lapse.jpg’) Out [1]:

Memory Lapse, courtesy to Wizards of the Coast

Part I

Foreword All code in this notebook are written in C. Why? Because it’s easier to observe hardware behavior in C than in Python. For instance:

In [2]: r = range(1000) print sum(r)

499500 Is equivalent to the following C code: In [3]: %%file sum.c #include #include int main() { int n = 1000; long *range = malloc(n*sizeof(long)); if(! range) return 1; for(int i = 0; i < n; i++) range[i] = i; int sum = 0; for(int i = 0; i < n; i++) sum += range[i]; printf("%ld\n", sum); free(range); return 0; }

Overwriting sum.c In [4]: %%sh make sum CFLAGS=-std=c99 ./sum

cc -std=c99 499500

sum.c

-o sum

How many allocation took place? We’ll count this using the LD_PRELOAD trick. In [5]: %%file memory_trace.c #define _GNU_SOURCE #include #include static void* (*real_malloc)(size_t) = NULL; static void init_malloc(void) { real_malloc = dlsym(RTLD_NEXT, "malloc"); if(!real_malloc) fprintf(stderr, "Error in ‘dlsym‘: %s\n", dlerror()); } void *malloc(size_t size) { if(!real_malloc) init_malloc(); void *p = real_malloc(size); fprintf(stderr, "%zd\n", (size_t)p); return p; }

Overwriting memory_trace.c We compile this snippet and use our version of malloc that tracks allocations to re-run the previous snippets: In [6]: %%sh gcc -shared -fPIC -ldl memory_trace.c -o memory_trace.so LD_PRELOAD=./memory_trace.so ./sum 2> csum.trace LD_PRELOAD=./memory_trace.so python -c ’print sum(range(1000))’ 2> pysum.trace

499500 499500 Let’s plot the output: In [7]: import numpy as np ctrace = np.fromfile(’csum.trace’, dtype=np.int, sep=’ ’) pytrace = np.fromfile(’pysum.trace’, dtype=np.int, sep=’ ’)

In [8]: %pylab inline plot(pytrace % 512) ; title(’Python memory trace’) ; show()

Populating the interactive namespace from numpy and matplotlib

In [9]: plot(ctrace % 512) ; title(’C memory trace’) ; show()

There is not much noise when you observe memory behavior of a program in C!

Part II

How developers see memory In [10]: Image(’files/von_neumann.png’) Out [10]:

1 Type and Memory In [11]: %%file scalar.c #include main(void) { printf("_Bool: %zd\n", sizeof(_Bool)); printf("char: %zd\n", sizeof(char));

printf("short: %zd\n", sizeof(short)); printf("int: %zd\n", sizeof(int)); printf("long: %zd\n", sizeof(long)); printf("long long: %zd\n", sizeof(long long)); puts("--"); printf("void*: %zd\n", sizeof(void*)); puts("--"); printf("float: %zd\n", sizeof(float)); printf("double: %zd\n", sizeof(double)); printf("long double: %zd\n", sizeof(long double)); return 0; }

Overwriting scalar.c In [12]: %%sh make scalar ./scalar

cc -g -O2 scalar.c _Bool: 1 char: 1 short: 2 int: 4 long: 8 long long: 8 -void*: 8 -float: 4 double: 8 long double: 16

-o scalar

And trickier: constructed types In [13]: %%file struct.c #include struct cc { char a,b; }; struct ii { int a, b; }; struct ccc { char a,b,c; }; struct iii { int a,b,c; }; struct ici { int a; char b; int c; }; struct iic { int a; int c; char b; }; struct flex{ int a; char b[]; }; // flexible array member main(void) { printf("cc: %zd\n", sizeof(struct cc)); printf("ii: %zd\n", sizeof(struct ii)); printf("ccc: %zd\n", sizeof(struct ccc)); printf("iii: %zd\n", sizeof(struct iii)); printf("ici: %zd\n", sizeof(struct ici)); printf("iic: %zd\n", sizeof(struct iic)); printf("flex: %zd\n", sizeof(struct flex)); return 0; }

Overwriting struct.c

In [14]: %%sh make struct ./struct

cc -g -O2 cc: 2 ii: 8 ccc: 3 iii: 12 ici: 12 iic: 12 flex: 4

struct.c

-o struct

2 The Von Neumann Bottleneck Memory is getting slower than CPUs In [15]: Image(’files/memory-cpu.png’) # from ’Computer Architecture: a Quantitative Approach’ Out [15]:

2.1 An experiment to showcase memory speed vs. CPU speed The purpose of the code below is to compare the effect of adding extra operations in a loop body, without additionnal memory dependencies: In [16]: %%file memory_cpu.c #include #include #include int main() { int n = 10000000; float *data = malloc(sizeof(float) * n);

if(!data) return 1; for(int i = 0; i < n ; i++) data[i] = 1.f / (1 + i) ; struct timeval start, stop; gettimeofday(&start, 0); float sum = 0; for(int i = 0; i < n ; i++) { sum += data[i]; #ifdef MORE_CPU sum /= data[i]; #endif } gettimeofday(&stop, 0); printf("%f\n", sum); fprintf(stderr, "%ld\n", stop.tv_usec - start.tv_usec + 1000000 * (stop.tv_sec - st free(data); return 0; }

Overwriting memory_cpu.c Compile & run this In [17]: %%sh make memory_cpu CFLAGS=’-std=c99’ ./memory_cpu

cc -std=c99 15.403683

memory_cpu.c

-o memory_cpu

30264 To measure execution time, we need a significant test bed. The median is usually a good metric. In [24]: %%sh for i in ‘seq 1 100‘ do ./memory_cpu > /dev/null done 2> memory_cpu.timings wc -l memory_cpu.timings

100 memory_cpu.timings In [25]: memory_cpu = np.fromfile(’memory_cpu.timings’, dtype=np.int, sep=’ ’) plot(memory_cpu) ; show()

In [26]: np.median(memory_cpu) Out [26]: 81469.0

And now with more instructions In [27]: %%sh rm -f memory_cpu make memory_cpu CFLAGS=’-DMORE_CPU -std=c99’ for i in ‘seq 1 100‘ do ./memory_cpu > /dev/null done 2> memory_more_cpu.timings wc -l memory_more_cpu.timings

cc -DMORE_CPU -std=c99 memory_cpu.c 100 memory_more_cpu.timings

-o memory_cpu

In [28]: memory_more_cpu = np.fromfile(’memory_more_cpu.timings’, dtype=np.int, sep=’ ’) plot(memory_more_cpu) ; show()

In [29]: print np.median(memory_more_cpu)

81410.0

3 Solutions to the Von Neumann Bottleneck Introduce a complex memory hierarchy!

In [30]: Image(’files/memory-models.jpg’) # from ’Why Modern CPU are Starving and What can be Do Out [30]:

3.1 Reading from Registers

In [31]: code = ’’’ #include #include #include int main(int argc, char *argv[]) {{ int n = argc > 1 ? atoi(argv[1]) : 100;

if(n /dev/null done 2> registers_$i.timings done echo ok

cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc

-std=c99 -std=c99 -std=c99 -std=c99 -std=c99 -std=c99 -std=c99 -std=c99 -std=c99 -std=c99 -std=c99 -std=c99 -std=c99 -std=c99 -std=c99 -std=c99 -std=c99 -std=c99

-O1 -O1 -O1 -O1 -O1 -O1 -O1 -O1 -O1 -O1 -O1 -O1 -O1 -O1 -O1 -O1 -O1 -O1

registers_1.c registers_2.c registers_3.c registers_4.c registers_5.c registers_6.c registers_7.c registers_8.c registers_9.c registers_10.c registers_11.c registers_12.c registers_13.c registers_14.c registers_15.c registers_16.c registers_17.c registers_18.c

-o registers_1 -o registers_2 -o registers_3 -o registers_4 -o registers_5 -o registers_6 -o registers_7 -o registers_8 -o registers_9 -o registers_10 -o registers_11 -o registers_12 -o registers_13 -o registers_14 -o registers_15 -o registers_16 -o registers_17 -o registers_18

cc -std=c99 -O1 ok

registers_19.c

-o registers_19

In [33]: regs = [] for i in range(1, 30): np_i = np.fromfile(’registers_’ + str(i) + ’.timings’, dtype=np.int, sep=’ ’) regs.append(np.median(np_i)) lin = [regs[0]+i*(regs[1]-regs[0]) for i in range(30)] In [34]: plot(regs,’-o’) ; plot(lin, ’-x’) Out [34]: []

3.2 Reading from Memory In [35]: %%file cache_cost.c #include #include #include int main() { int n = 10000000; float *a = malloc(sizeof(float) * n), *b = malloc(sizeof(float) * n); if(!a || !b) return 1; struct timeval start_a, start_b, stop_a, stop_b; gettimeofday(&start_a, 0); for(int i = 0; i < n ; i++) a[i] *= 2; gettimeofday(&stop_a, 0); gettimeofday(&start_b, 0); for(int i = 0; i < n ; i+=8) b[i] *= 2;

gettimeofday(&stop_b, 0); fprintf(stdout, "%ld\n", stop_a.tv_usec - start_a.tv_usec + 1000000 * (stop_a.tv_se fprintf(stderr, "%ld\n", stop_b.tv_usec - start_b.tv_usec + 1000000 * (stop_b.tv_se free(a); free(b); return 0; }

Overwriting cache_cost.c In [36]: %%sh make cache_cost CFLAGS=’-std=c99 -O0’ ./cache_cost

cc -std=c99 -O0 39851

cache_cost.c

-o cache_cost

17000 In [37]: %%sh for i in ‘seq 1 20‘ do ./cache_cost done 1> cache_cost_a.timings 2> cache_cost_b.timings echo ok

ok In [38]: cache_cost_a = np.fromfile(’cache_cost_a.timings’, dtype=np.int, sep=’ ’) cache_cost_b = np.fromfile(’cache_cost_b.timings’, dtype=np.int, sep=’ ’) np.median(cache_cost_a) / np.median(cache_cost_b) Out [38]: 2.3657835541114722

3.3 Reading from Cache In []: %%file cache_size.c #include #include #include int main(int argc, char *argv[]) { size_t m = 1 1 ? atoi(argv[1]) : 1; n = 1 cache_size.timings echo ok

cc -std=c99 -O0 ok

cache_size.c

-o cache_size

In [43]: dat = np.fromfile(’cache_size.timings’, dtype=np.int, sep=’ ’) plot(dat) Out [43]: []

3.4 Reading from Disk In [44]: %%sh dd if=/dev/zero of=big.dat bs=1024 count=400000 2> /dev/null time dd if=big.dat of=/dev/null bs=64k time dd if=big.dat of=/dev/null bs=64k rm -f big.dat

6250+0 records in 6250+0 records out 409600000 bytes (410 MB) copied, 0.0532299 s, 7.7 GB/s 0.00user 0.04system 0:00.05elapsed 96%CPU (0avgtext+0avgdata

868maxresident)k 0inputs+0outputs (0major+266minor)pagefaults 0swaps 6250+0 records in 6250+0 records out 409600000 bytes (410 MB) copied, 0.0430332 s, 9.5 GB/s 0.00user 0.04system 0:00.04elapsed 93%CPU (0avgtext+0avgdata 872maxresident)k 0inputs+0outputs (0major+266minor)pagefaults 0swaps Cache control through /proc/sys/vm/drop_caches! In []: %%sh dd if=/dev/zero of=/tmp/big.dat bs=1024 count=400000 2> /dev/null time dd if=/tmp/big.dat of=/dev/null bs=64k time dd if=/tmp/big.dat of=/dev/null bs=64k rm -f /tmp/big.dat

And then comes swapping. . .

4 Memory and Parallelism 4.1 False Cache Sharing In [45]: %%file cache_sharing.c #ifndef PADDING #define PADDING 1 #endif #include #include #include int main() { int n = 100000000; int *local_count = malloc(sizeof(int)*omp_get_num_threads()*PADDING); int *vector = malloc(sizeof(int)*n); int sum = 0; if(!local_count || ! vector) return 1; double start = omp_get_wtime(); #pragma omp parallel { int tid = omp_get_thread_num()*PADDING; #pragma omp for for(int j = 0; j < n; j++) local_count[tid] += vector[j]*2; #pragma omp master for(int k = 0; k 1 ? atoi(argv[1]) : 100; float* dat = malloc(sizeof(float)*n); if(!dat || size !=2 ) { MPI_Finalize(); return 1; } if(rank == 0) { for(int i = 0; i < n; ++i) MPI_Send(dat+i, 1, MPI_REAL, 1, tag, MPI_COMM_WORLD); } else /* rank == 1 */ { MPI_Status status; for(int i = 0; i < n; ++i) MPI_Recv(dat+i, 1, MPI_REAL, 0, tag, MPI_COMM_WORLD, &status); } MPI_Finalize();

free(dat); return 0; }

Overwriting data_transfer.c In [50]: %%sh make data_transfer CC=mpicc CFLAGS=-std=c99 time mpirun -np 2 ./data_transfer 1000000

mpicc -std=c99

data_transfer.c

-o data_transfer

0.58user 0.02system 0:01.34elapsed 44%CPU (0avgtext+0avgdata 24876maxresident)k 9496inputs+33200outputs (82major+15009minor)pagefaults 0swaps In [51]: %%sh LOG=data_transfer.timings rm -f $LOG for i in ‘python -c "print ’ ’.join(str(2**i) for i in range(1, 20))"‘ do /usr/bin/time -a -f ’%U’ -o $LOG mpirun -np 2 ./data_transfer $i done wc -l $LOG

19 data_transfer.timings In [52]: data_transfer = np.fromfile("data_transfer.timings", dtype=np.float, sep=" ") x = [(2**i) for i in range(1, 20)] plot(x, data_transfer) xscale(’log’) show() plot(x, data_transfer) show()

First timings are dominated by data transfer overhead! In [53]: %%file fast_data_transfer.c #include #include #include #include int main(int argc, char *argv[]) { MPI_Init(&argc,&argv); int rank, size, tag = 42; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); int n = argc > 1 ? atoi(argv[1]) : 100; float* dat = malloc(sizeof(float)*n); if(!dat || size !=2 ) { MPI_Finalize();

return 1; } if(rank == 0) { MPI_Send(dat, n, MPI_REAL, 1, tag, MPI_COMM_WORLD); } else /* rank == 1 */ { MPI_Status status; MPI_Recv(dat, n, MPI_REAL, 0, tag, MPI_COMM_WORLD, &status); } MPI_Finalize(); free(dat); return 0; }

Overwriting fast_data_transfer.c In [54]: %%sh make fast_data_transfer CC=mpicc CFLAGS=-std=c99 time mpirun -np 2 ./fast_data_transfer 1000000

mpicc -std=c99

fast_data_transfer.c

-o fast_data_transfer

0.03user 0.00system 0:01.03elapsed 3%CPU (0avgtext+0avgdata 8600maxresident)k 0inputs+656outputs (5major+6675minor)pagefaults 0swaps In [55]: %%sh LOG=fast_data_transfer.timings rm -f $LOG for i in ‘python -c "print ’ ’.join(str(2**i) for i in range(1, 20))"‘ do /usr/bin/time -a -f ’%U’ -o $LOG mpirun -np 2 ./fast_data_transfer $i done wc -l $LOG

19 fast_data_transfer.timings

In [56]: fast_data_transfer = np.fromfile("fast_data_transfer.timings", dtype=np.float, sep=" ") x = [(2**i) for i in range(1, 20)] plot(x, data_transfer, ’-o’) plot(x, fast_data_transfer, ’-x’) xscale(’log’) show() plot(x, data_transfer, ’-o’) plot(x, fast_data_transfer, ’-x’) show()

Part IV

References • Lecture on Compuer Architecture [http://www.fb9dv.uni-duisburg.de/vs/en/education/dv3/lecture/freinatis/LectureCAall-slides.pdf] • Gallery of Processor Cache Effects [http://igoro.com/archive/gallery-of-processor-cache-effects/] • What every programmer should know about memory [http://lwn.net/Articles/250967/] • Numbers Everyone Should Know [http://highscalability.com/numbers-everyone-should-know]