-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathp66_0x0d.txt
1592 lines (1232 loc) · 61.7 KB
/
p66_0x0d.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
==Phrack Inc.==
Volume 0x0d, Issue 0x42, Phile #0x0D of 0x11
|=----------------------------------------------------------------------=|
|=---------=[ Hacking the Cell Broadband Engine Architecture ]=---------=|
|=-------------------=[ SPE software exploitation ]=--------------------=|
|=----------------------------------------------------------------------=|
|=--------------=[ By BSDaemon ]=----------=|
|=--------------=[ <bsdaemon *noSPAM* risesecurity_org> ]=----------=|
|=----------------------------------------------------------------------=|
"There are two ways of
constructing a software design.
One way is to make it so simple
that there are obviously no
deficiencies. And the other way
is to make it so complicated that
there are no obvious deficiencies"
- C.A.R. Hoare
------[ Index
1 - Introduction
1.1 - Paper structure
2 - Cell Broadband Engine Architecture
2.1 - What is Cell
2.2 - Cell History
2.2.1 - Problems it solves
2.2.2 - Basic Design Concept
2.2.3 - Architecture Components
2.2.4 - Processor Components
2.3 - Debugging Cell
2.3.1 - Linux on Cell
2.3.2 - Extensions to Linux
2.3.2.1 - User-mode
2.3.2.2 - Kernel-mode
2.3.3 - Debugging the SPE
2.4 - Software Development for Linux on Cell
2.4.1 - PPE/SPE hello world
2.4.2 - Standard Library Calls from SPE
2.4.3 - Communication Mechanisms
2.4.4 - Memory Flow Control (MFC) Commands
2.4.5 - Direct Memory Access (DMA) Commands
2.4.5.1 - Get/Put Commands
2.4.5.2 - Resources
2.4.5.3 - SPE 2 SPE Communication
3 - Exploiting Software Vulnerabilities on Cell SPE
3.1 - Memory Overflows
3.1.1 - SPE memory layout
3.1.2 - SPE assembly basics
3.1.2.1 - Registers
3.1.2.2 - Local Storage Addressing Mode
3.1.2.3 - External Devices
3.1.2.4 - Instruction Set
3.1.3 - Exploiting software vulnerabilities in SPE
3.1.3.1 - Avoiding Null Bytes
3.1.4 - Finding software vulnerabilities on SPE
4 - Future and other uses
5 - Acknowledgements
6 - References
7 - Notes on SDK/Simulator Environment
8 - Sources
------[ 1 - Introduction
This article is all about Cell Broadband Architecture Engine [1], a new
hardware designed by a joint between Sony [2], Toshiba [3] and IBM [4].
As so, lots of architecture details will be explained, and also many
development differences for this platform.
The biggest differentiator between this article and others released about
this subject, is the focus on the architecture exploitation and not the
use of the powerful processor resources to break code [5] and of course,
the focus in the differentiators of the architecture, which means the SPU
(synergestic processor unit) and not in the core (PPU - power processor
unit) [6], since the core is a small-modified power processor (which
means, all shellcodes for Linux on Power will also works for the core and
there is just small differences in the code allocation and stuffs like
that).
It's important to mention that everything about Cell tries to focus in the
Playstation3 hardware, since it's cheap and widely deployed, but there is
also big machines made with this processor [7], including the #1 in the
list of supercomputers [8].
---[ 1.1 - Paper structure
The idea of this paper is to complete the studies about Cell, putting all
the information needed to do security research, focused in software
exploitation for this architecture together.
For that, the paper have been structured in two important portions:
Chapter 2 will be all about the Cell Architecture and how to develop for
this architecture. It includes many samples and explains the
modifications done to Linux in order to get the best from this
architecture. Also, it gives the knowledge needed in order to go further
in software exploitation for this arch. Chapter 3 is focused in the
exploitation of the SPU processor, showing the simple memory layout it has
and how to write a shellcode for the purpose of gaining control over an
application running inside the SPU.
------[ 2 - Cell Broadband Engine Architecture
From the IBM Research [9]: "The Cell Architecture grew from a challenge
posed by Sony and Toshiba to provide power-efficient and cost-effective
high-performance processing for a wide range of applications, including
the most demanding consumer appliance: game consoles. Cell - also known as
the Cell Broadband Engine Architecture (CBEA) - is an innovative solution
whose design was based on the analysis of a broad range of workloads in
areas such as cryptography, graphics transform and lighting, physics,
fast-Fourier transforms (FFT), matrix operations, and scientific
workloads. As an example of innovation that ensures the clients' success,
a team from IBM Research joined forces with teams from IBM Systems
Technology Group, Sony and Toshiba, to lead the development of a novel
architecture that represents a breakthrough in performance for consumer
applications. IBM Research participated throughout the entire development
of the architecture, its implementation and its software enablement,
ensuring the timely and efficient application of novel ideas and
technology into a product that solves real challenges."
It's impossible to not get excited with this. A so 'powerful' and
versatile architecture, completely different from what we usually seen is
an amazing stuff to research for software vulnerabilities. Also, since
it's supposed to be widely deployed, there will be an infinite number of
new vulnerabilities coming on in the near future. I wanted to exploit
those vulnerabilities.
---[ 2.1 - What is Cell
As must be already clear to the reader, I'm not talking about phones here.
Cell is a new architecture, which cames to solve some of the actual
problems in the computer industry.
It's compatible with a well-known architecture, which are the Power
Architecture, keeping most of it's advantages and solving most of it's
problems (if you cannot wait until know what problems, go to 2.2.1
section).
---[ 2.2 - Cell History
The focus of this section is just to give a timeline vision for the
reader, not been detailed at all.
The architecture was born from a joint between IBM, Sony and Toshiba,
formed in 2000.
They opened a design center in March 2001, based in Austin, Texas (USA).
In the spring of 2004, a single Cell BE became operational. In the summer
of the same year, a 2-way SMP version was released.
The first technical disclosures came just in February 2005, with the
simulator [10] and open-source SDK [11] (more on that later) been released
in November of the same year. In the same month, Mercury started to sell
Cell (yeah, sell Cell sounds funny) machines.
Cell Blades was announced by IBM in February of 2006. The SDK 1.1 was
released in July of the same year, with many improvements. The latest
version is 3.1.
---[ 2.2.1 - Problems it solves
The computer technology have been evolving along the years, but always
suffering and trying to avoid some barriers.
Those barriers are physically impossible to be bypassed and that's why the
processor clock stopped to grow and multi-core architectures been focused.
Basically we have three big walls (barriers) to the speedy grow:
- Power wall
It's related to the CMOS technology limits and the hard limit to
the acceptable system power
- Memory wall
Many comparisons and improvements trying to avoid the DRAM latency
when compared to the processor frequency
- Frequency wall
Diminishing return from deeper pipelines
For a new architecture to work and be widely deployed, it was also
important to keep the investments in software development.
Cell accomplish that being compatible with the 64 bits Power Architecture,
and attacks the walls in the following ways:
- Non-homogeneous coherent multi-processor and high design
frequency at a low operating voltage with advanced power
management attacks the 'power wall'.
- Streaming DMA architecture and three-level memory model (main
storage, local storage and register files) attacks the 'memory
wall'.
- Non-homogeneous coherent multi-processor, highly-optimized
implementation and large shared register files with software controlled
branching to allow deeper pipelines attacks the 'frequency wall'.
It have been developed to support any OS, which means it supports
real-time operating system as well non-real time operating systems.
---[ 2.2.2 - Basic Design Concept
The basic concept behind cell is it's asymmetric multi-core design. That
permits a powerful design, but of course requires specific-developed
applications to achieve the most of the architecture.
Knowing that, becomes clear that the understanding of the new component,
which is called SPU (synergistic processor unit) or SPE (synergistic
processor element) proofs to be essential - see the next section for a
better understanding of the differences between SPU and SPE.
---[ 2.2.3 - Architecture Components
In cell what we have is a core processor, called Power Processor Element
(PPE) which control tasks and synergistic processor elements (SPEs) for
data-intensive processing.
The SPE consists of the synergistic processor unit (SPU), which are a
processor itself and the memory flow control (MFC), responsible for the
data movements and synchronization, as well for the interface with the
high-performance element interconnect bus (EIB).
Communications with the EIB are done in a 16B/cycle, which means that each
SPU is interconnected at that speedy with the bus, which supports
96B/cycle.
Refer to the picture architecture-components.jpg in the directory images
of the attached file for a visual of the above explanation.
---[ 2.2.4 - Processor Components
As said, the Power Processor Element (PPE) is the core processor which
control tasks (scheduling). It is a general purpose 64 bit RISC processor
(Power architecture).
It's 2-way hardware multithreaded, with a L1: 32KB I and D caches and L2:
512KB cache.
Has support for real-time operations, like locking the L2 cache and the
TLB (also it supports managed TLB by hardware and software). It has
bandwidth and resource reservation and mediated interrupts.
It's also connected to the EIB using a 16B/cycle channel (figure
processor-components.jpg).
The EIB itself supports four 16 bytes data rings with simultaneous
transfers per ring (it will be clarified later).
This bus supports over 100 simultaneous transactions achieving in each bus
data port more than 25.6 Gbytes/sec in each direction.
On the other side, the synergistic processor element is a simple RISC
user-mode architecture supporting dual-issue VMX-like, graphics SP-float
and IEEE DP-float.
Important to note that the SPE itself has dedicated resources: unified 128
x 128 bit register files and 256KB local storage. Each SPE has a
dedicated DMA engine, supporting 16 requests.
The memory management on this architecture simplified it's use, with the
local storage of the SPE being aliased into the PPE system memory (figure
processor-components2.jpg).
MFC in the SPE acts as the MMU providing controls over the SPE DMA access
and it's compatible with the PowerPC Virtual Memory layout and is software
controllable using PPE MMIO.
DMA access supports 1,2,4,8...n*16 bytes transfer, with a maximum of 16 KB
for I/O, and with two different queues for DMA commands: Proxy & SPU
(more on this later).
EIB is also connected in a broadband interface controller (BIC). The
purpose of this controller is to provide external connectivity for
devices. It supports two configurable interfaces (60 GB/s) with a
configurable number of bytes, coherent (BIF) and/or I/O (IOIFx) protocols,
using two virtual channels per interface, and multiple system
configurations.
The memory interface controller (MIC) is also connected to the EIB and is
a Dual XDR controller (25.6 GB/s) with ECC and suspended DRAM support
(figure processor-components3.jpg).
Still are missing two more components: The internal interrupt controller
(IIC) and the I/O Bus Master Translation (IOT) (figure
processor-components4.jpg).
The IIC handles the SPE interrupts as well as the external interrupts and
interrupts comming from the coherent interconnect and the IOIF0 and IOIF1.
It is also responsible for the interrupt priority level control and for
the interrupt generation ports for IPI. Note that the IIC is duplicated
for each PPE hardware thread.
IOT translates bus addresses to system real addresses, supporting two
level translations:
- I/O segments (256 MB)
- I/O pages (4K, 64K, 1M, 16M bytes)
Interesting is the resource of I/O device identifier per page for LPAR use
(blades) and IOST/IOPT caches managed by software and hardware.
---[ 2.3 - Debugging Cell
As the bus is a high-speedy circuit, it's really difficult to debug the
architecture and better seen what is going on.
For that, and also to made it easy to develop software for Cell, IBM
Research developed a Cell simulator [10] in which you may run Linux and
install the software development kit [11].
The IBM Linux Technology Center brazilian team developed a plugin for
eclipse as an IDE for the debugger and SDK. Putting it all together is
possible to have the toolkit installed in a Linux machine, running the
frontends for the simulator and for the SDK. The debugging interface is
much better using this frontends. Anyway, it's important to notice that
it's just a frontend for the normal and well know linux tools with
extended support to Cell processor (GDB and GCC).
---[ 2.3.1 - Linux on Cell
Linux on cell is an open-source git branch and is provided in the PowerPC
64 kernel line.
It started in the 2.6.15 and is evolving to support many new features,
like the scheduling improvements for the SPUs (actually it can be
preempted, and my big friend Andre Detsch who reviewed this article was
one of the biggest contributors to create an stable code here).
On Linux it added heterogeneous lwp/thread model, with a new SPE thread
model (really similar to the pthreads library as we will see later),
supporting user-mode direct and indirect SPE access, full-preemptive SPE
context management and for that, spe_ptrace() was create and it's support
added to GDB, spe_schedule() for thread to physical spe assigment (it is
not anymore FIFO - run until completion).
As a note, the SPE threads shares it's address space with the parent PPE
process (using DMA), demand paging for SPE access and shared hardware page
table with PPE.
An implementation detail is the PPE proxy thread allocated for each SPE to
provide a single namespace for both PPE and SPE and assist in SPE
initiated C99 and Posix library services.
All the events, error and signal handling for SPEs are done by the parent
PPE thread.
The ELF objects for SPE are wrapped into PPE objects with an extended GLD.
---[ 2.3.2 - Extensions to Linux
Here I'll try to provide some details for Linux running under a Cell
Hardware. The base hardware used for this reference is a Playstation 3,
which has 8 SPUs, but one is reserved with the purpose of redundancy and
another one is used as hypervisor for a custom OS (in this case, Linux).
All the details are valid for any Linux on Cell and we will provide an
top-down view approach.
---[ 2.3.2.1 - User-mode
Cell supports both power 32 and 64 bits applications, as well as 32 and 64
cell workloads. It has different programming modes, like RPC, devices
subsystems and direct/indirect access.
As already said, it has heterogeneous threads: single SPU, SPU groups and
shared memory support.
It runs over a SPE management runtime library, with 32 and 64 bits. This
library interacts with the SPUFS filesystem (/spu/thread#/) in the
following ways:
* Open, close, read, write the files:
- mem
This file provides access to the local storage
- regs
Access to the 128 register of 128 bits each
- mbox
spe to ppe mailbox
- liox
spe to ppe interrupt mailbox
- xbox_stat
Get the mailbox status
- signal1
Signal notification acess
- signal2
Signal notification acess
- signalx_type
Signal type
- npc
Read/write SPE next program counter (for debugging)
- fpcr
SPE floating point control/status register
- decr
SPE decrementer
- decr_status
SPE decrementer status
- spu_tag_mask
Access tag query mask
- event_mask
Access spe event mask
- srr0
Access spe state restore register 0
* open, close mmap the files:
- mem
Program State access of the Local Storage
- signal1
Direct application access to signal 1
- signal2
Direct application access to signal 2
- cntl
Direct application access to SPE controls, DMA queues and
mailboxes
The library also provides SPE task control system calls (to interact with
the SPE system calls implemented in kernel-mode), which are:
- sys_spu_create_thread
Allocates a SPE task/context and creates a directory in SPUFS
- sys_spu_run
Activates a SPU task/context on a physical SPE and
blocks in the kernel as a proxy thread to handle the events
already mentioned
Some functions provided by the library are related to the management of
the spe tasks, like spe create group, create thread, get/set affinity,
get/set context, get event, get group, get ls, get ps area, get threads,
get/set priority, get policy, set group defaults, group max, kill/wait,
open/close image, write signal, read in_mbox, write out_mbox, read mbox
status.
Obviously the standard 32 and 64 bits powerpc ELF (binary) interpreters,
it is provided a SPE object loader, responsible for understand the
extension to the normal objects already mentioned and for initiate the
loading of the SPE threads.
Going down, we have the glibc and other GNU libraries, both supporting 32
and 64 bits.
---[ 2.3.2.2 - Kernel-mode
The next layer is the normal system-call interface, where we have the SPU
management framework (through special files in the spufs) and
modifications in the exec* interface, in a 64bit kernel.
This modification is done through a special misc format binary, called SPU
object loader extension.
Of course there is other kernel extensions, the SPUFS filesystem, which
provides the management interface and the SPU allocation, scheduling and
dispatch.
Also, we do have the Cell BE architecture specific code, supporting multi
and large pages, SPE event & fault handling, IIC and IOMMU.
Everything is controlled by a hypervisor, since Linux is what is called a
custom OS when running in a Playstation3 hardware (the hypervisor is
responsible for the protection of the 'secret key' of the hardware and
knowing how to exploit SPU vulnerabilities plus some fuzzing on the
hypervisor may be the needed knowledge to break the game protection copy
in this hardware).
---[ 2.3.3 - Debugging the SPE
The SDK for Linux on Cell provides good resources for Debugging and better
understanding of what is going on.
It's important to note the environment variables that control the
behaviour of the system.
So, if you set the SPU_INFO, for example, the spe runtime library will
print messages when loading a SPE ELF executable (see above).
---------- begin output ----------
# export SPU_INFO=1
# ./test
Loading SPE program: ./test
SPU LS Entry Addr : XXX
---------- end output ----------
And it will also print messages before starting up a new SPE thread, like:
---------- begin output ----------
Starting SPE thread 0x..., to attach debugger use: spu-gdb -p XXX
---------- end output ----------
When planning to use the spu-gdb to debug a SPU thread, it's important to
remember the SPU_DEBUG_START environment variable, which will include
everything provided by the SPU_INFO and will stop the thread until a
debugger is attached or a signal is received.
Since each SPU register can hold multiple fixed (or floating) point values
of different sizes, for GDB is provided a data structure that can be
accessed with different formats. So, specifying the field in the data
structure, we can update it using different sizes as well:
---------- begin output ----------
(gdb) ptype $r70
type = union __gdb_builtin_type_vec128 {
int128_t uint128;
float v4_float[4];
int32_t v4_int32[4];
int16_t v8_int16[8];
int8_t v16_int8[16];
}
(gdb) p $r70.uint128
$1 = 0x00018ff000018ff000018ff000018ff0
(gdb) set $r70.v4_int[2]=0xdeadbeef
(gdb) p $r70.uint128
$2 = 0x00018ff000018ff0deadbeef00018ff0
---------- end output ----------
To permit you to better understand when the SPU code starts the execution
and follow it gdb also included an interesting option:
---------- begin output ----------
(gdb) set spu stop-on-load
(gdb) run
...
(gdb) info registers
---------- end output ----------
Another important information for debugging your code is to understand the
internal sizes and be prepared for overlapping. Useful information can
be get using the following fragment code inside your spu program (careful:
It's not freeing the allocated memory).
--- code ---
extern int _etext;
extern int _edata;
extern int _end;
void meminfo(void)
{
printf("\n&_etext: %p", &_etext);
printf("\n&_edata: %p", &_edata);
printf("\n&_end: %p", &_end);
printf("\nsbrk(0): %p", sbrk(0));
printf("\nmalloc(1024): %p", malloc(1024));
printf("\nsbrk(0): %p", sbrk(0));
}
--- end code ---
And of course you can also play with the GCC and LD arguments to have more
debugging info:
--- code ---
# vi Makefile
CFLAGS += -g
LDFLAGS += -Wl,-Map,map_filename.map
--- end code ---
---[ 2.4 - Software Development for Linux on Cell
In this chapter I will introduce the inners of the Cell development,
giving the basic knowledge necessary to better understand the further
chapters.
---[ 2.4.1 - PPE/SPE hello world
Every program in Cell that uses the SPEs needs to have at least two source
codes. One for the PPE and another one for the SPE.
Following is a simple code to run on the SPE (it's also in the attached
tar file :
--- code ---
#include <stdio.h>
int main(unsigned long long speid, unsigned long long argp, unsigned long long envp)
{
printf("\nHello World!\n");
return 0;
}
--- end code ---
The Makefile for this code will look like:
--- code ---
PROGRAM_spu = hello_spu
LIBRARY_embed = hello_spu.a
IMPORTS = $(SDKLIB_spu)/libc.a
include ($TOP)/make.footer
--- end code ---
Of course it looks like any normal code. The PPE as already explained is
the responsible for the creation of the new thread and allocation in the
SPE:
--- code ---
#include <stdio.h>
#include <libspe.h>
extern spe_program_handle_t hello_spu;
int main(void)
{
int speid, status;
speid=spe_create_thread(0, &hello_spu, NULL, NULL, -1, 0);
spe_wait(speid, &status, 1);
return 0;
}
--- end code ---
With the following Makefile:
--- code ---
DIRS = spu
PROGRAM_ppu = hello_ppu
IMPORTS = ../spu/hello_spu.a -lspe
include $(TOP)/make.footer
--- end code ---
The reader will notice that the speid in the PPE program will be the same
value as the speid in the main of the SPE.
Also, the arguments passed to the spe_create_thread() are the ones
received by the SPE program when running (argp and envp equals to NULL in
our sample).
Important to remember that when compiled this program will generate a
binary in the spu directory, called hello_spu and another one in the root
directory of this example called hello_ppu, which CONTAINS embedded the
hello_spu.
---[ 2.4.2 - Standard Library Calls from SPE
When the SPE program needs to use any standard library call, like for
example, printf or exit, it has to call back to the PPE main thread.
It uses a simple stop-and-signal assembly instruction with standardized
arguments value (important to remember that since it's needed in
shellcodes for SPE).
That value is returned from the ioctl call and the user thread must react
to that. This means copying the arguments from the SPE Local Storage,
executing the library call and then calling ioctl again.
The instruction according to the manual:
"stop u14 - Stop and signal. Execution is stopped, the current
address is written to the SPU NPC register, the value u14 is
written to the SPU status register, and an interrupt is sent to
the PPU."
This is a disassembly output of the hello_spu program:
---------- begin output ----------
# spu-gdb ./hello_spu
GNU gdb 6.3
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "--host=powerpc64-unknown-linux-gnu --target=spu"...
(gdb) disassemble main
Dump of assembler code for function main:
0x00000170 <main+0>: ila $3,0x340 <.rodata>
0x00000174 <main+4>: stqd $0,16($1)
0x00000178 <main+8>: nop $127
0x0000017c <main+12>: stqd $1,-32($1)
0x00000180 <main+16>: ai $1,$1,-32
0x00000184 <main+20>: brsl $0,0x1a0 <puts> # 1a0
0x00000188 <main+24>: ai $1,$1,32 # 20
0x0000018c <main+28>: fsmbi $3,0
0x00000190 <main+32>: lqd $0,16($1)
0x00000194 <main+36>: bi $0
0x00000198 <main+40>: stop
0x0000019c <main+44>: stop
End of assembler dump.
(gdb)
---------- end output ----------
---[ 2.4.3 - Communication Mechanisms
The architecture offers three main communications mechanism:
- DMA
Used to move data and instructions between main storage and
a local storage. SPEs rely on asyncronous DMA transfers to hide
memory latency and transfer overhead by moving information in
parallel with SPU computation.
- Mailbox
Used for control communications between a SPE and the
PPE or other devices. Mailboxes holds 32-bit messages. Each
SPE has two mailboxes for sending messages and one mailbox for
receiving messages.
- Signal Notification
Used for control communications from PPE or
other devices. Signal notification (also known as signalling)
uses 32-bit registers that can be configured for
one-sender-to-one-receiver signalling or
many-senders-to-one-receiver signalling.
All three are controlled and implemented by the SPE MFC and it's
importance is related to the way the vulnerable program will receive it's
input.
---[ 2.4.4 - Memory Flow Control (MFC) Commands
This is the main mechanism for the SPE to access the main storage and
maintain syncronization with other processors and devices in the system.
MFC commands can be issued either by the SPE itself, or by the processor
and other devices, as follow:
- A code running on the SPU issue a MFC command by executing a
series of writes and/or reads using channel instructions.
- A code running on the PPU or any other device issue a MFC
command by performing a serie of stores and/or loads to
memory-mapped I/O (MMIO) registers in the MFC.
The MFC commands are then queued in one of those independent queues:
- MFC SPU Command Queue - For channel-initiated commands by the
associated SPU
- MFC Proxy Command Queue - For MMIO-initiated commands by the PPE
or other devices.
---[ 2.4.5 - Direct Memory Access (DMA) Commands
The MFC commands that transfers data are referred as DMA commands. The
transfer direction for DMA commands are based on the SPE point of view:
- Into a SPE (from main storage to the local storage) -> get
- Out of a SPE (from local storage to the main storage) -> put
---[ 2.4.5.1 - Get/Put Commands
DMA get from the main memory to the local storage:
(void) mfc_get (volatile void *ls, uint64_t ea, uint32_t size,
uint32_t tag, uint32_t tid, uint32_t rid)
DMA put into the main memory from the local storage:
(void) mfc_put (volatile void *ls, uint64_t ea, uint32_t size,
uint32_t tag, uint32_t tid, uint32_t rid)
To guarantee the synchronization of the writes to the main memory, there
is the options:
- mfc_putf: the 'f' means fenced, or, that all commands executed
before within the same tag group must finish first, later ones
could be before
- mfc_putb: the 'b' here means barrier, or, that the barrier
command and all commands issued thereafter are NOT executed
until all previously issued commands in the same tag group have
been performed
---[ 2.4.5.2 - Resources
For DMA operations the system uses DMA transfers with variable length
sizes (1, 2, 4, 8 and n*16 bytes (n an integer, of course). There is a
maximum of 16 KB per DMA transfer and 128b aligments offer better
performance.
The DMA queues are defined per SPU, with 16-element queue for
SPU-initiated requests and 8-element queue for PPU-initiated requests.
The SPU-initiated request has always a higher priority.
To differentiate each DMA command, they receive a tag, with a 5-bit
identifier. Same identifier can be applied to multiple commands since
it's used for polling status or waiting on the completion of the DMA
commands.
A great feature provided is the DMA lists, where a single DMA command can
cause execution of a list of transfers requests (in local storage). Lists
implements scatter-gather functions and may contain up to 2K transfer
requests.
---[ 2.4.5.3 - SPE 2 SPE Communication
An address in another SPE local storage is represented as a 32-bit
effective address (global address).
SPE issuing a DMA command needs a pointer to the other SPE's local
storage. The PPE code can obtain effective address of an SPE's local
storage:
--- code ---
#include <libspe.h>
speid_t speid;
void *spe_ls_addr;
spe_ls_addr=spe_get_ls(speid);
--- end code ---
This permits the PPE to give to the SPEs each other local addresses and
control the communications. Vulnerabilities may arise don't matter what
is the communication flow, even without involving the PPE itself.
Follow is a simple DMA demo program between PPE and SPE (see the attached
file for the complete version) - This program will send an address in the
PPE to the SPE through DMA:
--- PPE code ---
information_sent is[1] __attribute__ ((aligned 128)));
spe_git_t gid;
int * pointer=(int *)malloc(128);
gid=spe_create_group(SCHED_OTHER, 0, 1);
if (spe_group_max(gid) < 1 ) {
printf("\nOps, there is no free SPE to run it...\n");
exit(EXIT_FAILURE);
}
is[0].addr = (unsigned int) pointer;
/* Create the SPE thread */
speid=spe_create_thread (gid, &hello_dma, (unsigned long long *) &is[0], NULL, -1, 0);
/* Wait for the SPE to complete */
spe_wait(speids[0], &status[0], 0);
/* Best pratice: Issue a sync before ending - This is good for us ;) */
__asm__ __volatile__ ("sync" : : : "memory");
--- end code ---
--- SPE code ---
information_sent is __attribute__ ((aligned 128)));
int main(unsigned long long speid, unsigned long long argp, unsigned long long envp)
{
/* Where:
is -> Address in local storage to place the data
argp -> Main memory address
sizeof(is) -> Number of bytes to read
31 -> Associated tag to this DMA (from 0 to 31)
0 -> Not useful here (just when using caching)
0 -> Not useful here (just when using caching)
*/
mfc_get(&is, argp, sizeof(is), 31, 0, 0);
mfc_write_tag_mask(1<<31); /* Always 1 left-shifted the value of your tag mask */
/* Issue the DMA and wait until completion */
mfc_read_tag_status_all();
}
--- end code ---
And now between two SPEs (also for the complete code, please refer to the
attached sources):
--- PPE code ---
speid_t speid[2]
speid[0]=spe_create_thread (0, &dma_spe1, NULL, NULL, -1, 0);
speid[1]=spe_create_thread (0, &dma_spe2, NULL, NULL, -1, 0);
for (i=0; i<2; i++) local_store[i]=spe_get_ls(speid[i]); /* Get local storage address */
for (i=0; i<2; i++) spe_kill(speid[i], SIGKILL); /* Send SIGKILL to the SPE
threds */
--- end code ---
--- SPE code ---
/* Write something to the PPE */
spu_write_out_mbox(buffer);
/* Read something from the PPE */
pointer = spu_read_in_mbox();
/* DMA interface */
mfc_get(buffer, pointer, size, tag, 0, 0);
wait_on_mask(1<<tag);
/* DMA something to the second SPE */
mfc_put(buffer, local_store[1], size, tag, 0, 0);
wait_on_mask(1<<tag);
/* Notify the PPE */
spu_write_out_mbox(1);
--- end code ---
------[ 3 - Exploiting Software Vulnerabilities on Cell SPE
I love the architecture manuals and the engineers and the way they talk
about really dumb design choices:
"The SPU Local Store has no memory protection, and memory access wraps
from the end of the Local Store back to the beginning. An SPU program is
free to write anywhere in the Local Store including its own instruction
space. A common problem in SPU programming is the corruption of the SPU
program text when the stack area overflows into the program area. This
problem typically does not become apparent until some later point in the
program execution when the program attempts to execute code in area that
was corrupted, which typically results in illegal instruction exception.
Even with a debugger it can be difficult to track down this type of
problem because the cause and effect can occur far apart in the program
execution. Adding printf's just moves failure point around".
---[ 3.1 - Memory Overflows
In the aforementioned memory design of the SPU is already cleaver that
when an attacker controls the overwrite size it's really easy to exploit a
SPU vulnerability, just replacing the original program .text with the
attacker's one.
It's important to note that the SPU interrupt facility can be configured
to branch to an interrupt handler at address 0 if an external condition is
true (bisled - branch indirect and set link if external data is the
instruction used to check if there is external data available). Since the
memory layout loops around, it's always possible to overwrite this handler
if it's been used.
Another important note is the fact that instructions on memory MUST be
aligned on word boundaries.
There is instruction and data caches for the local storage (depending on
the implementation details), so it's important to assure:
- You are overflowing a large enough amount of data to avoid
caching
- You are not using a self-modifying shellcode unless you issue
the sync instruction (see [13] for references)
---[ 3.1.1 - SPE memory layout
The memory layout for the SPE looks like:
------------------------ -> 0x3FFFF
SPU ABI Reserved Usage
------------------------ | Stack grows from the
Runtime Stack | higher addresses to
------------------------ | the lower addresses.
Global Data |
------------------------ \/
.Text
------------------------ -> 0x00000
For the purpose of test your application, it's really interesting to use the
'size' application:
---------- begin output ----------
# size hello_spu
text data bss dec hex filename
1346 928 32 2306 902 hello_spu
---------- end output ----------
---[ 3.1.2 - SPE assembly basics
It's important in order to develop a shellcode to understand the
differences in the SPE assembly when comparing to PowerPC.
The SPE uses risc-based assembly, which means there is a small set of
instructions and everything in the SPE runs in user-mode (there is no
kernel-mode for the SPE). That said we need to remember there is no
system-calls, but instead there is the PPE calls (stop instructions).
It is also a big endian architecture (keep that in mind while reading the
following sections).