forked from facebook/hhvm
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbytecode.specification
4348 lines (3238 loc) · 185 KB
/
bytecode.specification
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
**********************************
* HipHop Bytecode v1 revision 18 *
**********************************
Introduction
------------
HipHop bytecode (HHBC) v1 is intended to serve as the conceptual basis for
encoding the semantic meaning of HipHop source code into a format that is
appropriate for consumption by interpreters and just-in-time compilers. By
using simpler constructs to encode more complex expressions and statements,
HHBC makes it straightforward for an interpreter or a compiler to determine
the order of execution for a program.
HHBC was designed with several competing goals in mind:
1) Run-time efficiency. The design of HHBC should be congruous to implementing
an efficient execution engine, whether it be an interpreter or a just-in-time
compiler.
2) PHP 5.5 compatibility. It should be possible to compile valid PHP 5.5 source
code into HipHop bytecode in a way that preserves the semantic meaning of the
source.
3) Simplicity. The design of HHBC should avoid features that could be removed
or simplified without compromising PHP 5.5 compatibility, run-time efficiency,
or design cleanliness.
Compilation units
-----------------
Each HipHop source file is compiled into a separate "compilation unit", or
"unit" for short. Units are composed of bytecode and metadata.
A unit's bytecode is an array of bytes encoding a sequence of HHBC
instructions, where each instruction is encoded using one or more bytes. This
specification defines an instruction set and defines the behavior of each HHBC
instruction, but the exact byte values used to encode HHBC instructions is
currently unspecified.
A unit's metadata is a set of structures that provide essential information
that is needed at run time by the execution engine. This specification will
describe a unit's metadata as a set of named tables with ordered rows, but the
exact format of the metadata is currently unspecified.
Each instruction in a unit's bytecode can be referred to using a "bytecode
offset", which is the distance in bytes from the first byte of a unit's
bytecode to the first byte of the instruction.
A unit's bytecode is partitioned into sections called "functions". The unit's
metadata uses bytecode offsets to specify which instructions belong to which
functions.
When a unit is loaded at run time, the execution engine assigns the unit's
bytecode a logical range of addresses called "bytecode addresses". An
instruction is referred to at run time using its bytecode address.
Flow of execution
-----------------
HipHop bytecode models the flow of execution using a stack of frames referred
to as the "call stack". A "frame" is a structure that logically consists of a
header, a program counter (PC), a local variable store, an iterator variable
store, an evaluation stack, and a function parameter info (FPI) stack.
The frame at the top of the call stack is referred to as the "current frame".
The current frame represents the function that is currently executing. The
program counter (PC) of the current frame is referred to as the "current PC".
At any given time, the current PC holds the bytecode address of the current
instruction to execute. When the execution engine executes an instruction, the
current PC is updated to point to the next instruction. By default, the current
PC is updated to point to the byte that sequentially follows the last byte of
the current instruction in the bytecode. Some instructions override the default
behavior and explicitly update the current PC in a specific way.
HHBC provides special instructions to allow for calling a function and
returning from a function. When a function is called, a new frame is pushed
onto the call stack, and the PC of the new frame is initialized to the
appropriate entry point (typically the instruction of the function that is
sequentially first in the bytecode). The new frame becomes the current frame,
and the PC of the new frame becomes the current PC. When a function returns,
the current frame is popped off the call stack. The previous frame becomes the
current frame, and its PC becomes the current PC. The facility provided by the
execution engine that is responsible for handling function calls and returns is
called the "dispatcher".
Typically, a frame is removed from the call stack when its corresponding
function returns. However, a frame may be removed from the call stack before
its corresponding function returns in the course of processing an exception.
The facility provided by the execution engine that is responsible for
processing exceptions is called the "unwinder".
Values
------
HHBC instructions may push and pop values on the current frame's evaluation
stack and they may read and write values to the current frame's local
variables. Values come in three flavors: cells, refs, and classrefs.
A "cell" is a structure that contains a type identifier and either data (for
non-refcounted types) or a pointer to data (for refcounted types). When a cell
containing a pointer is duplicated, the new cell will point to the same data as
the original cell. When a cell containing a pointer is duplicated or discarded,
the execution engine is responsible for honoring the data's refcount logic.
A "ref" is a structure that contains a pointer to a cell container. When a ref
is duplicated, the new ref will point to the same container as the original
ref. When a ref is duplicated or destroyed, the execution engine is responsible
for honoring the container's refcount logic. When the container is destroyed,
the cell it contains is also destroyed.
A "classref" is a structure that contains a reference to a class. When a
classref is pushed onto the stack or popped of the stack, no refcounting is
required.
Values on the evaluation stack may be any of the three flavors listed above.
Values stored in local variables may only be cells or refs.
Functions
---------
A unit's bytecode is organized into functions. Each function has its own
metadata that provides essential information about the function, such as the
name of the function, how many local variables it has, how many iterator
variables it has, how many formal parameters it has, the names of the local
variables, the names of the formal parameters, how each parameter should be
passed (pass by value vs. pass by reference), whether each parameter has a
default value, and an upper bound for the maximum depth the evaluation stack
can reach at run time.
Each local variable and iterator variable has an id, and HHBC instructions can
reference these variables using these ids. The id space for local variables is
distinct from the id space for iterator variables. Thus local id 1 refers to a
different variable than iterator id 1. Local variable ids and iterator ids are
signed 32-bit integer values. No function may have more than 2^31 - 1 local
variables, and no function may have more than 2^31 - 1 iterator variables.
Some local variables have names associated with them (called "named local
variables"), while other local variables do not have names associated with them
(called "unnamed local variables"). All local variables that reference formally
declared parameters have names associated with them. Iterator variables do not
have names associated with them. Variables that have a name associated with
them will appear in the current variable environment (if they are defined),
while variables that do not have a name associated with them will never appear
in the current variable environment.
Formally declared parameters are considered to be local variables. Given a
function with n formally declared parameters, local ids 0 through n-1 will be
used to reference the formally declared parameters. Formal parameters without
default values are called "required parameters", while formal parameters with
default values are called "optional parameters".
The bytecode for each function is partitioned into a primary function body and
0 or more fault funclets. The metadata for each function specifies how many
fault funclets the function has. For the primary function body, the metadata
specifies a set of non-overlapping ranges of bytecode that compose the primary
function body, and it specifies the main entry point and 0 or more DV entry
points (entry points are discussed in more detail in the "Entry points"
section). For each fault funclet, the metadata specifies a set of non-
overlapping ranges of bytecode that compose the fault funclet body, and it
specifies an entry point for the fault funclet. The primary function body and
the fault funclets may not overlap with each other, and the union of the
primary function body and the fault funclets must cover all of the function's
bytecode. Fault funclets are discussed in more detail in the "Exception handler
(EH) table" and "Processing exceptions" sections.
The total size of the bytecode for the primary function body and all the fault
funclets must not exceed 2^31 - 1 bytes. The bytecode for a function must be
one contiguous range of bytecode.
Each function's metadata provides a "line number table" to allow mapping
bytecode offsets back to source line numbers. Each row in the line number table
consists of a source line number and a range of bytecode. The table is sorted
by starting bytecode offset, lowest offset first. The bytecode offset of the
beginning of each instruction in the function must belong to exactly one of the
ranges of bytecode in the line number table.
Classes
-------
Functions may be grouped into metadata for classes. Class metadata objects are
used to describe several PHP-level language features including traits,
interfaces, closures, and (of course) classes.
Class metadata includes information about the properties on the class, special
functions on the class such as constructors or internal property initialization
routines (86sinit, 86pinit), class constants, list of used traits, list of
extended classes, list of implemented interfaces, etc.
Classes also include a flag indicating their "hoistability". For now this isn't
documented much here. See class.h.
Closures
--------
Closures are implemented in hhbc as subclassses of Closure, in conjunction with
the CreateCl opcode. It is legal hhbc to create other subclasses of Closure (to
represent user code that attempts to do the same), but attempting to
instantiate one will result in a fatal error. The documentation of the CreateCl
opcode below lists the requirements for a closure subclass to be usable with
it.
Generators
----------
The basic compilation strategy for generators is to create bytecode functions
consisting of two parts.
The first part, executed when the generator function is called, must consist
of a CreateCont, which is responsible for suspending execution state into a new
Generator object (includes resume offset pointing to the start of the second
part of the function) and returning it back to the caller.
The second part is where the real user-level code of the generator should be
placed. ContEnter and ContRaise opcodes used in Generator's next(), send()
and raise() methods resume execution and transfer control to the resume offset
stored in the Generator object. The user-level code yields values using
Yield and YieldK opcodes and returns using RetC opcode.
Async functions
---------------
Async functions are special type of functions representing asynchronous
execution. They can suspend while waiting for other asynchronous operations to
finish. This is achieved using Await opcode, which suspends execution into
an AsyncFunctionWaitHandle object. Once the given dependency is finished,
the scheduler resumes async function at the next opcode.
The async function body can be executed in 2 different modes. If the execution
was never suspended, we are in "eager execution" mode. The code executed after
the resume is executed in "resumed execution" mode.
The "eager execution" can end in 3 different ways. If a RetC opcode is reached,
the result is wrapped into a succeeded StaticWaitHandle and returned to the
caller. If an exception is thrown, it is wrapped into a failed StaticWaitHandle
and returned to the caller. Otherwise, if an Await opcode was reached and the
provided child WaitHandle has not finished, the current execution state is
suspended into an AsyncFunctionWaitHandle object and retured to the caller.
This mechanism allows fast execution if no blocking asynchronous operation was
reached.
The "resumed execution" mode is always entered from the scheduler. In this mode,
the async function either gets blocked on another dependency, or gets finished.
The scheduler is notified of these events using Await and RetC opcodes (or via
the unwinder if an exception was thrown) and the control is given back.
The async function implementation is still changing and the implementation may
change significantly, so this spec is staying light on details for now.
Entry points
------------
Entry points come in four varieties: the main entry point, DV entry points,
fault entry points, and catch entry points.
Every function has exactly one main entry point. When a function is called, the
dispatcher will set the PC of the new frame to point to the main entry point if
either (1) the function does not have any optional parameters or (2) the caller
provides values for all of the optional parameters.
DV entry points are normally used to handle initializing optional parameters
that the caller did not provide. Generally the DV entries contain blocks that
initialize parameters, and then fall through directly into one another, with
the last block ending with a jump to the main entry point. This is not a
requirement, however. The dispatcher selects the appropriate DV entry point
based on the number of arguments passed into the function.
The main entry point and DV entry points are used by the dispatcher when
handling a function call. Each function's metadata provides an "entry point
table". Each row in the entry point table consists of a number of arguments and
the bytecode offset of the entry point that should be used by the dispatcher
(either the main entry point or a DV entry point).
Fault entry points are used by the unwinder to enter fault funclets as
appropriate to perform necessary cleanup when a region of code exits abnormally
through an exception. For a function with N fault funclets there are exactly N
fault entry points (one fault entry point per fault funclet). The bytecode
offset of a fault entry point must be inside its corresponding fault funclet.
Catch entry points are used by the unwinder to resume normal execution once a
matching "catch" block has been found and all the necessary cleanup has been
performed.
More details about the unwinder, fault funclets, fault entry points, catch
entry points can be found in the "Exception handler (EH) table" and "Processing
exceptions" sections.
Unit metadata
-------------
Every compilation unit has a litstr table, a scalar array table, a function
table, and a class table.
The litstr table maps litstr ids to literal strings. Bytecodes that refer to
literal strings do so by litstr id. Litstr ids are signed 32-bit integer
values, which must be between 0 and 2^31 - 2 inclusive. In addition to the
per-unit litstr tables, a global table is built when generating an
"authoritative" repo (one in which all the PHP code is known at bytecode
generation time, and is guaranteed not to change). Global litstr ids can be
used in any unit, and are encoded in the range [2^30..2^31-2].
The scalar array table maps scalar array ids to a description of the contents
of a scalar array. An array is a scalar array if and only if each element of
the array is a null, boolean, integer, double, string, or a scalar array.
Furthermore, each element of a scalar array must be a cell. Finally, scalar
arrays may not recurse infinitely. Each scalar array id must be between 0 and
2^31 - 2 inclusive.
Each row in the function table contains a unique function id, a function name
specified by a litstr id, the bytecode offset for the corresponding function, a
flag that indicates if the function is unconditionally declared in the
outermost scope, and the function metadata. Note that there may be multiple
rows in the function table with same function name. However, there may not be
multiple rows that are marked as being unconditionally declared in the
outermost scope with the same function name. Each function id must be between 0
and 2^31 - 2 inclusive.
Each row in the class table contains a unique class id, a class name specified
by a litstr id, a flag that indicates if the class declaration is hoisted to
the prelude of pseudo-main, and the class metadata. Note that there may be
multiple rows in the class table with same class name. However, there may not
be multiple rows that are marked as being hoisted with the same class name.
Each class id must be between 0 and 2^31 - 2 inclusive.
Function parameter info (FPI) structures and the FPI stack
----------------------------------------------------------
Every function has a function parameter info (FPI) structure associated with it
that can be retrieved at run time. The FPI structure contains the bytecode
address of the function, the number of parameters the function has, and a
parameter table that indicates whether each parameter is pass by value or pass
by reference.
In addition to the evaluation stack, each frame also contains another stack
called the FPI stack. Each entry on the FPI stack consists of a reference to an
FPI structure and a bytecode address of entry point into the corresponding
function. The entry on the top of the FPI stack is called the "current FPI".
The FPush* instructions push a new entry onto the FPI stack, initializing the
entry with a reference to the FPI structure for a given function and the
bytecode address of the appropriate entry point. The FPass* instructions
prepare the parameters that will be passed into the callee. The FCall*
instructions look at the current FPI to get the bytecode address of the
function (the callee), transfers the parameters from the evaluation stack to
the callee, pops the current FPI off of the FPI stack, and then invokes the
dispatcher to call the function.
Calls to builtin functions may be optimized to avoid pushing an entry on the
FPI stack if it is known that the builtin function does not need access to the
call stack. In this case, the arguments to the builtin are pushed on stack as
Cells and Vars, and the builtin can be invoked with the FCallBuiltin
instruction. Unless otherwise noted, subsequent references to FCall*
instructions should be meant to refer to non-optimized FCall instructions, i.e.
all FCall instructions other than FCallBuiltin.
Calling convention
------------------
The caller may pass any number of parameters to the callee by executing FPass*
instructions zero or more times prior to executing an FCall* instruction. The
caller must pass the parameters in forward order, i.e. the first use of FPass*
passes the first parameter, the second use of FPass* passes the second
parameter, and so forth.
The FPush*/FPass*/FCall* instructions can be used to call a global function, a
method on an object, or a method from a class. The caller is responsible for
evaluating all of the parameters in forward order. When the caller executes an
FCall* instruction, the dispatcher creates a new frame and moves the parameters
prepared by the caller into the callee's variable environment. The dispatcher
then transfers control to the appropriate entry point of the callee (either the
main entry point or a DV entry point) based on the number of parameters passed.
When the callee executes the Ret* instruction, the dispatcher pushes the return
value onto the caller's evaluation stack. Then the dispatcher destroys the
callee's frame and transfers control back to the caller.
Exception handler (EH) table
----------------------------
The metadata for each function provides an "exception handler (EH) table".
Each row in the EH table (called an "EH entry") consists of a kind ("fault" or
"catch"), a non-negative integer "region depth", a set of non-overlapping
ranges of bytecode that compose the "protected region", and an offset of a
fault funclet (if it's a "fault" EH entry) or a list of [class name, catch
entry point] pairs (if it's a "catch" EH entry).
Each range of bytecode is given by a starting offset and an ending offset,
where the starting offset is the bytecode offset of the first byte of the first
instruction in the range and the ending offset is the bytecode offset after the
last byte of the last instruction in the range.
Note that two or more EH entries may refer to the same fault funclet or the
same catch entry points. Regardless of whether multiple EH entries share the
same fault funclet or the same catch entry points, each EH entry in the EH
table will be considered to declare a distinct "protected region".
The EH entries in each EH table must honor the following rules:
1) If an EH entry covers one or more bytes of the primary function body, then
it may not cover any bytes in any fault funclets. If an EH entry covers one or
more bytes of a fault funclet, then it may not cover any bytes in the primary
function body or in other fault funclets.
2) If a catch EH entry covers one or more bytes of the primary function body,
then its catch entry points must point to instructions in the primary function
body. If a catch EH entry covers one or more bytes of a fault funclet F, then
its catch entry points must point to instructions in fault funclet F.
3) For each EH entry with a region depth of D and a protected region P, for all
other protected regions Q that overlap with P, one of the following must be
true: (i) Q has a region depth that is greater than D and P is a superset of
(or equal to) Q; or (ii) Q has a region depth that is less than D and P is a
subset of (or equal to) Q.
4) For each EH entry with a region depth of D and a protected region P, for
each integer I where 0 <= I < D there must be exactly one protected region Q in
the EH table where Q's region depth equals I and P overlaps with Q.
Processing exceptions
---------------------
The unwinder maintains a stack of exception infos called the "exception stack".
An "exception info" is a record that contains an exception object, a reference
to a frame on the call stack, a program counter (PC), and a non-negative
integer "region depth". When a thread of execution first begins, the exception
stack is initially empty.
HHBC allows programs to throw exceptions via the Throw instruction. When a
Throw instruction executes, it pushes a new exception info to the top of the
exception stack that contains the thrown exception object, a reference to the
frame that threw the exception, the PC at the time the exception was thrown,
and a region depth of D where D equals the number of protected regions in the
current frame's EH table that cover PC. Then it transfers control to the
unwinder which starts processing the exception info at the top of the exception
stack by following the steps given at the end of this section starting with
step 1 until control is transferred elsewhere.
HHBC also provides an Unwind instruction to allow a fault funclet to return
control back to the unwinder when it has finished its work. When the Unwind
instruction executes, it transfers control to the unwinder and the unwinder
resumes processing the exception info at the top of the exception stack by
following the steps below starting at step 1 until control is transferred
elsewhere.
Here are the steps that the unwinder follows to process the exception info at
the top of the stack (called the "current exception info"):
Step 1) Consult the EH table of the current exception info's function. Check if
there are any EH entries that cover the current exception info's PC and
have a region depth that is less than the current exception info's
region depth. If one or more matching EH entries are found, choose the
EH entry with the greatest region depth and continue on to step 2. If
no matching EH entries are found go to step 5.
Step 2) Let E be the EH entry found in step 1, and let D be the region depth of
E. Set the current exception info's region depth to D (overwriting the
previous value). Continue on to step 3.
Step 3) If E is a fault EH entry, transfer control to E's fault funclet's entry
point; eventually control will be transferred back to the unwinder,
either via the Unwind instruction or via another exception being thrown
with the Throw instruction. Otherwise continue on to step 4.
Step 4) E is a catch EH entry. Consult E's list of [class name, catch entry
point] pairs. Check if there are any pairs in the list where the
exception's type is compatible with the pair's class name. If one or
more matching pairs are found, choose the one that occurs first in E's
list and transfer control to that pair's catch entry point; the catch
entry point begins with a Catch instruction which will take care of
popping the current exception info off of the exception stack. If no
matching pair is found, go to step 1.
Step 5) Check if the current exception info's PC is in a fault funclet. If it's
not, continue to step 6. If it is, read the exception X from the
current exception info and then pop the current exception info off of
the exception stack. Then, read the new current exception info's
exception Y, update exception X so that it's "previous" property chains
to exception Y, and then update the new current exception info to point
to exception X instead of exception Y. Then go to step 1.
Step 6) The current exception info's PC is in the primary function body. Pop
the current frame off of the call stack and then check if the call
stack is empty. If the call stack is empty, read the current exception
info's exception X, clear the exception stack, and transfer control to
the unhandled exception facility passing along exception X. If the call
stack is not empty continue to step 7.
Step 7) Update the current exception info to refer to the new current frame at
the top of the call stack, and set the current exception info's PC to
point to the FCall* instruction which immediately precedes the PC of
the current frame. Next, set the current exception info's region depth
to D, where D equals the number of protected regions in the current
frame's EH table that cover the current exception info's PC. Then go to
step 1.
Property access
---------------
As object properties are accessed during execution, the execution engine is
responsible for following certain rules to honor each property's accessibility
and visibility.
The accessibility and visibility of a property in a given class is determined
by that class's definition and the definitions of all of that class's
ancestors. When a property is declared in a class definition (a "declared
property") it may be specified as being "public", "protected", or "private".
Accessibility and visibility are two related but distinct concepts. Depending
on the current context, a property may be visible and accessible, visible but
inaccessible, or invisible and inaccessible.
If a property P is declared with the "public" qualifier in the definition of
class C, for instances of class C and descendent classes the property P will be
visible and accessible in all contexts. If C has an ancestor that declares a
public property with the same name as P, C is said to "redeclare" property P,
and the declaration of P in class C is considered to refer to the same property
as the declaration in the ancestor class.
If a property P is declared as "protected" in the definition of class C, for
instances of class C the property P will be visible in all contexts, but only
accessible in the context of class C, an ancestor class, or descendent class.
When class C is loaded at run time, a semantic check must be performed to
ensure that all ancestor classes of C do not declare a property as "public"
with the same name as P. If C has an ancestor that declares a public property
with the same name as P, the execution engine must throw a fatal error when
class C is loaded. If C has an ancestor that declares a protected property with
the same name as P, C is said to "redeclare" property P, and the declaration of
P in class C is considered to refer to the same property as the declaration in
the ancestor class. Note that there may exist a class D that is a descendent of
C and declares a property as "public" with the same name as P. In such cases
the new "public" declaration in D is considered to refer to the same property
as the original "protected" declaration in C, and the "protected" qualifier
from the original declaration is effectively overridden by the "public"
qualifier from the new declaration. Class D is said to "redeclare" property P
with the "public" qualifier. Thus, for instances of class D and descendent
classes of D, property P will be visible and accessible in all contexts.
Finally, if a class E that is descendent of C does not redeclare P as public
and does not have an ancestor class that redeclares P as public, for instances
of class E the property P will be visible in all contexts, but only accessible
in the context of class E, an ancestor class of E, or a descendent class of E.
If a property P is declared with the "private" qualifier in the definition of
class C, for instances of class C the property P will be visible in all
contexts, but only accessible in the context of class C. For instances of
descendent classes of C, the property P will be visible and accessible in the
context of the class C, and in all other contexts property P will be invisible
and inaccessible. When class C is loaded at run time, a semantic check must be
performed to ensure that all ancestor classes of C do not declare a property as
"public" or "protected" with the same as P. If C has an ancestor that declares
a public or protected property with the same name as P, the execution engine
must throw a fatal error when class C is loaded. Note that descendent classes
of C may declare another property with the same name as P. The declaration of
property P as "private" in class C is considered to define a separate property
that is distinct from all other properties of the same name declared in
ancestor classes and descendent classes of C.
An instruction that accesses a property specifies the property by a name N via
a litstr id, a local variable id, or a cell consumed from the evaluation stack.
As noted above, it is possible for a class to have multiple distinct properties
named N. In cases where there are multiple distinct properties named N, the
visibility rules are used to determine which property is retrieved. If there is
a visible private property P named N, then property P is retrieved. Otherwise,
if there is a visible non-private property Q named N, then property Q is
retrieved. If there is no visible property named N, the behavior is determined
by the specific instruction. The semantic checks and the visibility rules
ensure that for any context there cannot be more than one visible private
property, and there cannot be more than one visible non-private property.
Some instructions can create a new property at run time with a name that is
different than the names of all declared properties that are visible in the
current context. Such properties are called "non-declared properties" or
"dynamic properties". Dynamic properties are considered to be visible and
accessible in all contexts.
If a declared property is unset, and then re-accessed/re-created, then it is
treated the same way as an invisible property with the same attributes as the
original declared property. Specifically, if the property gets created again,
it must have the same access attributes as the original declared property.
Magic property access methods
-----------------------------
Instructions that access properties may in some cases invoke a magic property
access method (__get, __set, __isset, or __unset) if an object implements the
method and the method is considered eligible for invocation. A magic property
access method is considered "eligible" for a given object if there is not a
frame on the call stack that corresponds to an invocation of the same method on
the same object.
Static property access
----------------------
As a class's static properties are accessed during execution, the execution
engine is responsible for following certain rules to honor each static
property's accessibility and visibility.
The accessibility and visibility of a static property in a given class is
determined by that class's definition and the definitions of all of that
class's ancestors. When a static property is declared in a class definition it
may be specified as being "public", "protected", or "private". Depending on the
current context, a static property may be visible and accessible, visible but
inaccessible, or invisible and inaccessible.
Conceptually, each class has a "static store" associated with it at run time
that provides storage for the static properties declared in the class's
definition. Static properties are accessed at run time by name through the
scope of a class. When an instruction accesses a static property through the
scope of class C, it will search the static store of C and then the static
stores of C's ancestors (starting with C's base class and moving up the
inheritance chain) for the first static property with the given name that is
visible in the current context.
If a static property S is declared with the "public" qualifier in the
definition of class C, the static property S when accessed through the scope of
class C or a descendent of C will be visible and accessible in all contexts.
Note that descendent classes of C may declare another static property with the
same name as S. The declaration in class C is considered to define a separate
static property that is distinct from all other static properties declared in
descendent classes of C.
If a static property S is declared with the "protected" qualifier in the
definition of class C, the static property S when accessed through the scope of
class C or a descendent of C will be visible in all contexts, but only
accessible in the context of class C, an ancestor class of C, or descendent
class of C. When class C is loaded at run time, a semantic check must be
performed to ensure that all ancestor classes of C do not declare a static
property as "public" with the same name as S. If C has an ancestor that
declares a public static property with the same name as S, the execution engine
must throw a fatal error when class C is loaded. Note that descendent classes
of C may declare another static property with the same name as S. The
declaration in class C is considered to define a separate static property that
is distinct from all other static properties declared in descendent classes of
C.
If a static property S is declared with the "private" qualifier in the
definition of class C, the static property S when accessed through the scope of
class C will be visible in all contexts, but only accessible in the context of
class C. The static property S when accessed through the scope of a descendent
of C will only be visible and accessible in the context of class C. When class
C is loaded at run time, a semantic check must be performed to ensure that all
ancestor classes of C do not declare a static property as "public" or
"protected" with the same name as S. If C has an ancestor that declares a
public or protected static property with the same name as S, the execution
engine must throw a fatal error when class C is loaded. Note that descendent
classes of C may declare another static property with the same name as S. The
declaration in class C is considered to define a separate static property that
is distinct from all other static properties declared in descendent classes of
C.
Note that instructions cannot create new static properties in a class that were
not declared in the class definition.
FPI regions
-----------
An FPI region is a contiguous range of bytecode that constitutes a call site.
Each FPI region begins immediately after an FPush* instruction that pushes an
FPI structure onto the FPI stack and must end with the corresponding FCall*
instruction that pops that FPI structure off of the FPI stack. If two FPI
regions overlap, one of the FPI regions must be completely enclosed by the
other FPI region. An FPI region may not contain backward jumps, nor may it
contain forward jumps that jump past the end of the FPI region.
Each function has an "FPI region table". Each row in the FPI region table
consists of the starting offset of the FPI region (the bytecode offset
immediately following the FPush* instruction), the ending offset of the FPI
region (the bytecode offset of the FCall* instruction), and the number of
parameters being passed.
Flavor descriptors
------------------
Any given value on the stack must either be a cell, ref, or classref at run
time. However, at bytecode generation time the specific flavor of a value on
the stack is not always known. HipHop bytecode uses symbols called "flavor
descriptors" to precisely describe what is known at bytecode generation about
the state of the evaluation stack at each instruction boundary.
Each instruction description specifies the flavor descriptor produced for each
of its outputs. Each description also specifies the flavor descriptor consumed
for each of the instruction's inputs.
Here is a description of each flavor descriptor:
C - cell; specifies that the value must be a cell at run time
V - ref; specifies that the value must be a ref at run time
A - classref; specifies that the value must be a classref at run time
R - return value; specifies that the value may be a cell or a ref at run
time; this flavor descriptor is used for return values from function
calls
F - function argument; specifies that the value may be a cell or a ref at run
time; this flavor descriptor is used for parameter values that are about
to be passed into a function
U - uninit; specifies that the value must be an uninitialized null at run
time; this is only used for FCallBuiltin
Verifiability
-------------
Because instructions specify constraints on the flavor descriptor of each
input, it is important to be able to determine if a given HHBC program
satisfies these constraints. A program that satisfies the constraints on the
inputs to each instruction is said to be "flavor-safe".
HHBC provides a set of verification rules that can be mechanically applied to
verify that an HHBC program is flavor-safe. All valid HHBC programs must be
verifiably flavor-safe, and the execution engine may refuse to execute HHBC
programs that cannot be verified.
At bytecode generation time, what is known about the state of the evaluation
stack at a given instruction boundary can be precisely described using flavor
descriptors.
In addition to being flavor-safe, there are other invariants that valid HHBC
programs must uphold with respect to metadata and how certain instructions are
used.
Below is the complete list of verifiability rules. If the bytecode to be
executed does not come from a trusted source, it is the responsibility of the
bytecode execution engine to verify that these invariants hold.
1) The depth of the evaluation stack at any given point in the bytecode must be
the same for all possible control flow paths. The flavor descriptor of any
given slot on the evaluation stack at any given point in the bytecode must be
the same for all possible control flow paths.
2) No instruction may consume more values from the evaluation stack than are
available at that given point in the bytecode. Likewise, the flavor descriptor
of each slot on the evaluation stack must be compatible with the instruction's
inputs' flavor descriptors.
3) The evaluation stack must be empty at any offset listed as a catch entry
point.
4) If a given instruction is not the target of a forward branch and it follows
a Jmp, Switch, SSwitch, RetC, RetV, Unwind, Fatal, Throw, or NativeImpl
instruction, the evaluation stack before executing the given instruction
must be empty.
5) Before executing the RetC instruction, the evaluation stack must contain
exactly one value and the flavor descriptor of the value must be cell.
Likewise, before executing the RetV instruction, the evaluation stack must
contain exactly one value and the flavor descriptor of the value must be the
ref. Finally, before executing the Unwind instruction, the evaluation stack
must be empty.
6) The code for the primary function body and fault funclets must be laid out
in order in one contiguous block, starting with the primary function body and
optionally followed by one or more fault funclets. The code for primary
function body may not jump into the code for the funclets. Similarly, the code
for a funclet may not jump into the code for the primary function body or
another funclet.
7) Any bytecode instruction inside the primary function body or a fault funclet
that is immediately followed by an instruction not inside the same primary
function body or fault funclet must be one of the following instructions: Jmp,
Switch, SSwitch, RetC, RetV, Unwind, Fatal, Throw, or NativeImpl.
8) The primary function body may not contain the Unwind instruction, and fault
funclets may not contain the Ret* instructions. Also, each catch entry point
must point to a Catch instruction.
9) Each FPI region enumerated in the FPI region table must start immediately
after an FPush* instruction and it must end with an FCall* instruction. Each
use of the FPush* instruction must be the instruction immediately before
exactly one FPI region. Likewise, each use of the FCall* instruction must be
the last instruction in exactly one FPI region. Finally, FPass* instructions
may not be used outside an FPI region.
10) Each FPI region may not contain backward jumps, nor may it contain forward
jumps that jump outside the end of the FPI region. Also, there may not be jumps
anywhere in the function that transfer control from the outside of a given FPI
region to the inside of that region. An FPI region may not contain the Ret*,
Unwind, Throw, or Fatal instructions. Finally, an entry point may not point to
an instruction inside an FPI region.
11) The depth of the FPI stack at any given point in the bytecode must be the
same for all possible control flow paths. Also, for any given FPI region that
passes n parameters, all possible control flow paths from the beginning of the
region to the end must pass through exactly n FPass* instructions associated
with the region which pass the parameters in forward order.
12) Given an evaluation stack of depth n after an FPush* instruction, the
evaluation stack before the corresponding FCall* instruction must also have a
depth of n. Likewise, the evaluation stack after corresponding FPass*
instructions must have a depth of n as well. Finally, no instruction between an
FPush* and its corresponding FCall* may consume any of the values from the
evaluation stack that were pushed onto the stack before the FPush* instruction.
13) The initialization state of each iterator variable must be known at every
point in the code and must be the same for all control paths. There are four
possible states: (1) uninitialized, (2) "iter-initialized" (initialized via
IterInit*), (3) "miter-initialized" (initialized via MIterInit*), and (4)
"cuf-initialized" (initialized via DecodeCufIter). Every range of bytecode for
which an iterator variable i is initialized must be protected by a fault
funclet that unsets i by calling IterFree, MIterFree, or CIterFree.
14) The iterator variable referenced by IterInit* or MIterInit* or
DecodeCufIter must be in the uninitialized state when the instruction executes.
An iterator variable referenced by IterNext* and IterFree must be in the
"iter-initialized" state, an iterator variable referenced by MIterNext* or
MIterFree must be in the "miter-initialized" state, and an iterator variable
referenced by FPushCufIter or CIterFree must be in the citer-initialized state.
Note that IterInit* and MIterInit* conditionally initialize the iterator
variable, and IterNext* and MIterNext* conditionally free the iterator
variable.
15) Each EH table must follow all of the rules specified in the "Exception
handler (EH) table" section.
Instruction set
---------------
Each instruction description below consists of a mnemonic, followed by 0 or
more immediate operands, followed by a stack transition description of the form
"[xn,...,x2,x1] -> [ym,...,y2,y1]", where "[xn,...,x2,x1]" is a list of flavor
descriptors describing what the instruction consumes from the evaluation stack
and "[ym,...,y2,y1]" is the list of flavor descriptors describing what the
instruction pushes onto the stack. x1 and y1 represent the topmost stack
elements before and after execution, respectively.
Each element of a stack transition may also contain an optional type
annotation. Here is the list of the type annotations used in instruction
descriptions:
Null - denotes the null type
Bool - denotes the boolean type
Int - denotes the integer type
Dbl - denotes the double-precision floating-point type
Str - denotes the string type
Arr - denotes the array type
Obj - denotes the object type
Multiple type annotations may be combined together using the "|" symbol. For
example, the type annotation "Int|Dbl" means that a value is either integer or
a double.
Some instructions may contain multiple stack transition descriptions to express
the relationship between the types of the values consumed from the stack and
types of the values pushed onto the stack. Also, in some stack transition
descriptions, "<T>" is used as shorthand to represent any one specific type.
For example, a transition such as "[C:<T>] -> [C:<T>]" indicates that the type
of value that the instruction pushes onto the stack will match the type of
value that it consumed from the stack. Likewise, "<F>" is used as shorthand to
represent any one specific flavor descriptor.
$1 is used to refer to the value at the top of the evaluation stack, $2 is used
to refer to the value directly below $1 on the evaluation stack, $3 is used to
refer to the value directly below $2, and so forth. Also, %1 is used to refer
to the first immediate argument, and %2 is used to refer to the second
immediate argument.
Note that the relative offset immediate used by a Jmp*, Iter*, MIter*, Switch,
or SSwitch instruction is relative to the beginning of the instruction.
There are numerous instructions that operate on different kinds of locations.
Locations are specified using "location descriptors". The complete list of
location descriptors is given below:
L - local id; location is the local variable whose id is given by an
immediate.
N - local name; location is the local variable whose name is given by the
value of a cell.
G - global name; location is the global variable whose name is given by the
value of a cell.
S - static property; location is the static property whose class is given by
a classref and whose name is given by value of a cell.
C - cell; location is a temporary value given by a cell.
R - return value; location is a temporary value given by a cell or a ref
H - $this; location is the $this pointer in the current frame. Must only be
used in a frame that is known to have a non-null $this pointer; CheckThis
is most commonly used to ensure this.
There are several groups of similarly named instructions where the name of each
instruction ends with a different location descriptor (for example, Set*). Each
instruction in the group perform similar actions but take different kinds of
inputs to specify the location to access.
The Member instructions provide functionality to operate on elements and
properties. These instructions incorporate an immediate argument vector which
specifies a location descriptor (defined above) followed by one or more member
descriptors:
EC - consume a cell from the evaluation stack as an element
EL:<id> - consume a local given by an immediate id as an element
ET:<id> - consume a litstr given by an immediate id as an element
EI:<int> - consume a immediate integer as an element
PC - consume a cell from the evaluation stack as a property
PL:<id> - consume a local given by an immediate id as a property
PT:<id> - consume a litstr given by an immediate id as a property
W - synthesize a new element (no corresponding local variable or
evaluation stack slot)
For example, the following correspondence exists (ignoring setup bytecode):
Source code: $a[3][$b][]['hi'] = 42;
Bytecode: SetM <L:0 EI:3 EL:1 W ET:hi>
Stack: [] -> [42]
Instructions that have an immediate argument vector have different stack
transition descriptions depending on the kind of location descriptor and member
descriptors contained in the immediate argument vector. Member instructions
denote the immediate argument vector using the notation "<loc-desc/M-vector>"
and "C..C" is used to indicate that the member instructions consume a variable
number of cells from the stack. In most cases the immediate vector arguments
are ordered such that the loc-desc comes first (deepest in the stack), with the
last M-vector element last (shallowest in the stack). However, classrefs that
are part of the BaseSC and BaseSL loc-desc inputs always come last (the cell
input to BaseSC comes first though). Instructions accepting an immediate vector
containing a list of iterators and iterator types use the notation
"<iter-vec>".
In addition to describing each instruction, this instruction set documentation
also describes several operations that encapsulate fundamental, but non-trivial
processes that are shared by the Member instructions.
The instruction set is organized into the following sections:
1. Basic instructions
2. Literal and constant instructions
3. Operator instructions
4. Control flow instructions
5. Get instructions
6. Isset, Empty and type querying instructions
7. Mutator instructions
8. Call instructions
9. Member operations
10. Member instructions
11. Iterator instructions
12. Include, eval, and define instructions
13. Miscellaneous instructions
14. Generator creation and execution
15. Async functions
1. Basic instructions
---------------------
Nop [] -> []
No operation. This instruction does nothing.
PopA [A] -> []
PopC [C] -> []
PopV [V] -> []
PopR [R] -> []
Pop. Discards the value on the top of the stack.
Dup [C:<T>] -> [C:<T> C:<T>]
Duplicate. Duplicates the cell $1 and pushes it onto the stack.
Box [C:<T>] -> [V:<T>]
Box. Creates a new ref, sets the new ref to point at a copy of cell $1, and
pushes the ref onto the stack.
Unbox [V:<T>] -> [C:<T>]
Unbox. Creates a copy of the cell that ref $1 points to, and pushes the cell
onto the stack.
BoxR [R:<T>] -> [V:<T>]
Box. If $1 is a ref at run time, this instruction does nothing.
If $1 is a cell at run time, this instruction creates a new ref, sets the new
ref to point at a copy of cell $1, and pushes the ref onto the stack.
BoxRNop [R:<T>] -> [C:<T>]
Box, no op. $1 must be statically known to be boxed.
UnboxR [R:<T>] -> [C:<T>]
Unbox. If $1 is a cell at run time, this instruction does nothing.
If $1 is a ref at run time, this instruction creates a copy of the cell that
ref $1 points to, and pushes the cell onto the stack.