You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Several people have recently asked me where you should start if you want to add some new (syntactic) feature to PHP. As
I’m not aware of any existing tutorials on that matter, I’ll try to illustrate the whole process in the following. At
the same time this is a general introduction to the workings of the Zend Engine. So upfront: I apologize for this overly
long post.
This post assumes that you already have some basic knowledge of C and also know the fundamental concepts of the PHP
implementation (like zvals). If not, you should read up on them beforehand.
As an example I’ll use the addition of an in operator which you might already know from other languages like Python.
It works as follows:
$words = ['hello', 'world', 'foo', 'bar'];
var_dump('hello' in $words); // true
var_dump('foo' in $words); // true
var_dump('blub' in $words); // false
$string = 'PHP is fun!';
var_dump('PHP' in $string); // true
var_dump('Python' in $string); // false
So basically, for arrays the in operator is the same as the in_array function (but without the needle/haystack
problem) and for strings it’s like doing a false !== strpos($str2, $str1).
Prerequisites
Before we can get going, you’ll have to first check out and compile PHP. To do so, you need a few tools. Most of them
are probably already installed on your system, but you may need to install “re2c” and “bison” using the package manager
of your choice. On Ubuntu you’d do this:
// get source code
$ git clone http://git.php.net/repository/php-src.git
$ cd php-src
// create new branch for in operator
$ git checkout -b addInOperator
// build ./configure script
$ ./buildconf
// configure PHP in debug mode and with thread safety
$ ./configure --disable-all --enable-debug --enable-maintainer-zts
// compile (4 is the number of cores you have)
$ make -j4
The PHP binary should now be available in sapi/cli/php. You can try to do a few things:
Now that you have a (hopefully) working PHP compile, we’ll take a look at what PHP actually does when it runs a script.
The life of a PHP script
To run a script PHP goes through three main phases:
Tokenization
Parsing & Compilation
Execution
In the following I’ll explain what exactly is done in each phase, how it is implemented and what we need to change in
order to get the in operator working.
Tokenization
In the first phase PHP reads in the source code and breaks it down into smaller units called “tokens”. For example the
PHP code <?php echo "Hello World!"; would be broken down to the following tokens:
As you can see the raw source code was broken down into semantically meaningful tokens. The process of doing so is
referred to as tokenization, lexing or scanning and is implemented in the zend_language_scanner.l file
of the Zend/ directory.
If you open the file and scroll down a bit (to somewhere around line 1000), you’ll find a large number of token
definitions that look like this:
<ST_IN_SCRIPTING>"exit" {
return T_EXIT;
}
The meaning should be rather obvious: If exit is encountered in the source code, the lexer should tag it as T_EXIT.
The content between < and > is the state that the text should be matched in. ST_IN_SCRIPTING is the normal state
for PHP code. Some examples of other states are ST_DOUBLE_QUOTE (in double quoted string), ST_HEREDOC (in heredoc
string), etc.
Another thing that can be done in the scanning routines is specifying a “semantic” value (also called “lower value” or
“lval” for short). Here is an example:
{LABEL} matches a PHP identifier (it is defined as [a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*) and the code then
returns the token T_STRING. Additionally it copies the text of the token into zendlval. So if the lexer encounters
an identifier like FooBarClass it’ll set FooBarClass as the lval. The same is also done for strings, numbers,
variable names, etc.
Luckily the in operator does not require in-depth knowledge of the lexer. We just have to add this code snippet
somewhere in the file (analogous to the exit token above):
<ST_IN_SCRIPTING>"in" {
return T_IN;
}
Furthermore we have to let the engine know that we added a new token. For this open the
zend_language_parser.y and insert the following line somewhere among its peers:
Now you should compile PHP again using make -j4 (you have to run it in the top level php-src folder, not in
Zend/). This will generate a new lexer using re2c and compile it. To test that it worked, you can try running
something like this:
This should give you a nice parse error:
Parse error: syntax error, unexpected 'in' (T_IN) in Command line code on line 1
A last thing we have to do is regenerate the data used by the tokenizer extension (which exposes the
internal lexer to userland PHP code). For this you have to cd into the ext/tokenizer folder and execute
./tokenizer_data_gen.sh.
If you run git diff --stat now, you will see something like this:
The zend_language_scanner.c file is the actual lexer generated using re2c. Because it contains line
number information every change to the lexer will cause a huge diff. So don’t worry about that ;)
Parsing & Compilation
Now that the source code is broken down in meaningful tokens PHP has to recognize larger structures like “this is an
if block” or “you’re defining a function there”. This process is called parsing and is defined in the
zend_language_parser.y file. Again this is just a definition file and the actual parser is generated
using bison.
To understand how the parser definitions work, let’s look at an example:
A class statement is
a variable declaration (with access modifier)
or a class constant declaration
or a trait use statement
or a method (with method modifier, optional return-by-ref, method name, parameter list and method body)
.
To find out what exactly a “method modifier” is you’d then go to the method_modifier definition, etc. Should be
fairly straightforward.
So in order to add in support to the parser, all you have to do is add a new expr T_IN expr rule to
expr_without_variable:
expr_without_variable:
...
| expr T_IN expr
...
;
If you now run make -j4 again, bison will attempt to rebuild the parser, but it will fail with the following rather
obscure error message:
A shift/reduce conflict basically means that the parser doesn’t know what to do in some situation. The PHP grammar has
three shift/reduce conflicts by itself (which is expected due to stuff like the dangling elseif/else ambiguity). The
remaining 84 conflicts are caused by the new rule.
The reason is that we didn’t specify how in should behave around other operators. An example:
// if you write
$foo in $bar && $someOtherCond
// should PHP interpret this as
($foo in $bar) && $someOtherCond
// or as
$foo in ($bar && $someOtherCond)
The above is called “operator precedence”. A related concept is “operator associativity”, which determines what happens
when you write $foo in $bar in $baz.
In order to fix the shift/reduce conflicts all you have to do is find the following line at the start of the parser and add T_IN at the end of it:
What this means is that in has the same operator precedence as the <-style comparison operators and that the
operator is non-associative. Here are some examples of what this does:
$foo in $bar && $someOtherCond
// is interpreted as
($foo in $bar) && $someOtherCond
// because `&&` has lower precedence than `in`
$foo in ['abc', 'def'] + ['ghi', 'jkl']
// is interpreted as
$foo in (['abc', 'def'] + ['ghi', 'jkl'])
// because + has a higher precedence than in
$foo in $bar in $baz
// will throw a parse error, because in is non-associative
If you rerun make -j4 now everything should work out fine. After that you can try out running something like
sapi/cli/php -r '"foo" in "bar";'. This should do nothing, apart from printing a memory leak info:
This is expected because we haven’t told the parser yet what it should do when it matches an in operator. And this is
where the curly braces come in. What you have to do is replace the existing expr T_IN expr rule with the following:
The part in the curly braces is called a semantic action and is run whenever the parser matches a certain rule (or part
of it). The strange looking $$, $1 and $3 variables in there are nodes. For example $1 refers to the first
expr, $3 refers to the second expr ($3 because it is the third element in the rule) and $$ is the result node.
zend_do_binary_op is a compiler instruction. It tells the compiler to emit a ZEND_IN opcode that will take $1 and
$3 as operands and put the result into $$.
Compiler instructions are defined in zend_compile.c (with a header entry in
zend_compile.h). zend_do_binary_op for example looks like this:
The code should be easy to grasp and the next section should help putting it into context. One last thing to mention
here is that in most cases you will have to add your own zend_do_ function when adding some new syntax. Adding a new
binary operator is one of the few cases where you don’t have to. So if you have to add a new zend_do_ function, just
have a look at the existing ones. Most of them are quite simple.
Execution
In the previous section I already mentioned that the compiler is emitting opcodes. Let’s look closer at how those
opcodes look like (see zend_compile.h):
A short description of what the individual components mean:
opcode: This is the actual operation that should be executed. This could for example be ZEND_ADD or ZEND_SUB.
op1, op2, result: Every operation can have a maximum of two operands (it can obviously also use just one of
them, or even none at all) and a result node. The type of the nodes is given by op1_type, op2_type and
result_type. We’ll cover how nodes looks like and what types there are in a minute.
extended_value: The extended value is used to store flags or some other integer value. E.g. the variable fetching
instruction uses it to store the variable kind (like ZEND_FETCH_LOCAL or ZEND_FETCH_GLOBAL).
handler: To optimize opcode execution this stores the handler function associated with the opcode and the operand
types. This is determined automatically, so it doesn’t have to be set in the compilation code.
lineno: You know what that means…
There are five basic types that can go into the *_type properties:
IS_TMP_VAR: A temporary variable, usually the result of some expression like $foo + $bar. Temporary variables
cannot be shared, so they don’t implement refcounting. They are usually very short lived, so one instruction creates
them and the next one already frees them again. Temporary variables are usually written as ~n, so ~0 would be the
first temporary variable, ~1 the second, etc.
IS_CV: A compiled variable. To save hash table lookups PHP caches the location of simple variables like $foo in
an array (C array). Furthermore compiled variables allow PHP to optimize the hash table away altogether. CVs are
denoted using !n (n here is the offset into the compiled-variable array.)
IS_VAR: Only simple variables can be turned into CVs. All other kinds of variable accesses, like $foo['bar'] or
$foo->bar return an IS_VAR variable. It basically is just a normal zval (with refcounting and everything). Vars
are written as $n.
IS_CONST: Constants are literals in the code. For example if you write "foo" or 3.141 in your code, those will
be of type IS_CONST. Constants allow for some further optimizations, like reusing zvals for the same value, or
precalculating hashes.
IS_UNUSED: Operand not used.
Related to this is how znode_op itself looks like:
As you can see a node is a union, i.e. it can contain one of the elements above (just one!), depending on context. For
example zv is used to store IS_CONST zvals, var is used to store the variable number for IS_CV, IS_VAR and
IS_TMP_VAR variables. The rest is used in various special circumstances. E.g. jmp_addr is used with the JMP*
instructions (which are required for conditions and loops). Others again are used only during compilation, not during
execution (like constant).
So now that we know how individual ops look like, the only remaining question is where those are stored: For every
function (and file) PHP creates a zend_op_array, which stores the opcodes as well as a lot of other
information. I don’t want to go into detail what the individual components are for, you should just know that this
structure exists.
Now let’s get back to the implementation of the in operator! We already instructed the compiler to emit a ZEND_IN
opcode. Now we have to define what this opcode does.
This is done in the zend_vm_def.h file. If you look at it, you’ll find that it is full of definitions like
this:
The ZEND_IN opcode will look very similar to this, so it’s worth understanding what is going on there. I’ll go through
it line by line:
// The header defines four things:// 1. This is the opcode with ID 1// 2. This opcode is called ZEND_ADD// 3. This opcode accepts CONST, TMP, VAR and CV as the first operand// 4. This opcode accepts CONST, TMP, VAR and CV as the second operandZEND_VM_HANDLER(1,ZEND_ADD,CONST|TMP|VAR|CV,CONST|TMP|VAR|CV){// USE_OPLINE means that we want to access the zend_op as `opline`.// This is required for all opcodes accessing operands or setting a return value.USE_OPLINE// For every operand that is accessed a free_op* variable has to be defined.// It is used to figure out whether the operand needs freeing.zend_free_opfree_op1,free_op2;
<span>// SAVE_OPLINE() actually loads the zend_op into `opline`.</span>
<span>// USE_OPLINE was only the declaration</span>
<span>SAVE_OPLINE</span><span>();</span>
<span>// Call the fast add function</span>
<span>fast_add_function</span><span>(</span>
<span>// And tell it to put the result into the temporary result variable</span>
<span>// EX_T here accesses the temporary variable with ID opline->result.var.</span>
<span>&</span><span>EX_T</span><span>(</span><span>opline</span><span>-></span><span>result</span><span>.</span><span>var</span><span>).</span><span>tmp_var</span><span>,</span>
<span>// Fetch the first operand for reading (the R in BP_VAR_R)</span>
<span>GET_OP1_ZVAL_PTR</span><span>(</span><span>BP_VAR_R</span><span>),</span>
<span>// Fetch the second operand for reading</span>
<span>GET_OP2_ZVAL_PTR</span><span>(</span><span>BP_VAR_R</span><span>)</span> <span>TSRMLS_CC</span><span>);</span>
<span>// Free both operands (if necessary)</span>
<span>FREE_OP1</span><span>();</span>
<span>FREE_OP2</span><span>();</span>
<span>// Check for exceptions. Exceptions can occur virtually everywhere, so one has to check for them in nearly all</span>
<span>// opcodes. If in doubt, add the check.</span>
<span>CHECK_EXCEPTION</span><span>();</span>
<span>// Go to the next opcode</span>
<span>ZEND_VM_NEXT_OPCODE</span><span>();</span>
}
As you probably noticed there is a lot UPPERCASE_STUFF in there. The reason is that zend_vm_def.h once again is only
a definition file. The actual Zend VM is generated from it and stored in zend_vm_execute.h (biiig file). PHP
has three different virtual machine kinds, namely CALL (default), GOTO and SWITCH. Because they all have different
implementation details the definition file uses lots of pseudo-macros like USE_OPLINE that are later replaced by some
concrete implementation.
Furthermore the generated VM creates specialized implementations from all possible operand-type permutations. So in the
end there won’t be a single ZEND_ADD function, but rather different functions for ZEND_ADD_CONST_CONST,
ZEND_ADD_CONST_TMP, ZEND_ADD_CONST_VAR…
Now, in order to implement the ZEND_IN opcode, you should add a new opcode definition skeleton at the end of the
zend_vm_def.h file:
// 159 is the number of the next free opcode for me. You may need to choose a larger numberZEND_VM_HANDLER(159,ZEND_IN,CONST|TMP|VAR|CV,CONST|TMP|VAR|CV){USE_OPLINEzend_free_opfree_op1,free_op2;zval*op1,*op2;
This does nothing more than fetching the operands and discarding them again right away.
In order to generate a new VM you now have to run php zend_vm_gen.php from within the Zend/ directory. (If it gives
you lots of warnings about the /e modifier being deprecated, ignore those.) After that go into the top level directory
again and run make -j4 to recompile.
If everything worked out fine you can now run sapi/cli/php -r '"foo" in "bar";' without getting any errors (but it
still won’t do anything).
Now, finally, we can implement the actual logic. Let’s start with the string case:
The hardest part here is actually casting the needle into a string. This is done using zend_make_printable_zval. This
function may either have to create a new zval, or not. That’s why we pass op1_copy and use_copy into it. If the
function copied the value we just put it into the op1 variable (so we don’t have to deal with two different variables
everywhere). Furthermore the copy has to be freed at the end (what the last three lines do).
After that everything is simple. We can find out whether the haystack contains the needle using zend_memnstr. If the
needle is an empty string we just return true directly (because the empty string is part of every string).
Now, if you added the above code in place of the /* TODO */, reran zend_vm_gen.php and recompiled using make -j4,
you will already have a half-working in operator:
$ sapi/cli/php -r 'var_dump("foo" in "bar");'
bool(false)
$ sapi/cli/php -r 'var_dump("foo" in "foobar");'
bool(true)
$ sapi/cli/php -r 'var_dump("foo" in "hallo foo world");'
bool(true)
$ sapi/cli/php -r 'var_dump(2 in "123");'
bool(true)
$ sapi/cli/php -r 'var_dump(5 in "123");'
bool(false)
$ sapi/cli/php -r 'var_dump("" in "test");'
bool(true)
<span>/* Start under the assumption that the value isn't contained */</span>
<span>ZVAL_FALSE</span><span>(</span><span>&</span><span>EX_T</span><span>(</span><span>opline</span><span>-></span><span>result</span><span>.</span><span>var</span><span>).</span><span>tmp_var</span><span>);</span>
<span>/* Iterate through the array */</span>
<span>zend_hash_internal_pointer_reset_ex</span><span>(</span><span>Z_ARRVAL_P</span><span>(</span><span>op2</span><span>),</span> <span>&</span><span>pos</span><span>);</span>
<span>while</span> <span>(</span><span>zend_hash_get_current_data_ex</span><span>(</span><span>Z_ARRVAL_P</span><span>(</span><span>op2</span><span>),</span> <span>(</span><span>void</span> <span>**</span><span>)</span> <span>&</span><span>value</span><span>,</span> <span>&</span><span>pos</span><span>)</span> <span>==</span> <span>SUCCESS</span><span>)</span> <span>{</span>
<span>zval</span> <span>result</span><span>;</span>
<span>/* Compare values using == */</span>
<span>if</span> <span>(</span><span>is_equal_function</span><span>(</span><span>&</span><span>result</span><span>,</span> <span>op1</span><span>,</span> <span>*</span><span>value</span> <span>TSRMLS_CC</span><span>)</span> <span>==</span> <span>SUCCESS</span> <span>&&</span> <span>Z_LVAL</span><span>(</span><span>result</span><span>))</span> <span>{</span>
<span>ZVAL_TRUE</span><span>(</span><span>&</span><span>EX_T</span><span>(</span><span>opline</span><span>-></span><span>result</span><span>.</span><span>var</span><span>).</span><span>tmp_var</span><span>);</span>
<span>break</span><span>;</span>
<span>}</span>
<span>zend_hash_move_forward_ex</span><span>(</span><span>Z_ARRVAL_P</span><span>(</span><span>op2</span><span>),</span> <span>&</span><span>pos</span><span>);</span>
<span>}</span>
}
Here the haystack is simply traversed and every values is checked against the needle. We compare using ==. To compare
using === one would have to replace is_equal_function with is_identical_function.
After rerunning zend_vm_gen.php and make -j4 the in operator should be fully operational:
$ sapi/cli/php -r 'var_dump("test" in []);'
bool(false)
$ sapi/cli/php -r 'var_dump("test" in ["foo", "bar"]);'
bool(false)
$ sapi/cli/php -r 'var_dump("test" in ["foo", "test", "bar"]);'
bool(true)
$ sapi/cli/php -r 'var_dump(0 in ["foo"]);'
bool(true) // because we're comparing using ==
One last thing to consider is what should happen when the second operator is neither array nor string. I’ll just take
the easy way out for this: Throw a warning and return false:
else{zend_error(E_WARNING,"Right operand of in has to be either string or array");ZVAL_FALSE(&EX_T(opline->result.var).tmp_var);}
After rebuilding and recompiling the VM:
$ sapi/cli/php -r 'var_dump("foo" in new stdClass);'
Warning: Right operand of in has to be either string or array in Command line code on line 1
bool(false)
This might not be the best behavior. E.g. one could allow to do things like 2 in 123 or 3.14 in 3.141. But I’m too
lazy for that right now ;)
Finishing thoughts
I hope the above helped you understand how to add new features to PHP and what the Zend Engine does when it runs a PHP
script. But even though the article is quite long I covered only small parts of the whole system. So when you want to do
modifications to the ZE the largest part of the job will be reading the existing code. The cross-reference tool
helps a lot when browsing through the PHP source code. Apart from that you can always ask questions in the #php.pecl
room on efnet.
After you created an implementation for whatever feature you want, the next step is bringing it up on the internals
mailing list. People will then look at your feature and decide whether or not it should go in.
Oh, and one last thing: The in operator here was just an example. I don’t plan on proposing it for inclusion ;)
As always, if you have any further comments or questions, please leave them below.
</section></div></div></div><br>
via www.npopov.com https://www.npopov.com
February 13, 2025 at 10:11AM
The text was updated successfully, but these errors were encountered:
How to add new (syntactic) features to PHP
https://ift.tt/y1pUBgJ
Several people have recently asked me where you should start if you want to add some new (syntactic) feature to PHP. As I’m not aware of any existing tutorials on that matter, I’ll try to illustrate the whole process in the following. At the same time this is a general introduction to the workings of the Zend Engine. So upfront: I apologize for this overly long post.
This post assumes that you already have some basic knowledge of C and also know the fundamental concepts of the PHP implementation (like zvals). If not, you should read up on them beforehand.
As an example I’ll use the addition of an
in
operator which you might already know from other languages like Python. It works as follows:So basically, for arrays the
in
operator is the same as thein_array
function (but without the needle/haystack problem) and for strings it’s like doing afalse !== strpos($str2, $str1)
.Prerequisites
Before we can get going, you’ll have to first check out and compile PHP. To do so, you need a few tools. Most of them are probably already installed on your system, but you may need to install “re2c” and “bison” using the package manager of your choice. On Ubuntu you’d do this:
Next, clone php-src from git and compile it:
The PHP binary should now be available in
sapi/cli/php
. You can try to do a few things:Now that you have a (hopefully) working PHP compile, we’ll take a look at what PHP actually does when it runs a script.
The life of a PHP script
To run a script PHP goes through three main phases:
In the following I’ll explain what exactly is done in each phase, how it is implemented and what we need to change in order to get the
in
operator working.Tokenization
In the first phase PHP reads in the source code and breaks it down into smaller units called “tokens”. For example the PHP code
<?php echo "Hello World!";
would be broken down to the following tokens:As you can see the raw source code was broken down into semantically meaningful tokens. The process of doing so is referred to as tokenization, lexing or scanning and is implemented in the
zend_language_scanner.l
file of theZend/
directory.If you open the file and scroll down a bit (to somewhere around line 1000), you’ll find a large number of token definitions that look like this:
The meaning should be rather obvious: If
exit
is encountered in the source code, the lexer should tag it asT_EXIT
. The content between<
and>
is the state that the text should be matched in.ST_IN_SCRIPTING
is the normal state for PHP code. Some examples of other states areST_DOUBLE_QUOTE
(in double quoted string),ST_HEREDOC
(in heredoc string), etc.Another thing that can be done in the scanning routines is specifying a “semantic” value (also called “lower value” or “lval” for short). Here is an example:
{LABEL}
matches a PHP identifier (it is defined as[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*
) and the code then returns the tokenT_STRING
. Additionally it copies the text of the token intozendlval
. So if the lexer encounters an identifier likeFooBarClass
it’ll setFooBarClass
as the lval. The same is also done for strings, numbers, variable names, etc.Luckily the
in
operator does not require in-depth knowledge of the lexer. We just have to add this code snippet somewhere in the file (analogous to theexit
token above):Furthermore we have to let the engine know that we added a new token. For this open the
zend_language_parser.y
and insert the following line somewhere among its peers:Now you should compile PHP again using
make -j4
(you have to run it in the top levelphp-src
folder, not inZend/
). This will generate a new lexer using re2c and compile it. To test that it worked, you can try running something like this:This should give you a nice parse error:
A last thing we have to do is regenerate the data used by the tokenizer extension (which exposes the internal lexer to userland PHP code). For this you have to
cd
into theext/tokenizer
folder and execute./tokenizer_data_gen.sh
.If you run
git diff --stat
now, you will see something like this:The
zend_language_scanner.c
file is the actual lexer generated using re2c. Because it contains line number information every change to the lexer will cause a huge diff. So don’t worry about that ;)Parsing & Compilation
Now that the source code is broken down in meaningful tokens PHP has to recognize larger structures like “this is an
if
block” or “you’re defining a function there”. This process is called parsing and is defined in thezend_language_parser.y
file. Again this is just a definition file and the actual parser is generated using bison.To understand how the parser definitions work, let’s look at an example:
For now, let’s leave the parts in curly braces out, so we’re left with the following:
You can read this as:
To find out what exactly a “method modifier” is you’d then go to the
method_modifier
definition, etc. Should be fairly straightforward.So in order to add
in
support to the parser, all you have to do is add a newexpr T_IN expr
rule toexpr_without_variable
:If you now run
make -j4
again, bison will attempt to rebuild the parser, but it will fail with the following rather obscure error message:A shift/reduce conflict basically means that the parser doesn’t know what to do in some situation. The PHP grammar has three shift/reduce conflicts by itself (which is expected due to stuff like the dangling elseif/else ambiguity). The remaining 84 conflicts are caused by the new rule.
The reason is that we didn’t specify how
in
should behave around other operators. An example:The above is called “operator precedence”. A related concept is “operator associativity”, which determines what happens when you write
$foo in $bar in $baz
.In order to fix the shift/reduce conflicts all you have to do is find the following line at the start of the parser and add
T_IN
at the end of it:What this means is that
in
has the same operator precedence as the<
-style comparison operators and that the operator is non-associative. Here are some examples of what this does:If you rerun
make -j4
now everything should work out fine. After that you can try out running something likesapi/cli/php -r '"foo" in "bar";'
. This should do nothing, apart from printing a memory leak info:This is expected because we haven’t told the parser yet what it should do when it matches an
in
operator. And this is where the curly braces come in. What you have to do is replace the existingexpr T_IN expr
rule with the following:The part in the curly braces is called a semantic action and is run whenever the parser matches a certain rule (or part of it). The strange looking
$$
,$1
and$3
variables in there are nodes. For example$1
refers to the firstexpr
,$3
refers to the secondexpr
($3
because it is the third element in the rule) and$$
is the result node.zend_do_binary_op
is a compiler instruction. It tells the compiler to emit aZEND_IN
opcode that will take$1
and$3
as operands and put the result into$$
.Compiler instructions are defined in
zend_compile.c
(with a header entry inzend_compile.h
).zend_do_binary_op
for example looks like this:The code should be easy to grasp and the next section should help putting it into context. One last thing to mention
here is that in most cases you will have to add your own
zend_do_
function when adding some new syntax. Adding a newbinary operator is one of the few cases where you don’t have to. So if you have to add a new
zend_do_
function, justhave a look at the existing ones. Most of them are quite simple.
Execution
In the previous section I already mentioned that the compiler is emitting opcodes. Let’s look closer at how those opcodes look like (see
zend_compile.h
):A short description of what the individual components mean:
opcode
: This is the actual operation that should be executed. This could for example beZEND_ADD
orZEND_SUB
.op1
,op2
,result
: Every operation can have a maximum of two operands (it can obviously also use just one of them, or even none at all) and a result node. The type of the nodes is given byop1_type
,op2_type
andresult_type
. We’ll cover how nodes looks like and what types there are in a minute.extended_value
: The extended value is used to store flags or some other integer value. E.g. the variable fetching instruction uses it to store the variable kind (likeZEND_FETCH_LOCAL
orZEND_FETCH_GLOBAL
).handler
: To optimize opcode execution this stores the handler function associated with the opcode and the operand types. This is determined automatically, so it doesn’t have to be set in the compilation code.lineno
: You know what that means…There are five basic types that can go into the
*_type
properties:IS_TMP_VAR
: A temporary variable, usually the result of some expression like$foo + $bar
. Temporary variables cannot be shared, so they don’t implement refcounting. They are usually very short lived, so one instruction creates them and the next one already frees them again. Temporary variables are usually written as~n
, so~0
would be the first temporary variable,~1
the second, etc.IS_CV
: A compiled variable. To save hash table lookups PHP caches the location of simple variables like$foo
in an array (C array). Furthermore compiled variables allow PHP to optimize the hash table away altogether. CVs are denoted using!n
(n
here is the offset into the compiled-variable array.)IS_VAR
: Only simple variables can be turned into CVs. All other kinds of variable accesses, like$foo['bar']
or$foo->bar
return anIS_VAR
variable. It basically is just a normal zval (with refcounting and everything). Vars are written as$n
.IS_CONST
: Constants are literals in the code. For example if you write"foo"
or3.141
in your code, those will be of typeIS_CONST
. Constants allow for some further optimizations, like reusing zvals for the same value, or precalculating hashes.IS_UNUSED
: Operand not used.Related to this is how
znode_op
itself looks like:As you can see a node is a union, i.e. it can contain one of the elements above (just one!), depending on context. For example
zv
is used to storeIS_CONST
zvals,var
is used to store the variable number forIS_CV
,IS_VAR
andIS_TMP_VAR
variables. The rest is used in various special circumstances. E.g.jmp_addr
is used with theJMP*
instructions (which are required for conditions and loops). Others again are used only during compilation, not during execution (likeconstant
).So now that we know how individual ops look like, the only remaining question is where those are stored: For every function (and file) PHP creates a
zend_op_array
, which stores the opcodes as well as a lot of other information. I don’t want to go into detail what the individual components are for, you should just know that this structure exists.Now let’s get back to the implementation of the
in
operator! We already instructed the compiler to emit aZEND_IN
opcode. Now we have to define what this opcode does.This is done in the
zend_vm_def.h
file. If you look at it, you’ll find that it is full of definitions like this:The
ZEND_IN
opcode will look very similar to this, so it’s worth understanding what is going on there. I’ll go throughit line by line:
As you probably noticed there is a lot
UPPERCASE_STUFF
in there. The reason is thatzend_vm_def.h
once again is onlya definition file. The actual Zend VM is generated from it and stored in
zend_vm_execute.h
(biiig file). PHPhas three different virtual machine kinds, namely
CALL
(default),GOTO
andSWITCH
. Because they all have differentimplementation details the definition file uses lots of pseudo-macros like
USE_OPLINE
that are later replaced by someconcrete implementation.
Furthermore the generated VM creates specialized implementations from all possible operand-type permutations. So in the end there won’t be a single
ZEND_ADD
function, but rather different functions forZEND_ADD_CONST_CONST
,ZEND_ADD_CONST_TMP
,ZEND_ADD_CONST_VAR
…Now, in order to implement the
ZEND_IN
opcode, you should add a new opcode definition skeleton at the end of thezend_vm_def.h
file:This does nothing more than fetching the operands and discarding them again right away.
In order to generate a new VM you now have to run
php zend_vm_gen.php
from within theZend/
directory. (If it gives you lots of warnings about the/e
modifier being deprecated, ignore those.) After that go into the top level directory again and runmake -j4
to recompile.If everything worked out fine you can now run
sapi/cli/php -r '"foo" in "bar";'
without getting any errors (but it still won’t do anything).Now, finally, we can implement the actual logic. Let’s start with the string case:
The hardest part here is actually casting the needle into a string. This is done using
zend_make_printable_zval
. Thisfunction may either have to create a new zval, or not. That’s why we pass
op1_copy
anduse_copy
into it. If thefunction copied the value we just put it into the
op1
variable (so we don’t have to deal with two different variableseverywhere). Furthermore the copy has to be freed at the end (what the last three lines do).
After that everything is simple. We can find out whether the haystack contains the needle using
zend_memnstr
. If the needle is an empty string we just returntrue
directly (because the empty string is part of every string).Now, if you added the above code in place of the
/* TODO */
, reranzend_vm_gen.php
and recompiled usingmake -j4
, you will already have a half-workingin
operator:Next, we have to implement the array behavior:
Here the haystack is simply traversed and every values is checked against the needle. We compare using
==
. To compareusing
===
one would have to replaceis_equal_function
withis_identical_function
.After rerunning
zend_vm_gen.php
andmake -j4
thein
operator should be fully operational:One last thing to consider is what should happen when the second operator is neither array nor string. I’ll just take the easy way out for this: Throw a warning and return false:
After rebuilding and recompiling the VM:
This might not be the best behavior. E.g. one could allow to do things like
2 in 123
or3.14 in 3.141
. But I’m too lazy for that right now ;)Finishing thoughts
I hope the above helped you understand how to add new features to PHP and what the Zend Engine does when it runs a PHP script. But even though the article is quite long I covered only small parts of the whole system. So when you want to do modifications to the ZE the largest part of the job will be reading the existing code. The cross-reference tool helps a lot when browsing through the PHP source code. Apart from that you can always ask questions in the #php.pecl room on efnet.
After you created an implementation for whatever feature you want, the next step is bringing it up on the internals mailing list. People will then look at your feature and decide whether or not it should go in.
Oh, and one last thing: The
in
operator here was just an example. I don’t plan on proposing it for inclusion ;)As always, if you have any further comments or questions, please leave them below.
via www.npopov.com https://www.npopov.com
February 13, 2025 at 10:11AM
The text was updated successfully, but these errors were encountered: