Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETH_MAC indirection erratum #453

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bscottm
Copy link
Contributor

@bscottm bscottm commented Mar 5, 2025

Pervasive misuse of "ETH_MAC *" (a pointer to an ETH_MAC, aka a 6 element unsigned char array) when a simple "ETH_MAC" is correct. The best example of this was eth_mac_fmt() in sim_ether.c with the following prototype:

t_stat eth_mac_fmt (ETH_MAC* const mac, char* strmac)

The first parameter is a pointer to an array of 6 unsigned characters, whereas it really just wants to be a pointer to the first element of the array:

t_stat eth_mac_scan (const ETH_MAC mac, char* strmac)

The "ETH_MAC *" indirection error also results in subtle memcpy() and memcmp() issues, e.g.:

void network_func(DEVICE *dev, ETH_MAC *mac)
{
  ETH_MAC other_mac;

  /* ...code... */

  /* memcpy() bug: */
  memcpy(other_mac, mac, sizeof(ETH_MAC));

  /* or worse: */
  memcpy(mac, other_mac, sizeof(ETH_MAC));
}

eth_copy_mac() and eth_mac_cmp() replace calls to memcpy() and memcmp() that copy or compare Ethernet MAC addresses. These are type-enforcing functions, i.e., the parameters are ETH_MAC-s, to avoid the subtle memcpy() and memcmp() bugs.

This fix solves at least one Heisenbug in _eth_close() while free()-ing write request buffers (and possibly other Heisenbugs.)

Pervasive misuse of "ETH_MAC *" (a pointer to an ETH_MAC, aka a 6
element unsigned char array) when a simple "ETH_MAC" is correct.  The
best example of this was eth_mac_fmt() in sim_ether.c with the following
prototype:

    t_stat eth_mac_fmt (ETH_MAC* const mac, char* strmac)

The first parameter is a pointer to an array of 6 unsigned characters,
whereas it really just wants to be a pointer to the first element of the
array:

    t_stat eth_mac_scan (const ETH_MAC mac, char* strmac)

The "ETH_MAC *" indirection error also results in subtle memcpy() and
memcmp() issues, e.g.:

    void network_func(DEVICE *dev, ETH_MAC *mac)
    {
      ETH_MAC other_mac;

      /* ...code... */

      /* memcpy() bug: */
      memcpy(other_mac, mac, sizeof(ETH_MAC));

      /* or worse: */
      memcpy(mac, other_mac, sizeof(ETH_MAC));
    }

eth_copy_mac() and eth_mac_cmp() replace calls to memcpy() and memcmp()
that copy or compare Ethernet MAC addresses. These are type-enforcing
functions, i.e., the parameters are ETH_MAC-s, to avoid the subtle
memcpy() and memcmp() bugs.

This fix solves at least one Heisenbug in _eth_close() while free()-ing
write request buffers (and possibly other Heisenbugs.)
@bscottm
Copy link
Contributor Author

bscottm commented Mar 5, 2025

@pkoning2:

  • Really need Github CI/CD updates #450 so that the check build succeeds. I may have to resubmit this PR once that happens. The LTO bug is a real pain.
  • This problem has been around for a long time and likely the source of Heisenbugs. It solves one that's been plaguing me in the network code for a couple of months.

@pkoning2
Copy link
Member

pkoning2 commented Mar 5, 2025

I wonder about the correct vs. incorrect thing. The problem is that C has a broken type system, and its way of treating an array as essentially synonymous with a pointer is what lies at the bottom of this.

@bscottm
Copy link
Contributor Author

bscottm commented Mar 5, 2025

I wonder about the correct vs. incorrect thing. The problem is that C has a broken type system, and its way of treating an array as essentially synonymous with a pointer is what lies at the bottom of this.

Quite true. However, passing around a pointer to an array vs. the pointer to the array's first element is the issue here. At least with a type-enforcing inline function, code is more "correct" (since the parameters are pointers to the array's first element.)

@bscottm
Copy link
Contributor Author

bscottm commented Mar 5, 2025

I wonder about the correct vs. incorrect thing. The problem is that C has a broken type system, and its way of treating an array as essentially synonymous with a pointer is what lies at the bottom of this.

Quite true. However, passing around a pointer to an array vs. the pointer to the array's first element is the issue here. At least with a type-enforcing inline function, code is more "correct" (since the parameters are pointers to the array's first element.)

You'd have been surprised how many places code was memcpy-ing using the first 4 or 8 bytes of the ETH_MAC array as the source pointer (the effect of "ETH_MAC *"). Or awkward (ETH_MAC *) casts to eth_mac_fmt() (that should have signaled an obvious error to someone reviewing the code.)

@bscottm
Copy link
Contributor Author

bscottm commented Mar 5, 2025

I wonder about the correct vs. incorrect thing. The problem is that C has a broken type system, and its way of treating an array as essentially synonymous with a pointer is what lies at the bottom of this.

Quite true. However, passing around a pointer to an array vs. the pointer to the array's first element is the issue here. At least with a type-enforcing inline function, code is more "correct" (since the parameters are pointers to the array's first element.)

To put it another way, ETH_MAC * is unsigned char ** when passed as a parameter. ETH_MAC is effectively unsigned char *. I vigorously assert that ETH_MAC * is incorrect and not what was originally intended.

@markpizz
Copy link
Contributor

markpizz commented Mar 5, 2025

So lets see. You need to change 100's of lines in 10 files to change the view of a ETH_MAC which is a 6 byte object into an array of 6 bytes.

You claim that there were places that "you'd have been surprised how many places code was memcpy-ing using the first 4 or 8 bytes of ". Given the 100's of lines of code affected in your proposed "fix", you don't specifically identify any of these 4 or 8 byte error cases.

Looking through ALL of your changes, I can't find any memcpy (or memcmp) that you "fixed" with your massive changes that actually did what you claim (4 or 8 byte).

@markpizz
Copy link
Contributor

markpizz commented Mar 5, 2025

You claim: "This fix solves at least one Heisenbug in _eth_close() while free()-ing write request buffers (and possibly other Heisenbugs.)", but there isn't actually a function named _eth_close(). There is a function named _eth_close_port() which makes no reference to ETH_MAC objects. Similarly there is a function named eth_close() which also makes no reference to any ETH_MAC objects.

@bscottm
Copy link
Contributor Author

bscottm commented Mar 5, 2025

Looking through ALL of your changes, I can't find any memcpy (or memcmp) that you "fixed" with your massive changes that actually did what you claim (4 or 8 byte).

ETH_MAC * is equivalent to unsigned char **, so if you have code that looks like the example I wrote above, there's a subtle bug when ETH_MAC and ETH_MAC * are both coerced to void * in a memcpy. Might not be obvious to you at first glance, but really obvious when replaced with a type enforcing inline.

Just to make it really plain, I'll further annotate the example:

void network_func(DEVICE *dev, ETH_MAC *mac)
{
  ETH_MAC other_mac;

  /* ...code... */

  /* memcpy() bug: other_mac's type is equivalent to 'unsigned char *' (pointer to the first
   * array element), mac's type is equivalent to 'unsigned char **' (pointer to the array, or a pointer to
   * a pointer to  the array's first element), both of which are coerced to 'void *'.
   */
  memcpy(other_mac, mac, sizeof(ETH_MAC));

  /* or worse: */
  memcpy(mac, other_mac, sizeof(ETH_MAC));
}

In other words, ETH_MAC * and ETH_MAC are distinct and different types.

@bscottm
Copy link
Contributor Author

bscottm commented Mar 5, 2025

The reason for replacing all of the memcpy and memcmp calls involving an ETH_MAC is consistency -- instead of examining every single memcpy or memcmp to catch the &mac2 case below, it's just easier to replace every MAC address memcpy and memcmp with eth_copy_mac and eth_mac_cmp to catch all cases.

Example program where the extra indirection is an error (i.e., MAC address comes from a uint8_t array, aka a packet):

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

typedef uint8_t ETH_MAC[6];

void foo(const char *label, ETH_MAC *mac)
{
    printf("%-12s: mac = %0.16p [*mac = %0.16p, **mac = %02x]\n", label, mac, *mac, **mac);
}

int main(void)
{
    ETH_MAC mac = { 0x01, 0x23, 0x45, 0x67, 0x89, 0xab };

    puts("mac and &mac arrays:");
    foo("mac", (ETH_MAC *) mac);
    foo("&mac", &mac);

    uint8_t *mac2 = malloc(64 * sizeof(uint8_t));
    for (size_t i = 0; i < 63; ++i)
      mac2[i] = (uint8_t) (i + 127);

    puts("\nuint8_t masquerading as ETH_MAC *:");
    foo("mac2", (ETH_MAC *) mac2);
    foo("&mac2", (ETH_MAC *) &mac2);

    return 0;
}

Output:

mac and &mac arrays:
mac         : mac = 0x0000007fd02bc5a0 [*mac = 0x0000007fd02bc5a0, **mac = 01]
&mac        : mac = 0x0000007fd02bc5a0 [*mac = 0x0000007fd02bc5a0, **mac = 01]

uint8_t masquerading as ETH_MAC *:
mac2        : mac = 0x00000055afebb6b0 [*mac = 0x00000055afebb6b0, **mac = 7f]
&mac2       : mac = 0x0000007fd02bc598 [*mac = 0x0000007fd02bc598, **mac = b0]

@bscottm
Copy link
Contributor Author

bscottm commented Mar 5, 2025

Remove the extra ETH_MAC indirection, and you get:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

typedef uint8_t ETH_MAC[6];

void foo(const char *label, ETH_MAC mac)
{
    printf("%-12s: mac = %0.16p [*mac = %02x]\n", label, mac, *mac);
}

int main(void)
{
    ETH_MAC mac = { 0x01, 0x23, 0x45, 0x67, 0x89, 0xab };

    puts("mac array:");
    foo("mac", mac);

    uint8_t *mac2 = malloc(64 * sizeof(uint8_t));
    for (size_t i = 0; i < 63; ++i)
      mac2[i] = (uint8_t) (i + 127);

    puts("\nuint8_t masquerading as ETH_MAC:");
    foo("mac2", mac2);

    return 0;
}

Output:

mac array:
mac         : mac = 0x0000007ff2cfa7d8 [*mac = 01]

uint8_t masquerading as ETH_MAC:
mac2        : mac = 0x000000558320f6b0 [*mac = 7f]

You get a consistent result for the case when the ETH_MAC is actually an underlying packet.

@bscottm
Copy link
Contributor Author

bscottm commented Mar 6, 2025

You claim: "This fix solves at least one Heisenbug in _eth_close() while free()-ing write request buffers (and possibly other Heisenbugs.)", but there isn't actually a function named _eth_close(). There is a function named _eth_close_port() which makes no reference to ETH_MAC objects. Similarly there is a function named eth_close() which also makes no reference to any ETH_MAC objects.

Oh. Gee. I had a typo. BFD.

@bscottm
Copy link
Contributor Author

bscottm commented Mar 6, 2025

You claim: "This fix solves at least one Heisenbug in _eth_close() while free()-ing write request buffers (and possibly other Heisenbugs.)", but there isn't actually a function named _eth_close(). There is a function named _eth_close_port() which makes no reference to ETH_MAC objects. Similarly there is a function named eth_close() which also makes no reference to any ETH_MAC objects.

@markpizz There is an eth_close() function -- ok, so I typo-ed the name with an extra underscore. The Heisenbug pops up at line 2686 while free()-ing the write_requests list (MS's heap validator detects corruption on the pdp10-kl simulator, primarily, since that's what I'm testing and working with at the moment.) Going through all of the ETH_MAC memcpy()-s to track that down would have been very time consuming and error prone.

Much easier and more reliable to replace the memcpy()-s with a type enforcing function to simply "get it right."

@markpizz
Copy link
Contributor

markpizz commented Mar 7, 2025

In other words, ETH_MAC * and ETH_MAC are distinct and different types.

I don't dispute this statement. However, there may (or may not) be one or at most a couple of places where this is mixed up. Fixing those few places would be an appropriate fix rather than the 100's of lines of code changes you've proposed. You carefully avoid pointing out any specific case (or cases).

You claim: "This fix solves at least one Heisenbug in _eth_close() while free()-ing write request buffers (and possibly other Heisenbugs.)", but there isn't actually a function named _eth_close(). There is a function named _eth_close_port() which makes no reference to ETH_MAC objects. Similarly there is a function named eth_close() which also makes no reference to any ETH_MAC objects.

Oh. Gee. I had a typo. BFD.

Surely you can have a typo.

@markpizz There is an eth_close() function -- ok, so I typo-ed the name with an extra underscore. The Heisenbug pops up at line 2686 while free()-ing the write_requests list (MS's heap validator detects corruption on the pdp10-kl simulator, primarily, since that's what I'm testing and working with at the moment.) Going through all of the ETH_MAC memcpy()-s to track that down would have been very time consuming and error prone.

Much easier and more reliable to replace the memcpy()-s with a type enforcing function to simply "get it right."

Independent of the typo, you claim that there is a problem there which, in fact you don't actually fix since you didn't actually find ANY case where the issue you claim has happened (wrong pointer to memcpy) is actually at fault. Your changes no longer demonstrate the memory corruption problem which occurs in the cleanup being done of a linked list WITHOUT ANY ETH_MAC objects being referenced by eth_close(). Your fix changes SO MANY things that you've merely hidden the potential heisenbug problem.

You claim: "Going through all of the ETH_MAC memcpy()-s to track that down would have been very time consuming and error prone." Actually, there are maybe 3 places in essentially any running simulator which write ETH_MAC objects. Those should be somewhat easy to find, but there are other ways to track down a memory corruption problem which would be useful to track down the problem. Since you are already using MS's heap validator, there should be an easy way to force a heap validation check with a specific call in the very small number of places that actually copy ETH_MAC objects. There could be bugs in other code (outside of sim_ether) that corrupt the heap. I recently sent a message to the author of the pdp10-kl simulator about such a bug in kx10_imp.c that could possibly write outside of the contents of a packet.

Maybe you want to provide specific details about how you reproduce the failing case.

@bscottm
Copy link
Contributor Author

bscottm commented Mar 7, 2025

Actually, there are maybe 3 places in essentially any running simulator which write ETH_MAC objects.

You're making a mountain out of a molehill.

In terms of the number of lines changed, I'll agree with there a lot of lines changed. But it's not a massive structural change as you vigorously assert.

What are those changes?

  1. Removing a level of indirection from ETH_MAC variable references. Quite a few places where the & is removed and no longer necessary, like calls to eth_mac_fmt() and eth_check_address_conflict().
  2. Replacing memcpy(dst, src, sizeof(ETH_MAC)) with eth_copy_mac(dst, src)
  3. Replacing memcmp(dst, src, sizeof(ETH_MAC)) with eth_mac_cmp(dst, src)
  4. Consolidating all-zero and all-one Ethernet addresses as eth_mac_any and eth_mac_bcast constants. No particular reason why those constant arrays have to be replicated by individual simulators when two constants suffice.
  5. Changing the addresses argument in the eth_filter family of functions from ETH_MAC const *addresses to const ETH_MAC addresses[] for clarity that they are arrays of MAC addresses.

There are places where a packet buffer is used as an ETH_MAC and referenced as &msg[0], where msg is a uint8 * treated as an array. No reason to change those references unless the compiler whines, which it doesn't.

Perhaps I may have moved a Heisenbug's effect. I'm more confident that the Heisenbug is resolved. But that's the nature of a Heisenbug -- one cannot be completely certain.

One thing is certain: An extra level of ETH_MAC indirection is unnecessary and can lead to very subtle bugs. Code is cleaner without the extra indirection.

Maybe you want to provide specific details about how you reproduce the failing case.

And if you're unable to reproduce it? What would that prove? All it would prove is that you cannot reproduce the problem while I still get to suffer with it. I've solved a particular issue for which I've submitted a PR and from which the open-simh could benefit.

@markpizz
Copy link
Contributor

markpizz commented Mar 7, 2025

Maybe you want to provide specific details about how you reproduce the failing case.

And if you're unable to reproduce it? What would that prove? All it would prove is that you cannot reproduce the problem while I still get to suffer with it. I've solved a particular issue for which I've submitted a PR and from which the open-simh could benefit.

Your PR DOES NOT specifically identify an actual problem and a fix for it.

If you were solving the whole network problem from scratch, you could code the various pieces however you want, but the existing WORKING code happens to be written with a view of things that you personally wouldn't have chosen from scratch.

Your PR does not prove that the problem has been solved, only that it seems to have not been detected in the same way as before.

Interesting that you see this problem with only one simulator but your change requires changes to many simulators (dozens of simulators involving 10 files and hundreds of lines of change).

Once again, describe the problem's reproduction and I will either demonstrate it and propose a very minimal fix, or prove that it has nothing to do with memcpy of ETH_MAC objects.

@bscottm
Copy link
Contributor Author

bscottm commented Mar 10, 2025

Your PR DOES NOT specifically identify an actual problem and a fix for it.

First and foremost, using ETH_MAC * is a TYPE ERROR. That's an actual problem despite being accepted as a legitimate type by the compiler, It masks a semantic issue when interchanging a unsigned char (*)[6] with unsigned char ** entities when coerced to void * during memcpy()-s to copy these ETH_MAC * entities.

Secondly, this PR fixes a heap corruption issue encountered at line 2688, sim_eth.c, when deallocating the ETH_DEV-s write_requests linked list when quitting SIMH. This heap corruption is not detected earlier because heap consistency is only checked when calling malloc()-related functions and generally within the particular allocation arena affected (1594 byte packet buffers are in a different allocation arena than smaller allocations, e.g., 1594 byte allocations live in a 2k sized arena than a 192 byte allocation that lives in a 256 sized arena.)

If you were solving the whole network problem from scratch, you could code the various pieces however you want, but the existing WORKING code happens to be written with a view of things that you personally wouldn't have chosen from scratch.

Type safety is the issue here.

Your PR does not prove that the problem has been solved, only that it seems to have not been detected in the same way as before.

So, you're admitting there's a bug while claiming that I haven't actually found one?

Interesting that you see this problem with only one simulator but your change requires changes to many simulators (dozens of simulators involving 10 files and hundreds of lines of change).

Not changing them all to leverage the C language's type system while fixing one simulator would be irresponsible and incomplete. Of the 256 lines changed, many of those changes remove an unnecessary level of indirection so that the code only deals with ETH_MAC entities, not ETH_MAC * entities.

Once again, describe the problem's reproduction and I will either demonstrate it and propose a very minimal fix, or prove that it has nothing to do with memcpy of ETH_MAC objects.

  1. Create an ITS image. I've been using the pdp10-kl simulator.
  2. Change the IMP's networking to use DHCP. On Linux, I use tap, SLiRP on Windows. Doesn't seem to matter, other than Linux detects heap corruption less often than Windows. Your mileage may also vary you're using a Linux TAP adapter other than tap0.

Windows:

set imp enabled mac=e2:6c:84:1d:34:a3 host=192.168.1.100
at imp nat:dhcp,dns=192.168.2.3,gateway=192.168.2.2,network=192.168.2.0/24,tcp=2023:192.168.2.15:23,tcp=2021:192.168.2.15:21,tcp=9595:192.168.2.15:95

Linux:

set imp enabled mac=e2:6c:84:1d:34:a3 host=192.168.1.100
at imp tap:tap0
  1. Startup ITS:
<ESC>L ITS<CR>
<ESC>G<CR>
  1. Wait for ITS to become active (KL ITS 1652 IN OPERATION AT ...), then SUPDUP in another terminal session to ITS.
  2. In the SUPDUP session, shut down ITS:
:lock<CR>
5downy<^C>
  1. Wait for SHUTDOWN COMPLETE in the simulator's terminal, then type ^\ to get back to the sim> prompt and quit the simulator. The Heisenbug occurs as the simulator exits.

@pkoning2
Copy link
Member

I think you're mistaking C for a strongly typed language. It isn't.

@bscottm
Copy link
Contributor Author

bscottm commented Mar 10, 2025

I think you're mistaking C for a strongly typed language. It isn't.

Haskell, it ain't for sure. My contention is that it's a good idea to use the type system, such as it is, to catch subtle errors. This is one case where the type system helps, not hurts.

@markpizz
Copy link
Contributor

Your PR does not prove that the problem has been solved, only that it seems to have not been detected in the same way as before.

So, you're admitting there's a bug while claiming that I haven't actually found one?

There is some sort of bug which corrupts heap. The failure on exit proves that, but you haven't actually identified the specific bug.

The failure on exit is NOT the bug. The bug happened some time before that and the failure on exit is the demonstration that the bug had been encountered during this particular run.

Meanwhile, You've provided good instructions about how to see the failure. It starts with: "1. Create an ITS image. I've been using the pdp10-kl simulator." Since you've already got an ITS image that fails, will you make that available somewhere (Google Drive, FTP, github, etc.) so that hill doesn't need to be climbed to see the failure?

Thanks.

@bscottm
Copy link
Contributor Author

bscottm commented Mar 16, 2025

The failure on exit is NOT the bug. The bug happened some time before that and the failure on exit is the demonstration that the bug had been encountered during this particular run.

<sarcasm>You don't say. I'd have never guessed. You mean that can happen with malloc()?</sarcasm>

Meanwhile, You've provided good instructions about how to see the failure. It starts with: "1. Create an ITS image. I've been using the pdp10-kl simulator." Since you've already got an ITS image that fails, will you make that available somewhere (Google Drive, FTP, github, etc.) so that hill doesn't need to be climbed to see the failure?

It's easy enough to build it, if @larsbrinkhoff or @rcornwell haven't uploaded it already somewhere. Standard, stock ITS using the generated SIMH configuration file with aforementioned tweaks to the IMP network. If you build it from the PDP-10/its repo on GH, you also get the SIMH configuration file in the out/<sim>/run file, e.g., out/pdp10-kl/run.

@markpizz
Copy link
Contributor

Meanwhile, You've provided good instructions about how to see the failure. It starts with: "1. Create an ITS image. I've been using the pdp10-kl simulator." Since you've already got an ITS image that fails, will you make that available somewhere (Google Drive, FTP, github, etc.) so that hill doesn't need to be climbed to see the failure?

It's easy enough to build it, if @larsbrinkhoff or @rcornwell haven't uploaded it already somewhere. Standard, stock ITS using the generated SIMH configuration file with aforementioned tweaks to the IMP network. If you build it from the PDP-10/its repo on GH, you also get the SIMH configuration file in the out//run file, e.g., out/pdp10-kl/run.

Well, technically it is easy enough to build, BUT a build seems to take some 2 hours on my working environment AND the build doesn't attempt to build the ENVIRONMENT needed pieces until the very end. These may fail depending on whether all the dependencies are installed. One which wasn't mentioned in the repo was openssl for cbridge. There were a couple of others so it took some 6 hours to get through everything, since I started from scratch each time. Even if I didn't start from scratch, it still built all the contents again which seemed to be 99% of the elapsed time. I've got the results now.

@bscottm
Copy link
Contributor Author

bscottm commented Mar 18, 2025

@markpizz: While I appreciate your minimalist approach to rectifying places where the Ethernet MAC address is embedded in a packet buffer (an unsigned char * allocated space), you will still have residual risk because the solution is incomplete. There can still be memcpy() calls where an unsigned char * pointer has been cast to ETH_MAC * and hasn't been dereferenced (which is what you'll need to do.)

That's one reason why I developed a more comprehensive solution. Ultimately, it makes the code safer because unsigned char * and ETH_MAC are interchangeable when passed as a parameter. Passing the address of a pointer, which is required to hoist an unsigned char * to ETH_MAC * and becomes a latent bug.

Let me underscore my point with an expanded version of the example code above:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>

typedef uint8_t ETH_MAC[6];

void foo(const char *label, ETH_MAC *mac, ETH_MAC *compare)
{
    printf("%-12s: mac = %0.16p *mac = %0.16p\n            **mac = %02x:%02x:%02x:%02x:%02x:%02x\n",
           label, mac, *mac, (*mac)[0], (*mac)[1], (*mac)[2], (*mac)[3], (*mac)[4], (*mac)[5]);
    printf("              cmp = %02x:%02x:%02x:%02x:%02x:%02x\n",
           (*compare)[0], (*compare)[1], (*compare)[2], (*compare)[3], (*compare)[4], (*compare)[5]);
    printf("              memcmp(mac, compare, sizeof(ETH_MAC)) = %d\n",
           memcmp(*mac, *compare, sizeof(ETH_MAC)));
}

int main(void)
{
    ETH_MAC mac = { 0x01, 0x23, 0x45, 0x67, 0x89, 0xab };

    /* mac == &mac: The address of the first element is the
     * same as the address of the array. */
    foo("mac", (ETH_MAC *) mac, &mac);
    /* &mac == &mac */
    foo("&mac", &mac, &mac);

    uint8_t *mac2 = malloc(64 * sizeof(uint8_t));
    for (size_t i = 0; i < 63; ++i)
      mac2[i] = (uint8_t) (i + 127);

    puts("\nuint8_t pointer masquerading as ETH_MAC *:");
    foo("mac2", (ETH_MAC *) mac2, (ETH_MAC *) mac2);
    /* Fails. */
    foo("mac2.1", (ETH_MAC *) mac2, (ETH_MAC *) &mac2);
    /* Fails */
    foo("&mac2", (ETH_MAC *) &mac2, (ETH_MAC *) mac2);

    return 0;
}

Here's the output:

mac         : mac = 0x0000007fca658cc0 *mac = 0x0000007fca658cc0
            **mac = 01:23:45:67:89:ab
              cmp = 01:23:45:67:89:ab
              memcmp(mac, compare, sizeof(ETH_MAC)) = 0
&mac        : mac = 0x0000007fca658cc0 *mac = 0x0000007fca658cc0
            **mac = 01:23:45:67:89:ab
              cmp = 01:23:45:67:89:ab
              memcmp(mac, compare, sizeof(ETH_MAC)) = 0

uint8_t pointer masquerading as ETH_MAC *:
mac2        : mac = 0x00000055989806b0 *mac = 0x00000055989806b0
            **mac = 7f:80:81:82:83:84
              cmp = 7f:80:81:82:83:84
              memcmp(mac, compare, sizeof(ETH_MAC)) = 0
mac2.1      : mac = 0x00000055989806b0 *mac = 0x00000055989806b0
            **mac = 7f:80:81:82:83:84
              cmp = b0:06:98:98:55:00
              memcmp(mac, compare, sizeof(ETH_MAC)) = -1
&mac2       : mac = 0x0000007fca658cb8 *mac = 0x0000007fca658cb8
            **mac = b0:06:98:98:55:00
              cmp = 7f:80:81:82:83:84
              memcmp(mac, compare, sizeof(ETH_MAC)) = 1

In the first two mac cases, the address of the array (&mac) is the same as the array itself (mac), and it also works for the mac2 case with hoisting the unsigned char */uint8_t as an ETH_MAC *. The mac2.1 and &mac2 cases illustrate the latent bug where the pointer's address is being used as the MAC address.

I'm not as confident as you that this latent bug only exists in three places. Hence the more comprehensive solution that ultimately makes the code safer, where unsigned char * can be used safely and interchangeably with ETH_MAC.

@bscottm
Copy link
Contributor Author

bscottm commented Mar 18, 2025

Let's ask Copilot this question: Assume I have a C typedef, "typedef unsigned char ETH_MAC[6]". Why is it a bug to pass the array as a pointer, i.e., "ETH_MAC *", to functions?

Answer:
In C, using a typedef to define an array type, such as typedef unsigned char ETH_MAC[6];, creates a specific array type with a fixed size. When you pass an array to a function, it decays to a pointer to its first element. Therefore, passing an array as a pointer to a function can lead to potential issues and bugs. Here are some reasons why it is a bug to pass ETH_MAC * to functions:

Loss of Array Size Information: When you pass an array as a pointer, the size information of the array is lost. The function receiving the pointer has no way of knowing the actual size of the array. This can lead to out-of-bounds access and buffer overflows if the function assumes a different array size.

Pointer Arithmetic: When you pass ETH_MAC * to a function, it is treated as a pointer to an array of 6 unsigned char elements. However, pointer arithmetic on this type can be confusing and error-prone, as it may not behave as expected. For example, incrementing the pointer will move it by 6 bytes, not 1 byte.

Unexpected Behavior: If the function modifies the contents of the array through the pointer, it may lead to unexpected behavior if the caller is not aware of the modifications. This can result in hard-to-debug issues, especially if the array is used elsewhere in the program.

Type Mismatch: Passing ETH_MAC * to functions expecting a different pointer type can lead to type mismatches and potential undefined behavior. This can cause compilation errors or runtime issues.

To avoid these issues, it is better to pass the array directly or use a pointer with explicit size information.

@markpizz
Copy link
Contributor

As I said before, your hundreds of lines of code changes don't actually identify a problem anywhere, just something you wouldn't have done the way it was originally written. You provide examples that cover your theory, but don't actually reflect anything that was done in the original code.

Meanwhile, I can reliably reproduce the failure you observe when closing the IMP device ethernet connection both with the https://github/simh/simh and https://github.com/bscottm/open-simh code in your PR. I'll provide an appropriate PR to fix this problem after I fully test the minimal change it involves.

@bscottm
Copy link
Contributor Author

bscottm commented Mar 18, 2025

As I said before, your hundreds of lines of code changes don't actually identify a problem anywhere, just something you wouldn't have done the way it was originally written. You provide examples that cover your theory, but don't actually reflect anything that was done in the original code.

On the contrary, I identified a latent bug source that doesn't result in a catastrophic termination of SIMH. Latent bugs don't have to cause catastrophic termination; they are known to cause corruption and incorrect behavior. Instead of identifying individual instances of memcpy() where corruption or incorrect behavior occurs, I took the opportunity to eliminate the latent bug source.

Coincidentally, it eliminated a tiresome heap corruption message each time I exited the PDP10-KL simulator.

This isn't a coding style issue. It's a correctness issue -- ETH_MAC * is incorrect if you're going to mix arrays, pointers and pointers to pointers cast to ETH_MAC * pointers as happens in the code. You will have to examine each call site to eliminate the residual risk that you will have found all of the memcpy() or memcmp() instances that are this particular latent bug source.

@bscottm
Copy link
Contributor Author

bscottm commented Mar 18, 2025

Ultimately, @pkoning2 and the open-simh steering group decide whether to accept this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants