Why can't code be uncompiled?

Squizzy@lemmy.world · 2 years ago

Why can't code be uncompiled?

Dark Arc · 2 years ago

I actually work on a C++ compiler… I think I should weigh in. The general consensus here that things are lossy is correct but perhaps non-obvious if you’re not familiar with the domain.

When you compile a program you’re taking the source, turning into a graph that represents every aspect of the program, and then generating some kind of IR that then gets turned into machine code.

You lose things like code comments because the machine doesn’t care about the comments right off the bat.

Then you lose local variable and function parameter names because the machine doesn’t care about those things.

Then you lose your class structure … because the machine really just cares about the total size of the thing it’s passing around. You can recover some of this information by looking at the functions but it’s not always going to be straight forward because not every constructor initializes everything and things like unions add further complexity … and not every memory allocation uses a constructor. You won’t get any names of any data members/fields though because … again the machine doesn’t care.

So what you’re left with is basically the mangled names of functions and what you can derive from how instructions access memory.

The mangled names normally tell you a lot, the namespace, the class (if any), and the argument count and types. Of course that’s not guaranteed either, it’s just because that’s how we come up with unique stable names for the various things in your program. It could function with a bunch of UUIDs if you setup a table on the compilers side to associate everything.

But wait! There’s more! The optimizer can do some really wild things in the name of speed… Including combining functions. Those constructors? Gone, now they’re just some more operations in the function bodies. That function you wrote to help improve readability of your code? Gone. That function you wrote to deduplicate code? Gone. That eloquent recursive logic you wrote? Gone, now it’s the moral equivalent of a giant mess of goto statements. That template code that makes use of dozens of instantiated functions? Those functions are gone now too; instead it’s all the instantiated logic puked out into one giant function. That piece of logic computing a value? Well the compiler figured out it’s always 27, so the logic to compute it? Gone.

Now all of that stuff doesn’t happen every time, particularly not all of those things are always possible optimizations or good optimizations … But you can see how incredibly difficult it is to reconstruct a program once it’s been compiled and gone through optimization. There’s a very low chance if you do reconstruct it, that it will look anything like what you started with.

Treczoks@lemmy.world · 2 years ago

Just wait until you see the crazy optimizers for embedded systems. They take the complete code of a system into consideration, and, in a number of compile passes, reuses code snippets from app, libraries, and OS layer to create one big tangled mess that is hard to follow even if you have the source code…

noli@programming.dev · 2 years ago

Isn’t that still the same exact process as a normal compiler except in the case of embedded systems your OS is like a couple kilobytes large and just compiled along with the rest of your code?

As in, are those “crazy optimizations” not just standard compiler techniques, except applied to the entire OS+applications?

Treczoks@lemmy.world · 2 years ago

In a way, yes. But it really creates a mess when the linker starts sharing code between your code of which you have sources, and then jumps in the middle of system code for which you don’t have sources. And a pain in the whatever to debug.

noli@programming.dev · 2 years ago

Don’t you have the code in most cases? Like with e.g. freeRTOS? That’s fully open source

morhp@lemmynsfw.com · 2 years ago

The main difference is that when you compile a program for Windows, Linux etc., you have an operating system and kernel with their exposed functions/interfaces so even in a compiled program it’s pretty easy to find the function calls for opening a file, moving a window, etc. (as long as the developer doesn’t add specific steps hiding these calls). But in an embedded system, it’s one large mess without any interfaces apart from those directly on the hardware level.