Linker trouble: musl libc and weak aliasing

By Macoy Madson. Published on .

My linker in development was crashing on free. It was calling one malloc but then freeing with a free associated with a different malloc. This subsequently caused a segmentation fault because the free expected a metadata structure that didn't exist in the other malloc (at least, not at the same size).

This took a lot of sleuthing. I knew it was a problem with my linker/loader, but couldn't step-debug it because my loader doesn't create a program image the way GDB expects.1

This article goes over how I found the issue. It might help you if you are encountering strange things with your work-in-progress linker (ha ha!), or like reading about someone debugging something.2

Finding the problem

The problem manifest in my attempt to statically link3 the following simple Cakelisp program:

(add-c-search-directory-module "/home/macoy/musl/include")
(c-import "stdio.h" "stdlib.h")

(defun main (&return int)
  (fprintf stderr "Hello, C runtime!\n")
  (var data (* char) (type-cast (malloc (* (sizeof (type (* char))) 10))
                                (* char)))
  (fprintf stderr "Allocated and got %p!\n" data)
  (set (at 0 data) 0)
  (fprintf stderr "Accessed %p, it's now %d\n" data (at 0 data))
  (free data)
  (fprintf stderr "Freed!\n")
  (return 0))

This program first proved that musl libc was at least partially functional by successfully printing to stderr. However, the program segfaulted in free.

I used a combination of a signal handler and rudimentary stack printing via backtrace.h to find I was calling the incorrect malloc function relative to the later free.

I discovered this by noticing that I could successfully set the data returned by malloc without encountering a segmentation fault, so the memory was at least valid.

I then hacked together a damn simple "interactive debugger", which gets triggered when SIGSEGV is caught by my signal handler4:

;; Very minimal!
(defun-local interactive-debugger ()
  (fprintf stderr "Commands:\n
\tquit\n
\tprint-symbol [symbol-name]\n")
  (var print-symbol-tag-length (const int) (strlen "print-symbol"))
  (while 1
    (fprintf stderr "> ")
    (var input ([] 256 char) (array 0))
    ;; Note: We need to request stdin before running this in a signal handler!
    (fgets input (sizeof input) stdin)
    (cond
      ((or (= 0 (strcmp "quit\n" input))
           (= 0 (strcmp "q\n" input)))
       (break))
      ((= 0 (strncmp "print-symbol " input print-symbol-tag-length))
       (var symbol-name-buffer ([] 128 char) (array 0))
       (strcpy symbol-name-buffer (+ input print-symbol-tag-length 1))
       (set (at (- (strlen symbol-name-buffer) 1) symbol-name-buffer) 0)
       (fprintf stderr "Searching for '%s'\n" symbol-name-buffer)
       ;; This prints where it finds it for us
       (var symbol (* void) (find-symbol-address-in-allocated-sections
                             symbol-name-buffer))))))

The print-symbol command alerted me to the fact that malloc was resolving to the lite_malloc.c implementation, but free was resolving to the mallocng implementation.

I then started looking at lite_malloc.c and untangling the mess.

Let's walk through the issue.

musl's malloc implementation

musl has a "lite" or simple malloc that is defined as a fallback when e.g. mallocng isn't included in your musl build.

It is defined like so:

static void *__simple_malloc(size_t n)
{
    // [Implementation omitted by article author]
}

weak_alias(__simple_malloc, __libc_malloc_impl);

void *__libc_malloc(size_t n)
{
    return __libc_malloc_impl(n);
}

static void *default_malloc(size_t n)
{
    return __libc_malloc_impl(n);
}

weak_alias(default_malloc, malloc);

After reading several pages on this relatively obscure feature, I came to understand what these weak_alias macros accomplish.

If the user defines their own malloc, that creates a strong definition, thereby overriding musl libc's malloc.

If the user does not define their own malloc, the definition of default_malloc will be resolved to by the linker when the linker asks for malloc.

This I deduce is accomplished like so:

~/Repositories/linker-loader $ nm --defined /home/macoy/Downloads/musl-1.2.3/obj/src/malloc/lite_malloc.lo
0000000000000000 b brk.2119
0000000000000000 D __bump_lockptr
0000000000000000 b cur.2120
0000000000000000 t default_malloc
0000000000000000 b end.2121
0000000000000000 T __libc_malloc
0000000000000000 W __libc_malloc_impl
0000000000000000 b lock
0000000000000000 W malloc
0000000000000000 b mmap_step.2122
0000000000000000 t __simple_malloc

The W denotes a weak symbol. Note that you would be able reference default_malloc directly, i.e. the alias isn't an override, but in this case default_malloc is marked static, so it will not be exposed to the linker by its true name.

The other weak_alias on __simple_malloc is the one that broke my loader. This alias accomplishes a different goal. In case the user has not defined their own malloc, default_malloc will be called, which references __libc_malloc_impl. The weak alias on __simple_malloc says "If __libc_malloc_impl is not defined, then use __simple_malloc instead."

The intent I believe with this alias is to allow users when building musl to switch which malloc implementation musl will use internally and by default. This is corroborated by the configure --help prompt:

Optional packages:
  --with-malloc=...       choose malloc implementation [mallocng]

The problem with my linker

I end up finding a weak definition of malloc, which resolves to calling default_malloc. This is actually the right behavior because there is no strong definition of malloc, because I never override it in the user program.

However, __libc_malloc_impl resolves to the weak __simple_malloc, when it should instead resolve to the strong __libc_malloc_impl provided by musl libc's mallocng implementation:

~/Repositories/linker-loader $ nm --defined /home/macoy/Downloads/musl-1.2.3/obj/src/malloc/mallocng/malloc.lo
0000000000000000 t alloc_slot
0000000000000000 r debruijn32.3106
0000000000000000 t enframe
0000000000000000 t get_stride
0000000000000000 T __libc_malloc_impl
0000000000000000 T __malloc_alloc_meta
0000000000000000 T __malloc_allzerop
0000000000000000 T __malloc_atfork
0000000000000000 B __malloc_context
0000000000000004 C __malloc_lock
0000000000000000 R __malloc_size_classes
0000000000000000 r med_cnt_tab
0000000000000000 t queue
0000000000000000 t rdlock
0000000000000000 t size_to_class
0000000000000000 r small_cnt_tab
0000000000000000 t step_seq
0000000000000000 t wrlock

Here are the relevant lines adjacent to each other, for comparison:

# lite_malloc.lo:
0000000000000000 W __libc_malloc_impl

# malloc.lo:
0000000000000000 T __libc_malloc_impl

W denotes a weak symbol definition while T denotes a strong public/global symbol defined in the text section of the object file.

In the ELF specification (PDF), the issue becomes quite clear (emphasis mine):

When the link editor combines several relocatable object files, it does not allow multiple definitions of STB_GLOBAL symbols with the same name. On the other hand, if a defined global symbol exists, the appearance of a weak symbol with the same name will not cause an error. The link editor honors the global definition and ignores the weak ones. Similarly, if a common symbol exists (i.e., a symbol whose st_shndx field holds SHN_COMMON), the appearance of a weak symbol with the same name will not cause an error. The link editor honors the common definition and ignores the weak ones.

I was resolving __libc_malloc_impl to the weak definition in lite_malloc.lo instead of the strong definition in malloc.lo.

Later, when I try to free, I end up referencing the free I find in malloc/free.lo, which just calls __libc_free, which is only defined in malloc/mallocng/free.c. If instead there was a corresponding __simple_free, I never would have realized that I was calling __simple_malloc.

Of course, I had the TODO item to implement this all properly before I began debugging:

TODO Symbol resolution needs to be addressed, especially once I load programs that override "weak" functions

I had put it off not knowing whether it would become an issue. It isn't a straightforward implementation, which is why I didn't do it immediately.

Now, after not doing it and seeing why it is important, I have gained a better understanding of how it is supposed to work.

Takeaways

If you do not implement the specification to a T5, you may end up debugging tricky things like this without any debugger nor good idea of what's going wrong.

The advantage is if you persist, you can learn new tools for understanding the problem and investigating the data.

If you're interested in other linker adventures, read about my "linker-loader" project, which talks about why I even bother with all this work.

You can also read the much simpler Know What Your Linker Knows article where I explain how objdump can be useful when debugging link errors. Around the time I wrote that article was when I started learning more about linkers, which are something you don't really have to think too hard about during regular program development.


  1. It does not meet the assumptions required by the GDB add-symbol-file, so I couldn't use gdb even if I manually added objects one-by-one, with specified offsets in memory↩︎

  2. There must be like, a dozen people in the world that would meet that criteria, right? Right?!↩︎

  3. I wanted to statically link to musl libc because dynamic linking to e.g. glibc seemed much more complicated, especially because I did not have any dynamic linking support yet in my linker/loader.↩︎

  4. Yes, I am aware I am calling functions which aren't safe to call in signal handlers. In my case I am not expecting to "ship" this debugger, it's only a means to an end, so it ended up being fine.↩︎

  5. Why hadn't I? Well, I find implementing things piece-by-piece and testing as I go to result in higher success rates than all-or-nothing pushes. Sometimes it bites you when the missing pieces are essential to the next test. Also, sometimes you're not really sure how to make things until you're halfway down the road making it, because it's unique/experimental and/or you don't fully understand the purpose of the specification.↩︎