Skip to content

feat: Support -z pack-relative-relocs#1701

Draft
mati865 wants to merge 15 commits intowild-linker:mainfrom
mati865:push-ruqzpzktuvvm
Draft

feat: Support -z pack-relative-relocs#1701
mati865 wants to merge 15 commits intowild-linker:mainfrom
mati865:push-ruqzpzktuvvm

Conversation

@mati865
Copy link
Member

@mati865 mati865 commented Mar 17, 2026

Works now but doesn't pack RELR entries via bitmaps, so the size reduction is not that big.

Doesn't work yet:

./a.out: error while loading shared libraries: ./a.out: DT_RELR without GLIBC_ABI_DT_RELR dependency

I haven't yet figured out how to cleanly synthesise GLIBC_ABI_DT_RELR version for __libc_start_main symbol. I'd prefer to avoid matching that symbol by the name, but there might be no other choice.

@mati865
Copy link
Member Author

mati865 commented Mar 17, 2026

Ah, I have misunderstood that part. That version has to be just declared, not assigned to any symbol.

@mati865
Copy link
Member Author

mati865 commented Mar 17, 2026

It works now but doesn't support the major selling point of RELR which is compacting of the addresses via bitmaps. I'm not sure how to approach that as we need sorted entries before they are written for it work (use a temp buffer?). Currently, I'm working around the problem by sorting the entries after they are already written.

Clang release (without debuginfo) builds without and with -z pack-relative-relocs:

❯ ls bin*
.rwxr-xr-x 248M mateusz 17 mar 17:37  bin.default-ld
.rwxr-xr-x 248M mateusz 17 mar 17:37  bin.default-wild
.rwxr-xr-x 242M mateusz 17 mar 17:38  bin.pack-ld
.rwxr-xr-x 244M mateusz 17 mar 17:45  bin.pack-wild

Even without the compaction this is a small win to the size.

This is how compacted vs non-compacted entries look like:

❯ readelf -Wr bin.pack-ld | rg relr -A 5
Relocation section '.relr.dyn' at offset 0x71f100 contains 7264 entries which relocate 273137 locations:
Index: Entry            Address           Symbolic Address
0000:  000000000c8a8940 000000000c8a8940  __frame_dummy_init_array_entry
0001:  ffffffffffffffff 000000000c8a8948  __frame_dummy_init_array_entry + 0x8
                        000000000c8a8950  __frame_dummy_init_array_entry + 0x10
                        000000000c8a8958  __frame_dummy_init_array_entry + 0x18

❯ readelf -Wr bin.pack-wild | rg relr -A 5
Relocation section '.relr.dyn' at offset 0x70f868 contains 269508 entries which relocate 269508 locations:
Index: Entry            Address           Symbolic Address
0000:  000000000ca36288 000000000ca36288  __frame_dummy_init_array_entry
0001:  000000000ca36290 000000000ca36290  __frame_dummy_init_array_entry + 0x8
0002:  000000000ca36298 000000000ca36298  __frame_dummy_init_array_entry + 0x10
0003:  000000000ca362a0 000000000ca362a0  __frame_dummy_init_array_entry + 0x18
Performance impact is not bad considering the sort workaround
❯ OUT=/tmp/bin powerprofilesctl launch -p performance hyperfine -w 5 './run-with ~/Projects/wild/target/release/wild' './run-with ~/Projects/wild/target/release/wild -z pack-relative-relocs' './run-with ld.bfd' './run-with ld.bfd -z pack-relative-relocs' './run-with ~/Projects/wild/target/debug/wild' './run-with ~/Projects/wild/target/debug/wild -z pack-relative-relocs'
Benchmark 1: ./run-with ~/Projects/wild/target/release/wild
  Time (mean ± σ):      55.3 ms ±   1.8 ms    [User: 1.0 ms, System: 1.3 ms]
  Range (min … max):    52.4 ms …  59.9 ms    51 runs

Benchmark 2: ./run-with ~/Projects/wild/target/release/wild -z pack-relative-relocs
  Time (mean ± σ):      56.3 ms ±   1.1 ms    [User: 1.3 ms, System: 1.0 ms]
  Range (min … max):    54.4 ms …  60.7 ms    52 runs

Benchmark 3: ./run-with ld.bfd
  Time (mean ± σ):      1.853 s ±  0.052 s    [User: 1.391 s, System: 0.454 s]
  Range (min … max):    1.730 s …  1.925 s    10 runs

Benchmark 4: ./run-with ld.bfd -z pack-relative-relocs
  Time (mean ± σ):      1.987 s ±  0.046 s    [User: 1.541 s, System: 0.440 s]
  Range (min … max):    1.882 s …  2.062 s    10 runs

Benchmark 5: ./run-with ~/Projects/wild/target/debug/wild
  Time (mean ± σ):     320.7 ms ±   3.2 ms    [User: 1.7 ms, System: 1.1 ms]
  Range (min … max):   314.1 ms … 326.7 ms    10 runs

Benchmark 6: ./run-with ~/Projects/wild/target/debug/wild -z pack-relative-relocs
  Time (mean ± σ):     424.5 ms ±   3.7 ms    [User: 1.3 ms, System: 1.4 ms]
  Range (min … max):   418.2 ms … 430.8 ms    10 runs

Summary
  ./run-with ~/Projects/wild/target/release/wild ran
    1.02 ± 0.04 times faster than ./run-with ~/Projects/wild/target/release/wild -z pack-relative-relocs
    5.80 ± 0.20 times faster than ./run-with ~/Projects/wild/target/debug/wild
    7.67 ± 0.26 times faster than ./run-with ~/Projects/wild/target/debug/wild -z pack-relative-relocs
   33.48 ± 1.44 times faster than ./run-with ld.bfd
   35.90 ± 1.43 times faster than ./run-with ld.bfd -z pack-relative-relocs

@mati865 mati865 force-pushed the push-ruqzpzktuvvm branch 6 times, most recently from 5085561 to 5e53268 Compare March 19, 2026 22:41
Copy link
Member

@davidlattimore davidlattimore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is marked as a draft, I assume there's still stuff that you want to address, so I haven't reviewed thoroughly yet. I did skim over it though.

let mut verneed_info = state.verneed_info;

if let Some(v) = state.verneed_info.as_ref()
if let Some(v) = &mut verneed_info
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this mut perhaps isn't needed.

Copy link
Member Author

@mati865 mati865 Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this is a leftover from one of the previous attempts at making glibc special symbol logic not feel like a plaster on a broken limb. This version I'm fairly satisfied with, but as you noticed it needs a bit of clean-up still.

Also, too bad none of the linter caught it.

@mati865
Copy link
Member Author

mati865 commented Mar 20, 2026

Thanks, the status from #1701 (comment) is still up-to-date. Not much progress, so far. At least I no longer hate the way this PR implements glibc special symbol version.

@mati865 mati865 force-pushed the push-ruqzpzktuvvm branch from 5e53268 to 00d2009 Compare March 20, 2026 18:49
@mati865
Copy link
Member Author

mati865 commented Mar 21, 2026

Looking at readelf --got-contents and relr addresses without sorting enabled from linked Clang binary got me thinking. It looks like this:

Relocation section '.relr.dyn' at offset 0x70f868 contains 269508 entries which relocate 269508 locations:
Index: Entry            Address           Symbolic Address
0000:  000000000cdbd4f0 000000000cdbd4f0  __dso_handle
0001:  000000000ca37900 000000000ca37900  __do_global_dtors_aux_fini_array_entry
0002:  000000000ca36288 000000000ca36288  __frame_dummy_init_array_entry
0003:  000000000ca36290 000000000ca36290  __frame_dummy_init_arra[...] + 0x8
0004:  000000000cc15a08 000000000cc15a08  _ZTVSt23_Sp_counted_ptr[...] + 0x10
0005:  000000000cc15a10 000000000cc15a10  _ZTVSt23_Sp_counted_ptr[...] + 0x18
0006:  000000000cc15a18 000000000cc15a18  _ZTVSt23_Sp_counted_ptr[...] + 0x20
0007:  000000000cc15a20 000000000cc15a20  _ZTVSt23_Sp_counted_ptr[...] + 0x28
0008:  000000000cc15a28 000000000cc15a28  _ZTVSt23_Sp_counted_ptr[...] + 0x30
0009:  000000000cc15a40 000000000cc15a40  _ZTVSt23_Sp_counted_ptr[...] + 0x10
0010:  000000000cc15a48 000000000cc15a48  _ZTVSt23_Sp_counted_ptr[...] + 0x18
0011:  000000000cc15a50 000000000cc15a50  _ZTVSt23_Sp_counted_ptr[...] + 0x20
0012:  000000000cc15a58 000000000cc15a58  _ZTVSt23_Sp_counted_ptr[...] + 0x28
0013:  000000000cc15a60 000000000cc15a60  _ZTVSt23_Sp_counted_ptr[...] + 0x30
0014:  000000000cda60b0 000000000cda60b0  _GLOBAL_OFFSET_TABLE_
0015:  000000000cc15a78 000000000cc15a78  _ZTVN12_GLOBAL__N_128AA[...] + 0x10
0016:  000000000cc15a80 000000000cc15a80  _ZTVN12_GLOBAL__N_128AA[...] + 0x18
0017:  000000000cc15a88 000000000cc15a88  _ZTVN12_GLOBAL__N_128AA[...] + 0x20
...

But if I map addresses to the sections we get:

0000:  000000000cdbd4f0 000000000cdbd4f0  .data
0001:  000000000ca37900 000000000ca37900  .fini_array
0002:  000000000ca36288 000000000ca36288  .init_array
0003:  000000000ca36290 000000000ca36290  .init_array
0004:  000000000cc15a08 000000000cc15a08  .data.rel.ro
0005:  000000000cc15a10 000000000cc15a10  .data.rel.ro
0006:  000000000cc15a18 000000000cc15a18  .data.rel.ro
0007:  000000000cc15a20 000000000cc15a20  .data.rel.ro
0008:  000000000cc15a28 000000000cc15a28  .data.rel.ro
0009:  000000000cc15a40 000000000cc15a40  .data.rel.ro
0010:  000000000cc15a48 000000000cc15a48  .data.rel.ro
0011:  000000000cc15a50 000000000cc15a50  .data.rel.ro
0012:  000000000cc15a58 000000000cc15a58  .data.rel.ro
0013:  000000000cc15a60 000000000cc15a60  .data.rel.ro
0014:  000000000cda60b0 000000000cda60b0  .got
0015:  000000000cc15a78 000000000cc15a78  .data.rel.ro
0016:  000000000cc15a80 000000000cc15a80  .data.rel.ro
0017:  000000000cc15a88 000000000cc15a88  .data.rel.ro

The relocations are not as unordered as I previously thought. Currently, sections are written at the first available slot, but relocations within each section are already ordered (not verified but seems plausible).
So, if I could apply layout to .relr.dyn section by offsetting the relocations, rather than writing them at the first available slot, everything would naturally land in the perfect order.

GOTs won't be a problem as well (at least in Clang's case). Even though there are 2002 entries with relative reloc:

❯ readelf -W --got-contents bin | rg RELATIVE | wc -l
2002

They are sequential:

❯ readelf -W --got-contents bin | rg -ow 'R_.*?\s' | uniq
R_X86_64_RELATIVE
R_X86_64_GLOB_DAT
R_X86_64_TPOFF64
R_X86_64_GLOB_DAT
R_X86_64_TPOFF64
R_X86_64_GLOB_DAT

Other relocations should be irrelevant here.


With some creativity this solution could be extended to avoid overallcoation as well. If we store first and last addresses of each written relocations chunk for each section somewhere, we can increment the sizes by the chunk size, instead of relocations count.

Maybe it will be clearer that way:

Id   Reloc address
00: 0000

# 16 bytes apart, cannot be packed with 00
01: 0010 
02: 0018
03: 0020 
# 3 subsequent consecutive addresses, emit 01 as the real reloc and move 02 and 03 into bitmap

04: 00d0
```
In that example we would end up with 3 relocation entries (for 00, 01, and 04) rather than 5. Sounds almost too good to be true.

@mati865
Copy link
Member Author

mati865 commented Mar 24, 2026

There is something I didn't account for: buckets and groups.
I still cannot figure out how the buffer splitting works in regard to threads and groups (input files), This is the best I could come up with so far: eac18b6

It works only for simple cases: using single thread (--threads=1) and low amount of inputs. For example, running:

❯ wild/tests/build/pack-relative-relocs.c/default-host/pack-relative-relocs.c.save/run-with cargo r --bin wild -q -- --threads=1

wild on  HEAD (eac18b6) is 📦 v0.8.0 via 🦀 v1.94.0
❯ readelf -Wr wild/tests/build/pack-relative-relocs.c/default-host/pack-relative-relocs.c.save/bin | rg -A99 relr
Relocation section '.relr.dyn' at offset 0xd60 contains 10 entries which relocate 10 locations:
Index: Entry            Address           Symbolic Address
0000:  0000000000003120 0000000000003120  __frame_dummy_init_array_entry
0001:  0000000000003128 0000000000003128  __frame_dummy_init_array_entry + 0x8
0002:  0000000000003130 0000000000003130  __frame_dummy_init_array_entry + 0x10
0003:  0000000000003138 0000000000003138  __do_global_dtors_aux_fini_array_entry
0004:  0000000000003140 0000000000003140  __do_global_dtors_aux_fini_array_entry + 0x8
0005:  0000000000003148 0000000000003148  __do_global_dtors_aux_fini_array_entry + 0x10
0006:  0000000000003150 0000000000003150  ptrs_b
0007:  0000000000003158 0000000000003158  ptrs_b + 0x8
0008:  0000000000003160 0000000000003160  ptrs_b + 0x10
0009:  0000000000004418 0000000000004418  __dso_handle

Shows the addresses are in order, but adding threads or inputs makes the offsets fall apart.

EDIT: Regarding the fields in common structs, some of those were ELF specific when I started and I didn't bother cleaning up the code that doesn't even work when rebasing. I'll move them if this ever starts working correctly.

@davidlattimore
Copy link
Member

The closest equivalent I can think of in the linker is how dynamic symbol definitions are handled. During the GC phase, objects collect up the dynamic symbols that they need to define, storing them in CommonGroupState::dynamic_symbol_definitions. Afterwards, merge_dynamic_symbol_definitions combines them together and they're supplied to the epilogue, which is responsible for sorting them and then writing them. Perhaps you can do something similar here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants