7 - Caches, Optimization & Linking — Study Guide
A comprehensive guide to understanding cache memories, compiler optimization, and the linking process — everything you need for the final exam.
Introduction: Bridging the Speed Gap
Modern computers have a fundamental problem: CPUs are incredibly fast, but memory is slow. This creates a bottleneck that can dramatically affect program performance. The solution is a memory hierarchy with fast, small, expensive storage at the top and slow, large, cheap storage at the bottom.
Key insight: Programs exhibit locality — they tend to access the same data repeatedly and access nearby data. This property makes caching effective. Understanding caches helps you write faster code.
The Memory Hierarchy {#memory-hierarchy}
Why Memory Hierarchies Work
The key properties that make memory hierarchies effective:
- Fast storage costs more per byte, has less capacity, requires more power
- The gap between CPU and main memory speed is widening
- Well-written programs tend to exhibit good locality
These properties complement each other beautifully.
Levels of the Memory Hierarchy
Level Technology Typical Size Access Time
─────────────────────────────────────────────────────────────
L0: Registers SRAM < 1 KB 0 cycles
L1 Cache SRAM 32-64 KB 4 cycles
L2 Cache SRAM 256 KB-8 MB 10-20 cycles
L3 Cache SRAM 4-64 MB 30-50 cycles
Main Memory DRAM 8-64 GB 100-300 cycles
Disk Magnetic/SSD TB scale ms scale
The big idea: The memory hierarchy creates a large pool of storage that costs as much as cheap storage, but serves data at the rate of fast storage.
Example: Intel Core i7 Cache Hierarchy
Core 0 Core 3
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│Regs │ │L1 i-cache│ │Regs │ │L1 i-cache│
│ │ │L1 d-cache│ │ │ │L1 d-cache│
│ │ │32KB 8-way│ │ │ │32KB 8-way│
└─────────┘ └─────────┘ └─────────┘ └─────────┘
↓ ↓
┌─────────┐ ┌─────────┐
│L2 Cache │ │L2 Cache │
│256KB │ │256KB │
│8-way │ │8-way │
└─────────┘ └─────────┘
↓
┌─────────────────┐
│L3 Cache (shared)│
│8 MB, 16-way │
│40-75 cycles │
└─────────────────┘
↓
┌─────────────────┐
│ Main Memory │
└─────────────────┘
Cache block size: 64 bytes for all cache levels
Locality of Reference {#locality}
Principle of Locality
Programs tend to use data and instructions with addresses near or equal to those they have used recently.
Two Types of Locality
Temporal Locality:
- Recently referenced items are likely to be referenced again soon
- Example: Loop counter used every iteration, function variable accessed repeatedly
Spatial Locality:
- Items with nearby addresses tend to be referenced close together in time
- Example: Sequential access through an array, instructions in a loop
Locality Example
sum = 0;
for (i = 0; i < n; i++)
sum += a[i]; // a[i] accessed sequentially (spatial)
return sum; // sum accessed every iteration (temporal)
Data references:
a[i]: stride-1 access (good spatial locality)sum: accessed repeatedly (good temporal locality)
Instruction references:
- Instructions in sequence (spatial)
- Loop repeats same instructions (temporal)
Good vs Bad Locality
Good locality (row-wise access):
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
sum += a[i][j]; // Consecutive elements
Access pattern: a[0][0], a[0][1], a[0][2]... a[1][0], a[1][1]...
- All consecutive in memory
- One cache miss per block load
Bad locality (column-wise access):
for (j = 0; j < N; j++)
for (i = 0; i < M; i++)
sum += a[i][j]; // Elements spaced N apart
Access pattern: a[0][0], a[1][0], a[2][0]... a[0][1], a[1][1]...
- Stride = N elements
- Multiple cache misses per block
Cache Basics {#cache-basics}
What is a Cache?
A cache is a smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device.
Why caches work: Because of locality, programs tend to access data at level k more often than data at level k+1.
Cache Terminology
- Hit: Requested data is in the cache
- Miss: Requested data is not in cache
- Hit Rate: Fraction of accesses that hit (1 - miss rate)
- Hit Time: Time to deliver data from cache to processor
- Miss Penalty: Additional time when a miss occurs
Cache Performance Impact
Hit time: ~4 cycles (L1)
Miss penalty: ~100-200 cycles (to main memory)
A miss can be 50x slower than a hit!
Types of Cache Misses
Cold (Compulsory) Miss:
- Cache is empty
- First access to any block causes this
Conflict Miss:
- Multiple blocks map to same cache location
- E.g., accessing blocks 0, 8, 0, 8 repeatedly misses every time
Capacity Miss:
- Working set larger than cache
- Even with good locality, cache can't hold everything
Cache Organization {#cache-organization}
Block Size (B)
Unit of transfer between cache and memory. Always a power of 2 (e.g., 64 bytes).
Block contains consecutive bytes:
Address: 0x00 0x01 0x02 ... 0x3F (64 bytes)
Block 0: 0x40 ... 0x7F
Block 1: 0x80 ... 0xBF
Block 2:
Address Breakdown
For an m-bit address with block size B = 2^b bytes:
┌────────────────────────────────────────────────┐
│ Tag (t bits) │ Index (s bits) │ Offset (b bits) │
└────────────────────────────────────────────────┘
m - s - b log2(B)
- Offset: which byte within the block (b bits)
- Index: which set in the cache (s bits)
- Tag: identifies the block among those that map to same set
Cache Parameters
- S = 2^s sets
- E lines per set (associativity)
- B = 2^b bytes per block
- Cache size C = S × E × B
Direct-Mapped Cache (E = 1)
Each memory block maps to exactly one cache location.
Memory Cache
Block 0 ──────────────┐ Set 0 ──┐
Block 1 ──────────────┼───> Set 1 ──┤
Block 2 ──────────────┤ Set 2 ──┤
Block 3 ──────────────┘ Set 3 ──┘
(block % 4 = index)
Set Associative Cache (E > 1)
Each set can hold E blocks. More associativity = fewer conflict misses.
2-way associative:
Set 0: [Block] [Block] ← Two blocks can occupy set 0
Set 1: [Block] [Block]
Set 2: [Block] [Block]
Set 3: [Block] [Block]
4-way associative:
Set 0: [B][B][B][B] ← Four blocks can occupy set 0
Set 1: [B][B][B][B]
Fully Associative Cache (E = C/B)
Any block can go anywhere. Only practical for small, special-purpose caches (like TLB).
Associativity Trade-offs
| Type | Speed | Power | Conflict Misses |
|---|---|---|---|
| Direct-mapped | Fastest | Lowest | High |
| 2-way | Moderate | Moderate | Lower |
| 4-way | Slower | Higher | Even lower |
| 8-way | Slowest | Highest | Lowest |
Higher associativity reduces conflicts but requires more hardware for tag comparison.
Cache Read Operation {#cache-read}
Step-by-Step Cache Lookup
- Extract index from address → identifies which set to check
- Check all lines in that set → compare tags
- If tag matches AND valid bit is set → HIT
- Extract data from offset within block
Address: 0xBA = 10111010₂
With B=4, C=32 bytes:
- Offset bits (b) = 2
- Index bits (s) = 3
- Tag bits (t) = 3
Breakdown: 101 | 110 | 10
Tag Index Offset
Tag = 0x5, Index = 0x6, Offset = 0x2
Cache Line Structure
Each cache line contains:
- Valid bit: Is the data meaningful?
- Tag: Identifies which memory block
- Data: The actual bytes from memory
┌──────┬─────┬────────────────────────────────┐
│ Valid│ Tag │ Data (B bytes) │
└──────┴─────┴────────────────────────────────┘
1 bit t bits B bytes
Cache Write Operations {#cache-write}
Write-Hit Policies
Write-through:
- Write to cache AND main memory immediately
- Simple, always consistent
- Slower (memory write on every store)
Write-back:
- Write to cache only
- Set "dirty" bit to indicate modification
- Write to memory only when block is evicted
- Faster but more complex
Write-Miss Policies
Write-allocate:
- Load block into cache on write miss
- Good if more writes to location follow
- Common with write-back
No-write-allocate:
- Write directly to memory, don't load into cache
- Good if writes won't be followed by reads
- Common with write-through
Typical Combinations
- Write-through + No-write-allocate: Simple, used for I/O
- Write-back + Write-allocate: Most common for CPU caches
Writing Cache-Friendly Code {#cache-friendly-code}
Make the Common Case Go Fast
Focus on the inner loops of core functions where:
- Bulk of computations occur
- Bulk of memory accesses occur
Principles
- Maximize temporal locality: Use data repeatedly once loaded
- Maximize spatial locality: Read data with stride-1 pattern
Example: Matrix Multiplication
Naive (ijk) ordering:
for (i = 0; i < n; i++)
for (j = 0; j < n; j++) {
sum = 0;
for (k = 0; k < n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
Misses per iteration: A=0.25, B=1.0, C=0.0 (total 1.25 per iteration)
Better (kij) ordering:
for (k = 0; k < n; k++)
for (i = 0; i < n; i++) {
r = a[i][k];
for (j = 0; j < n; j++)
c[i][j] += r * b[k][j];
}
Misses per iteration: A=0.0, B=0.25, C=0.25 (total 0.5 per iteration)
Blocking for Cache
Divide matrices into blocks that fit in cache:
for (i = 0; i < n; i += B)
for (j = 0; j < n; j += B)
for (k = 0; k < n; k += B)
// B×B mini matrix multiply
for (i1 = i; i1 < i+B; i1++)
for (j1 = j; j1 < j+B; j1++)
for (k1 = k; k1 < k+B; k1++)
c[i1][j1] += a[i1][k1] * b[k1][j1];
Total misses: n³/(4B) vs (9/8)n³ for naive
The Memory Mountain
A 3D plot of read throughput:
- X-axis: Array size (working set)
- Y-axis: Stride
- Z-axis: Read bandwidth (MB/s)
Ridges show temporal locality (constant stride, varying size) Slopes show spatial locality (constant size, varying stride)
Compiler Optimization {#compiler-optimization}
GCC Optimization Levels
gcc -O0 # No optimization (debugging)
gcc -Og # Like -O0 but better debug info
gcc -O2 # Enable most optimizations
gcc -O3 # More aggressive, may increase size
gcc -Os # Optimize for size
Constant Folding
Pre-calculates constants at compile time:
int seconds = 60 * 60 * 24 * n_days;
// Compiler calculates 86400 at compile time
// if n_days is a constant or becomes one
Common Subexpression Elimination
Avoids recalculating the same expression:
// Before:
int a = (param2 + 0x201);
int b = param1 * (param2 + 0x201) + a;
return a * (param2 + 0x201) + b * (param2 + 0x201);
// After:
int temp = param2 + 0x201;
int a = temp;
int b = param1 * temp + a;
return a * temp + b * temp;
Dead Code Elimination
Removes code that doesn't affect results:
// Before:
if (param1 < param2 && param1 > param2) // Always false!
printf("Never happens\n");
for (int i = 0; i < 1000; i++); // Empty loop
// After: (entire block removed)
Strength Reduction
Replaces expensive operations with cheaper ones:
// Multiply to shift/add:
int a = param2 * 32; // → param2 << 5
int b = a * 7; // → (param2 << 5) * 7
// Division to multiply (unsigned):
unsigned div19(unsigned arg) {
return arg / 19;
}
// Becomes: multiply by magic constant, then shift
Code Motion
Move invariant code outside loops:
// Before:
for (int i = 0; i < n; i++)
sum += arr[i] + foo * (bar + 3);
// After (if foo, bar are constant):
int temp = foo * (bar + 3);
for (int i = 0; i < n; i++)
sum += arr[i] + temp;
Loop Unrolling
Do multiple iterations per loop body:
// Before:
for (int i = 0; i < n; i++)
sum += arr[i];
// After (4x unrolling):
for (int i = 0; i < n - 3; i += 4) {
sum += arr[i];
sum += arr[i+1];
sum += arr[i+2];
sum += arr[i+3];
}
// Handle leftovers
Tail Recursion Optimization
Converts tail-recursive functions to iterative:
// Before (recursive):
long factorial(int n) {
if (n <= 1) return 1;
else return n * factorial(n - 1);
}
// After (iterative):
long factorial(int n) {
long result = 1;
for (int i = 2; i <= n; i++)
result *= i;
return result;
}
Limitations of Compiler Optimization
The compiler can't fix everything:
int char_sum(char *s) {
int sum = 0;
for (size_t i = 0; i < strlen(s); i++) // strlen called every iteration!
sum += s[i];
return sum;
}
The compiler can't pull strlen() out because s changes (potentially).
You know more than the compiler:
void lower1(char *s) {
for (size_t i = 0; i < strlen(s); i++) // Compiler can't optimize
if (s[i] >= 'A' && s[i] <= 'Z')
s[i] -= ('A' - 'a');
}
Fix it yourself:
void lower1(char *s) {
size_t len = strlen(s); // Calculate once
for (size_t i = 0; i < len; i++)
if (s[i] >= 'A' && s[i] <= 'Z')
s[i] -= ('A' - 'a');
}
Linking {#linking}
What is Linking?
Linking is the process of combining multiple object files to create an executable. It happens at different times:
- Compile time: Static linking
- Load time: Dynamic linking
- Run time: Dynamic library loading
The Linking Process
main.c sum.c
│ │
↓ ↓
cpp, cc1, as cpp, cc1, as
↓ ↓
main.o sum.o
└────────┬─────────┘
↓
Linker (ld)
↓
prog (executable)
Static Linking:
linux> gcc -Og -o prog main.c sum.c
linux> ./prog
Why Linkers?
Modularity:
- Write programs as collections of smaller files
- Build libraries of common functions
- Easier to maintain and reuse
Efficiency:
- Time: Change one file, recompile only that file
- Space: Libraries share code across programs
Object Files {#object-files}
Three Types of Object Files
| Type | Extension | Purpose |
|---|---|---|
| Relocatable | .o | Can be combined with other .o files |
| Executable | a.out | Can be loaded and run directly |
| Shared | .so | Loaded at runtime (DLL on Windows) |
ELF Object File Format
ELF = Executable and Linkable Format (standard binary format on Linux)
┌────────────────┐
│ ELF Header │ ← Word size, file type, machine type
├────────────────┤
│ Segment Header │ ← Memory segments for execution
├────────────────┤
│ .text │ ← Machine code
├────────────────┤
│ .rodata │ ← Read-only data (jump tables, strings)
├────────────────┤
│ .data │ ← Initialized globals
├────────────────┤
│ .bss │ ← Uninitialized globals (no space)
├────────────────┤
│ .symtab │ ← Symbol table
├────────────────┤
│ .rel.text │ ← Relocation info for .text
├────────────────┤
│ .rel.data │ ← Relocation info for .data
├────────────────┤
│ .debug │ ← Debugging info
├────────────────┤
│ Section Header │ ← Offsets and sizes
└────────────────┘
Symbol Resolution {#symbol-resolution}
Types of Symbols
Global symbols:
- Defined by module m, can be referenced by other modules
- Non-static C functions and non-static global variables
External symbols:
- Global symbols referenced by module m but defined elsewhere
- E.g., calling
printf()fromstdio.h
Local symbols:
- Defined and referenced exclusively by module m
- C functions and global variables with
staticattribute
Strong vs Weak Symbols
| Type | Definition |
|---|---|
| Strong | Procedures and initialized globals |
| Weak | Uninitialized globals |
Linker's Symbol Rules
Rule 1: Multiple strong symbols are not allowed
- Each item can be defined only once
- Otherwise: Linker error!
Rule 2: Given a strong symbol and multiple weak symbols, choose strong
- Weak references resolve to the strong symbol
Rule 3: If there are multiple weak symbols, pick an arbitrary one
- Can override with
gcc -fno-common
Symbol Resolution Examples
// Example 1: Two strong symbols (ERROR)
int x;
p1() {}
int x; // ERROR: multiple strong symbols
p2() {}
// Example 2: Strong and weak (OK, weak wins)
int x = 7;
p1() {}
int x; // x refers to strong definition
p2() {}
// Example 3: Two weak (arbitrary)
int x;
p1() {}
double x; // Arbitrary choice
p2() {}
// Writes to x might overwrite other variables!
Best Practices for Global Variables
- Avoid global variables if possible
- Use
staticif you can - Initialize if you define a global variable
- Use
externif you reference external globals
Relocation {#relocation}
What is Relocation?
After symbol resolution, the linker:
- Merges separate code and data sections
- Relocates symbols from relative positions to final addresses
- Updates all references to reflect new positions
Relocation Entries
When the assembler can't determine final addresses, it creates relocation entries:
// main.c
int array[2] = {1, 2};
int main() {
int val = sum(array, 2);
return val;
}
Assembly with relocation info:
0000000000000000 <main>:
0: 48 83 ec 08 sub $0x8,%rsp
4: be 02 00 00 00 mov $0x2,%esi
9: bf 00 00 00 00 mov $0x0,%edi
e: e8 00 00 00 00 callq <sum>
a: R_X86_64_32 array // Relocation entry
f: R_X86_64_PC32 sum-0x4
13: 48 83 c4 08 add $0x8,%rsp
17: c3 retq
Relocated Executable
00000000004004d0 <main>:
4004d0: 48 83 ec 08 sub $0x8,%rsp
4004d4: be 02 00 00 00 mov $0x2,%esi
4004d9: bf 18 10 60 00 mov $0x601018,%edi // array address
4004de: e8 05 00 00 00 callq 4004e8 <sum> // sum address
4004e3: 48 83 c4 08 add $0x8,%rsp
4004e7: c3 retq
Static Libraries {#static-libraries}
Creating Static Libraries
A static library (.a) concatenates related object files with an index:
atoi.o printf.o random.o
│ │ │
└───────┼────────┘
↓
Archiver (ar)
↓
libc.a
unix> ar rs libc.a atoi.o printf.o ... random.o
Using Static Libraries
unix> gcc -static -o prog main.o -lm
# Links with libm.a
Linker algorithm:
- Scan .o and .a files in command line order
- Keep list of unresolved references
- When a file resolves references, link it in
- Error if unresolved references at end
Important: Libraries should go at the END of the command line!
# Correct order:
gcc main.o -lm # main.o can use math functions
# Wrong order:
gcc -lm main.o # May fail: libm can't resolve main's symbols
Dynamic Linking {#dynamic-linking}
Why Dynamic Linking?
Static library problems:
- Duplication in stored executables
- Duplication in running executables
- Bug fixes require relinking
Shared Libraries (.so)
Object files loaded and linked at runtime:
- Load-time linking: Executable linked to library when loaded
- Run-time linking: Libraries loaded with
dlopen()
Dynamic Linking at Load Time
main2.c
↓
main2.o ─────┐
↓ ↓ Linker
prog2l (partially linked)
↓
Loader (execve)
↓
libc.so + libvector.so (runtime)
Dynamic Linking at Run Time
#include <dlfcn.h>
void *handle = dlopen("./libvector.so", RTLD_LAZY);
// Get function pointer
void (*addvec)(int*, int*, int*, int) = dlsym(handle, "addvec");
// Call it
addvec(x, y, z, 2);
// Unload when done
dlclose(handle);
Uses:
- Plugin systems
- High-performance web servers
- Runtime library interpositioning
Loading Executable Files {#loading}
Memory Layout of Running Program
Kernel virtual memory
┌────────────────────────────┐
│ User stack │ ← Created at runtime
│ (grows down) │
├────────────────────────────┤
│ Memory-mapped region │
│ (shared libraries) │
├────────────────────────────┤
│ heap │ ← Created by malloc
│ (grows up) │
├────────────────────────────┤
│ .data .bss .text │ ← From executable
└────────────────────────────┘
↓
0x400000 (text starts here)
How Execution Begins
- Shell calls
execve()to load executable - Kernel loads program into memory
- Kernel transfers control to entry point (usually
_start) _startcallsmain()- When
main()returns,exit()is called
Practice Problems {#practice-problems}
Problem 1: Cache Organization
Given:
- Cache size: 32 bytes
- Block size: 4 bytes
- Direct-mapped
Question: For 8-bit address 0x2A, what are the tag, index, and offset?
Answer:
- b = log2(4) = 2 bits offset
- s = log2(32/4) = 3 bits index
- t = 8 - 2 - 3 = 3 bits tag
0x2A = 00101010₂
- Offset:
10= 2 - Index:
010= 2 - Tag:
001= 1
Problem 2: Cache Miss Analysis
Given:
- Array:
int arr[1000] - Cache: 64-byte blocks, 4 bytes per int
Question: What is the miss rate for:
a) Sequential access: arr[0], arr[1], arr[2], ...
b) Stride-4 access: arr[0], arr[4], arr[8], ...
c) Stride-250 access: arr[0], arr[250], arr[500], ...
Answer: a) Block holds 16 ints (64/4). Sequential: 1 miss per 16 accesses → 6.25% miss rate b) Block holds 16 ints. Stride-4 accesses every 4th int: 1 miss per 4 accesses → 25% miss rate c) Stride-250: accesses 0, 250, 500... Each block holds 16 ints, so 250 % 16 = 10, 500 % 16 = 4 → no two accesses in same block → 100% miss rate
Problem 3: Matrix Multiply Optimization
Given matrices A, B, C where C = A × B.
Question: Which loop order has best cache performance?
// Option 1: ijk
for (i) for (j) for (k) c[i][j] += a[i][k] * b[k][j];
// Option 2: kij
for (k) for (i) for (j) c[i][j] += a[i][k] * b[k][j];
Answer: Option 2 (kij) is better.
- ijk:
b[k][j]is column-wise (bad spatial locality) - kij:
a[i][k]andb[k][j]are both row-wise for inner loop (good)
Problem 4: Symbol Resolution
// main.c
extern int count;
int total = 5;
void main() { ... }
// lib.c
int count = 0;
static int limit = 100;
void process() { ... }
Question: Classify each symbol.
Answer:
count(lib.c): External, referenced by maintotal(main.c): Global, defined herelimit(lib.c): Local (static), only in lib.cprocess(lib.c): Global, can be called from main
Common Exam Traps {#common-exam-traps}
Trap 1: Cache Hit/Miss Timing
int arr[1000];
for (int i = 0; i < 1000; i++)
x += arr[i];
If cache holds 16 ints (64 bytes):
- First access: miss (load block)
- Next 15 accesses: hits
- Miss rate: 1000/16 = 62.5 misses, but only ~63 misses total
Trap 2: Write-Back Dirty Bit
A dirty block (modified in cache) must be written back to memory before eviction. This adds time to the miss penalty.
Trap 3: Compiler Optimization Assumptions
The compiler assumes no pointer aliasing (two pointers to same memory) unless told otherwise. With restrict keyword:
// Compiler can assume no aliasing:
void copy(int *restrict dst, int *restrict src, int n);
// Must check for overlap:
void copy(void *dst, void *src, int n);
Trap 4: Library Order Matters
# Wrong:
gcc -lm main.o # May fail
# Correct:
gcc main.o -lm # Works
Trap 5: Weak Symbol Resolution
int x; // Weak (uninitialized)
double x; // Also weak (uninitialized)
void foo() {} // Strong
Two weak symbols: linker picks arbitrarily. Could cause subtle bugs!
Trap 6: Relocation not for Variables
Code relocation entries tell the linker how to fix up instruction addresses. But variable addresses are resolved directly:
int x = 5;
int *p = &x; // p = address of x (resolved at link time)
Trap 7: Array Parameter sizeof
void func(int arr[10]) {
// arr is actually int*
sizeof(arr); // 8, not 40!
}
Trap 8: Cache Associativity Confusion
Direct-mapped = 1 line per set n-way set associative = n lines per set
For same cache size, higher associativity means fewer sets but more lines per set.
Final Summary {#final-summary}
Caches
- Memory hierarchy: Fast, small, expensive at top; slow, large, cheap at bottom
- Locality: Programs access same data repeatedly (temporal) and nearby data (spatial)
- Cache organization: Address = tag + index + offset
- Direct-mapped: Each block maps to one location
- Set associative: Each block can go in E locations per set
- Write policies: Write-through/write-back, write-allocate/no-write-allocate
Writing Cache-Friendly Code
- Focus on inner loops
- Stride-1 access for spatial locality
- Use data repeatedly for temporal locality
- Blocking for matrix operations
Compiler Optimization
- Constant folding: Pre-calculate constants
- Common subexpression elimination: Don't recalculate
- Dead code elimination: Remove unreachable code
- Strength reduction: Replace expensive ops with cheap ones
- Code motion: Move invariant code out of loops
- Loop unrolling: Reduce loop overhead
Linking
- Symbol resolution: Match symbol references to definitions
- Strong symbols win over weak symbols
- Relocation: Update addresses after placement
- Static libraries: Archives of .o files
- Dynamic linking: Shared libraries at load or run time
- Library order matters in command line
Key Insight
Understanding how the system works — memory hierarchy, caches, compilation, linking — lets you write better code by anticipating what the hardware and software will do with your program.
Good luck with your final exam!