Understanding Java Garbage Collection Notes

(This notes is from watching Understanding Java Garbage Collection and What You Can Do About It)

Understanding Java Garbage Collection

This Talk’s Purpose / Goals

  • Focused on GC education
  • Not a “how to use flags to tune a collector” talk
  • Talk bout how the “GC machine” works
  • Purpose: Once you understand how it works, you can use your own brain…
  • You’ll learn just enough to be dangerous

About the speaker

  • co-founder, CTO @Azul Systems
  • Created Pauseless & C4 core GC algorithms
  • A long history building Virtual & Physical Machines, Operating Systems, Enterprise apps, etc…

About Azul

  • Make scalable Virtual Machines
  • Have build “whatever it takes to get job done” since 2002
  • 3 generations of custom SMP Multi-core HW (Vega)
  • Zing: Pure software for commodity x86
  • Known for low latency, consistent execution, and large data set excellence

High level agenda

  • GC fundamentals and key mechanisms
  • GC terminology & metrics
  • Classifying current commericially available collectors
  • Why stop-the-world is a problem
  • The C4 collector: What a solution to STW looks like

Memory use

How big is your heap size?

Why you should understand (at least a little) how GC works?

The story of the good little architect

  • A good architect must, first and foremost, be able to impose their architectural choices on the project…
  • Early in Azul’s concurent collector days, we encountered an application exhibiting 18 second pauses
    • Upon investigation, we found the collector was performing 10s of millions of object finalizations per GC cycle (We have since made reference processing fully concurrent)
  • Every single class written in the project had a finalizer
    • The only work the finalizers did was nulling every reference field
  • The right discipline for a C++ ref-counting environment
    • The wrong discipline for a precise GC environment

Much of what people seem to “know” about GC is wrong

  • In many cases, it’s much better than you may think
    • GC is extremely efficient. Much more so that malloc()
    • Dead objects cost nothing to collect
    • GC will find all the dead objects (including cyclic graphs)
  • In many cases, it’s much worse than you may think
    • Yes, it really does stop for ~1 sec per live GB (unless you use Zing)
    • No, GC does not mean you can’t have memory leaks
    • No, those pauses you eliminated from you 20 minute test are not gone

Some GC terminology

A basic terminology example: What is a concurrent collector?

  • A concurrent collector performs GC work concurrently with the application’s own execution
  • A parallel colelctor uses multiple CPUs to perform GC
  • A stop-the-world collector performs GC while the application is completely stopped
  • An incremental collector performs a GC operation or phase as a series of smaller discrete operations with (potentially long) gaps in between
  • Mostly means sometimes it isn’t (usually means a different fall back mechanism exists)

Precise vs. Conservative Collection

  • A collector is conservative if it is unaware of some object references at collection time, or is unsure about whether a field is a reference or not
  • A collector is precise if it can fully identify and process all object references at the time of collection
    • A collector MUST be precise in order to move objects
    • The COMPILERS need to produce a lot of information (oopmaps)
  • All commercial server JVMs use precise collectors
    • All commercial server JVMs use some form of a moving collector

Safepointes

  • A GC Safepoint is a point or range in a thread’s execution where the collector can identify all the references in that thread’s execution stack
    • “Safepoint” and “GC Safepoint” are often used interchangeably
    • But there are other types of safepoints, including ones that require more information than a GC safepoint does
  • “Bringing a thread to a safepoint” is the act of getting a thread to reach a safepoint and not execute past it
    • Close to, but not exactly the same as “stop at a safepoint”
      • e.g. JNI: you can keep running in, but not past the safepoint
    • Safepoint opportunities are (or should be) frequent
  • In a global safepoint all threads are at a Safepoint

What’s common to all precise GC mechanisms?

  • Identify the live objects in the memory heap
  • Reclaim resources held by dead objects
  • Periodically relocate live objects

  • Examples:

    • Mark/Sweep/Compct (common for old generations)
    • Copying collector (common for young generations)

Mark (aka “Trace”)

  • Start from “roots” (thread stacks, statics, etc…)
  • “Paint” anything you can reach as “live”
  • At the end of a mark pass:

    • all reachable objects will be marked “live”
    • all non-reachable objects will be marked “dead” (aka “non-live”)
  • Note: work is generatlly linear to “live set”

Sweep

  • Scan through the heap, identify “dead” objects and track them somehow

    • (Usually in some form of fre list )
  • Note: work is generally linear to heap size

Compact

  • Over time, heap will get “swiss cheesed”: contiguous dead space betweeen objects may not be large enough to fit new objects (aka “fragmentation”)
  • Compaction moves live objects together to reclaim contiguous empty space (aka “relocate”)
  • Compaction has to correct all object references to point to new object locations (aka “remap”)
  • Remap scan must cover all references that could possibly point to relocated objects

  • Note: work is generally linear to “live set”

Copy

  • A copying collector moves all lives objects from a “from” space to a “to” space & reclaims “from” space
  • At start of copy, all objects are in “from” space and all references point to “from” space
  • Start from “root” references, copy any reachable object to “to” space, correcting references as we go
  • At end of copy, all objects are in “to” space, and all references point to “to” space

  • Note: work is generally linear to “live set”

Mark/Sweep/Compact, Copy, Mark/Compact

  • Copy requires 2x the max live set to be reliable
  • Mark/Compact typically requries 2x the max live set in order to fully recover garbage in each cycle
  • Mark/Sweep/Compact only requires 1x (plus some)
  • Copy and Mark/Compact are linear only to live set
  • Mark/Sweep/Compact linear (in sweep) to heap size
  • Marp/Sweep/(Compact) may be able to avoid some moving work
  • Copying is typically monolithic

Generational Collection

  • Weak Generational Hypothesis: “most objects die young”
  • Focus collection effort on young generation:
    • Use a moving collector: work is linear to the live set
    • The live set in the young generation is a small % of the space
    • Promote objects that live long enough to older generations
  • Only collect older generations as they fill up
    • “Generational filter” reduces rate of allocation into older generations
  • Tends to be (order of magnitude) more efficient

    • Great way to keep up with high allocation rate
    • Practically necessity for keeping up with processor throughput
  • Requires a “Remembered set”: a way to track all references into the young generation from the outside

  • Remembered set is also part of “roots” for young generation collection
  • No need for 2x the live set: Can “spill over” to old gen
  • Usually want to keep surviving objects in young generation for a while before promoting them to the old generation
    • Immediate promoation can significantly reduce gen. filter efficiency
    • Waiting too long to promote can eliminate generational benefits

How does the (generational) remembered set work?

  • Generational collectors require a “Remembered set”: a way to track all references into the young generation from the outside
  • Each store of a NewGen reference into and OldGen object needs to be intercepted and tracked
  • Common technique: “Card Marking”
    • A bit (or byte) indicating a word (or region) in OldGen is “suspect”
  • Write barrier used to track references
    • Common technique (e.g. HotSpot): blind stores on reference write
    • Variant: precise vs. imprecise card marking, conditional vs. non-conditional

The typical combos in commercial server JVMs

  • Young generation usually uses a copying collector
  • Young generation is usually monolithic, stop-the-world

  • Old generation usually uses Mark/Sweep/Compact

  • Old generation may be STW, or Concurrent, or mostly-Concurrent, or Incremental-STW, or mostly-Incremental-STW

Other useful terms

Useful terms

Useful metrics

Useful terms

Empty memory and CPU/throughput

Heap size vs. GC CPU %

Two intuitive limits

  • If we had exactly 1 byte of empty memory at all times, the collector would have to work “very hard”, and GC would take 100% of the CPU time
  • If we had infinite empty memory, we would never have to collect, and GC would take 0% of the CPU time
  • GC CPU % will follow a rough 1/x curve between these two limit points, dropping as the amout of memory increases

Empty memory needs

  • The amount of empty memory in the heap is the dominant factor controlling the amount of GC work
  • For both Copy and Mark/Compact collectors, the amount of work per cycle is linear to live set
  • The amount of memory recovered per cycle is equal to the amount of unused memory (heap size - live set)
  • The collector has to perform a GC cycle when the empty memory runs out
  • A Copy or Mark/Compact collector’s efficiency doubles with every doubling of empty memory

What empty memory controls

  • Efficiency (amount of collector work needed per amount of application work performed)
  • Frequency of pauses (if the collector performs any STW operations)
  • Doese not control pause times (only their frequency)
  • In Mark/Sweep/Compact collectors that pause for sweeping, more empty memory means less frequent but LARGER pauses

Some non monolithic-STW stuff

Concurrent Marking

  • Mark all rechable objects as “live”, but object graph is “mutating” under us
  • Classic concrrent marking race: mutator may move reference that has not yet been seen by the marker into a object that has already been visited
    • If not intercepted or prevented in some way, will corrupt the heap
  • Example technique: track mutations, multi-pass marking

    • Track reference mutation during mark (e.g. in card table)
    • Re-visit all mutated references (and track new mutations)
    • When set is “small enough”, do a STW catch up (mostly concurrent)
  • Note: work grows with mutation rate, may fail to finish

Incremental Compaction

  • Track cross-region remembered sets (which region points to which)
  • To compact a single region, only need to scan regions that point into it to remap all potential references
  • Identify regions sets that fit in limited time

    • Each such set of regions is a STW increment
    • Safe to run application between (but not within) increments
  • Note: work can grow with the square of the heap size

Delaying the inevitable

  • Some form of copying/compaction is inevitable in practice
    • And compacting anything requires scanning/fixing all references to it
  • Delay tactics focus on getting “easy empty space” first
    • This is the focus for the vast majority of GC tuning
  • Most objects die young (generational)
    • So collect young objects only, as much as possible, hope for short STW
    • But eventually, some old dead objects must be reclaimed
  • Most old dead space can be reclaimed without moving it
    • (e.g. CMS) track dead space in lists, and reuse it in place
    • But eventually, space gets fragmented, and needs to be moved
  • Much of the heap is not “popular” (e.g. G1, “Balanced”)
    • A non popular region will only be pointed to from a small % of the heap
    • So compact non-popular regions in short STW pauses
    • But eventually, popular objects and regions need to be compacted

HotSpot ParallelGC

  • Monolithic STW copying NewGen
  • Monolithic STW Mark/Sweep/Compact OldGen

CocMarkSweepGC (aka CMS)

  • Monolithic STW copying NewGen (ParNew)
  • Mostly Concurrent, non-compacting OldGen (CMS)
    • Mostly Concurrent marking
      • Mark concurrently while mutator is running
      • Track mutations in card marks
      • Revisit mutated cards (repeated as needed)
      • STW to catch up on mutations, ref processing, etc
    • Concurrent Sweeping
    • Does not Compact (maintains free list, does not move objects)
  • Fallback to Full Collection (Monolithic STW)
    • Used for Compaction, etc

HotSpot G1GC (aka “ Garbage First”)

  • Monolithic STW copying NewGen
  • Mostly Concurrent, OldGen marker
    • Mostly Concurrent marking
      • STW to catch up on mutations, ref processing, etc
    • Tracks inter-region relationships in remembered sets
  • STW mostly incremental compacting old gen
    • Objective: “Avoid, as much as possible, having a full GC…”
    • Compact sets of regions that can be scanned in a limited time
    • Delay compaction of popular objects, popular regions
  • Fallback to Full Collection (Monolithic STW)
    • Used for compacting popular objects, popular regions, etc

The “Application Memory Wall” (or why STW is a problem?)

A simple observation:

  • Application instances appear to be unable to make effective use of modern server memory capacities

How much memory do applications needs

  • “640KB ought to be enough for anybody”

    • DOH!
  • There’s no right number

  • Target moves at 50x-100x per decade

“Tiny” application history

Application memory wall

What is causing the Application Memory Wall?

  • GC is a clear and dominant cause
  • There seem to be practical heap size limits for applications with responsiveness requirements
  • (Virtually) All current commercial JVMs will exhibit a multi-second pause on a normally utilized 2-6GB heap
    • It’s a question of “When” and “How often”, not “If”
    • GC tuning only moves the “when” and the “how often” around
  • Root cause: The link between scale and responsiveness

What quality of GC is responsible for the Application Memory Wall

  • NOT about overhead or efficiency
    • CPU utilization, bottlenecks, memory consumpation and utilization
  • NOT about speed
  • NOT about minor GC events
  • NOT about the frequency of very large pauses
  • ALL about the worst observable pause behavior
    • People avoid building/deploying visibly broken systems

We need to solve the right problems

  • Scale is artificially limited by responsiveness
  • Responsiveness must be unlined from scale:
    • Heap size, Live Set size, Allocation rate, Muatation rage
    • Transaction Rate, Concurrent users, Data set size, etc.
    • Responsiveness must be continually sustainable
    • Can’t ignore “rare” events
  • Eliminate all STW fallbacks
    • At modern server scales, any STW fallback is a failure

The problems that need solving

  • Robust Concurrent Marking
    • In the presence of high muation and allocation rates
    • Cover modern runtime semantics (e.g. weak refs, lock deflation)
  • Compaction that is not monolithic-STW
    • e.g. stay responsive while compacting 256 GB heaps
    • Must be robust: not just a tactic to delay STW compaction
    • (current “incremental STW” attempts fall short on robustness)
  • Young-Gen that is not monolithic-STW
    • Stay responsive while promoting multi-GB data spikes
    • Concurrent or “incremental STW” may both be OK
    • Surprisingly little work done in this specific area

C4 (Continuously Concurrent Compacting Collector) Collector

  • Concurrent guaranteed-single-pass marker
  • Concurrent Compactor
  • Concurrent, compacting old generation
  • Concurrent, compacting new generation
  • No STW fallback

Benefits

ELIMINATES GC as a concern for enterprise applications

Just use java -Xmx40g