Building Software Systems At Google and Lessons Learned Notes

(This notes is from watching Building Software Systems At Google and Lessons Learned by Jeff Dean)

Software Systems at Google

Plan

  • Evolution of various systems at Google
    • computing hardware
    • core search systems
    • infrastructure software
  • Techniques for building large-scale systems
    • decomposition into services
    • design patterns for performance & reliability

Google Web Search 1999 vs. 2010

  • Number of docs: 1000x
  • queries processed/day: 1000x
  • per doc info in index: 3x
  • update latency: ~50000x
  • avg. query latency <1s to <0.2s ~5x

  • More machines * faster machines: ~1000x

Continuous evolution

  • 7 significant revisions in last 11 years
  • often rolled out without users realizing we’ve made major changes

Research Project, circa 1997

circa

Basic Principles

  • Index Servers:
    • Given {query} return sorted list of pairs
    • partitioned {“sharded”} by docid
    • index shards are replicated for capacity
    • cost is O(# queries * # docs in index)
  • Doc Servers
    • Given {docid, query} generate {title, snippet}
      • snippet is query-dependent
    • map from docid to full text of docs
    • also partitioned by docid
    • cost is O(# queries)

Build Servers by ourselves

“Corkboards” (1999)

Serving System, circa 1999

circa

Caching in Web Search Systems

  • Cache servers:
    • cache both index results and doc snippets
    • hit rates typically 30-60%
      • depends on frequency of index updates, mix of query traffic, level of personalization, etc..
  • Main benefits:
    • performance! a few machines do work of 100s or 1000s
    • much lower query latency on hits
      • queries that hit in cache tent to be both popular and expensive
  • Beware: big latency spike/capacity drop when index updated or cache flushed

Indexing (circa 1998-1999)

  • Simple batch indexing systems
    • no real checkpointing, so machine failures painful
    • no checksumming of raw data, so hardware bit errors caused problems
      • exacerbated by early machines having no ECC, no parity
      • sort 1 TB data without parity: ends up “mostly sorted”
      • sort it again: “mostly sorted” another way
  • “Programming with adversarial memory”
    • Developed file abstraction that stores checksums of small records and can skip and resynchronize after corrupted records

Google Data center (2000 - 2001)

Evolved

Increasing Index Size and Query Capacity

  • Huge increases in index size in 99, 00, 01
    • From ~50m pages to more than 1000m pages
  • At same time as huge traffic increases
    • ~20% growth per month in 1999, 2000
    • plus major new parteners
  • Performance of index servers was paramount
    • Depolying more machines continuously, but…
    • Need ~10-30% software-based improvement every month

Dealing with growth

Dealing with Growth

In-Memory Index

  • Many positives:
    • big increase in throughput
    • big decrease in query latency
      • especially at the tail: expensive queries that previously needed GBs of disk I/O became much faster and cheaper
        • e.g. [“circle of life”]
  • Some issues
    • Variance: query touches 1000s of machines, not dozens
      • e.g. randomized cron jobs caused us trouble for a while
    • Availability: 1 or few replicas of each doc’s index data
      • availability of index data when machine failed: replicate important docs
      • queries of death that kill all the backends at once: very bad

Canary Requets

  • Problem: requests sometimes cause server process to crash
    • testing can help, but can’t eliminate
  • If sending same of similar request to 1000s of machines:
    • they all might creash
    • recover time for 1000s of processes pretty slow
  • solution: send canary request first to one machine
    • if RPC finishes successfully, go ahead and send to all the rest
    • if RPC fails unexpectedly, try another machine
    • if fails K times, reject request
  • Crash only a few servers, not 1000s

Query Serving System, 2004

Query Serving Systems

  • Clean abstractions:
    • repository
    • document
    • attachments
    • scoring functions
  • Easy experimentation
    • Attach new doc and index data without full reindexing
  • Higher performance: designed from ground up to assume data is in memory

New index Format

  • Old disk and in-memory index used two-level scheme:
    • Each hit was encoded as (docid, workpositionindoc) pair
    • docid deltas encoded with Rice encoding
    • very good compression (originally designed for disk-based indices), but slow/CPU-intensive to decode
  • New format: single flat position space
    • data structures on side keep track of doc boundaries
    • posting lists are just lists of delta-encoded positions
    • need to be compact (can’t afford 32 bit value per occurrence)
    • … but need to be very fast to decode

Byte-Aligned Variable-length Encodings

Query Serving Systems

Group Varint Encoding

  • Idea: encode groups of 4 32-bit values in 5-17 bytes
    • pull out 4 2-bit binary lenths into single byte prefix

Query Serving Systems

  • Much faster than alternatives
    • 7 bit-per-byte varint: decode ~180M numbers/second
    • 30-bit varaint w/2-bit length: decode ~240M numbers/second
    • Group varint: decode ~400M numbers/second

Query Serving Systems

  • Search all corpora in parallel
  • Performance: most of the corpora weren’t designed to deal with high QPS levl of web search
  • Mixing: Which corpora are relevant to query
    • changes over time
  • UI: How to organize results from different corpora?
    • interleaved?
    • separate sections for different types of documents?

System Software Evolution

Hardware

  • Machine + Racks
  • Clusters

lots of failures of playing with REAL hardware

Reliability/Availability must come from software!

Low-Level System Software Desires

  • If you have lots of machines, you want to:
  • Store data persistently
    • w/ high availability
    • high r/w bandwidth
  • Run large-scale computation reliably
    • without having to deal with machine failures

GFS Design

Query Serving Systems

  • Master manages metadata
  • Data transfers are directly between clients/chunkservers
  • File broken into chunks (typically 64MB)
  • Chunks replicated across multiple machines (usually 3)

Google Cluster Software Environment

  • Cluster is 5k-20k machines, typically one of handful of hw configurations
  • File system (GFS or Colossus) + cluster scheduling system are core services
  • Typically 100s to 1000s of active jobs (some w/1 task, some w/1000s)
    • mix of batch and low-latency, user-facing production jobs

Query Serving Systems

MapReduce

Problem: lots of data

  • Example: 20+ billion web pages X 20KB = 400+TB
  • One computer can read 50 MB/sec from disk
    • ~3 months to read the web
  • ~1000 hard drives just to store the web
  • Even more to do something with the data

Solution

  • Good news: same problem with 1000 machines, < 3 hours
  • Bad news: programming work
    • communication and coordination
    • recovering from maching failure
    • status reporting
    • debugging
    • optimization
    • locality
  • Bad news 2: repeat for every problem you want to solve

MapReduce History

  • 2003: Working on rewriting indexing system
    • start with raw page contents on disk
    • many phases
      • duplicate elimination
      • anchor text extraction
      • language identification
      • index shard generation
    • end with data structures for index and doc serving
  • Each phase was hand written parallel computation:
    • hand parallelized
    • hand-written checkpointing code for fault-tolerance

MapReduce

  • A simple programming model that applies to many large-scale computing problems
    • allowed us to express all phases of our indexing system
    • since used across broad range of CS areas, plus other scientific fields
    • Hadoop open-source implementation seeing significant usage
  • Hide messy details in MapReduce runtime library:
    • automatic parallelization
    • load balancing
    • network and disk transfer optimizations
    • handling of machine failures
    • robustness
    • improvements to core library benefit all users of library!

Typical problem

  • Read a lot of data
  • Map: extract something you care about from each record
  • Shuffle and Sort
  • Reduce: aggregate, summarize, filter, or transform
  • Write the results

Outline stays the same, user writes map and reduce functions to fit the problem

Example: Rendering Map Tiles:

Query Serving Systems

MapReduce: Scheduling

  • One master, many workers
    • Input data split into M map tasks
    • Reduce phase partitioned into R reduce tasks
    • Tasks are assigned to workers dynamically
  • Master assigns each map task to a free worker
    • Consider locality of data to worker when assigning task
    • Worker reads task input (often from local disk)
    • Worker produces R local files containing intermediate k/v pairs
  • Master assigns each reduce task to a free worker
    • Worker reads intermediate k/v pairs from map workers
    • Worker sorts & applies user’s Reduce op to produce the output

Parallel MapReduce

Parallel MapReduce

Task Granularity and Pipelining

  • Fine granularity tasks: many more map tasks than machines
    • Minimizes time for fault recovery
    • Can pipeline shuffling with map execution
    • Better dynamic load balancing

Scheduling

Fault Tolerance: Handled via re-execution

On worker failure:

  • Detect failure via periodic heartbeats
  • Re-execute completed and in-progress map tasks
  • Re-execute in progress reduce tasks
  • Task completion committed through master

On master failure:

  • State is checkpointed to GFS: new master recovers & continues

Very robust: lost 1600 of 1800 machines once, but finished fine

Refinement: Backup Tasks

  • Slow workers significantly lenghten completion time
    • Other jobs consuming resources on machine
    • Bad disks with soft errors transfer data very slowly
    • Weird things: processor caches disabled
  • Solution: Near end of phase, spawn backup copies of tasks
    • Whichever one finishes first “wins”
      Effect: Dramatically shortens job completion time

Refinement: Locality Optimization

Master scheduling policy:

  • Asks GFS for locations of replicas of input file blocks
  • Map tasks typically split into 64MB (== GFS block size)
  • Map tasks scheduled so GFS input block replica are on same machine or same rack

Effect: Thousands of machines read input at local disk speed

  • Without this, rack switches limit read rate

Spanner

  • Storage & computation system that runs across many datacenters
    • single global namespace
      • names are independent of location(s) of data
      • fine-grained replication configurations
    • support mix of strong and weak consistency across datacenters
      • strong consistency implemented with Paxos across tablet replicas
      • full support for distributed transactions across directories/machines
    • much more automated operation
      • automatically changes replication based on constraints and usage patterns
      • automated allocation of resources across entire fleet of machines

Design Goals for Spanner

Spanner Design Goals

System Building Experiences and Patterns

  • Experiences from building a variety of systems
    • A collection of patterns that have emerged
    • Not all encompassing, obviously, but good rules of thumb

Many Internal Services

Break large complex systems down into many services!

  • Simpler from a software engineering standpoint

    • few dependencies, clearly specified
    • easy to test and deply new versions of individual services
    • ability to run lots of experiments
    • easy to reimplement service without affecting clients
  • Development cycles largely decoped

    • lots of benefits: small teams can work independently
    • easier to have man engineering offices around the world
  • e,g. google.com search touches 200+ services

    • ads, web search, books, news, …

Designing Efficient Systems

Given a basic problem definition, how do you choos “best” solution?

  • Best might be simplest, highest performance, easiest to extend, etc.

Important skill: ability to estimate performance of a system design

  • without actually having to build it!

Numbers Everyone Should Know

Numbers Everyone Should Know

Know Your Basic Building Blocks

  • Core language libraries, basic data structures, protocol buffers, GFS, BigTable, indexing systems, MapReduce…
  • Not just their interfaces, but understand their implementations (at least at a high level)
  • If you don’t know what’s going on, you can’t do decent back-of-the-envelope calculations!

Designing & Building Infrastructure

Identify common problems, and build software systems to address them in a general way

  • Important to not try to be all things to all people
    • Clients might be demanding 8 different things
    • Doing 6 of them is easy
    • Handling 7 of them requires real thought
    • Dealing with all 8 usually results in a worse system
      • more complex, compromises other clients in trying to satisfy everyone

Don’t build infrstructure just for its own sake

  • Identify common needs and address them
  • Don’t imagine unlikely potential needs that aren’t really there

Best approach: use your own infrastructure (especially at first)

  • (much more rapid feedback about what works, what doesn’t)

Design for Growth

Try to anticipate how requirements will evolve

  • keep likely features in mind as you design base system

Don’t design to scale infinitely:

  • 5x - 50x growth good to consider
  • 100x probably requires rethink and rewrite

Pattern: Single Master, 1000s of Workers

  • Master orchestrates global operation of system
    • load balancing, assignment of work, reassignment when machines fail, etc
    • … but client interaction with master is fairly minimal

Master Worker Pattern

  • Examples:

    • GFS, BigTable, MapReduce, cluster scheduling system, …
  • Often: hot standby of master waiting to take over

  • Always: bulk of data transfer directly between clients and workers

  • Pro:

    • simpler to reason about state of system with centralized master
  • Caveats:

    • careful design required to keep master out of common case ops
    • scale to 1,000s of workers, but not 100,000s of workers

Pattern: Tree Distribution of Requests

  • Problem: Single machine sending 1000s of RPCs overloads NIC on machine when handling replies

    • wide fan in causes TCP drops/retransmits, significant latency
    • CPU becomes bottleneck on single machine
  • Solution: Use tree distribution of requests/responses

    • fan in at root is smaller
    • cost of processing leaf responses spread across many parents
  • Most effective when parent processing can trim/combime leaf data
    • can also co-locate parents on same rack as leaves

Pattern: Backup Requests to Minimize Latency

  • Problem: variance high when requests go to 1000s of machines

    • last few machines to respond stretch out latency tail substantially
  • Often, multiple replicas can handle same kind of request

  • When few tasks remaining, send backup requests to other replicas
  • Whichever duplicate request finishes first wins

    • usefull when variance is unrelated to specifics of request
    • increases overall load by a tiny percentage
    • decreases latency tail significantly
  • Examples

    • MapReduce backup tasks
    • various query serving systems

Pattern: Multiple Smaller Units per Machine

  • Problems:
    • want to minimize recovery time when machine crashes
    • want to do fine-grained load balancing
  • Having each maching manage 1 unit of work is inflexible

    • slow recovery: new replica must recover data that is O(machine state) in size
    • load balancing much harder
  • Solution: Have each machine manage many smaller units of work/data

    • typical: ~10-100 units/machine
    • allows fine grained load balancing (shed or add one unit)
    • fast recovery from failure (N machines each pick up 1 unit)
  • Examples

    • map and reduce tasks, GFS chunks, Bigtable tablest, quesry serving system index shards

Pattern: Elastic Systems

  • Problems:

    • overcapacity: wasted resources
    • undercapacity: meltdown
  • Design system to adapt:

    • automatically shrink capacity during idle period
    • automatically grow capacity as load grows
  • Make system resilient to overload

    • do something reasonable even up to 2x planned capacity
      • e.g. shrink size of index searched, back off to less CPU intensive algorithms, drop spelling correction tips, etc…
      • more aggressive load balancing when imblance more severe

Pattern: Combine Multiple Implementations

  • Example: Google web search system wants all of these:

    • freshness
    • massive capacity
    • high quality retrieval
    • masisve size
  • very difficult to accomplish in single implementation

  • partition problem into several subproblems with different eng tradoffs e.g.

    • realtime system: few docs, ok to pay lots of $$$/doc
    • base system: high # of docs, optimized for low $/doc
    • realtime+base system: high # of docs, fresh, low $/doc

Final Thoughts

Today: existing collection of trends

  • large-scale data centers
  • increasing scale and diversity of available data sets
  • proliferation of more powerful client devices

Many interesting opportunities

  • planetary scale distributed systems
  • development of new CPU and data intensive services
  • new tools and techniques for constructing such systems