(This notes is from watching Building Software Systems At Google and Lessons Learned by Jeff Dean)
Software Systems at Google
Plan
- Evolution of various systems at Google
- computing hardware
- core search systems
- infrastructure software
- Techniques for building large-scale systems
- decomposition into services
- design patterns for performance & reliability
Google Web Search 1999 vs. 2010
- Number of docs: 1000x
- queries processed/day: 1000x
- per doc info in index: 3x
- update latency: ~50000x
avg. query latency <1s to <0.2s ~5x
More machines * faster machines: ~1000x
Continuous evolution
- 7 significant revisions in last 11 years
- often rolled out without users realizing we’ve made major changes
Research Project, circa 1997
Basic Principles
- Index Servers:
- Given {query} return sorted list of
pairs - partitioned {“sharded”} by docid
- index shards are replicated for capacity
- cost is O(# queries * # docs in index)
- Given {query} return sorted list of
- Doc Servers
- Given {docid, query} generate {title, snippet}
- snippet is query-dependent
- map from docid to full text of docs
- also partitioned by docid
- cost is O(# queries)
- Given {docid, query} generate {title, snippet}
Build Servers by ourselves
“Corkboards” (1999)
Serving System, circa 1999
Caching in Web Search Systems
- Cache servers:
- cache both index results and doc snippets
- hit rates typically 30-60%
- depends on frequency of index updates, mix of query traffic, level of personalization, etc..
- Main benefits:
- performance! a few machines do work of 100s or 1000s
- much lower query latency on hits
- queries that hit in cache tent to be both popular and expensive
- Beware: big latency spike/capacity drop when index updated or cache flushed
Indexing (circa 1998-1999)
- Simple batch indexing systems
- no real checkpointing, so machine failures painful
- no checksumming of raw data, so hardware bit errors caused problems
- exacerbated by early machines having no ECC, no parity
- sort 1 TB data without parity: ends up “mostly sorted”
- sort it again: “mostly sorted” another way
- “Programming with adversarial memory”
- Developed file abstraction that stores checksums of small records and can skip and resynchronize after corrupted records
Google Data center (2000 - 2001)
Evolved
Increasing Index Size and Query Capacity
- Huge increases in index size in 99, 00, 01
- From ~50m pages to more than 1000m pages
- At same time as huge traffic increases
- ~20% growth per month in 1999, 2000
- plus major new parteners
- Performance of index servers was paramount
- Depolying more machines continuously, but…
- Need ~10-30% software-based improvement every month
Dealing with growth
- Many positives:
- big increase in throughput
- big decrease in query latency
- especially at the tail: expensive queries that previously needed GBs of disk I/O became much faster and cheaper
- e.g. [“circle of life”]
- especially at the tail: expensive queries that previously needed GBs of disk I/O became much faster and cheaper
- Some issues
- Variance: query touches 1000s of machines, not dozens
- e.g. randomized cron jobs caused us trouble for a while
- Availability: 1 or few replicas of each doc’s index data
- availability of index data when machine failed: replicate important docs
- queries of death that kill all the backends at once: very bad
- Variance: query touches 1000s of machines, not dozens
Canary Requets
- Problem: requests sometimes cause server process to crash
- testing can help, but can’t eliminate
- If sending same of similar request to 1000s of machines:
- they all might creash
- recover time for 1000s of processes pretty slow
- solution: send canary request first to one machine
- if RPC finishes successfully, go ahead and send to all the rest
- if RPC fails unexpectedly, try another machine
- if fails K times, reject request
- Crash only a few servers, not 1000s
Query Serving System, 2004
- Clean abstractions:
- repository
- document
- attachments
- scoring functions
- Easy experimentation
- Attach new doc and index data without full reindexing
- Higher performance: designed from ground up to assume data is in memory
New index Format
- Old disk and in-memory index used two-level scheme:
- Each hit was encoded as (docid, workpositionindoc) pair
- docid deltas encoded with Rice encoding
- very good compression (originally designed for disk-based indices), but slow/CPU-intensive to decode
- New format: single flat position space
- data structures on side keep track of doc boundaries
- posting lists are just lists of delta-encoded positions
- need to be compact (can’t afford 32 bit value per occurrence)
- … but need to be very fast to decode
Byte-Aligned Variable-length Encodings
Group Varint Encoding
- Idea: encode groups of 4 32-bit values in 5-17 bytes
- pull out 4 2-bit binary lenths into single byte prefix
- Much faster than alternatives
- 7 bit-per-byte varint: decode ~180M numbers/second
- 30-bit varaint w/2-bit length: decode ~240M numbers/second
- Group varint: decode ~400M numbers/second
2007: Universal Search
- Search all corpora in parallel
- Performance: most of the corpora weren’t designed to deal with high QPS levl of web search
- Mixing: Which corpora are relevant to query
- changes over time
- UI: How to organize results from different corpora?
- interleaved?
- separate sections for different types of documents?
System Software Evolution
Hardware
- Machine + Racks
- Clusters
lots of failures of playing with REAL hardware
Reliability/Availability must come from software!
Low-Level System Software Desires
- If you have lots of machines, you want to:
- Store data persistently
- w/ high availability
- high r/w bandwidth
- Run large-scale computation reliably
- without having to deal with machine failures
GFS Design
- Master manages metadata
- Data transfers are directly between clients/chunkservers
- File broken into chunks (typically 64MB)
- Chunks replicated across multiple machines (usually 3)
Google Cluster Software Environment
- Cluster is 5k-20k machines, typically one of handful of hw configurations
- File system (GFS or Colossus) + cluster scheduling system are core services
- Typically 100s to 1000s of active jobs (some w/1 task, some w/1000s)
- mix of batch and low-latency, user-facing production jobs
MapReduce
Problem: lots of data
- Example: 20+ billion web pages X 20KB = 400+TB
- One computer can read 50 MB/sec from disk
- ~3 months to read the web
- ~1000 hard drives just to store the web
- Even more to do something with the data
Solution
- Good news: same problem with 1000 machines, < 3 hours
- Bad news: programming work
- communication and coordination
- recovering from maching failure
- status reporting
- debugging
- optimization
- locality
- Bad news 2: repeat for every problem you want to solve
MapReduce History
- 2003: Working on rewriting indexing system
- start with raw page contents on disk
- many phases
- duplicate elimination
- anchor text extraction
- language identification
- index shard generation
- …
- end with data structures for index and doc serving
- Each phase was hand written parallel computation:
- hand parallelized
- hand-written checkpointing code for fault-tolerance
MapReduce
- A simple programming model that applies to many large-scale computing problems
- allowed us to express all phases of our indexing system
- since used across broad range of CS areas, plus other scientific fields
- Hadoop open-source implementation seeing significant usage
- Hide messy details in MapReduce runtime library:
- automatic parallelization
- load balancing
- network and disk transfer optimizations
- handling of machine failures
- robustness
- improvements to core library benefit all users of library!
Typical problem
- Read a lot of data
- Map: extract something you care about from each record
- Shuffle and Sort
- Reduce: aggregate, summarize, filter, or transform
- Write the results
Outline stays the same, user writes map and reduce functions to fit the problem
Example: Rendering Map Tiles:
MapReduce: Scheduling
- One master, many workers
- Input data split into M map tasks
- Reduce phase partitioned into R reduce tasks
- Tasks are assigned to workers dynamically
- Master assigns each map task to a free worker
- Consider locality of data to worker when assigning task
- Worker reads task input (often from local disk)
- Worker produces R local files containing intermediate k/v pairs
- Master assigns each reduce task to a free worker
- Worker reads intermediate k/v pairs from map workers
- Worker sorts & applies user’s Reduce op to produce the output
Parallel MapReduce
Task Granularity and Pipelining
- Fine granularity tasks: many more map tasks than machines
- Minimizes time for fault recovery
- Can pipeline shuffling with map execution
- Better dynamic load balancing
Fault Tolerance: Handled via re-execution
On worker failure:
- Detect failure via periodic heartbeats
- Re-execute completed and in-progress map tasks
- Re-execute in progress reduce tasks
- Task completion committed through master
On master failure:
- State is checkpointed to GFS: new master recovers & continues
Very robust: lost 1600 of 1800 machines once, but finished fine
Refinement: Backup Tasks
- Slow workers significantly lenghten completion time
- Other jobs consuming resources on machine
- Bad disks with soft errors transfer data very slowly
- Weird things: processor caches disabled
- Solution: Near end of phase, spawn backup copies of tasks
- Whichever one finishes first “wins”
Effect: Dramatically shortens job completion time
- Whichever one finishes first “wins”
Refinement: Locality Optimization
Master scheduling policy:
- Asks GFS for locations of replicas of input file blocks
- Map tasks typically split into 64MB (== GFS block size)
- Map tasks scheduled so GFS input block replica are on same machine or same rack
Effect: Thousands of machines read input at local disk speed
- Without this, rack switches limit read rate
Spanner
- Storage & computation system that runs across many datacenters
- single global namespace
- names are independent of location(s) of data
- fine-grained replication configurations
- support mix of strong and weak consistency across datacenters
- strong consistency implemented with Paxos across tablet replicas
- full support for distributed transactions across directories/machines
- much more automated operation
- automatically changes replication based on constraints and usage patterns
- automated allocation of resources across entire fleet of machines
- single global namespace
Design Goals for Spanner
System Building Experiences and Patterns
- Experiences from building a variety of systems
- A collection of patterns that have emerged
- Not all encompassing, obviously, but good rules of thumb
Many Internal Services
Break large complex systems down into many services!
Simpler from a software engineering standpoint
- few dependencies, clearly specified
- easy to test and deply new versions of individual services
- ability to run lots of experiments
- easy to reimplement service without affecting clients
Development cycles largely decoped
- lots of benefits: small teams can work independently
- easier to have man engineering offices around the world
e,g.
google.com
search touches 200+ services- ads, web search, books, news, …
Designing Efficient Systems
Given a basic problem definition, how do you choos “best” solution?
- Best might be simplest, highest performance, easiest to extend, etc.
Important skill: ability to estimate performance of a system design
- without actually having to build it!
Numbers Everyone Should Know
Know Your Basic Building Blocks
- Core language libraries, basic data structures, protocol buffers, GFS, BigTable, indexing systems, MapReduce…
- Not just their interfaces, but understand their implementations (at least at a high level)
- If you don’t know what’s going on, you can’t do decent back-of-the-envelope calculations!
Designing & Building Infrastructure
Identify common problems, and build software systems to address them in a general way
- Important to not try to be all things to all people
- Clients might be demanding 8 different things
- Doing 6 of them is easy
- Handling 7 of them requires real thought
- …Dealing with all 8 usually results in a worse system
- more complex, compromises other clients in trying to satisfy everyone
Don’t build infrstructure just for its own sake
- Identify common needs and address them
- Don’t imagine unlikely potential needs that aren’t really there
Best approach: use your own infrastructure (especially at first)
- (much more rapid feedback about what works, what doesn’t)
Design for Growth
Try to anticipate how requirements will evolve
- keep likely features in mind as you design base system
Don’t design to scale infinitely:
- 5x - 50x growth good to consider
100x probably requires rethink and rewrite
Pattern: Single Master, 1000s of Workers
- Master orchestrates global operation of system
- load balancing, assignment of work, reassignment when machines fail, etc
- … but client interaction with master is fairly minimal
Examples:
- GFS, BigTable, MapReduce, cluster scheduling system, …
Often: hot standby of master waiting to take over
Always: bulk of data transfer directly between clients and workers
Pro:
- simpler to reason about state of system with centralized master
Caveats:
- careful design required to keep master out of common case ops
- scale to 1,000s of workers, but not 100,000s of workers
Pattern: Tree Distribution of Requests
Problem: Single machine sending 1000s of RPCs overloads NIC on machine when handling replies
- wide fan in causes TCP drops/retransmits, significant latency
- CPU becomes bottleneck on single machine
Solution: Use tree distribution of requests/responses
- fan in at root is smaller
- cost of processing leaf responses spread across many parents
- Most effective when parent processing can trim/combime leaf data
- can also co-locate parents on same rack as leaves
Pattern: Backup Requests to Minimize Latency
Problem: variance high when requests go to 1000s of machines
- last few machines to respond stretch out latency tail substantially
Often, multiple replicas can handle same kind of request
- When few tasks remaining, send backup requests to other replicas
Whichever duplicate request finishes first wins
- usefull when variance is unrelated to specifics of request
- increases overall load by a tiny percentage
- decreases latency tail significantly
Examples
- MapReduce backup tasks
- various query serving systems
Pattern: Multiple Smaller Units per Machine
- Problems:
- want to minimize recovery time when machine crashes
- want to do fine-grained load balancing
Having each maching manage 1 unit of work is inflexible
- slow recovery: new replica must recover data that is O(machine state) in size
- load balancing much harder
Solution: Have each machine manage many smaller units of work/data
- typical: ~10-100 units/machine
- allows fine grained load balancing (shed or add one unit)
- fast recovery from failure (N machines each pick up 1 unit)
Examples
- map and reduce tasks, GFS chunks, Bigtable tablest, quesry serving system index shards
Pattern: Elastic Systems
Problems:
- overcapacity: wasted resources
- undercapacity: meltdown
Design system to adapt:
- automatically shrink capacity during idle period
- automatically grow capacity as load grows
Make system resilient to overload
- do something reasonable even up to 2x planned capacity
- e.g. shrink size of index searched, back off to less CPU intensive algorithms, drop spelling correction tips, etc…
- more aggressive load balancing when imblance more severe
- do something reasonable even up to 2x planned capacity
Pattern: Combine Multiple Implementations
Example: Google web search system wants all of these:
- freshness
- massive capacity
- high quality retrieval
- masisve size
very difficult to accomplish in single implementation
partition problem into several subproblems with different eng tradoffs e.g.
- realtime system: few docs, ok to pay lots of $$$/doc
- base system: high # of docs, optimized for low $/doc
- realtime+base system: high # of docs, fresh, low $/doc
Final Thoughts
Today: existing collection of trends
- large-scale data centers
- increasing scale and diversity of available data sets
- proliferation of more powerful client devices
Many interesting opportunities
- planetary scale distributed systems
- development of new CPU and data intensive services
- new tools and techniques for constructing such systems