大规模服务架构设计刘文博，百纳信息科技有限公司 [email protected] 2013 年 1 月 30 日

大规模服务架构设计

刘文博，百纳信息科技有限公司[email protected]

2013 年 1 月 30 日

mailto:[email protected]

Agenda

Principles and Patterns of large-scale service architecture

Best Practice in Large Scale Service Development

Principles and Patterns of large-scale service architecture

Challenges at Internet Scale• Online Services manage …

– Millions of active users worldwide– Millions of URL requests per day

• … worldwide– In 39 countries– 24x7x365

• millions or billions read / write operations / day

Architectural Forces: What do we think about?• Scalability

– Resource usage should increase linearly (or better!) with load– Design for 10x growth in data, traffic, users, etc.

• Availability– Resilience to failure– Graceful degradation– Recoverability from failure

• Latency– User experience latency– Data latency

• Manageability– Simplicity– Maintainability– Diagnostics

• Cost– Development effort and complexity– Operational cost (TCO)

Principles for Internet Scale

• 1. Partition Everything

• 2. Asynchrony Everywhere

• 3. Automate Everything

• 4. Remember Everything Fails

• 5. Embrace Inconsistency

Principle 1: Partition Everything• Split every problem into manageable chunks

– Split by data, load, and/or usage pattern– “If you can’t split it, you can’t scale it”

• Motivations– Scalability: scale horizontally and independently– Availability: isolate failures– Manageability: decouple different segments and functional areas – Cost: choose partition size that maximizes price-performance

• Partitioning Patterns– Functional Segmentation– Horizontal Split

Partition Everything: Functional Segmentation

• Segment processing into pools, services, and stages, following is what ebay do

• Segment data by entity and usage pattern

Item Transaction ProductUser Account Feedback

Search View ItemSelling Bidding

MyEbay Checkout Feedback

Partition Everything: Horizontal Split

• Load-balance processing– All servers in a pool are created equal

• Split (or “shard”) data along primary access path– Many split strategies: modulo of a key, range, lookup table,

etc.– Aggregation / routing in Data Access Layer (DAL)– Abstract developers from split logic, logical-physical mapping

Item Transaction ProductUser Account Feedback



Corollary: No Session State

• User session flows through multiple pools

• Absolutely no session state or business object caching in application tier– No HttpSession, no EJBs

• Transient session state maintained in (or referenced by) cookie, URL

• Extended session state in MemCache or database



Principle 2: Async Everywhere• Prefer Asynchronous Processing

– Move most processing to asynchronous flows– Integrate components asynchronously

• Motivations– Scalability: scale components independently– Availability: Decouple availability state, retry operations– Latency: trade execution latency for response latency– Cost: spread peak load over time

• Asynchrony Patterns– Event Queue– Message Multicast– Batch

Async EverywherePattern: Event Queue

– Primary use case writes data and queues event• Event created transactionally with insert/update of primary table

– Consumers subscribe to event• At least once delivery, rather than exactly once• No guaranteed order, rather than in-order

– Guaratee order in app protocol layer, not in transport layer• Idempotency: processing event N times should give same results as processing

once• Readback: consumer typically reads back to primary database for latest data

Async Everywhere: SearchPattern: Message Multicast

– Feeder reads item updates from primary database– Feeder publishes updates via reliable multicast

• Persist messages in intermediate data store for recovery• Publish sequenced updates via multicast to search grid• Resend recovery messages when messages are missed

– Search nodes listen to updates• Listen to assigned subset of messages• Update in-memory index in real time

Async Everywhere: BatchPattern: Periodic Batch

– Scheduled offline batch process

– Most appropriate for• Infrequent, periodic, or scheduled processing (once per day, week,

month)• Non-incremental computation (a.k.a. “Full Table Scan”)

– Examples• Import third-party data (catalogs, currency, etc.)• Generate recommendations (items, products, searches, etc.)• Process items at end of auction

– Often drives further downstream processing through Event Queue

Principle 3: Automate Everything• Prefer Adaptive / Automated Systems to Manual Systems

– Design systems which adapt to their environment

• Motivations– Scalability: scale with machines, not humans– Availability / Latency: adapt to changing environment more rapidly– Cost: machines are less expensive than humans

• Automation Patterns– Adaptive Configuration

Automate Everything: ConfigurationPattern: Adaptive Configuration

– Do not manually configure event consumers• Polling size and frequency• Number of threads

– Define service-level agreement (SLA) for a given logical event consumer• E.g., 99% of events processed in 15 seconds• Each consumer dynamically adjusts to meet defined SLA with minimal resources

– Consumers automatically adapt to changes in • Load (queue length)• Event processing time• Number of consumer instances

Principle 4: Remember Everything Fails

• Build all systems to be tolerant of failure– Assume every operation will fail and every resource will be

unavailable– Detect failure as rapidly as possible– Recover from failure as rapidly as possible– Do as much as possible during failure

• Motivation– Availability

• Failure Patterns– Failure Detection– Rollback– Graceful Degradation

Everything Fails: EBay’s Central Application Logging (CAL)

Pattern: Failure Detection– Application servers log all requests on multicast bus

• Log all application activity, database and service calls on the bus• Log request, application-generated information, and exceptions

– Listeners automate failure detection and notification• Real-time application state monitoring: exceptions and operational alerts• Historical reports by application server pool, URL, database, etc.

Everything Fails: Feature Wire-on / Wire-off

Pattern: Rollback– Every feature has on / off state driven by central

configuration– Can immediately turn feature off for operational or

business reasons– Can deploy “wired-off” to unroll dependencies

• Decouples code deployment from feature deployment

– For an application, feature “availability” is just like resource availability

Everything Fails: Resource Markdown

Pattern: Failure Detection– Application detects when database or other backend resource is unavailable or

distressed• “Resource slow” is often far more challenging than “resource down” (!)

– Application “marks down” the resource• Stops making calls to it and sends alert

Pattern: Graceful Degradation– Remove or ignore non-critical functionality– Retry or defer critical functionality

• Failover to alternate resource• Defer processing event

– Explicit “markup”• Allows resource to be restored and brought online in a controlled way

Principle 5: Embrace Inconsistency• Brewer’s CAP Theorem

– Any shared-data system can have at most two of the following properties:

• Consistency: All clients see the same data, even in the presence of updates

• Availability: All clients will get a response, even in the presence of failures

• Partition-tolerance: The system properties hold even when the network is partitioned

• This trade-off is fundamental to all distributed systems

Principle 5: Embrace InconsistencyChoose Appropriate Consistency Guarantees

• Typically we trade off immediate consistency for availability and partition-tolerance– Choose A and P; give up C

• Most real-world systems do not require immediate consistency (even financial systems!)– “Settlement”: financial parties reconcile and resolve conflicts at end

of trading day• Consistency is a spectrum

Immediate

ConsistencyBids, Purchases

Eventual

ConsistencySearch Engine, Billing System, etc.

No

ConsistencyPreferences

Principle 5: Embrace InconsistencyAvoid Distributed Transactions

• Never do distributed transactions– No JDBC client transactions, etc.

• Minimize inconsistency through state machines and careful ordering of database operations

• Reach eventual consistency through asynchronous event or reconciliation batch

Best Practice in Large Scale Service Development

Implementation

• Services are typically running as RPC or Web Service– Rest Service ： use URL to identify resources and http method to

access resource– SOAP Service – Good for interop– RPC Style Service (XML is not really an efficient solution)

• Services can use other services to implement its functionality.

• Difference between service and library components – Latency, different model of memory access, concurrency and

failure control. Coarse-grained.– Isolation boundary (minimum interface) makes development,

deployment, maintenance and troubleshooting easier.

Release Cycle

• Short Release Cycles– typically on the order of weeks

• Fork a release branch at a golden CL number (Continuous Integration System is mandatory)

• Only fixes show-stoppers on release branch• What are the benefits of short cycles?

– Quick feedback-response turnaround– Never work on “Old” version– Easy to know what has changed– Great for experimental features

• Big features can be checked in partially

Release Cycle• Production Quality at Every Check In• Dev Unit Test

– Dev write more test code than dev code– Aim at high coverage number like 70% (coverage number is not as important as

scenario coverage)– Unit test also documents how the class is used– What about dependencies (programming by Interface)

• use Dependency Injection and Mock framework to implement tests

• Thorough code reviews• Share common code as much as possible• Very strict process for storage schema change• Have a testing environment almost identical to production environment.

Common platform makes this possible• Automated tests for all P1 scenarios• Mechanism to disable features/enable features in production

Deployment

• Service continuity during upgrade?– Upgrade every DC in turn• needs replicas in different DCs• Stale data due to replication latency• less than optimal performance for users during upgrade

– Rolling update servers in the same DC in turn• Code compatibility: can old code work with new code?• Data compatibility: can old code work with new data, can new code

work with old data

• Disasters after going live?– Roll back to last release, very similar to upgrade– It is easy to roll back code, but not easy to roll back data.

Deployment• Service Binaries will be uploaded to repository.• Each Service will have a configuration file. Common tools are used to start/update/stop

service based on the configuration file.• It’s good idea to use Zookeeper, Chubby… tools to manage configurations.• Job Description Language:

– Logic Service Name– Binary Name, Version– Binary Dependencies– Runtime Service Dependencies– Resource Requirement– Number of Instances

• Job Scheduler: run the job somewhere and communicate server location to naming service

• Client access service by service name• Service typically only access resource/services in the same DC: performance,

manageability• Service can run in multiple DC use replicated data

How To Scale• Offline Data Processing: MapReduce• Online Processing

– Run multiple replicas– Stateless servers will be ideal: any new server can handle new request, easy

to fail over– Use sticky session to improve cache effectiveness if user has state– Use sharding to partition data where there is too much data– Multi-level load balancer (DNS, hardware load balancer, soft load balancer)– Newer distributed storage system are much more scalable than traditional

database or file system. Easy to increase resource usage dynamically for a table. MongoDB, Hbase, Redis…

– Try to reduce synchronization between replicas – Split functionality into multiple services if necessary

• Do not do everything in a single service. It is not easy to scale.

Failures• Failures are common rather than exceptional. Services/Machine can die at

any time. Design for failures.• Monitoring

– Service/Machine need to publish status information (performance counters etc).– Use heartbeat/customized prober to detect failures. Probers are registered with

global monitoring services.– Alerts can be fired. Custom actions can be taken.

• Logging– Application log should contain enough information for analysis. It is impossible to

debug production server most of the time.– Create a unique id for ach request and save it in log entries. – Centralized logging service and tool will provide complete log for a request across

machines/services.• Scribe: open sourced by Facebook• Chukwa: open source data collection system for monitoring large distributed systems

Failures

• How to handle failures:– Make it possible for another (or any) machine to resume

processing of a request. – Operation log might need to be saved to shared storage or

replicated to another machine to continue.– Failed machine should be not corrupt data. – Idempotent operation will be good.– Optimistic locking can be used to detect conflicts.– Queues can be used on server side or client side to help

retry.– Sometime it is ok to tell user an operation failed.

Failures

• Exception Handling– Fail-fast is preferred in general.– Is the exception retryable? What should the retry

strategy be?– Poison detection: servers should not crash

repeatedly due to the same request (DOS attack.)– What exceptions should be handled? Can I catch

an unknown exception? – Transactional semantics is very useful.

Latency• Latency (what user perceives) Matters

– Google +500 ms -20% traffic– Yahoo +400 ms -5-9% full-page traffic– Amazon +100 ms -1% sales

• What can you do to reduce latency?– Know what things cost (try things out, profiling)– From high level to low level– Be lazy, caching, batching, asynchronous pattern– Know the platform well (number of concurrent connections from Browser)– Use more than one machine to handle the request– Reduce I/O: load data into memory– Large scale computing make it possible to handle interactive requests that

were not possible before

Thank You!

Questions?

Documents

大规模服务架构设计 刘文博，百纳信息科技有限公司 [email protected] 2013 年 1 月 30 日

大规模服务架构设计刘文博，百纳信息科技有限公司 [email protected] 2013 年 1 月 30 日