49
Service Fabric Internals, p1

Tokyo azure meetup #12 service fabric internals

Embed Size (px)

Citation preview

PowerPoint Presentation

Service Fabric Internals, p1

Service Fabric

Cloud ConsiderationsDont own the hardwareFailures are part of the gameScale is unpredictableManaging services is harder than building themAdvanced telemetry for visibility requiredNo downtime for upgrades Do you control your costs? How about density?Dedicate attention for security

Why Microservices

Evolve continuously

Faster delivery

Build and operate at scale

Application Design - TraditionalProsCompile-time contract validationLocal operationsEasier to understand

ConsExpensive to scale applicationHard to scale data accessUpgrades are difficult

Application Design Service Oriented ProsCheaper to scale applicationEasier to scale data accessUpgrade continuously

ConsRuntime contract validationNetwork operationsHarder to understand

State

Azure Service Fabric

Service Fabric Cluster

Cloud Service vs Service Fabric

Service Fabric Programming Models

Stateless Service Pattern

Stateful Service Pattern

Types of MicroservicesStateless MicroserviceState is stored externally We can have N instancesWeb frontends, protocol gateways, Azure Cloud SerivcesStateful MicroserviceMaintain hard, authoritative stateN consistent copies achieved through replication and local persistenceDatabase, documents, workflows, user profile, shopping cart

Demo Stateless Service

DashboardDeactivate nodeShow diagosticsEnable node

18

Demo Stateful Service

Migrating a traditional applicationDecide on the problems you are solvingScale, agility, resilience

Decided on a well define area to re-architect

You can have mixture of traditional and microservice designs

3) Traditional app with new microservices 2) Traditional app hosted as guest executable or container in Service Fabric

4) Breaking the Traditional app into microservice 5) Transformed into microservicesStages of migrating a traditional application

Migrating a traditional application

Traditional appHosted as guest executable or container in Service FabricWith new microservices added alongisdeBreaking into microservicesTransformed into microservices

You can stop at any stage

Common design pattern using gateways

Web Gateway

REST/Websockets

API Management

IoT Hub

Event Hub

Load Balancer

Using a gateway to integrate a traditional app with Service Fabric

Gateway

ClientClient

Problem Cluster Management nightmares

I always worry about running out of capacity ?

I am not sure if all the VM resources are utilized ?

I am worried sick about my cluster being compromised.

I have no control on when a new Service fabric version is rolled out to my cluster.

I am not sure if what disasters my cluster can survive ?

Best PracticeService Fabric Cluster Management nightmare mitigation

Let us divide the problem space into three buckets

Plan out your cluster capacity

Optimize and Secure your cluster

Manage your cluster version

Best PracticeService Fabric Cluster planning

Capacity planning is not an easy exercise.

Capacity planning is not a one time exercise.

Do not assume that you can add capacity on demand instantly.

Do not assume that you can take downtime to change capacity later

Here are some of the questions you will need answer for, before you can truly plan out your cluster.

Source ControlBuild

OPS

OPS

PPL

PROD

Inner Dev Loop

Test

DEVService Fabric Cluster planningWhat is this cluster to be used for ?Is this to be used for Test ?Is this a part of the CICD pipeline ?Is this for Production use ?

Where do you want this cluster hosted ?On Azure ?On-Premise, in your data center ?On some other cloud provide ?

Are there unique compliance and security requirements?End-to-end RBAC and Auditing ? Certificates OK ? Active Directory OK ?Compliance expectations from the infrastructure ?Compliance goals on the application ?

28

What kinds of workloads are planned to be deployed to it ?For each Application Total State# of instancesReplica set sizePort requirements per serviceIOPS needed External state vs State in the Service Fabric Clusters.Growth rateHow many Node types (what kinds of apps are to be deployed)Are their non-SF services to be run as well ?

Service Fabric Cluster planning

Service Fabric Cluster planningOnce you know what each Application needs, focus on Characteristics of each nodetype.CPURAM Disk (total state of the replicas you want to host)State durability (Gold vs Silver)Reliability (applies only to primary node type). Fault tolerance - # of FD and # of UDChoosing the # of FDsThis determines the headroom needed in case of unplanned failures. Choosing the # of UDsThis determines the headroom needed in case of planned failures.

FD1

FD2

FD3

FD4

FD5Choosing the # of Fault Domains you needNumber of Fault Domains determines the headroom needed in case of unplanned failures.Examples could be a PDU failing or TOR maintenance . Which will typically take out all machines in a Rack.

In terms of capacity you need to leave enough headroom to accommodate failure of at least one FD

This will result in SF moving/creating new replicas on the available Machines in other FDs.PDU Burn outReplica

FD1

FD2

FD3

FD4

FD5Choosing the # of Upgrade Domains you needNumber of Upgrade Domains determines the headroom needed in case of planned failures/downtimes.Examples could be a service fabric upgrade going on, and a UD is down. You have to have room to place additional replicas if need be.

Replica

UD1

UD2

UD3

UD4

UD5

UD6

UD7

UD8

UD9

UD10SF upgrade

FD1

FD2

FD3

FD4

FD5Best practice capacity headroomYou should plan your capacity in such a way that, your service can survive at leastA loss of one FDA UD being down because of an upgrade going onA random node/VM failing additionally.

UD1

UD2

UD3

UD4

UD5

UD6

UD7

UD8

UD9

UD10

33

Best practices Cluster Set upUse ARM template to customize your clusterSpread VMs across multiple storage accountFan out the IOProtection against widespread outage

Use ARM template to drive changes to your Resource GroupEasy configuration managementAuditingAvoid using implicit commands to tweak your resources.Be very pedantic on the configurations your deploy to your production environment

Best practices Cluster Set upUse a separate node type to host system services for large clusters for large cluster.

FD1

FD2

FD3

FD4

FD5

UD1

UD2

UD3

UD4

UD5

UD6

UD7

UD8

UD9

UD10

10 - NT1 Nodes20 -NT2 Nodes7- SF nodesLegend

Best practices Cluster SecurityAlways use a secure cluster to deploy anything you care about

Additionally consider the following

Create DMZs using NSGs

Use Jump boxes to manage your cluster

2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.2/26/2017 4:37 PM36

Best practices Cluster SecurityService Fabric ClusterKey VaultAADSecurityLB#3LB#2LB#1NSG#1NSG#2NSG#2VMSS#1VMVMVMVMSS#1VMVMVMVMSS#1VMVMVMFor DiagnosticsAzure StorageFor SF logsFor VHDsFor VHDsFor VHDsService Fabric ClusterVNETLB#3LB#2LB#1VMSS#1VMVMVMVMSS#1VMVMVMVMSS#1VMVMVMNSG#1NSG#2NSG#2Jump box

2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.2/26/2017 4:37 PM37

NSG ports that needs to be openedClientConnectionEndpoint (TCP): 19000

HttpGatewayEndpoint (HTTP/TCP): 19080

SMB 445 and 135

ClusterConnectionEndpointPort (TCP): 9025

LeaseDriverEndpointPort (TCP): 9026,

Ephemeral Port range min 256 ports

App ports as needed.

Ability to select a supported Fabric versionSet the upgrade mode to Automatic or ManualSelect the specific fabric version Via APIs or PortalYou can switch between Automatic and ManualYou have 60 days to adopt the new versionA warning is generated 14 days prior to your cluster going out of supportNew versions are announced on the team blogManage you Cluster Version

Microsoft 2016 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.2/26/2017 4:37 PM39

Cluster Fabric Upgrade

Factors to consider when choosing the upgrademodeAvailability of your service Need for predictability of performanceFreedom of choice to select the velocitySupport considerations Recommendation of upgrade mode for for dev, test, PPL, prod

Source ControlBuild

OPS

OPS

PPL

PROD

Inner Dev Loop

Test

DEV

40

Debugging in ProductionDont debug in ProducitonDifficult to catch an issue directlySecurity and compliance concernsShould debug tools be installed on all production nodes?

Instrument your codeInstrumenting your code is critical for debugging based on logsShould be able to trace the execution path through your codeAlways log in UTCFlow an activity id

2016 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.2/26/201741

Data LossRecovery Point Objective (RPO)How much data in minutes can the business afford to lose?Business should set the RPO, smaller RPO is more expensiveEach service must expect and plan for data lossSoft deletes (tombstoning) are best practiceHard delete later when you know it is not neededData CorruptionFrequently caused by software bug (or a hacker) Detect corruption is a hard issue that is domain specificIf needed, deal with corruption using journaling, snapshots/backupsMake sure you test restoring from corruption

Availability and Reliability Active/PassiveAzure Traffic Manager

Cluster A (Primary)

NodeNodeNodeNodeNodeCluster B (Secondary)

NodeNodeNodeNodeNodeReplication Traffic

Two similar clustersOnly Cluster A takes trafficPrimary must handle spikesData replicated to cluster B in the background

Availability and Reliability Active/PassiveAzure Traffic Manager

Cluster A (Primary)

NodeNodeNodeNodeNodeCluster B (Secondary)

NodeNodeNodeNodeNodeReplication Traffic

Failover flowCustomer experiences issueDevOps decides to fail overData inconsistency/lossRPO == replication delayTakes minutesSimple developmentInfrequently testedWasted capacity

Availability and Reliability Active/ActiveAzure Traffic Manager

Cluster A

NodeNodeNodeNodeNodeCluster B

NodeNodeNodeNodeNodeReplication TrafficTwo similar clustersBoth clusters takes trafficBoth clusters handle spikesLess expensiveData replicated to other cluster in the background

2016 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.2/26/201745

Availability and Reliability Active/ActiveAzure Traffic Manager

Cluster A

NodeNodeNodeNodeNodeCluster B

NodeNodeNodeNodeNodeReplication TrafficFailover is fast and freeHarder developmentData inconsistency ordual readsContinuously testedLess wasted capacity

2016 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.2/26/201746

Availability and Reliability Real Example Cluster A

NodeNodeNodeNodeNodeCluster B

NodeNodeNodeNodeNodeAzure Traffic ManagerAzure Storage A(RA-GRS)Azure Storage B(RA-GRS)Azure Storage A(read only)Azure Storage B(read only)

West USEast USCluster A

NodeNodeNodeNodeNodeCluster B

NodeNodeNodeNodeNodeAzure Traffic Manager

Azure Storage A(RA-GRS)Azure Storage B(RA-GRS)Azure Storage A(read only)Azure Storage B(read only)

Replication

West USEast USTwo regionally separated DCs

Can read from or write to either storage (RA-GRS), but default is local DC

Cascading FailuresOne simple failure leads to system-wide failurePlan for failure and understand the impact of failure on the system and its SLAWhen a service fails, clients must retry continuously causing a traffic storm Can occur across regions, active-active cross region are not immune Look at using Circuit Breaker patternsRetry using exponential back-off with a maximum intervalOnce connection is reestablished, reset the back-off interval

2016 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.2/26/201748

Humans cause most ProblemsHuman error causes 60% to 80% of service outages Treat operational procedures like codeAutomate as much as feasibleManual procedures must be one-off processesHumans are slower than automationIf you can document a manual procedure, why cant it be automated?Validate and test automationAutomate certificate and key rotationAlways have two certificates/keys and ensure that one is always validRotate regularlyCaused Azure outage in 2013

Thank you!